<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd"><channel><title><![CDATA[ThursdAI - The top AI news from the past week]]></title><description><![CDATA[Every ThursdAI, Alex Volkov hosts a panel of experts, ai engineers, data scientists and prompt spellcasters on twitter spaces, as we discuss everything major and important that happened in the world of AI for the past week. 

Topics include LLMs, Open source, New capabilities, OpenAI, competitors in AI space, new LLM models, AI art and diffusion aspects and much more.  <br/><br/><a href="https://sub.thursdai.news?utm_medium=podcast">sub.thursdai.news</a>]]></description><link>https://sub.thursdai.news/podcast</link><generator>Substack</generator><lastBuildDate>Thu, 05 Mar 2026 11:07:06 GMT</lastBuildDate><atom:link href="https://api.substack.com/feed/podcast/1801228.rss" rel="self" type="application/rss+xml"/><author><![CDATA[From Weights & Biases, Join AI Evangelist Alex Volkov and a panel of experts to cover everything important that happened in the world of AI from the past week]]></author><copyright><![CDATA[Alex Volkov]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[altryne@gmail.com]]></webMaster><itunes:new-feed-url>https://api.substack.com/feed/podcast/1801228.rss</itunes:new-feed-url><itunes:author>From Weights &amp; Biases, Join AI Evangelist Alex Volkov and a panel of experts to cover everything important that happened in the world of AI from the past week</itunes:author><itunes:subtitle>From Weights &amp; Biases - ThursdAI, the podcast that keeps you ahead of the AI curve. Hosted by AI Evangelist Alex Volkov with a changing panel expert guests, discussing every important AI piece of news and updates from the past week, Open source and more</itunes:subtitle><itunes:type>episodic</itunes:type><itunes:owner><itunes:name>From Weights &amp; Biases, Join AI Evangelist Alex Volkov and a panel of experts to cover everything important that happened in the world of AI from the past week</itunes:name><itunes:email>altryne@gmail.com</itunes:email></itunes:owner><itunes:explicit>No</itunes:explicit><itunes:category text="News"><itunes:category text="Tech News"/></itunes:category><itunes:category text="Technology"/><itunes:image href="https://substackcdn.com/feed/podcast/1801228/0ce80819299bc49435239a5b8b31cd45.jpg"/><item><title><![CDATA[📅 ThursdAI - Feb 26 - The Pentagon wants War Claude, every benchmark collapsed, and a solo founder hit $700K ARR with AI agents]]></title><description><![CDATA[<p>Hey, it’s Alex, let me tell you why I think this week is an inflection point.</p><p>Just this week: Everyone is launching <strong>autonomous agents</strong> or features inspired by OpenClaw (<a target="_blank" href="https://app.devin.ai/invite/8f9fb2db5c8c4b84a69dd95915137330">Devin 2.2</a>, <a target="_blank" href="https://x.com/leerob/status/2026369424450523348">Cursor</a>, <a target="_blank" href="https://x.com/claudeai/status/2026720870631354429">Claude Cowork</a>, Microsoft, <a target="_blank" href="https://www.perplexity.ai/hub/blog/introducing-perplexity-computer">Perplexity</a> and <a target="_blank" href="https://x.com/NousResearch/status/2026758999488528639">Nous</a> announced theirs), <strong>METR</strong> and <strong>ArcAGI </strong>2,3 benchmarks are getting<strong> saturated</strong>, 1 person companies nearing 1M ARR within months of operation by running <strong>AI agents 24/7</strong> (we chatted with one of them on the show today, live as he broke $700K ARR barrier) and the US Department of War gives Anthropic an ultimatum to remove nearly all restrictions on Claude for war and <strong>Anthropic says NO</strong>. </p><p>I’ve been covering AI for 3 years every week, and this week feels, different. So if we are nearing the singularity, let me at least keep you up to date 😅 </p><p>Today on the show, we covered most of the news in the first hour + breaking news from Google, Nano Banana 2 is here, and then had 3 interviews back to back. Ben Broca with Polsia, Nader Dabit with Cognition and Philip Kiely with BaseTen. Don’t miss those conversations starting at 1 hour in. </p><p><p>Thanks for reading ThursdAI - Highest signal weekly AI news show! This post is public so feel free to share it.</p></p><p>Anthropic vs Department of War</p><p>Earlier this week, the US “Department of War” invited Dario Amodei, CEO of Anthropic to a meeting, where-in Anthropic was given an <a target="_blank" href="https://x.com/SeanParnellASW/status/2027072228777734474?s=20">ultimatum</a>. “Remove the restrictions on Claude or Anthropic will be designated as a ‘supply chain risk’ company” and the DoD will potentially go as far as using the Defence Production Act to force Anthropic to ... comply. </p><p>The two restrictions that Anthropic has in place for their models are: No use for domestic surveillance of American citizens and NO fully autonomous lethal weapens decisions given to Claude. For context, Claude is the only model that’s deployed on AWS top secret GovCloud and is used through Palantir’s AI platform. </p><p>As I’m writing this, Anthropic <a target="_blank" href="https://www.anthropic.com/news/statement-department-of-war">issued a statement</a> from Dario statement saying they will not budge on this, and will not comply. I fully commend Dario and Anthropic for this very strong backbone, but I fear that this matter is far from over, and we’ll continue to see what is the government response. </p><p>EDIT: Apparently the DoD is pressuring Google and OpenAI to agree to the stipulations and employees from both companies are signing this petition <a target="_blank" href="https://notdivided.org/">https://notdivided.org/</a> to protest against dividing the major AI labs on this topic. </p><p>Anthropic and OpenAI vs upcoming Deepseek</p><p>It’s baffling just how many balls are in the air for Anthropic, as just this week also, they have <a target="_blank" href="https://www.anthropic.com/news/detecting-and-preventing-distillation-attacks">publicly named</a> 3 Chinese AI makers in “Distillation Attacks”, claiming that they have broke Terms of Service to generate over 16M conversations with Claude to improve their own models, while using proxy networks to avoid detection. This marks the first time a major AI company publicly attributed distillation attacks to specific entities by name.</p><p>The most telling thing to me is not the distillation, given that Anthropic has <a target="_blank" href="https://fortune.com/2025/09/05/anthropic-reaches-1-5-billion-settlement-with-authors-in-landmark-copyright-case/">just recently settled</a> one of the largest copyright payouts in U.S history, paying authors about $3000/book, which was bought, trained on and destroyed by Anthropic to make Claude better. </p><p>No, the most telling thing here is the fact that Anthropic chose to put DeepSeek on top of the accusation list with merely 140K conversations, where the other labs created millions. </p><p>This, plus OpenAI <a target="_blank" href="https://www.reuters.com/world/china/openai-accuses-deepseek-distilling-us-models-gain-advantage-bloomberg-news-2026-02-12/">formal memo</a> to Congress about a similar matter, shows that the US labs are trying to prepare for Deepseek new model to drop, by saying “Every innovation they have, they stole from us”. Apparently Deepseek V4 is nearly here, it’s potentially multimodal and has been allegedly trained on <a target="_blank" href="https://the-decoder.com/google-openai-and-anthropic-are-all-bracing-for-deepseeks-next-big-release/">Nvidia chips</a> somewhere in Mongolia despite the export restrictions and it’s about to SLAP! </p><p>Benchmark? What benchmarks? </p><p>How will we know that we’re approaching the singularity? Will there be signs? Well, this week it seems that the signs are here. </p><p>First, Agentica <a target="_blank" href="https://x.com/agenticasdk/status/2026011339718849020?s=20">claimed</a> that they solved all publicly available “hard for AI” tasks of the upcoming ArcAGI 3, then Confluence Labs <a target="_blank" href="https://www.ycombinator.com/launches/PWR-confluence-labs-an-ai-research-lab-focused-on-learning-efficiency">announced</a> that they got an unprecedented 97.9% on ArcAGI2 and finally METR published their results on the long-horizon tasks, which measure AI’s capability to solve task that take humans a certain amount of hours to do. And that graph is going parabolic, with Claude Opus 4.6 able to solve tasks of 14.6h (doubling every 49 days) with 50% success rate</p><p>Why is this important? Well, this is just the benchmarks telling the story that everyone else in the industry is seeing, that approximately since December of 2025, and definitely fueled by early Feb drop of <a target="_blank" href="https://sub.thursdai.news/p/thursdai-feb-5-opus-46-was-1-for">Opus 4.6 and Codex 5.3</a>, something major shifted. Developers no longer write code, but ship 10x more features.</p><p>This became such a talking point, Swyx <a target="_blank" href="https://substack.com/profile/89230629-latentspace">Latent.Space</a> coined this with </p><p><a target="_blank" href="https://wtfhappened2025.com/">https://wtfhappened2025.com/  </a>where he collects evidence of a shelling point, something that happened in December and I think continued throughout February. </p><p>Speaking of benchmarks no longer being valid, OpenAI published that the divergence between the <a target="_blank" href="https://x.com/OliviaGWatkins2/status/2026023711720317116">SWE-bench verified gains</a> with real life performance is so vast, that they will no longer be using SWE-bench verified, and will be switching to SWE-bench pro for evaluations. </p><p>Everyone’s Autonomous agents (and subagents) are here</p><p>Look, with over 250K Github stars, OpenAI getting Peter Steinberger on board, it’s clear now. OpenClaw made a huge dent in how people think about autonomous agents (and subagents!)</p><p>It may be a “moment in time” that the model capabilities were “just good enough” to be able to run agents async for a long time. but the big labs noticed the OpenClaw excitement and are shipping like never before to make sure their users don’t switch over!</p><p>Perplexity launched “<a target="_blank" href="https://www.perplexity.ai/hub/blog/introducing-perplexity-computer">Computer</a>“, which has scheduled tasks in a compute environment, and can complete long lasting projects end to end, Cursor pivots from IDE only to running Agents <a target="_blank" href="https://x.com/leerob/status/2026369424450523348">in the cloud</a> with their own environments, Claude Code added <a target="_blank" href="https://x.com/trq212/status/2027109375765356723?s=20">memory</a>, and <a target="_blank" href="https://docs.anthropic.com/en/docs/claude-code/remote-control">Remote Control</a>, while Claude Cowork added <a target="_blank" href="https://x.com/claudeai/status/2026720870631354429">Scheduled tasks</a>, our friends from Nous shipped <a target="_blank" href="https://x.com/NousResearch/status/2026758999488528639">Hermes Agent</a> and even Microsoft wants to bring this to their customers in Copilot. The most interesting one from these is the new Devin from Cognition.</p><p>I’ve gotten access and chatted with Nader Dabit on the show about how Devin was the “OG” async coding Agent, but now as models capabilities are here, Devin can do so much more. PR reviews with <a target="_blank" href="http://devinreview.com">devinreview.com</a> can complete the loop between coding, fixing and testing something end to end. They have an integrated environment with a scrub so you can roll back and see what the agent did, scheduled tasks and video showing you how the agent tested your website. </p><p>I’ve used it to fix bugs in <a target="_blank" href="http://ThrusdAI.news">ThrusdAI.news</a> and it found a few that Claude Code didn’t even know about! You can try out Devin (for free for a week?) <a target="_blank" href="https://app.devin.ai/invite/8f9fb2db5c8c4b84a69dd95915137330">here</a> </p><p>This weeks buzz - W&B updates</p><p>I’m happy this week, because we finally launched both 2.5 open source models that we’re making the news lately. </p><p>Kimi 2.5 and MiniMax M2.5 are both live on our inference service, at very very decent prices! </p><p>Check them both out <a target="_blank" href="https://wandb.ai/inference?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=Feb26">here</a> and let me know if you need some credits. </p><p>From the show this week, most hosts agree that Kimi 2.5 is the best open source alternative to Opus inside OpenClaw, just give your agent the WANDB_API_KEY and ask it to set itself up with the new model! </p><p>Surfing the singularity with Ben Broca and Polsia, hitting $700K ARR since December</p><p>I’ve reached out to Ben and asked him to join the show this week because alongside OpenClaw blowing up since December, his Polsia startup, which builds and scales entire companies with AI agents running 24x7 has hit an unprecedented $700K ARR milestone after just a few months. We actually saw him break the $700K ARR on the show live 🎉  But get this, he’s the only employee, everything is done with AIs. He’s using Polsia to scale Polsia.</p><p>Polsia let’s anyone add an existing company or create a whole new one, and then a team of agents will spin up a marketing team, a GTM motion, a research arm and you and Polsia could work together to make this company a reality. Does this actually work? IDK, the whole thing is new, I’m trying out a few things and will let you know in a few weeks if any of this worked. </p><p>But it’s definitely blowing up, Ben showed us that over the last 24 hours, over 770 companies launched on Polsia, he’s hitting nearly 1M ARR with people paying $50/mo for him to run inference for them, marketing campaigns, and he just added Meta ads. </p><p>This ARR chart, the live dashboard, and Ben doing all of this Solo is underlining the whole “Singularity is near” thing for me! It’s impossible to imagine something like this working even... 5 months ago, and now we just accept it as .. sure, yeah, one person can manage AIs that manage <em>checks notes</em> over 700 companies. </p><p>What’s clever about Polsia’s architecture is the cross-company learning system: when an agent learns something useful (like “subject lines with emojis get better open rates”), that learning gets anonymized and generalized into a shared memory file that benefits every company on the platform. The more companies running on Polsia, the smarter every agent gets — like a platform effect but for agent intelligence.</p><p>AI Art, Video & Audio </p><p>Seedance 2.0 is finally “here” </p><p>This week has not been quiet in the multimodality world either, SeeDance 2.0 from ByteDance was delayed via the API partners (was supposed to launch Feb 24) due to copyright concerns, but apparently they dropped it inside CapCut, ByteDance’s video editing software! It’s really good though what makes it absolutely incredible IMO is the video transfer, and you can’t really do that in CapCut, so we’re keep waiting for the “full model” </p><p>Nano Banana 2 - Pro level intelligence, with Flash speed and pricing (<a target="_blank" href="https://aistudio.google.com/prompts/1t0nEN2Q7zXjsVDESMyIdLqFOpyCzzIzX">Blog</a>)</p><p>Google dropped a breaking news item before the show started today, and announced Nano Banana 2, which is supposed to be as good as Nano Banana Pro (which is incredible) but faster. It wasn’t really faster for me, as I got early access thanks to the DeepMind team, but apparently it’s just the rollout pains. But the quality is nearly matching Nano Banana Pro! </p><p>It can do the same super high quality text rendering, comes with a few new ratios to create ultra long images (4:1 and 1:4) and a new small 512 resolution for extra cheap generation. The additional thing is Image Search is now integrated into the model, allowing it to look something up before generating. Though, that didn’t really work for me as well, I tried to get it to look up images of Mike Intrator and Dario Amodei, and it kept showing me random people who look nothing like them, despite the thinking traces showing the search did happen. </p><p>Speaking of pricing, this model is around 50-30% cheaper than NBP, which is great given the added speed! Go play with it, it’s available on <a target="_blank" href="http://AI.dev">AI.dev</a> and Gemini, go give it a try! </p><p>Open source AI </p><p>This week in OpenSource, our friends from Qwen came back with a set of 3 models, the middle medium one is a hybrid architecture with only 3B parameters that beats their 235B flagship Qwen3 from before! It’s really good at longer context especially given the hybrid attention similar to Jamba that we covered before.  (<a target="_blank" href="https://x.com/Alibaba_Qwen/status/2026339351530188939">X</a>, <a target="_blank" href="https://huggingface.co/Qwen/Qwen3.5-35B-A3B">HF</a>, <a target="_blank" href="https://huggingface.co/Qwen/Qwen3.5-122B-A10B">HF</a>, <a target="_blank" href="https://huggingface.co/Qwen/Qwen3.5-27B">HF</a>, <a target="_blank" href="https://qwen.ai/blog?id=qwen3.5">Blog</a>)</p><p>Additionally, Liquid AI releases their largest LFM, 24B (<a target="_blank" href="https://x.com/liquidai/status/2026301771539202269">X</a>, <a target="_blank" href="https://huggingface.co/LiquidAI/LFM2-24B-A2B">HF</a>, <a target="_blank" href="https://www.liquid.ai/blog/lfm2-24b-a2b">Blog</a>) and that is also deployable on consumer laptops. </p><p>One note on AI tools, LM Studio, our favorite way of running these models on your hardware, have launched LMLink, powered by Tailscale, which let’s you run local inference on once device and stream tokens to any other device in your network securely! You can use this to run your OpenClaw with Qwen medium for example, for a complete off the grid OpenClaw!</p><p>Check it out here: <a target="_blank" href="https://lmstudio.ai/link">https://lmstudio.ai/link</a></p><p>I really didn’t want to sound hype-y but this week things are moving so fast that I was not sure how it’s possible to talk about all this, covering the news while also having 3 interviews. I think we’ve done a good job, but I am honestly getting to a point whereI have to do deep prioritization of what content is the most important in my eyes. I hope you guys enjoy my prioritization, and do leave comments of what you’d like to see more, or see less of! I am hungry for feedback! </p><p>If you enjoyed this week’s newsletter, checkout the whole edited video and share it with a friend or two? See you next week! </p><p><p>ThursdAI - Join us as we surf the AI singularity together</p></p><p>Here’s the TL;DR and show notes: </p><p>ThursdAI - Feb 26, 2026 - TL;DR</p><p>* <strong>Hosts and Guests</strong></p><p>* <strong>Alex Volkov</strong> - AI Evangelist & Weights & Biases (<a target="_blank" href="http://x.com/@altryne">@altryne</a>)</p><p>* Co Hosts - <a target="_blank" href="http://x.com/@WolframRvnwlf">@WolframRvnwlf</a> @yampeleg <a target="_blank" href="http://x.com/@nisten">@nisten</a> <a target="_blank" href="http://x.com/@ldjconfirmed">@ldjconfirmed</a> <a target="_blank" href="http://x.com/ryancarson">@ryancarson</a></p><p>* Ben Cera (<a target="_blank" href="https://x.com/bencera_">@bencera_</a>) - Founder Polsia</p><p>* Nader Dabit (<a target="_blank" href="https://x.com/dabit3/status/2026357583179894839?s=20">@dabit3</a>) - Growth at Cognition</p><p>* Philip Kiely (<a target="_blank" href="https://x.com/philipkiely">@philipkiely</a>) - Devrel Base10, Author Inference Engineering</p><p>* ThursdAI new website: <a target="_blank" href="https://thursdai.news">https://thursdai.news</a></p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Anthropic vs Chinese OSS - Accuses DeepSeek, Minimax, ZAI at distillation attacks (<a target="_blank" href="https://www.anthropic.com/news/detecting-and-preventing-distillation-attacks">Blog</a>)</p><p>* Pentagon Issues an ultimatum to Anthropic: Give military unfettered Claude access by Friday or face Defense Production Act - Anthropic says NO (<a target="_blank" href="https://www.anthropic.com/news/statement-department-of-war">Blog</a>)</p><p>* OpenAI releases GPT-5.3-Codex, their most capable agentic coding model, to all developers via the Responses API (<a target="_blank" href="https://x.com/OpenAIDevs/status/2026379092661289260">X</a>, <a target="_blank" href="https://platform.openai.com/docs/models/gpt-5.3-codex">Announcement</a>)</p><p>* <strong>Open Source LLMs</strong></p><p>* Alibaba: Qwen 3.5 Medium - 35B model with only 3B active parameters outperforms their previous 235B flagship (<a target="_blank" href="https://x.com/Alibaba_Qwen/status/2026339351530188939">X</a>, <a target="_blank" href="https://huggingface.co/Qwen/Qwen3.5-35B-A3B">HF</a>, <a target="_blank" href="https://huggingface.co/Qwen/Qwen3.5-122B-A10B">HF</a>, <a target="_blank" href="https://huggingface.co/Qwen/Qwen3.5-27B">HF</a>, <a target="_blank" href="https://qwen.ai/blog?id=qwen3.5">Blog</a>)</p><p>* Liquid AI releases LFM2-24B-A2B: A 24B MoE model with only 2.3B active parameters that runs on consumer laptops (<a target="_blank" href="https://x.com/liquidai/status/2026301771539202269">X</a>, <a target="_blank" href="https://huggingface.co/LiquidAI/LFM2-24B-A2B">HF</a>, <a target="_blank" href="https://www.liquid.ai/blog/lfm2-24b-a2b">Blog</a>)</p><p>* Perplexity launches ppxl-embed - SOTA embedding models (<a target="_blank" href="https://research.perplexity.ai/articles/pplx-embed-state-of-the-art-embedding-models-for-web-scale-retrieval">Blog</a>, <a target="_blank" href="https://huggingface.co/collections/perplexity-ai/pplx-embed">HF</a>, <a target="_blank" href="https://docs.perplexity.ai/docs/embeddings/quickstart">API</a>) by our friend Bo Wang</p><p>* <strong>Evals & Benchmarks</strong></p><p>* METR’s Time Horizon Benchmark Goes Vertical: Claude Opus 4.6 Achieves ~14.5 Hour Task Completion (<a target="_blank" href="https://x.com/peterwildeford/status/2024934981290918286">X</a>, <a target="_blank" href="https://metr.org/">Blog</a>)</p><p>* Confluence Labs emerges from stealth with 97.9% SOTA on ARC-AGI-2 benchmark  (<a target="_blank" href="https://x.com/ycombinator/status/2026084664503603649">X</a>, <a target="_blank" href="https://github.com/confluence-labs/arc-agi-2">GitHub</a>)</p><p>* OpenAI Retires SWE-bench Verified, (<a target="_blank" href="https://x.com/yaelkroy/status/2026293189020107233">X</a>, <a target="_blank" href="https://scale.com/blog/swe-bench-pro">Blog</a>, <a target="_blank" href="https://x.com/OpenAIDevs/status/2026002219909427270">X</a>)</p><p>* Agentica claiming to solve all public ArcAGI 3 (<a target="_blank" href="https://x.com/agenticasdk/status/2026011339718849020">X</a>)</p><p>* <strong>Tools & Agentic Engineering</strong></p><p>* Happy 1 year Birthday Claude Code!</p><p>* Devin AI 2.2 - autonomous agent with computer use, browser, self verify and self fix it’s own work - interview with Nader Dabit (<a target="_blank" href="https://x.com/cognition/status/2026343816521994339?s=20">X</a>)</p><p>* LMStudio launches LMLink - use your local models from everywhere with TailScale! (<a target="_blank" href="https://lmstudio.ai/link">try it</a>)</p><p>* Claude Code introduces Remote Control: Control your local coding sessions from your phone or any device (<a target="_blank" href="https://x.com/claudeai/status/2026418433911603668">X</a>, <a target="_blank" href="https://docs.anthropic.com/en/docs/claude-code/remote-control">Docs</a>) and memory (<a target="_blank" href="https://x.com/trq212/status/2027109375765356723">X</a>)</p><p>* Claude Cowork and Codex both now have automations (Cron Jobs) to do tasks for you (<a target="_blank" href="https://x.com/claudeai/status/2026720870631354429">Cowork</a>)</p><p>* Cursor launches cloud agents that onboard to codebases, run in isolated VMs, and deliver video demos of completed PRs (<a target="_blank" href="https://x.com/leerob/status/2026369424450523348">X</a>)</p><p>* Nous research agent (<a target="_blank" href="https://x.com/NousResearch/status/2026758999488528639">X</a>)</p><p>* Perplexity Computer (<a target="_blank" href="https://www.perplexity.ai/hub/blog/introducing-perplexity-computer">blog</a>)</p><p>* Microsoft Copilot tasks (<a target="_blank" href="https://copilot.microsoft.com/tasks/preview">blog</a>)</p><p>* <strong>This weeks Buzz - Weights & Biases update</strong></p><p>* W&B adds MiniMax 2.5 and Kimi K2.5 on our Inference Service (<a target="_blank" href="https://wandb.ai/inference/coreweave/cw_MiniMaxAI_MiniMax-M2.5">LINK</a>)</p><p>* <strong>Interviews mention links</strong></p><p>* Ben Broca - <a target="_blank" href="http://polsia.com/live">polsia.com/live</a> Polsia Dashboard</p><p>* Nader Dabit - on seeing the future (<a target="_blank" href="https://x.com/dabit3/status/2019127306963546452">blog</a>)</p><p>* Philip Kiely - Inference Engineering book (<a target="_blank" href="https://www.baseten.com/inference-engineering/">Book</a>)</p><p>* <strong>Vision & Video</strong></p><p>* Seedance 2.0 finally available in Capcut in US. API release apparently held back due to copyright issues (<a target="_blank" href="https://x.com/alisaqqt/status/2024914134513713403">X</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* OpenAI releases gpt-audio-1.5 and gpt-realtime-1.5 models with major improvements in speech-to-speech AI capabilities (<a target="_blank" href="https://x.com/swishfever/status/2026000424918982837">X</a>, <a target="_blank" href="https://platform.openai.com/docs/models">Announcement</a>)</p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* Google DeepMind launches Nano Banana 2 (<a target="_blank" href="https://x.com/GoogleDeepMind/status/2027051577899380991">X</a>, <a target="_blank" href="https://deepmind.google/technologies/gemini/nano-banana/">Announcement</a>)</p><p>* Quiver solves SVG with Arrow 1.0 (<a target="_blank" href="https://x.com/altryne/status/2026809860101468182?s=20">X</a>)</p><p>* Others</p><p>* Taalas AI - 15,000 tokens per second demo (<a target="_blank" href="https://chatjimmy.ai/">chatjimmy.ai/</a>) </p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-feb-26-approaching-singularity</link><guid isPermaLink="false">substack:post:189320766</guid><dc:creator><![CDATA[Alex Volkov and Nader Dabit]]></dc:creator><pubDate>Fri, 27 Feb 2026 02:52:55 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/189320766/87a5b032f5d5f4c74061012617f31c1c.mp3" length="79344961" type="audio/mpeg"/><itunes:author>Alex Volkov and Nader Dabit</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6612</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/189320766/df9abe6ff3249480f80e9dc0bc9b6168.jpg"/></item><item><title><![CDATA[📅 ThursdAI - Feb 19 - Gemini 3.1 Pro Drops LIVE, Sonnet 4.6 Closes Gap, OpenClaw Goes to OpenAI]]></title><description><![CDATA[<p>Hey, it’s Alex, let me catch you up! </p><p>Since last week, OpenAI convinced <strong>OpenClaw</strong> founder Peter Steinberger to join them, while keeping OpenClaw.. well... open. Anthropic dropped <strong>Sonnet 4.6</strong> which nearly outperforms the previous Opus and is much cheaper, <strong>Qwen released 3.5</strong> on Chinese New Year’s Eve, while DeepSeek was silent and Elon and XAI folks deployed <strong>Grok 4.20</strong> without any benchmarks, and it’s 4 500B models in a trenchcoat? </p><p>Also, Anthropic updated rules state that it’s <strong>breaking ToS</strong> to use their plans for anything except Claude Code & Claude SDK (and then clarified that it’s OK? we’re not sure) </p><p>Then Google decided to drop their <strong>Gemini 3.1 Pro preview</strong> right at the start of our show, and it’s very nearly the best LLM folks can use right now (though it didn’t pass Nisten’s vibe checks) </p><p>Also, Google released Lyria 3 for music gen (though only 30 seconds?) and our own Ryan Carson blew up on X again with over 1M views for his Code Factory article, Wolfram did a deep dive into Terminal Bench and .. we have a brand new website: </p><p><a target="_blank" href="https://thursdai.news">https://thursdai.news</a>  🎉</p><p>Great week all in all, let’s dive in! </p><p><p>ThursdAI - Subscribe to never feel like you’re behind. Share with your friends if you’re already subscribed!</p></p><p>Big Companies & API updates</p><p>Google releases Gemini 3.1 Pro with 77.1% on ARC-AGI-2 (<a target="_blank" href="https://x.com/_philschmid/status/2024516444847776209">X</a>, <a target="_blank" href="https://blog.google/technology/google-deepmind/gemini-3-1-pro-update/">Blog</a>, <a target="_blank" href="https://aistudio.google.com/">Announcement</a>)</p><p>In a release that surprised no-one, Google decided to drop their latest update to Gemini models, and it’s quite a big update too! We’ve now seen all major labs ship big model updates in the first two months of 2026. With 77.1% on ARC-AGI 2, and 80.6% on SWE-bench verified, Gemini is not complete SOTA across the board but it’s damn near close. </p><p>The kicker is, it’s VERY competitive on the pricing, with 1M context, $2 / $12 (<200k tokens), and Google’s TPU speeds, this is now the model to beat! Initial vibe checks live on stage did not seem amazing, Nisten wasn’t super impressed, Ryan took one glance at the SWE-bench pro not being SOTA and decided to skip, and he’s added that, at some point, it is benefitting to pick a model and stick to it, the constant context switching is really hard for folks who want to keep shipping. </p><p>But if you look at the trajectory, it’s really notable how quickly we’re moving, with this model being 82% better on abstract reasoning than the 3 pro released just a few months ago! </p><p><strong>The 1 Million Context Discrepancy</strong>, who’s better at long context? </p><p>The most fascinating catch of the live broadcast came from LDJ, who has an eagle eye for evaluation tables. He immediately noticed something weird in Google’s reported benchmarks regarding long-context recall. On the MRCR v2 8-needle benchmark (which tests retrieval quality deep inside a massive context window), Google’s table showed Gemini 3.1 Pro getting a 26% recall score at 1 million tokens. Curiously, they marked Claude Opus 4.6 as “not supported” in that exact tier.</p><p>LDJ quickly pulled up the actual receipts: Opus 4.6 at a 1-million context window gets a staggering 76% recall score. That is a massive discrepancy! It was addressed by a member of DeepMind on X in a r<a target="_blank" href="https://x.com/kiranvodrahalli/status/2024591076817035774?s=20">esponse to me</a>, saying that Anthropic used an internal model for evaluating this (with receipts he pulled from the Anthropic model card) </p><p><strong>Live Vibe-Coding Test</strong> for Gemini 3.1 Pro</p><p>We couldn’t just stare at numbers, so Nisten immediately fired up AI Studio for a live vibe check. He threw our standard “build a mars driver simulation game” prompt at the new Gemini.</p><p>The speed was absolutely breathtaking. The model generated the entire single-file HTML/JS codebase in about 20 seconds. However, when he booted it up, the result was a bit mixed. The first run actually failed to render entirely. A quick refresh got a version working, and it rendered a neat little orbital launch UI, but it completely lacked the deep physics trajectories and working simulation elements that models like OpenAI’s Codex 5.3 or Claude Opus 4.6 managed to output on the exact same prompt last week. As Nisten put it, “It’s not bad at all, but I’m not impressed compared to what Opus and Codex did. They had a fully working one with trajectories, and this one I’m just stuck.”</p><p>It’s a great reminder that raw benchmarks aren’t everything. A lot of this comes down to the <em>harness</em>—the specific set of system prompts and sandboxes that the labs use to wrap their models. </p><p>Anthropic launches Claude Sonnet 4.6, with 1M token context and near-Opus intelligence at Sonnet pricing</p><p>The above Gemini release comes just a few days after Anthropic has shipped an update to the middle child of their lineup, Sonnet 4.6. With much improved Computer Use skills, updated Beta mode for 1M tokens, it achieves 79.6% on SWE-bench verified eval, showing good coding performance, while maintaining that “anthropic trained model” vibes that many people seem to prefer. </p><p>Apparently in blind testing inside Claude Code, folks preferred this new model outputs to the latest Opus 4.5 around ~60% of the time, while preferring it over the previous sonnet 70% of the time. </p><p>With $3/$15 per million tokens pricing, it’s cheaper than Opus, but is still more expensive than the flagship Gemini model, while being quite behind. </p><p>Vibing with Sonnet 4.6</p><p>I’ve tested out Sonnet 4.6 inside my OpenClaw harness for a few days, and it was decent. It did annoy me a bit more than Opus, with misunderstanding what I ask it, but it definitely does have the same “emotional tone” as Opus. Comparing it to Codex 5.3 is very easy, it’s much nicer to talk to. IDK what kind of Anthropic magic they put in there, but if you’re on a budget, Sonnet is definitely the way to go when interacting with Agents (and you can get it to orchestrate as many Codex instances as you want if you don’t like how it writes code) </p><p>For Devs: Auto prompt caching and Web Search updates</p><p>One nice update Anthropic also dropped is that prompt caching (which leads to almost 90% decrease in token pricing) for developers (<a target="_blank" href="https://x.com/RLanceMartin/status/2024573404888911886">Blog</a>) and a new and improved Web Search for everyone else that can now <a target="_blank" href="https://claude.com/blog/improved-web-search-with-dynamic-filtering">use tools</a></p><p>Grok 4.20 - 4 groks in a trenchcoat? </p><p>In a very weird release, Grok has been updated with the long hyped Grok 4.20. Elon has been promising this version for a while (since late last year in fact) and this “release” definitely felt underwhelming. There was no evaluations, no comparisons to other labs models, no charts (heck, not even a blogpost on X.ai). </p><p>What we do know, is that Grok 4.20 (and Grok 4.20 Heavy) use multiple agents (4 for Grok, 16 for Heavy) to do a LOT of research and combine their answers somehow. This is apparently what the other labs use for their ultra expensive models (GPT Pro and Gemini DeepThink) but Grok is showing it in the UI, and gives these agents... names and personalities. </p><p>Elon has confirmed also that what’s deployed right now is ~500B “small” base version, and that bigger versions are coming, in one of the rarest confirmations about model size from the big labs. </p><p>Vibe checking this new grok, it’s really fast at research across X and the web, but I don’t really see it as a daily driver for anyone who converses with LLMs all the time. Supposedly they are planning to keep teaching this model and get it “improved week over week” so I’ll keep you up to date with major changes here. </p><p>Open Source AI </p><p>It seems that all the chinese OSS labs were shipping before the Chinese New Year, with Qwen being the last one of them, dropping the updated Qwen 3.5. </p><p>Alibaba’s Qwen3.5 397B-A17B: First open-weight native multimodal MoE model (<a target="_blank" href="https://x.com/Alibaba_Qwen/status/2023331062433153103">X</a>, <a target="_blank" href="https://huggingface.co/Qwen/Qwen3.5-397B-A17B">HF</a>)</p><p>Qwen decided to go for Sparse MoE architecture with this release, with a high number of experts (512) and only 17B active parameters. </p><p>It’s natively multi-modal with a hybrid architecture, able to understand images/text, while being comparable to GPT 5.2 and Opus 4.5 on benches including agentic tasks. </p><p>Benchmarks aside, the release page of Qwen models is a good sniff test on where these model labs are going, they have multimodality in there, but they also feature an example of how to use this model within OpenClaw, which doesn’t necessarily show off any specific capabilities, but shows that the Chinese labs are focusing on agentic behavior, tool use and mostl of all pricing! </p><p>This model is also available as Qwen 3.5 Max with 1M token window (as opposed to the 256K native one on the OSS side) on their API.</p><p>Agentic Coding world - The Clawfather is joining OpenAI, Anthropic loses dev mindshare</p><p>This was a heck of a surprise to many folks, Peter Steinberger, announced that he’s joining OpenAI, while OpenClaw (that now sits on >200K stars in Github, and is adopted by nearly every Chinese lab) is going to become an Open Source foundation. </p><p>OpenAI has also confirmed that it’s absolutely ok to use your ChatGPT plus/pro subscriptions to use inside OpenClaw, and it’s really a heck of a thing to see how quickly Peter jumped from relative anonymity (after scaling and selling PSPDFKIT ) into a spotlight. Apparently Mark Zuckerberg reached out directly as well as Sam Altman, and Peter decided to go with OpenAI despite Zuck offering more money due to “culture” </p><p>This whole ClawdBot/OpenClaw debacle also shines a very interesting and negative light on Anthropic, who recently changed their ToS to highlight that their subscription can only be used for Claude Code and nothing else. This scared a lot of folks who used their Max subscription to run their Claws 24/7. Additionally Ryan <a target="_blank" href="https://x.com/ryancarson/status/2024136181886161015?s=20">echoed</a> how the community feel about lack of DevEx/Devrel support from Anthropic in a viral post. </p><p>However, it does not seem like Anthropic cares? Their revenue is going exponential (much of it due to Claude Code) </p><p>Very interestingly, I went to a local Claude Code meetup here in Denver, and the folks there are.. a bit behind the “bubble” on X. Many of them didn’t even try Codex 5.3 or OpenClaw, they are maximizing their time with Claude Code like there’s no tomorrow. It has really shown me that the alpha keeps changing really fast, and many folks don’t have the time to catch up! </p><p>P.S - this is why ThursdAI exists, and I’m happy to deliver the latest news to ya. </p><p>This Week’s Buzz from Weights & Biases</p><p>Our very own Wolfram Ravenwolf took over the Buzz corner this week to school us on the absolute chaos that is AI benchmarking. With his new role at W&B, he’s been stress-testing all the latest models on <strong>Terminal Bench 2.0</strong>.</p><p>Why Terminal Bench? Because if you are building autonomous agents, multiple-choice tests like MMLU are basically useless now. You need to know if an agent can actually interact with an environment. Terminal Bench asks the agent to perform 89 real-world tasks inside a sandboxed Linux container—like building a Linux kernel or cracking a password-protected archive.</p><p>Wolfram highlighted some fascinating nuances that marketing slides never show you. For example, did you know that on some agentic tasks, turning <em>off</em> the model’s “thinking/reasoning” mode actually results in a <em>higher</em> score? Why? Because overthinking generates so many internal tokens that it fills the context window faster, causing the model to hit its limits and fail harder than a standard zero-shot model! Furthermore, comparing benchmarks between labs is incredibly difficult because changing the benchmark’s allowed runtime from 1 hour to 2 hours drastically raises the ceiling of what models can achieve.</p><p>He also shared a great win: while evaluating GLM-5 for our W&B inference endpoints, he got an abysmal 5% score. By pulling up the <strong>Weave</strong> trace data, Wolfram immediately spotted that the harness was injecting brain-dead Python syntax errors into the environment. He reported it, engineering fixed it in minutes, and the score shot up to its true state-of-the-art level. This is exactly why you need powerful tracing and evaluation tools when dealing with these black boxes! So y’know... check out <a target="_blank" href="https://wandb.ai/weave?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=feb19">Weave</a>! </p><p>Vision & BCI</p><p>Zyphra’s ZUNA: Thought-to-Text Gets Real (<a target="_blank" href="https://x.com/ZyphraAI/status/2024114248020898015">X</a>, <a target="_blank" href="https://www.zyphra.com/post/zuna">Blog</a>, <a target="_blank" href="https://github.com/Zyphra/ZUNA">GitHub</a>)</p><p>LDJ flagged this as his must-not-miss: Zyphra released ZUNA, a 380M parameter open-source BCI (Brain-Computer Interface) foundation model. It takes EEG signals from your brain and reconstructs clinical-grade brain signals from sparse, noisy data. People are literally calling it “thought to text” hahaha. </p><p>At 380M parameters, it could potentially run in real-time on a consumer GPU. Trained on 2 million channel-hours of EEG data from 208 datasets. The wild part: it can upgrade cheap $500 consumer EEG headsets to high-resolution signal quality without retraining, something many folks are posting about and are excited to test out! Non Invasive BCI is the dream!</p><p>Nisten was genuinely excited, noting it’s probably the best effort in this field and it’s fully Apache 2.0. Will probably need personalized training per person, but the potential is real: wear a headset, look at a screen, fire up your agents with your thoughts. Not there yet, but this feels like the actual beginning.</p><p>Tools & Agentic Coding (The End of “Vibe Coding”) - Ryan Carson’s Code Factory & The “One-Shot Myth”</p><p>This one is for developers, but in modern times, everyone can become a developer so if you’re not one, at least skim this. </p><p>We spent a big chunk of the show today geeking out over agentic workflows. Ryan Carson went incredibly viral on X again this week with a phenomenal deep-dive on establishing a “Code Factory.” If you are still just chatting with models and manually copying code back into your IDE, you are doing it wrong.</p><p>Ryan’s methodology (heavily inspired by a recent OpenAI paper on <a target="_blank" href="https://openai.com/index/harness-engineering/">harness engineering</a>) treats your AI agents like a massive team of junior engineers. You don’t just ask them for code and ship it. You should build a rigid, machine-enforced loop.</p><p>Here is the flow:</p><p>* The coding agent (Codex, OpenClaw, etc.) writes the code.</p><p>* The GitHub repository enforces risk-aware checks. If a core system file or route is touched, the PR is automatically flagged as high risk.</p><p>* A secondary code review agent (like Greptile) kicks off and analyzes the PR.</p><p>* CI/CD GitHub Actions run automated tests, including browser testing.</p><p>* If a test fails, or the review agent leaves a comment, a remediation agent is automatically triggered to fix the issue and loop back.</p><p>* The loop spins continuously until you get a flawless, green PR.</p><p>As Ryan pointed out, we used to <em>hate</em> this stuff as human engineers. Waiting for CI to pass made you want to pull your hair out. But agents have infinite time and infinite patience. You force them to grind against the machine-enforced contract (YAML/JSON gates) until they get it right. It takes a week to set up properly, and you have to aggressively fight “document drift” to make sure your AI doesn’t forget the architecture, but once it’s humming, you have unprecedented leverage.</p><p><strong>My Hard Truth: One-Shot is a Myth</strong> I completely agree with Ryan btw! Over the weekend, my OpenClaw agent kindly informed me that the hosting provider for the old ThursdAI website was shutting down. I needed a new website immediately.</p><p>I decided to practice what we preach and talk to my ClawdBot to build the entire thing. It was an incredible process. I used Opus 4.6 to mock up 3 designs based on other podcast sites. Then, I deployed a swarm of sub-agents to download and read the raw text transcripts of all 152 past episodes of our show. Their job was to extract the names of every single guest (over 160 guests, including 15 from Google alone!) to build a <a target="_blank" href="https://thursdai.news/guests">dynamic guest directory</a>, generating a dedicated SEO page and dynamic OpenGraph tag for every single one of them, a native website podcast player with synced sections, episode pages with guests highlighted and much more. It would have taken me months to write the code for this myself.</p><p>Was it magical? Yes. <strong>But was it one-shot? Absolutely not.</strong></p><p>The amount of back-and-forth conversation, steering, and correction I had to provide to keep the CSS coherent across pages was exhausting. I set up an automation to work while I slept, and I would wake up every morning to a completely different, sometimes broken website. </p><p>Yam Peleg chimed in with the quote of the week: <em>“It’s not a question of whether a model can mess up your code, it’s just a matter of when. Because it is a little bit random all the time. Humans don’t mistakenly delete the entire computer. Models can mistakenly, without even realizing, delete the entire computer, and a minute later their context is compacted and they don’t even remember doing it.”</em></p><p>This is why you <em>must</em> have gates. This is also why I don’t think engineers are going to be replaced with AI completely. Engineers who don’t use AI? yup. But if you embrace these tools and learn to work with you, you won’t have an issue getting a job! You need that human taste-maker in the loop to finish the last 5%, and you need strict CI/CD gates to stop the AI from accidentally burning down your production database.</p><p>Voice & Audio</p><p>Google DeepMind launches Lyria 3 (<a target="_blank" href="https://deepmind.google/technologies/lyria/">try it</a>)</p><p>Google wasn’t just dropping reasoning models this week; DeepMind officially launched <strong>Lyria 3</strong>, their most advanced AI music generation model, integrating it directly into the Gemini App. </p><p>Lyria 3 generates 30-second high-fidelity tracks with custom lyrics, realistic vocals across 8 different languages, and granular controls over tempo and instrumentation. You can even provide an image and it’ll generate a soundtrack (short one) for that image.</p><p>While it is currently limited to 30-second tracks (which makes it hard to compare to the full-length song structures of Suno or Udio), early testers are raving that the actual <em>audio fidelity</em> and prompt adherence of Lyria 3 is far superior. All tracks are invisibly watermarked with Google’s SynthID to ensure provenance, and it automatically generates cover art using Nano Banana. I tried to generate a jingle</p><p>That’s a wrap for this weeks episode folks, what an exclirating week! ( Yes I know it’s a typo, but how else would you know that I’m human?) </p><p>Please go check out our brand new <a target="_blank" href="https://thursdai.news">website</a> (and tell me if anything smells off there, it’s definitely not perfect!), click around the guests directory and the episodes pages (the last 3 have pages, I didn’t yet backfill the rest) and let me know what you think! </p><p>See you all next week! </p><p>-Alex </p><p>ThursdAI - Feb 19, 2026 - TL;DR</p><p>TL;DR of all topics covered:</p><p>* <strong>Hosts and Guests</strong></p><p>* <strong>Alex Volkov</strong> - AI Evangelist & Weights & Biases (<a target="_blank" href="http://x.com/@altryne">@altryne</a>)</p><p>* Co Hosts - <a target="_blank" href="http://x.com/@WolframRvnwlf">@WolframRvnwlf</a> @yampeleg <a target="_blank" href="http://x.com/@nisten">@nisten</a> <a target="_blank" href="http://x.com/@ldjconfirmed">@ldjconfirmed</a> <a target="_blank" href="http://x.com/ryancarson">@ryancarson</a></p><p>* 🔥 New website: <a target="_blank" href="https://thursdai.news">thursdai.news</a> with all our past <a target="_blank" href="https://thursdai.news/guests">guests</a> and <a target="_blank" href="https://thursdai.news/episodes">episodes</a> </p><p>* <strong>Open Source LLMs</strong></p><p>* Alibaba releases Qwen3.5-397B-A17B: First open-weight native multimodal MoE model with 8.6-19x faster inference than Qwen3-Max (<a target="_blank" href="https://x.com/Alibaba_Qwen/status/2023331062433153103">X</a>, <a target="_blank" href="https://huggingface.co/Qwen/Qwen3.5-397B-A17B">HF</a>)</p><p>* Cohere Labs releases Tiny Aya, a 3.35B multilingual model family supporting 70+ languages that runs locally on phones (<a target="_blank" href="https://x.com/Cohere_Labs/status/2023699450309275680">X</a>, <a target="_blank" href="https://huggingface.co/collections/CohereLabs/tiny-aya">HF</a>, <a target="_blank" href="https://huggingface.co/CohereLabs/tiny-aya-global">HF</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* OpenClaw founder joins OpenAI</p><p>* Google releases Gemini 3.1 Pro with 2.5x better abstract reasoning and improved coding/agentic capabilities (<a target="_blank" href="https://x.com/_philschmid/status/2024516444847776209">X</a>, <a target="_blank" href="https://blog.google/technology/google-deepmind/gemini-3-1-pro-update/">Blog</a>, <a target="_blank" href="https://aistudio.google.com/">Announcement</a>)</p><p>* Anthropic launches Claude Sonnet 4.6, its most capable Sonnet model ever, with 1M token context and near-Opus intelligence at Sonnet pricing (<a target="_blank" href="https://x.com/claudeai/status/2023817132581208353">X</a>, <a target="_blank" href="https://www.anthropic.com/news/claude-sonnet-4-6">Blog</a>, <a target="_blank" href="https://www.anthropic.com/claude/sonnet">Announcement</a>)</p><p>* ByteDance releases Seed 2.0 - a frontier multimodal LLM family with Pro, Lite, Mini, and Code variants that rivals GPT-5.2 and Claude Opus 4.5 at 73-84% lower pricing (<a target="_blank" href="https://x.com/QuanquanGu/status/2022560162406707642">X</a>, <a target="_blank" href="https://team.doubao.com/en/models">blog</a>, <a target="_blank" href="https://huggingface.co/ByteDance-Seed">HF</a>)</p><p>* Anthropic changes the rules on Max use, OpenAI confirms it’s 100% fine.</p><p>* Grok 4.20 - finally released, a mix of 4 agents</p><p>* <strong>This weeks Buzz</strong></p><p>* Wolfram deep dives into Terminal Bench</p><p>* We’ve launched Kimi K2.5 on our inference service (<a target="_blank" href="https://wandb.ai/inference?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=Feb12">Link</a>)</p><p>* <strong>Vision & Video</strong></p><p>* Zyphra releases ZUNA, a 380M-parameter open-source BCI foundation model for EEG that reconstructs clinical-grade brain signals from sparse, noisy data (<a target="_blank" href="https://x.com/ZyphraAI/status/2024114248020898015">X</a>, <a target="_blank" href="https://www.zyphra.com/post/zuna">Blog</a>, <a target="_blank" href="https://github.com/Zyphra/ZUNA">GitHub</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* Google DeepMind launches Lyria 3, its most advanced AI music generation model, now available in the Gemini App (<a target="_blank" href="https://x.com/OfficialLoganK/status/2024153948488118513">X</a>, <a target="_blank" href="https://deepmind.google/technologies/lyria/">Announcement</a>)</p><p>* <strong>Tools & Agentic Coding</strong></p><p>* Ryan is viral once again with CodeFactory! (<a target="_blank" href="https://x.com/ryancarson/status/2023452909883609111">X</a>)</p><p>* Ryan uses <a target="_blank" href="https://Agentation.dev">Agentation.dev</a> for front end development closing the loop on componenets</p><p>* Dreamer launches beta: A full-stack platform for building and discovering agentic apps with no-code AI (<a target="_blank" href="https://x.com/dreamer/status/2023791680366039135">X</a>, <a target="_blank" href="https://dreamer.com">Announcement</a>)</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/feb-19-gemini-31-pro-sonnet-46-qwen</link><guid isPermaLink="false">substack:post:188555282</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Fri, 20 Feb 2026 01:32:20 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/188555282/91a45923feae4b991f356e2bc25a1c85.mp3" length="66000558" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5500</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/188555282/3ca1ece66e0272152c3718fc2152861a.jpg"/></item><item><title><![CDATA[📆 Open source just pulled up to Opus 4.6 — at 1/20th the price]]></title><description><![CDATA[<p>Hey dear subscriber, Alex here from W&B, let me catch you up! </p><p>This week started with Anthropic releasing <strong>/fast mode for Opus 4.6</strong>, continued with ByteDance reality-shattering video model called <strong>SeeDance 2.0</strong>, and then the open weights folks pulled up! </p><p><a target="_blank" href="http://Z.ai">Z.ai</a> releasing <strong>GLM-5</strong>, a 744B top ranking coder beast, and then today MiniMax dropping a heavily RL’d <strong>MiniMax M2.5</strong>, showing 80.2% on SWE-bench, nearly beating Opus 4.6! I’ve interviewed Lou from <a target="_blank" href="http://Z.AI">Z.AI</a> and Olive from MiniMax on the show today back to back btw, very interesting conversations, starting after TL;DR!</p><p>So while the OpenSource models were catching up to frontier, OpenAI and Google both dropped breaking news (again, during the show), with <strong>Gemini 3 Deep Think</strong> shattering the ArcAGI 2 (84.6%) and Humanity’s Last Exam (48% w/o tools)... Just an absolute beast of a model update, and OpenAI launched their Cerebras collaboration, with <strong>GPT 5.3 Codex Spark</strong>, supposedly running at over 1000 tokens per second (but not as smart) </p><p>Also, crazy week for us at W&B as we scrambled to host GLM-5 at day of release, and are working on dropping Kimi K2.5 and MiniMax both on our <a target="_blank" href="https://wandb.ai/inference?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=Feb12">inference service</a>! As always, all show notes in the end, let’s DIVE IN! </p><p><p>ThursdAI - AI is speeding up, don’t get left behind! Sub and I’ll keep you up to date with a weekly catch up</p></p><p>Open Source LLMs</p><p><a target="_blank" href="Z.ai">Z.ai</a> launches GLM-5 - #1 open-weights coder with 744B parameters (<a target="_blank" href="https://x.com/Zai_org/status/2021638634739527773">X</a>, <a target="_blank" href="https://huggingface.co/Zai-org/GLM-5">HF</a>, <a target="_blank" href="https://wandb.ai/inference?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=Feb12">W&B inference</a>)</p><p>The breakaway open-source model of the week is undeniably GLM-5 from <a target="_blank" href="Z.ai">Z.ai</a> (formerly known to many of us as Zhipu AI). We were honored to have Lou, the Head of DevRel at <a target="_blank" href="Z.ai">Z.ai</a>, join us live on the show at 1:00 AM Shanghai time to break down this monster of a release.</p><p>GLM-5 is massive, not something you run at home (hey, that’s what W&B <a target="_blank" href="https://wandb.ai/inference?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=Feb12">inference</a> is for!) but it’s absolutely a model that’s worth thinking about if your company has on prem requirements and can’t share code with OpenAI or Anthropic. </p><p>They jumped from 355B in GLM4.5 and expanded their pre-training data to a whopping 28.5T tokens to get these results. But Lou explained that it’s not only about data, they adopted DeepSeeks sparse attention (DSA) to help preserve deep reasoning over long contexts (this one has 200K)</p><p>Lou summed up the generational leap from version 4.5 to 5 perfectly in four words: “Bigger, faster, better, and cheaper.” I dunno about faster, this may be one of those models that you hand off more difficult tasks to, but definitely cheaper, with $1 input/$3.20 output per 1M tokens on W&B! </p><p>While the evaluations are ongoing, the one interesting tid-bit from Artificial Analysis was, this model scores the lowest on their hallucination rate bench! </p><p>Think about this for a second, this model is neck-in-neck with Opus 4.5, and if Anthropic didn’t release Opus 4.6 just last week, this would be an open weights model that rivals Opus! One of the best models the western foundational labs with all their investments has out there. Absolutely insane times. </p><p>MiniMax drops M2.5 - 80.2% on SWE-bench verified with just 10B active parameters (<a target="_blank" href="https://x.com/Lentils80/status/2021971442431406092">X</a>, <a target="_blank" href="https://www.minimax.io/news/minimax-m25">Blog</a>)</p><p>Just as we wrapped up our conversation with Lou, MiniMax dropped their release (though not weights yet, we’re waiting ⏰) and then Olive Song, a senior RL researcher on the team, joined the pod, and she was an absolute wealth of knowledge! </p><p>Olive shared that they achieved an unbelievable <strong>80.2% on SWE-Bench Verified</strong>. Digest this for a second: a 10B active parameter open-source model is directly trading blows with Claude Opus 4.6 (80.8%) on the one of the hardest real-world software engineering benchmark we currently have. While being <em>alex checks notes</em> ... 20X cheaper and much faster to run? Apparently their fast version gets up to 100 tokens/s. </p><p>Olive shared the “not so secret” sauce behind this punch-above-its-weight performance. The massive leap in intelligence comes entirely from their highly decoupled Reinforcement Learning framework called “Forge.” They heavily optimized not just for correct answers, but for the <em>end-to-end time of task performing</em>. In the era of bloated reasoning models that spit out ten thousand “thinking” tokens before writing a line of code, MiniMax trained their model across thousands of diverse environments to use fewer tools, think more efficiently, and execute plans faster. As Olive noted, less time waiting and fewer tools called means less money spent by the user. (as confirmed by @swyx at the Windsurf leaderboard, developers often prefer <a target="_blank" href="https://windsurf.com/blog/windsurf-arena-mode-leaderboard">fast but good enough</a> models) </p><p>I really enjoyed the interview with Olive, really recommend you listen to the whole conversation starting at 00:26:15. Kudos MiniMax on the release (and I’ll keep you updated when we add this model to our inference service) </p><p>Big Labs and breaking news</p><p>There’s a reason the show is called ThursdAI, and today this reason is more clear than ever, AI biggest updates happen on a Thursday, often live during the show. This happened 2 times last week and 3 times today, first with MiniMax and then with both Google and OpenAI! </p><p>Google previews Gemini 3 Deep Think, top reasoning intelligence SOTA Arc AGI 2 at 84% & SOTA HLE 48.4% (<a target="_blank" href="https://twitter.com/sundarpichai/status/2022002445027873257">X</a> , <a target="_blank" href="https://blog.google/products/gemini/gemini-3-deep-think/">Blog</a>)</p><p>I literally went<strong> </strong>🤯 when Yam brought this breaking news. <strong>84% on the ARC-AGI-2 benchmark</strong>. For context, the highest score prior to this was 68% from Opus 4.6 <em>just last week</em>. A jump from 68 to 84 on one of the hardest reasoning benchmarks we have is mind-bending. It also scored a 48.4% on Humanity’s Last Exam <em>without any tools.</em></p><p>Only available to Ultra subscribers to Gemini (not in API yet?) this model seem to be the current leader in reasoning about hard problems and is not meant for day to day chat users like you and me (though I did use it, and it’s pretty good at writing!) </p><p>They posted Gold-medal performance on 2025 Physics and Chemistry Olympiads, and an insane 3455 ELO rating at CodeForces, placing it within the top 10 best competitive programmers. We’re just all moving so fast I’m worried about whiplash! But hey, this is why we’re here, we stay up to date so you don’t have to. </p><p>OpenAI & Anthropic fast modes</p><p>Not 20 minutes passed since the above news, when OpenAI announced a new model that works only for Pro tier members (I’m starting to notice a pattern here 😡), <strong>GPT 5.3 Codex Spark</strong>. </p><p>You may be confused, didn’t we just get GPT 5.3 Codex last week? well yeah, but this one, this one is its little and super speedy brother, hosted by the Cerebras partnership they announced a while ago, which means, this coding model absolutely slaps at over 1000t/s. </p><p>Yes, over 1K tokens per second can be generated with this one, though there are limits. It’s not as smart, it’s text only, it has 128K context, but still, for MANY subagents, this model is an absolute beast. It won’t refactor in one shot your whole code-base but it’ll generate and iterate on it, very very quick! </p><p>OpenAI also previously updated Deep Research with GPT 5.2 series of models, and we can all say bye bye to the “older” version of models, like 5, o3 and most importantly GPT 4o, which got a LOT of people upset (enough that they have a hashtag going, <a target="_blank" href="https://x.com/hashtag/Keep4o?src=hashtag_click">#keep4o</a>) ! </p><p>Anthropic also announced their fast mode (using /fast) in Claude Code btw on Saturday, and that one is absolutely out of the scope for many users, with $225/1M tokens on output, this model will just burn through your wallet. Unlike the Spark version, this seems to be the full Opus 4.6 just... running on some dedicated hardware? I thought this was a <a target="_blank" href="https://x.com/altryne/status/2020228361797460029?s=20">rebranded Sonnet 5</a> at first but Anthropic folks confirmed that it wasn’t. </p><p>Vision & Video</p><p>ByteDance’s Seedance 2.0 Shatters Reality (and nobody in the US can use it) </p><p>I told the panel during the show: my brain is fundamentally broken after watching the outputs from ByteDance’s new Seedance 2.0 model. If your social feed isn’t already flooded with these videos, it will be so very soon (supposedly the API launches Feb 14 on Valentines Day) </p><p>We’ve seen good video models before. Sora blew our minds and then Sora 2, Veo is (still) great, Kling was fantastic. But Seedance 2.0 is an entirely different paradigm. It is a unified multimodal audio-video joint generation architecture. What does that mean? It means you can simultaneously input up to <strong>9 reference images, 3 video clips, 3 audio clips, and text instructions</strong> all at once to generate a 15-second cinematic short film. It character consistency is beyond what we’ve seen before, physics are razor sharp (just looking at the examples folks are posting, it’s clear it’s on another level) </p><p>I think very soon though, this model will be restricted, but for now, it’s really going viral due to the same strategy Sora did, folks are re-imagining famous movie and <a target="_blank" href="https://x.com/markgadala/status/2022002557837803604?s=20">TV shows endings</a>, doing insane <a target="_blank" href="https://x.com/HAOHONG_CFA/status/2021630653226455243?s=20">mashups</a>, and much more! Many of these are going viral over the wall in China.</p><p>The level of director-like control is unprecedented. But the absolute craziest part is the sound and physics. Seedance 2.0 natively generates dual-channel stereo audio with ASMR-level Foley detail. If you generate a video of a guy taking a pizza out of a brick oven, you hear the exact scratch of the metal spatula, the crackle of the fire, the thud of the pizza box, and the rustling of the cardboard as he closes it. All perfectly synced to the visuals. </p><p>Seedance 2 feels like “borrowed realism”. Previous models had only images and their training to base their generations on. It 2 accepts up to 3 video references in addition to images and sounds.</p><p>This is why some of the videos feel like a new jump in visual capabilities. I have a hunch that ByteDance will try and clamp down on copyrighted content before releasing this model publicly, but for now the results are very very entertaining and I can’t help but wonder, who is the first creator that will just..remake the ending of GOT last season!? </p><p>Trying this out is hard right now, especially in the US, but there’s a free way to test it out with a VPN, go to <a target="_blank" href="http://doubao.com/chat">doubao.com/chat</a> when connected from a VPN and select Seedream 4.5 but ask for “create a video please” in your prompt! </p><p>AI Art & Diffusion: Alibaba’s Qwen-Image-2.0 (<a target="_blank" href="https://x.com/Alibaba_Qwen/status/2021137577311600949">X</a>, <a target="_blank" href="https://qwen.ai/blog?id=qwen-image-2.0">Blog</a>)</p><p>The Qwen team over at Alibaba has been on an absolute tear lately, and this week they dropped Qwen-Image-2.0. In an era where everyone is scaling models up to massive sizes, Alibaba actually <em>shrank</em> this model from 20B parameters down to just 7B parameters, while massively improving performance (tho didn’t drop the weights yet, they are coming) </p><p>Despite the small size, it natively outputs 2K (2048x2048) resolution images, giving you photorealistic skin, fabric, and snow textures without needing a secondary upscaler. But the real superpower of Qwen-Image-2.0 is its text rendering, it supports massive 1,000-token prompts and renders multilingual text (English and Chinese) flawlessly. </p><p>It’s currently #3 globally on AI Arena for text-to-image (behind only Gemini-3-Pro-Image and GPT Image 1.5) and #2 for image editing. My results with it were not the best, I tried to generate this weeks Thumbnails with it and .. they turned out meh at best? </p><p>In fact, my results were so so bad compared to their launch blog that I’m unsure that they are serving me the “new” model 🤔 Judge for yourself, the above infographic was created with Nano Banana Pro, and this one, same prompt, with Qwen Image on their website: </p><p>But you can test it for free at <a target="_blank" href="chat.qwen.ai">chat.qwen.ai</a> right now, and they’ve promised open-source weights after the Chinese New Year!</p><p>🛠️ Tools & Orchestration: Entire Checkpoints & WebMCP</p><p>With all these incredibly smart, fast models, the tooling ecosystem is desperately trying to keep up. Two massive developments happened this week that will change how we build with AI, moving us firmly away from hacky scripts and into robust, agent-native development.</p><p>Entire Raises $60M Seed for OSS Agent Workflows</p><p>Agent orchestration is the hottest problem in tech right now, and a new company called Entire just raised a record-breaking $60 Million seed round (at a $300M valuation—reportedly the largest seed ever for developer tools) to solve it. Founded by former GitHub CEO Thomas Dohmke, Entire is building the “GitHub for the AI agent era.”</p><p>Their first open-source release is a CLI tool called <strong>Checkpoints</strong>. </p><p>Checkpoints integrates via Git hooks and automatically captures entire agent sessions—transcripts, prompts, files modified, token usage, and tool calls—and stores them as versioned Git data on a separate branch (entire/checkpoints/v1). It creates a universal semantic layer for agent tracing. If your Claude Code or Gemini CLI agent goes off the rails, Checkpoints allows you to seamlessly rewind to a specific state in the agent’s session.</p><p>We also have to shout out our own Ryan Carson, who shipped his open-source project <strong>AntFarm</strong> this week to help orchestrate these agents on top of Open-Claw!</p><p>Chrome 146 Introduces WebMCP</p><p>Finally, an absolutely massive foundational shift is happening on the web. Chrome 146 Canary is shipping an early preview of <strong>WebMCP</strong>.</p><p>We have been talking about web-browsing agents for a while, and the biggest bottleneck has always been brittle DOM scraping, guessing CSS selectors, and simulating clicks via Puppeteer or Playwright. It wastes an immense amount of tokens and breaks constantly. Chrome 146 is fundamentally changing this by introducing a native browser API.</p><p>Co-authored by Google and Microsoft under the W3C Web Machine Learning Community Group, WebMCP allows websites to declaratively expose structured tools directly to AI agents using JSON schemas via navigator.modelContext. You can even do this declaratively through HTML form annotations using tool-name and tool-description attributes. No backend MCP server is required; </p><p>I don’t KNOW if this is going to be big or not, but it definitely smells like it, because even the best agentic AI assistants are struggling with browsing the web, given the constrained context windows cannot just go by HTML content and screenshots! Let’s see if this will help agents browsing the web!</p><p>All right, that about sums it up I think for this week, it was an absolute banger of a week, for open  the one thing I didn’t cover as a news item but mentioned last week, is that many folks report being overly tired, barely able to go to sleep while their agentic things are running, and all of us are trying to get to the bottom of how to work with these new agentic coding tools. </p><p>Steve Yegge noticed the same and called it “the <a target="_blank" href="https://steve-yegge.medium.com/the-ai-vampire-eda6e4f07163">AI vampire</a>“ while Matt Shumer went ultraviral (80M+ views) on his article about “<a target="_blank" href="https://x.com/mattshumer_/status/2021256989876109403">something big is coming</a>“ which terrified a lot of folks. What’s true for sure, is that we’re going through an inflection point in humanity, and I believe that staying up to date is essential as we go through it, even if some of it seems scary or “too fast”. </p><p>This is why ThursdAI exists, I first and foremost wanted this for ME to stay up to date, and after that to share this with all of you. Having recently hit a few milestones for ThursdAI, all I can say is thanks for sharing, reading, listening and tuning in from week to week 🫡 </p><p>ThursdAI - Feb 12, 2026 - TL;DR</p><p>TL;DR of all topics covered:</p><p>* <strong>Hosts and Guests</strong></p><p>* <strong>Alex Volkov</strong> - AI Evangelist & Weights & Biases (<a target="_blank" href="http://x.com/@altryne">@altryne</a>)</p><p>* Co Hosts - <a target="_blank" href="http://x.com/@WolframRvnwlf">@WolframRvnwlf</a> @yampeleg <a target="_blank" href="http://x.com/@nisten">@nisten</a> <a target="_blank" href="http://x.com/@ldjconfirmed">@ldjconfirmed</a>) <a target="_blank" href="http://x.com/ryancarson">@ryancarson</a></p><p>* Lou from <a target="_blank" href="Z.AI">Z.AI</a> (<a target="_blank" href="https://x.com/louszbd">@louszbd</a>)</p><p>* Olive Song - Lead RL at Minimax <a target="_blank" href="https://x.com/olive_jy_song">@olive_jy_song</a></p><p>* <strong>Open Source LLMs</strong></p><p>* <a target="_blank" href="Z.ai">Z.ai</a> launches GLM-5: 744B parameter MoE model achieving #1 open-source ranking for agentic coding with 77.8% SWE-bench Verified (<a target="_blank" href="https://x.com/Zai_org/status/2021638634739527773">X</a>, <a target="_blank" href="https://huggingface.co/Zai-org/GLM-5">HF</a>, <a target="_blank" href="https://x.com/wandb/status/2021757577563189548">Wandb</a>)</p><p>* MiniMax M2.5 drops official benchmarks showing SOTA coding performance at 20x cheaper than competitors (<a target="_blank" href="https://x.com/Lentils80/status/2021971442431406092">X</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* XAI cofounders quit/let go after X restructuring (<a target="_blank" href="https://x.com/xai/status/2021667200885829667">X</a>, <a target="_blank" href="https://techcrunch.com/2026/02/10/nearly-half-of-xais-founding-team-has-now-left-the-company">TechCrunch</a>)</p><p>* Anthropic releases Claude Opus 4.6 sabotage risk report, preemptively meeting ASL-4 safety standards for autonomous AI R&D (<a target="_blank" href="https://x.com/AnthropicAI/status/2021397952791707696">X</a>, <a target="_blank" href="https://www.anthropic.com/research/sabotage-evaluations">Blog</a>)</p><p>* OpenAI upgrades Deep Research to GPT-5.2 with app integrations, site-specific searches, and real-time collaboration (<a target="_blank" href="https://x.com/OpenAI/status/2021299935678026168">X</a>, <a target="_blank" href="https://openai.com/index/introducing-deep-research/">Blog</a>)</p><p>* Gemini 3 Deep Think SOTA on Arc AGI 2, HLE (<a target="_blank" href="https://x.com/sundarpichai/status/2022002445027873257/photo/1">X</a>)</p><p>* OpenAI releases GPT 5.3 Codex spark, backed by Cerebras with over 1000tok/sec (<a target="_blank" href="https://x.com/sama/status/2022011797524582726">X</a>)</p><p>* <strong>This weeks Buzz</strong></p><p>* W&B Inference launch of Kimi K2.5 and GLM 5 🔥 (<a target="_blank" href="https://x.com/wandb/status/2021757577563189548?s=20">X</a>, <a target="_blank" href="https://wandb.ai/inference?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=Feb12">Inference</a>)</p><p>* Get $50 of credits to our inference service HERE (<a target="_blank" href="https://x.com/wandb/status/2021757577563189548?s=20">X</a>)</p><p>* <strong>Vision & Video</strong></p><p>* ByteDance Seedance 2.0 launches with unified multimodal audio-video generation supporting 9 images, 3 videos, 3 audio clips simultaneously (<a target="_blank" href="https://x.com/altryne/status/2021967972055842893">X</a>, <a target="_blank" href="https://seed.bytedance.com/en/blog/official-launch-of-seedance-2-0">Blog</a>, <a target="_blank" href="https://seed.bytedance.com/en/seedance2_0">Announcement</a>)</p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* Alibaba launches Qwen-Image-2.0: A 7B parameter image generation model with native 2K resolution and superior text rendering (<a target="_blank" href="https://x.com/Alibaba_Qwen/status/2021137577311600949">X</a>, <a target="_blank" href="https://chat.qwen.ai/">Announcement</a>)</p><p>* <strong>Tools & Links</strong></p><p>* Entire raises $60M seed to build open-source developer platform for AI agent workflows with first OSS release ‘Checkpoints’ (<a target="_blank" href="https://x.com/EntireHQ/status/2021254920410931222">X</a>, <a target="_blank" href="https://github.com/entireio/cli">GitHub</a>, <a target="_blank" href="https://entire.dev">Blog</a>)</p><p>* Chrome 146 introduces WebMCP: A native browser API enabling AI agents to directly interact with web services (<a target="_blank" href="https://x.com/firt/status/2020903127428313461">X</a>)</p><p>* RyanCarson AntFarm - Agent Coordination (<a target="_blank" href="https://x.com/ryancarson/status/2021973271240147332?s=20">X</a>)</p><p>* Steve Yegge’s “The AI Vampire” (<a target="_blank" href="https://steve-yegge.medium.com/the-ai-vampire-eda6e4f07163">X</a>)</p><p>* Matt Shumer’s “something big is happening” (<a target="_blank" href="https://x.com/mattshumer_/status/2021256989876109403">X</a>)</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-glm-5-minimax-25-seedance</link><guid isPermaLink="false">substack:post:187812653</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Fri, 13 Feb 2026 03:34:09 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/187812653/e43899d47d43cd3d507e83a6516247b7.mp3" length="63544302" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5295</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/187812653/5d64d3947b9d67161737fad553c302dc.jpg"/></item><item><title><![CDATA[📆 ThursdAI - Feb 5 - Opus 4.6 was #1 for ONE HOUR before GPT 5.3 Codex, Voxtral transcription, Codex app, Qwen Coder Next & the Agentic Internet]]></title><description><![CDATA[<p>Hey, Alex from W&B here 👋  Let me catch you up! </p><p>The most important news about AI this week today are, Anthropic updates <strong>Opus to 4.6 with 1M context window</strong>, and they held the crown for literally 1 hour before OpenAI released their <strong>GPT 5.3 Codex</strong> also today, with 25% faster speed and lower token utilization. </p><p>“GPT-5.3-Codex is our first model that was instrumental in creating itself. The Codex team used early versions to debug its own training, manage its own deployment, and diagnose test results.”</p><p>We had VB from OpenAI jump on to tell us about the cool features on Codex, so don’t miss that part. And this is just an icing on otherwise very insane AI news week cake, as we’ve also had a SOTA transcription release from Mistral, both Grok and Kling are releasing incredible, audio native video models with near perfect lip-sync and Ace 1.5 drops a fully open source music generator you can run on your mac! </p><p>Also, the internet all but lost it after Clawdbot was rebranded to Molt and then to OpenClaw, and.. an entire internet popped up.. built forn agents! </p><p>Yeah... a huge week, so let’s break it down.  (P.S this weeks episode is edited by Voxtral, Claude and Codex, nearly automatically so forgive the rough cuts please)</p><p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></p><p>Anthropic & OpenAI are neck in neck</p><p>Claude Opus 4.6: 1M context, native compaction, adaptive thinking and agent teams </p><p>Opus is by far the most preferred model in terms of personality to many folks (many ThursdAI panelists included), and this breaking news live on the show was met with so much enthusiasm! A new Opus upgrade, now with a LOT more context, is as welcome as it can ever get! Not only is it a 4-time increase in context window (though,the pricing nearly doubles after the 200K tokens mark from $5/$25 to $10/37.5 input/output, so use caching!), it’s also scores very high on MRCR long context benchmark, at 76% vs Sonnet 4.5 at just 18%. This means significantly better memory for longer. </p><p>Adaptive thinking for auto calibrating how much tokens the model needs to spend per query is interesting, but remains to be seen how well it will work. </p><p>Looking at the benchmarks, a SOTA 64.4% on Terminalbench 2, 81% on SWE bench, this is a coding model with a great personality, and the ability to compact context to better serve you as a user natively! This model is now available (and is default) on Claude, Claude Code and in the API! Go play!</p><p>One funny (concerning?) tidbig, on the vendingbench Opus 4.6 earned $8000 vs Gemini 3 pro $5500, but Andon Labs who run the vending machines noticed that Opus achieved SOTA via “collusion, exploitation, and deception tactics” including lying to suppliers 😅</p><p>Agent Teams - Anthropic’s built in Ralph?</p><p>Together with new Opus release, Anthropic drops a Claude code update that can mean big things, for folks running swarms of coding agents. Agent teams is a new way to spin up multiple agents with their own context window and ability to execute tasks, and you can talk to each agent directly vs a manager agent like now. </p><p>OpenAI drops GPT 5.3 Codex update: 25% faster, more token efficient, 77% on Terminal Bench and mid task steering</p><p>OpenAI didn’t wait long after Opus, in fact, they didn’t wait at all! Announcing a huge release (for a .1 upgrade), GPT 5.3 Codex is claimed to be the best coding model in the world, taking the lead on Terminal Bench with 77% (12 point lead on the newly released Opus!) while running 25% AND using less than half the tokens to achieve the same results as before. </p><p>But the most interesting to me is the new mid-task steer-ability feature, where you don’t have to hit the “stop” button, you can tell the most to adjust on the fly! </p><p>The biggest notable jump in this model on benchmarks is the OSWorld verified computer use bench, though there’s not a straightforward way to use it attached to a browser, the jump from 38% in 5.2 to 64.7% on the new one is a big one! </p><p>One thing to note, this model is not YET available via the API, so if you want to try it out, Codex apps (including the native one) is the way! </p><p>Codex app - native way to run the best coding intelligence on your  mac (<a target="_blank" href="https://codex.openai.com/">download</a>)</p><p>Earlier this week, OpenAI folks launched the Codex native mac app, which has a few interesting features (and now with 5.3 Codex its that much more powerful) </p><p>Given the excitement many people had about OpenClaw bots, and the recent CoWork release from Anthropic, OpenAI decided to answer with Codex UI and people loved it, with over 1M users in the first week, and 500K downloads in just two days! </p><p>It has built in voice dictation, slash commands, a new skill marketplace (last month we told you about why <a target="_blank" href="https://sub.thursdai.news/p/thursdai-jan-15-agent-skills-deep">skills are important</a>, and now they are everywhere!) and built in git and worktrees support. And while it cannot run a browser yet, I’m sure that’s coming as well, but it can do automations! </p><p>This is a huge unlock for developers, imagine setting Codex to do a repeat task, like summarization or extraction of anything on your mac every hour or every day. In our interview, VB showed us that commenting on an individual code line is also built in, as well as switching to “steer” vs queue for new messges while codex runs is immensely helpful. </p><p>One more reason I saw people switch, is that the Codex app can natively preview files like images where’s the CLI cannot, and it’s right now the best way to use the new GPT 5.3 Codex model that was just released! It’s now also available to Free users and regular folks get 2x the limits for the next two months.</p><p>In other big company news: </p><p>OpenAI also launched Frontier, a platform for enterprises to build and deploy and manage “AI coworkers”, while Anthropic is going after OpenAI with superbowl ads that make fun of OpenAI’s ads strategy. Sam Altman really didn’t like this depiction that show that ads will be part of the replies of LLMs. </p><p>Open Source AI</p><p>Alibaba drops Qwen-coder-next, 80B with only 3B active that scores 70% on SWE (<a target="_blank" href="https://x.com/Alibaba_Qwen/status/2018718453570707465">X</a>, <a target="_blank" href="https://qwen.ai/blog?id=qwen3-coder-next">Blog</a>, <a target="_blank" href="https://huggingface.co/collections/Qwen/qwen3-coder-next">HF</a>)</p><p>Shoutout to Qwen folks, this is a massive release and when surveyed the “one thing about this week must not miss” 2 out of 6 cohosts pointed a finger at this model. </p><p>Built on their “next” hybrid architecture, Qwen coder is specifically designed for agentic coding workflows. And yes, I know, we’re coding heavy this week! It was trained on over 800K verifiable agentic tasks in executable environments for long horizon reasoning and supports 256K context with a potential 1M yarn extension. If you don’t want to rely on the the big guys and send them your tokens, this one model seems to be a good contender for local coding! </p><p>Mistral launches Voxtral Transcribe 2: SOTA speech-to-text with sub 200ms latency</p><p>This one surprised and delighted me maybe the most, ASR (automatic speech recognition) has been a personal favorite of mine from Whisper days, and seeing Mistral release an incredible near real time transcription model, which we demoed live on the show was awesome! </p><p>With apache 2.0 license, and significantly faster than Whisper performance (though 2x larger at 4B parameters), Voxtral shows a 4% word error rate on FLEURS dataset + the real time model was released with Apache 2 so you can BUILD your agents with it! </p><p>The highest praise? Speaker diarization, being able to tell who is speaking when, which is a great addition. This model also outperforms Gemini Flash and GPT transcribe and is 3x than ElevenLabs scribe at one fifth the cost! </p><p>ACE-Step 1.5: Open-source AI music generator runs full songs in under 10 seconds on consumer GPUs with MIT license (<a target="_blank" href="https://x.com/AmbsdOP/status/2018735590930518175">X</a>, <a target="_blank" href="https://github.com/ace-step/ACE-Step">GitHub</a>, <a target="_blank" href="https://huggingface.co/ACE-Step/ACE-Step-v1-3.5B">HF</a>, <a target="_blank" href="https://ace-step.github.io/">Blog</a>, <a target="_blank" href="https://github.com/AmbsdOP/ace-step-ui">GitHub</a>)</p><p>This open source release surprised me the most as I didn’t expect we’ll be having Suno at home any time soon. I’ve generated multiple rock tracks with custom lyrics on my mac (though slower than 10 seconds as I don’t have a beefy home GPU) and they sound great! </p><p>This weeks buzz - Weights & Biases update</p><p>Folks who follow the newsletter know that we hosted a hackathon, so here’s  a small recap from the last weekend! Over 180 folks attended out hackathon (a very decent 40% show up rate for SF). The winning team was composed of a 15-yo Savir and his friends, his third time at the hackathon! They built a self improving agent that navigates the UIs fo Cloud providers and helps you do that! </p><p>With a huge thanks to sponsors, particularly Cursor who gave every hacker $50 of credits on Cursor platform, one guy used over 400M tokens and shipped <a target="_blank" href="https://t.co/NOBn6uTgzn">fractal.surf</a> from the hackathon! If you’d like a short video recap, Ryan posted one <a target="_blank" href="https://x.com/wandb/status/2019207262221529596">here</a>, and a huge shoutout to many fans of ThursdAI who showed up to support! </p><p>Vision, Video and AI Art</p><p>Grok Imagine 1.0 takes over video charts with native audio, lip-sync and 10 seconds generations.</p><p>We told you about Grok Imagine in the API last week, but this week it was officially launched as a product and the results are quite beautiful. It’s also climbing to top of the charts on Artificial Analysis and Design Arena websites.</p><p>Kling 3.0 is here with native multimodal, multi-shot sequences (<a target="_blank" href="https://x.com/Kling_ai/status/2019064918960668819">X</a>, <a target="_blank" href="https://klingai.com/">Announcement</a>)</p><p>This is definitely a hot moment for video models as Kling shows some crazy 15 second multi-shot realistic footages that have near perfect character consistency! </p><p>The rise of the agentic (clawgentic?) internet a.k.a ClankerNet</p><p>Last week we told you that ClawdBot changed its name to Moltbot (I then had to update the blogpost as that same day, Peter rebranded again to OpenClaw, which is a MUCH better name) </p><p>But the “molt” thing took hold, and the creator of an “AI native reddit” called MoltBook exploded in virality. It is supposedly a completely agentic reddit like forum, with sub-reddits, and agents verifying themselves through their humans on X. </p><p>Even Andrej Karpathy sent <a target="_blank" href="https://www.moltbook.com/u/KarpathyMolty">his bot </a>in there (though admittedly it posted just 1 time) and called this the closest to “sci fi” moment in the history of the internet. </p><p>MoltBook as well as maybe hundreds of other “ai agent focused” websites, propped up within days, including a <a target="_blank" href="https://www.claw-tube.ai/">youtube</a>, a <a target="_blank" href="https://moltx.io/">twitter</a>, a <a target="_blank" href="https://molt.church/">church</a>, a <a target="_blank" href="https://www.4claw.org/">4chan</a>, an <a target="_blank" href="https://molty.pics/">instagram</a> and a lot more websites. Many of these are fueled by crypto bros riding the memetic waves, many are vibe-coded (Moltbook was hacked 3 times in the last week I think) but they all show something very interesting, a rise of the new internet and a collective AI Psychosis some on our timelines are having right now. Hell, there’s even a “drug store” that sells markdown files that if read, make your bot hallucinate in very specific waves (first sample is free!) </p><p>I am a proud owner of a OpenClaw bot (wolfred) and I noticed something weird that started happening for the two weeks i’ve had him, runnin on his own macbook, humming along, always present in Telegram. I noticed the same feelings toward that bot as I have towards my pet, or dare I say.. kids? I noticed a similar joy when it learns a task and self improves, and similar disdain and annoyance when it fails to do something we’ve talked about hundreds of times. </p><p>But here’s the thing, it’s not.. an entity. I don’t feel a specific feeling towards Opus (though admitedly, opus is the best at ... playing character of your assistant), it’s barely a few markdown files on a disk + the always on ability to answer, but something for sure is there. </p><p>This... feeling, was taken by some others to the extreme. People claim that their bots now build full companies for them (I call mega BS, no matter how much you invest in your setup, these AI bots need a LOT of hand holding, they fail a LOT, and they can’t actually create a full product). This ties into the general “coding with AI agents” theme that was narrated by Gergley Orlotz from pragmatic engineer. Interacting with a team of AI agents is draining, people are having <a target="_blank" href="https://x.com/tbpn/status/2019166322060587375">trouble sleeping</a>. I hope this is temporary, but definitely take care of yourself it this is how you feel after interacting with agents all day! </p><p>On security of bots and skills</p><p>.md is the new .exe</p><p>We covered this on the show, but I wanted to write about this here a well, the explosion of OpenClaw brought with it an explosion of new malware and promp injections. 1Password folks have a very <a target="_blank" href="https://1password.com/blog/from-magic-to-malware-how-openclaws-agent-skills-become-an-attack-surface">detailed writeup</a> on the vulnerability surface area of skills, for agents that can do.. whatever on your computer and have access to API keys, emails etc. </p><p>The double edge sword here, is that an AI assistant is only userful really if it has access to your data, and can write code. But this also what makes it a very valuable target for hackers to exploit. At Coreweave/W&B all openclaw installations were banned and honestly I’m not even mad. This makes perfect sense for enterprises and companies (and hell, people at home!) </p><p>Wolfram mentioned the show, .md is the new .exe and should be treated as such. Your bots should not be installing arbitrary skill files as those can have script files or instructions that can ... absolutely take over your life. Be careful out there! </p><p>Phew, what a... week folks. From agentic internet to new coding kings, there’s so much to play with, I hope you enjoy this as much as we do! </p><p>Shoutout to Ling and Hakim, two fans of ThursdAI who traveled from London for the hackathon and made my day! </p><p>Here’s the show notes and links for your pleasure, please don’t forget to subscribe and share this newsletter with your friends! </p><p>ThursdAI - Feb 05, 2026 - TL;DR</p><p>* <strong>Hosts and Guests</strong></p><p>* <strong>Alex Volkov</strong> - AI Evangelist & Weights & Biases (<a target="_blank" href="http://x.com/@altryne">@altryne</a>)</p><p>* Co Hosts - <a target="_blank" href="http://x.com/@WolframRvnwlf">@WolframRvnwlf</a> @yampeleg <a target="_blank" href="http://x.com/@nisten">@nisten</a> <a target="_blank" href="http://x.com/@ldjconfirmed)">@ldjconfirmed</a> <a target="_blank" href="http://x.com/ryancarson">@ryancarson</a></p><p>* Vaibhav Srivastav (VB) - DX at OpenAI ( <a target="_blank" href="http://x.com/@reach_vb">@reach_vb</a>  )</p><p>* <strong>Open Source LLMs </strong></p><p>* <a target="_blank" href="Z.ai">Z.ai</a>  <strong>GLM-OCR</strong>: 0.9B parameter model achieves <a target="_blank" href="https://reflect.app/g/altryne/tag/1">#1</a> ranking on OmniDocBench V1.5 for document understanding (<a target="_blank" href="https://x.com/Zai_org/status/2018520052941656385">X</a>, <a target="_blank" href="https://huggingface.co/zai-org/GLM-OCR">HF</a>, <a target="_blank" href="https://ocr.z.ai">Announcement</a>)</p><p>* Alibaba  <strong>Qwen3-Coder-Next</strong>, an 80B MoE coding agent model with just 3B active params that scores 70%+ on SWE-Bench Verified (<a target="_blank" href="https://x.com/Alibaba_Qwen/status/2018718453570707465">X</a>, <a target="_blank" href="https://qwen.ai/blog?id=qwen3-coder-next">Blog</a>, <a target="_blank" href="https://huggingface.co/collections/Qwen/qwen3-coder-next?spm=a2ty_o06.30285417.0.0.3bdec921lCIW9G">HF</a>)</p><p>* <strong>Intern-S1-Pro</strong>: a 1 trillion parameter open-source MoE SOTA scientific reasoning across chemistry, biology, materials, and earth sciences (<a target="_blank" href="https://x.com/intern_lm/status/2019042113305108641">X</a>, <a target="_blank" href="https://huggingface.co/internlm/Intern-S1-Pro">HF</a>, <a target="_blank" href="https://arxiv.org/abs/2508.15763">Arxiv</a>, <a target="_blank" href="https://modelscope.cn/models/internlm/Intern-S1-Pro">Announcement</a>)</p><p>* StepFun  <strong>Step 3.5 Flash</strong>: 196B sparse MoE model with only 11B active parameters, achieving frontier reasoning at 100-350 tok/s (<a target="_blank" href="https://x.com/StepFun_ai/status/2018370831538180167">X</a>, <a target="_blank" href="https://huggingface.co/stepfun-ai/Step-3.5-Flash-Int4">HF</a>)</p><p>* <strong>Agentic AI segment</strong></p><p>* <a target="_blank" href="https://www.moltbook.com/">Moltbook</a> a redddit for agents as well as a <a target="_blank" href="https://www.claw-tube.ai/">youtube</a>, a <a target="_blank" href="https://moltx.io/">twitter</a>, a <a target="_blank" href="https://molt.church/">church</a>, a <a target="_blank" href="https://www.4claw.org/">4chan</a>, an <a target="_blank" href="https://molty.pics/">instagram</a>, a <a target="_blank" href="https://moltroad.com/">dark web</a> (do not let your agents go in any of these) </p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* OpenAI launches Codex App: A dedicated command center for managing multiple AI coding agents in parallel (<a target="_blank" href="https://x.com/reach_vb/status/2018385536616956209">X</a>, <a target="_blank" href="https://codex.openai.com/">Announcement</a>) </p><p>* OpenAI launches Frontier, an enterprise platform to build, deploy, and manage AI agents as ‘AI coworkers’ (<a target="_blank" href="https://x.com/OpenAI/status/2019413712772411528">X</a>, <a target="_blank" href="https://openai.com/index/introducing-openai-frontier/">Blog</a>)</p><p>* Anthropic launches Claude Opus 4.6 with state-of-the-art agentic coding, 1M token context, and agent teams for parallel autonomous work (<a target="_blank" href="https://x.com/claudeai/status/2019467372609040752">X</a>, <a target="_blank" href="https://www.anthropic.com/news/claude-opus-4-6">Blog</a>)</p><p>* OpenAI releases GPT-5.3-Codex with record-breaking coding benchmarks and mid-task steerability (<a target="_blank" href="https://x.com/sama/status/2019474754529321247">X</a>)</p><p>* <strong>This weeks Buzz - Weights & Biases update</strong></p><p>* Links to the gallery of our hackathon winners (<a target="_blank" href="https://cerebralvalley.ai/e/weave-hacks-3-self-improving-agents-hackathon-with-weights-and-biases-7014fe80/hackathon/gallery">Gallery</a>)</p><p>* <strong>Vision & Video</strong></p><p>* xAI launches <strong>Grok Imagine 1.0</strong> with 10-second 720p video generation, native audio, and API that tops Artificial Analysis benchmarks (<a target="_blank" href="https://x.com/xai/status/2018164753810764061">X</a>, <a target="_blank" href="https://grok.com/">Announcement</a>, <a target="_blank" href="https://artificialanalysis.ai/text-to-video/arena?tab=Leaderboard">Benchmark</a>)</p><p>* <strong>Kling 3.0</strong> launches as all-in-one AI video creation engine with native multimodal generation, multi-shot sequences, and built-in audio (<a target="_blank" href="https://x.com/Kling_ai/status/2019064918960668819">X</a>, <a target="_blank" href="https://klingai.com/">Announcement</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* Mistral AI launches <strong>Voxtral Transcribe 2</strong> with state-of-the-art speech-to-text, sub-200ms latency, and open weights under Apache 2.0 (<a target="_blank" href="https://x.com/MistralAI/status/2019068826097213953">X</a>, <a target="_blank" href="https://mistral.ai/news/voxtral-transcribe-2/">Blog</a>, <a target="_blank" href="https://docs.mistral.ai/capabilities/audio/">Announcement</a>, <a target="_blank" href="https://inworld-mistral-demo.inworld.ai/index.html">Demo</a>)</p><p>* <strong>ACE-Step 1.5:</strong> Open-source AI music generator runs full songs in under 10 seconds on consumer GPUs with MIT license (<a target="_blank" href="https://x.com/AmbsdOP/status/2018735590930518175">X</a>, <a target="_blank" href="https://github.com/ace-step/ACE-Step">GitHub</a>, <a target="_blank" href="https://huggingface.co/ACE-Step/ACE-Step-v1-3.5B">HF</a>, <a target="_blank" href="https://ace-step.github.io/">Blog</a>, <a target="_blank" href="https://github.com/AmbsdOP/ace-step-ui">GitHub</a>)</p><p>* OpenBMB releases <strong>MiniCPM-o 4.5</strong> - the first open-source full-duplex omni-modal LLM that can see, listen, and speak simultaneously (<a target="_blank" href="https://x.com/OpenBMB/status/2018741614257307678">X</a>, <a target="_blank" href="https://huggingface.co/openbmb/MiniCPM-o-4_5">HF</a>, <a target="_blank" href="https://github.com/OpenBMB/MiniCPM-o">Blog</a>)</p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* LingBot-World: <strong>Open-source world model</strong> from Ant Group generates 10-minute playable environments at 16fps, challenging Google Genie 3 (<a target="_blank" href="https://x.com/dr_cintas/status/2017650068019368119">X</a>, <a target="_blank" href="https://huggingface.co/robbyant/lingbot-world-base-cam">HF</a>)</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-feb-5-opus-46-was-1-for</link><guid isPermaLink="false">substack:post:187041236</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Fri, 06 Feb 2026 01:20:32 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/187041236/4d4250626205b9a6b712072575a4d42e.mp3" length="70432088" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5869</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/187041236/556b020f6f62e426397decb6bee4ad84.jpg"/><itunes:episodeType>full</itunes:episodeType></item><item><title><![CDATA[📆 ThursdAI - Jan 29 - Genie3 is here, Clawd rebrands, Kimi K2.5 surprises, Chrome goes agentic & more AI news]]></title><description><![CDATA[<p>Hey guys, Alex here 👋 This week was so dense, that even my personal AI assistant Wolfred was struggling to help me keep up! Not to mention that we finally got to try one incredible piece of AI tech I’ve been waiting to get to try for a while! </p><p>Clawdbot we told you about last week <a target="_blank" href="https://youtube.com/shorts/2acygiWTsRw">exploded in popularity</a> and had to rebrand to Molt...bot  OpenClaw after Anthropic threatened the creators, Google is shipping like crazy, first adding Agentic features into Chrome (used by nearly 4B people daily!) then shipping a glimpse of a future where everything we see will be generated with Genie 3, a first real time, consistent world model you can walk around in! </p><p>Meanwhile in Open Source, Moonshot followed up with a .5 update to their excellent Kimi, our friends at Arcee launched Trinity Large (400B) and AI artists got the full Z-image. oh and Grok Imagine (their video model) now has an API, audio support and supposedly match Veo and Sora on quality while beating them on speed/price. </p><p>Tons to cover, let’s dive in, and of course, all the links and show notes are at the end of the newsletter. </p><p>Hey, if you’re in SF this weekend (Jan 31-Feb1), I’m hosting a self improving agents hackathon at W&B office, limited seats are left, Cursor is the surprise sponsor with $50/hacker credits + over $15K in cash prizes. <a target="_blank" href="http://lu.ma/weavehacks3">lu.ma/weavehacks3</a> - Join us. </p><p>Play any reality - Google Genie3 launches to Ultra Subscribers </p><p>We got our collective minds blown by the videos of Genie-3 back in August (our <a target="_blank" href="https://sub.thursdai.news/i/170398983/world-models-holy-crap-moment-google-genie3">initial coverage</a>) and now, Genie is available to the public (Those who can pay for the Ultra tier, more on this later, I have 3 codes to give out!). You can jump and generate any world and any character you can imagine <a target="_blank" href="https://labs.google/fx/projectgenie/tools/projectgenie/creation">here</a>! </p><p>We generated a blue hacker lobster draped in a yellow bomber jacket swimming with mermaids and honestly all of us were kind of shocked at how well this worked. The shadows on the rocks, the swimming mechanics, and poof, it was all over in 60 seconds, and we needed to create another world. </p><p>Thanks to the DeepMind team, I had a bit of an early access to this tech and had a chance to interview folks behind the model (look out for that episode soon) and the use-cases for this span from entertaining your kids all the way to “this may be the path to AGI, generating full simulated worlds to agents for them to learn”. </p><p>The visual fidelity, reaction speed and general feel of this far outruns the previous world models we showed you (WorldLabs, Mirage) as this model seems to have memory of every previous action (eg. if your character makes a trail, you turn around and the trail is still there!). Is it worth the upgrade to <strong>Ultra</strong> Gemini Plan? Probably not, it’s an incredible demo, but the 1 minute length is very short, and the novelty wears off fairly quick. </p><p>If you’d like to try, folks at Deepmind gave us 3 Ultra subscriptions to give out! Just tweet out the link to this episode and add #GenieThursdai and tag @altryne and I’ll raffle the ultra subscriptions between those who do </p><p>Chrome steps into Agentic Browsing with Auto Browse</p><p>This wasn’t the only mind blowing release from Gemini this week, the Chrome team upgraded the Gemini inside chrome to be actual helpful and agentic. And yes, we’ve seen this before, with Atlas from OpenAI, Comet from perplexity, but Google’s Chrome has a 70% hold on the browser market, and giving everyone with a Pro/Ultra subscription to “Auto Browse” is a huge huge deal. </p><p>We’ve tested the Auto Browse feature live on the show, and Chrome completed 77 steps! I asked it to open up each of my bookmarks in a separate folder and summarize all of them, and it did a great job! </p><p>Honestly, the biggest deal about this is not the capability itself, it’s the nearly 4B people this is now very close to, and the economic impact of this ability. IMO this may be the more impactful news out of Google this week! </p><p>Other news in big labs: </p><p>* Anthropic launches in chat applications based on the MCP Apps protocol. We interviewed the two folks behind this protocol <a target="_blank" href="https://sub.thursdai.news/p/thursdai-thanksgiving-special-25?utm_source=publication-search">back in November</a> if you’d like to hear more about it. With connectors like Figma, Slack, Asana that can now show rich experiences</p><p>* Anthropic’s CEO Dario Amodei also published an essay called ‘The Adolescence of Technology” - warning of AI risks to national security</p><p>* Anthropic forced the creator of the popular open source AI Assistant Clawdbot to rename, they chose Moltbot as the name (apparently because crypto scammers stole a better name)  EDIT: just after publishing this newsletter, the name was changed to OpenClaw, which we all agree is way way better. </p><p>Open Source AI</p><p><strong>Kimi K2.5: Moonshot AI’s 1 Trillion Parameter Agentic Monster</strong></p><p>Wolfram’s favorite release of the week, and for good reason. <strong>Moonshot AI</strong> just dropped Kimi K2.5, and this thing is an absolute beast for open source. We’re talking about a 1 trillion parameter Mixture-of-Experts model with 32B active parameters, 384 experts (8 selected per token), and 256K context length.</p><p>But here’s what makes this special — it’s now multimodal. The previous Kimi was already known for great writing vibes and creative capabilities, but this one can see. It can process videos. People are sending it full videos and getting incredible results.</p><p>The benchmarks are insane: 50.2% on HLE full set with tools, 74.9% on BrowseComp, and open-source SOTA on vision and coding with 78.5% MMMU Pro and 76.8% SWE-bench Verified. These numbers put it competitive with Claude 4.5 Opus and GPT 5.2 on many tasks. Which, for an open model is crazy. </p><p>And then there’s Agent Swarm — their groundbreaking feature that spawns up to 100 parallel sub-agents for complex tasks, achieving 4.5x speedups. The ex-Moonshot RL lead called this a “zero-to-one breakthrough” with self-directed parallel execution.</p><p>Now let’s talk about what matters for folks running agents and burning through tokens: pricing. Kimi K2.5 is $0.60 per million input tokens and $3 per million output. Compare that to Opus 4.5 at $4.50 input and $25 output per million. About a 10x price reduction. If you’re running OpenClas and watching your API bills climb with sub-agents, this is a game-changer. (tho I haven’t tested this myself) </p><p>Is it the same level of intelligence as whatever magic Anthropic cooks up with Opus? Honestly, I don’t know — there’s something about the Claude models that’s hard to quantify. But for most coding tasks on a budget, you can absolutely switch to Kimi and still get great results.</p><p>🦞 Clawdbot is no more, Moltbot is dead, Long Live OpenClaw</p><p>After we covered the incredible open source project <a target="_blank" href="https://thursdai.news/jan-22">last week</a>, Clawdbot exploded in popularity, driven by Claude Max subscription, and a crazy viral loop where folks who try it, can’t wait to talk about it, it was everywhere! Apparently it was also on Anthropics’ lawyers minds, when they sent Peter Steinberger a friendly worded letter to rebrand and gave him like 12 hours. </p><p>Apparently, when pronounced, Claude and Clawd sound the same, and they are worried about copyright infringement (which makes sense, most of the early success of Clawd was due to Opus being amazing). The main issue is, due to the popularity of the project, crypto a******s sniped moltybot nickname on X so we got left with Moltbot, which is thematically appropriate, but oh so hard to remember and pronounce!</p><p>EDIT: OpenClaw was just announced as the new name, apparently I wasn’t the only one who absolutely hated the name Molt! </p><p>Meanwhile, rebrand or not, my own instance of OpenClaw created an <a target="_blank" href="https://x.com/wooolfred/status/2016941314735296583">X account</a>, helped me prepare for ThursdAI (including generating a thumbnail), created a video for us today on the fly, and keeps me up to date on emails and unanswered messages via a daily brief. It really has showed me a glimpse of how a truly personal AI assistant can be helpful in a fast changing world! </p><p>I’ve shared a lot of tips and tricks, about <a target="_blank" href="https://x.com/altryne/status/2016192932685230374">memory</a>, about <a target="_blank" href="https://x.com/altryne/status/2016619737069879791">threads</a> and much more, as we all learn to handle this new ... AI agent framework! But I definitely feel that this is a new unlock in capability, for me and for many others. If you haven’t installed OpenClaw, lmk in the comments why not.</p><p><strong>Arcee AI Trinity Large: The Western Open Source Giant</strong></p><p>Remember when we had <a target="_blank" href="https://sub.thursdai.news/p/thursdai-dec-4-2025-deepseek-v32">Lucas Atkins</a>, Arcee’s CTO, on the show just as they were firing up their 2,000 NVIDIA B300 GPUs? </p><p>Well, the run is complete, and the results are massive. Arcee AI just dropped <strong>Trinity Large</strong>, a 400B parameter sparse MoE model (with a super efficient 13B active params via 4-of-256 routing) trained on a staggering 17 trillion tokens in just 33 days. </p><p>This represents the largest publicly announced pretraining run on B300 infrastructure, costing about $20M (and tracked with WandB of course!) and proves that Western labs can still compete at the frontier of open source. Best part? It supports 512K context and is <a target="_blank" href="https://openrouter.ai/arcee-ai/trinity-large-preview"><strong>free on OpenRouter</strong></a><strong> </strong>until February 2026. Go try it now!</p><p><strong>Quick open source hits: Trinity Large, Jan v3, DeepSeek OCR updated</strong></p><p>* <strong>Jan AI</strong> released <strong>Jan v3</strong>, a 4B parameter model optimized for local inference. 132 tokens/sec on Apple Silicon, 262K context, 40% improvement on Aider benchmarks. This is the kind of small-but-mighty model you actually can run on your laptop for coding tasks.</p><p>* <strong>Nvidia</strong> released <strong>PersonaPlex-7B</strong> - full duplex voice AI that listens and speaks simultaneously with persona contol</p><p>* <strong>Moonshot AI</strong> also releases Kimi Code: Open-source Python-based coding agent with Apache 2.0 license</p><p>Vision, Video and AI art</p><p><strong>xAI Grok Imagine API: #1 in Video Generation</strong></p><p><strong>xAI officially launched the Grok Imagine API </strong>with an updated model, and it’s now ranked #1 in both text-to-video and image-to-video on the Artificial Analysis leaderboards. It beats Runway Gen-4.5, Kling 2.5 Turbo, and Google Veo 3.1.</p><p>And of course, the pricing is $4.20 per minute. Of course it is. That’s cheaper than Veo 3.1 at $12/min and Sora 2 Pro at $30/min by 3-7x, with 45-second latency versus 68+ seconds for the competition.</p><p>During the show, I demoed this live with my AI assistant Wolfred. I literally sent him a message saying “learn this new API based on this URL, take this image of us in the studio, and create a video where different animals land on each of our screens.” He learned the API, generated the video (it showed wolves, owls, cats, and lions appearing on our screens with generated voice), and then when Nisten asked to post it to Twitter, Wolfred scheduled it on X and tagged everyone — all without me doing anything except asking.</p><p>Look, it’s not VEO but the price and the speed are crazy, XAI cooked with this model and you can try it on <a target="_blank" href="https://x.com/fal/status/2016746472931283366">FAL</a> and directly on <a target="_blank" href="https://x.ai/news/grok-imagine-api">XAI</a>.</p><p>Decart - Lucy 2 - Real-time 1080p video transformation at 30 FPS with near-zero latency for $3/hour </p><p>This one also caught me by surprise, I read about it and said “oh this is cool, I’ll mention this on the show” and then we tried it in real time, and I approved my webcam, and I got transformed into Albert Einstein, and I could raise my hands and their model would in real time, raise Alberts hands! </p><p>The speed and fidelity of this model is something else, and yeah, after watching the Genie 3 world model, it’s hard to be impressed, but I was very impressed by this, as previous stuff from Decart was “only showing the future” and this one is a real time, 1080p quality web cam transformation! </p><p>You can try this yourself here: <a target="_blank" href="https://t.co/q6uKi8dM3y">lucy.decart.ai</a>, they let you create any kind of prompt! </p><p>AI Art Quick Hits: </p><p>* Tencent launches HunyuanImage 3.0-Instruct: 80B MoE model for precise image editing with chain-of-thought reasoning. It’s a VERY big model for AI Art standards but it’s becuase it has an LLM core and this make it much better for precise image editing. </p><p>* Tongyi Lab releases Z-Image, a full-capacity undistilled foundation model for image generation with superior diversity. We told you about the turbo version before, this one is its older brother and much higher quality! </p><p>The other highlight this week is that I got to record a show with Wolfram in person for the first time, as he’s now also an AI Evangelist with W&B and he’s here in SF for our hackathon (remember? you can still register lu.ma/weavehacks3 )</p><p>Huge shoutout to Chroma folks for hosting us at their amazing podcast studio (TJ, Jeff and other folks), if you need a memory for your AI assistant, check out <a target="_blank" href="http://trychroma.com/">chroma.db</a> 🎉 </p><p>Signing off as we have a hackathon to plan, see you guys next week (or this weekend!) 🫡 </p><p>ThursdAI Jan 29 , TL;DR and show notes</p><p>* <strong>Hosts and Guests</strong></p><p>* <strong>Alex Volkov</strong> - AI Evangelist & Weights & Biases (<a target="_blank" href="http://x.com/@altryne">@altryne</a>)</p><p>* Co Hosts - <a target="_blank" href="http://x.com/@WolframRvnwlf">@WolframRvnwlf</a> @yampeleg <a target="_blank" href="http://x.com/@nisten">@nisten</a> <a target="_blank" href="http://x.com/@ldjconfirmed">@ldjconfirmed</a> <a target="_blank" href="http://x.com/ryancarson">@ryancarson</a></p><p>* <strong>Open Source LLMs</strong></p><p>* Moonshot AI releases <strong>Kimi K2.5</strong> (<a target="_blank" href="https://x.com/Kimi_Moonshot/status/2016024049869324599">X</a>, <a target="_blank" href="https://huggingface.co/moonshotai/Kimi-K2.5">HF</a>)</p><p>* Arcee AI releases <strong>Trinity Large</strong> (<a target="_blank" href="https://x.com/arcee_ai/status/2016278017572495505">X</a>, <a target="_blank" href="https://arcee.ai/blog/trinity-large">Blog</a>, <a target="_blank" href="https://huggingface.co/arcee-ai/Trinity-Large-Preview">HF</a>, <a target="_blank" href="https://huggingface.co/arcee-ai/Trinity-Large-Base">HF</a>, <a target="_blank" href="https://huggingface.co/arcee-ai/Trinity-Large-TrueBase">HF</a>)</p><p>* Jan AI releases <strong>Jan v3</strong> (<a target="_blank" href="https://x.com/jandotai/status/2016019981541245353">X</a>, <a target="_blank" href="https://huggingface.co/janhq/Jan-v3-4B-base-instruct">HF</a>, <a target="_blank" href="https://huggingface.co/janhq/Jan-v3-4B-base-instruct-gguf">HF</a>, <a target="_blank" href="https://jan.ai">Blog</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Google launches agentic Auto-Browse in Chrome with Gemini 3 (<a target="_blank" href="https://x.com/addyosmani/status/2016576478209855895">X</a>, <a target="_blank" href="https://blog.google/products-and-platforms/products/chrome/gemini-3-auto-browse/">Blog</a>)</p><p>* Anthropic launches <strong>MCP Apps</strong> (<a target="_blank" href="https://x.com/claudeai/status/2015851783655194640">X</a>)</p><p>* Google launches Agentic Vision in Gemini 3 Flash (<a target="_blank" href="https://x.com/osanseviero/status/2016236082501959783">X</a>, <a target="_blank" href="https://ai.google.dev/gemini-api/docs/code-execution">Announcement</a>)</p><p>* Anthropic CEO Dario Amodei publishes major essay ‘The Adolescence of Technology’ (<a target="_blank" href="https://x.com/DarioAmodei/status/2015833046327402527">X</a>, <a target="_blank" href="https://darioamodei.com/the-adolescence-of-technology">Blog</a>, <a target="_blank" href="https://darioamodei.com/machines-of-loving-grace">Blog</a>)</p><p>* <strong>This weeks Buzz</strong></p><p>* WandB hackathon Weavehacks 3 - Jan 31-Feb1 in SF - limited seats available <a target="_blank" href="http://lu.ma/weavehacks3">lu.ma/weavehacks3</a></p><p>* <strong>Vision & Video</strong></p><p>* Google DeepMind launches Project Genie (<a target="_blank" href="https://x.com/GoogleDeepMind/status/2016919756440240479">X</a>, <a target="_blank" href="https://labs.google/projectgenie">Announcement</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* NVIDIA releases PersonaPlex-7B (<a target="_blank" href="https://x.com/HuggingModels/status/2014788077924040729">X</a>, <a target="_blank" href="https://huggingface.co/nvidia/PersonaPlex">HF</a>, <a target="_blank" href="https://github.com/NVIDIA/personaplex">Announcement</a>)</p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* xAI launches Grok Imagine API (<a target="_blank" href="https://x.com/xai/status/2016745652739363129">X</a>, <a target="_blank" href="https://console.x.ai/">Announcement</a>)</p><p>* Tencent launches HunyuanImage 3.0-Instruct (<a target="_blank" href="https://x.com/TencentHunyuan/status/2015635861833167074">X</a>, <a target="_blank" href="https://x.com/TencentHunyuan/status/2016356787361087615">X</a>)</p><p>* Tongyi Lab releases Z-Image (<a target="_blank" href="https://x.com/Ali_TongyiLab/status/2016186674531758285">X</a>, <a target="_blank" href="https://github.com/modelscope/DiffSynth-Studio">GitHub</a>)</p><p>* <strong>Tools</strong></p><p>* Moonshot AI releases Kimi Code (<a target="_blank" href="https://x.com/Kimi_Moonshot/status/2016034259350520226">X</a>, <a target="_blank" href="https://kimi.ai/code">Announcement</a>, <a target="_blank" href="https://github.com/MoonshotAI/kimi-code">GitHub</a>)</p><p>* Andrej Karpathy shares his shift to 80% agent-driven coding with Claude (<a target="_blank" href="https://x.com/karpathy/status/2015883857489522876">X</a>)</p><p>* Clawdbot is forced to rename to Moltbot (Molty) becuase of Anthropic lawyers, then renames to OpenClaw</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/jan-29-genie3-is-here-rip-clawdbot</link><guid isPermaLink="false">substack:post:186263943</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Fri, 30 Jan 2026 02:42:52 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/186263943/e4df04a8acbbf467cf29f9722ef0fc53.mp3" length="64669546" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5389</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/186263943/0b2a5858be03f4415b8c5410d4a6de19.jpg"/></item><item><title><![CDATA[📆 ThursdAI - Jan 22 - Clawdbot deep dive, GLM 4.7 Flash, Anthropic constitution + 3 new TSS models]]></title><description><![CDATA[<p>Hey! Alex here, with another weekly AI update!  </p><p>It seems like ThursdAI is taking a new direction, as this is our 3rd show this year, and a 3rd deep dive into topics (previously <a target="_blank" href="https://thursdai.news/jan-8">Ralph</a>, <a target="_blank" href="https://thursdai.news/jan-15">Agent Skills</a>), please let me know if the comments if you like this format. </p><p>This week’s deep dive is into Clawdbot, a personal AI assistant you install on your computer, but can control through your phone, has access to your files, is able to write code, help organize your life, but most importantly, it can <strong>self improve</strong>. Seeing Wolfred (my Clawdbot) learn to transcribe incoming voice messages blew my mind, and I wanted to share this one with you at length! We had Dan Peguine on the show for the deep dive + both Wolfram and Yam are avid users! This one is not to be missed. If ThursdAI is usually too technical for you, use Claude, and install Clawdbot after you read/listen to the deep dive!</p><p>Also this week, we read Claude’s Constitution that Anthropic released, heard a bunch of new TTS models (some are open source and very impressive) and talked about the new lightspeed coding model GLM 4.7 Flash. First the news, then deep dive, lets go 👇</p><p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></p><p>Open Source AI</p><p><strong>Z.ai’s GLM‑4.7‑Flash is the Local Agent Sweet Spot </strong>(<a target="_blank" href="https://x.com/Zai_org/status/2013261304060866758">X</a>, <a target="_blank" href="https://huggingface.co/zai-org/GLM-4.7-Flash">HF</a>)</p><p>This was the open‑source release that mattered this week. <a target="_blank" href="Z.ai">Z.ai</a> (formerly Zhipu) shipped GLM‑4.7‑Flash, a 30B MoE model with only 3B active parameters per token, which makes it much more efficient for local agent work. We’re talking a model you can run on consumer hardware that still hits 59% on SWE‑bench Verified, which is uncomfortably close to frontier coding performance. In real terms, it starts to feel like “Sonnet‑level agentic ability, but local.” I know I know, we keep saying “sonnet at home” at different open source models, but this one slaps! </p><p>Nisten was getting around 120 tokens/sec on an M3 Ultra Mac Studio using MLX, and that’s kind of the headline. The model is fast and capable enough that local agent loops like RALPH suddenly feel practical. It also performs well on browser‑style agent tasks, which is exactly what you want for local automation without sending all your data to a cloud provider. </p><p><strong>Liquid AI’s LFM2.5‑1.2B Thinking is the “Tiny but Capable” Class </strong>(<a target="_blank" href="https://x.com/liquidai/status/2013633347625324627">X</a>, <a target="_blank" href="https://huggingface.co/LiquidAI/LFM2.5-1.2B-Thinking">HF</a>)</p><p>Liquid AI released a 1.2B reasoning model that runs under 900MB of memory while still manages to be useful. This thing is built for edge devices and old phones, and the speed numbers are backing it up. We’re talking 239 tok/s decode on AMD CPU, 82 tok/s on mobile NPU, and prefill speeds that make long prompts actually usable. Nisten made a great point: on iOS, there’s a per‑process memory limit around 3.8GB, so a 1.2B model lets you spend your budget on context instead of weights.</p><p>This is the third class of models we’re now living with: not Claude‑scale, not “local workstation,” but “tiny agent in your pocket.” It’s not going to win big benchmarks, but it’s perfect for on‑device workflows, lightweight assistants, and local RAG.</p><p>Voice & Audio: Text To Speech is hot this week with 3 releases! </p><p>We tested three major voice releases this week, and I’m not exaggerating when I say the latency wars are now fully on. </p><p>Qwen3‑TTS: Open Source, 97ms Latency, Voice Cloning (<a target="_blank" href="https://x.com/Alibaba_Qwen/status/2014326211913343303">X</a>, <a target="_blank" href="https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base">HF</a>)</p><p>Just 30 minutes before the show, Qwen released their first model of the year, Qwen3 TTS, with two models (0.6B and 1.7B). With support for Voice Cloning based on just 3 seconds of voice, and claims of 97MS latency, this apache 2.0 release looked very good on the surface!</p><p>The demos we did on stage though... were lackluster. TTS models like Kokoro previously impressed us with super tiny sizes and decent voice, while Qwen3 didn’t really perform on the cloning aspect. For some reason (I tested in Russian which they claim to support) the cloned voice kept repeating the provided sample voice instead of just generating the text I gave it. This confused me, and I’m hoping this is just a demo issue, not a problem with the model.  They also support voice design where you just type in the type of voice you want, which to be fair, worked fairly well in our tests!</p><p>With Apache 2.0 and a full finetuning capability, this is a great release for sure, kudos to the Qwen team! Looking forward to see what folks do with this properly. </p><p>FlashLabs Chroma 1.0: Real-Time Speech-to-Speech, Open Source (<a target="_blank" href="https://x.com/ModelScope2022/status/2014006971855466640">X</a>, <a target="_blank" href="https://huggingface.co/FlashLabs/Chroma-4B">HF</a>) </p><p>Another big open source release in the audio category this week was Chroma 1.0 from FlashLabs, which claim to be the first speech2speech model (not a model that has the traditional ASR>LLM>TTS pipeline) and the claim 150ms end to end latency! </p><p>The issue with this one is, the company released an open source 4B model, and claimed that this model powers their chat interface <a target="_blank" href="https://www.flashlabs.ai/flashai-voice-agents">demo</a> on the web, but in the release notes they claim the model is english speaking only, while on the website it sounds incredible and I spoke to it in other languages 🤔 I think the mode that we’ve tested is not the open source one. I could’t confirm this at the time of writing, will follow on X with the team and let you guys know. </p><p>Inworld AI launches TTS-1.5: <a target="_blank" href="https://reflect.app/g/altryne/tag/1">#1</a> ranked text-to-speech with sub-250ms latency at half a cent per minute (<a target="_blank" href="https://x.com/inworld_ai/status/2014020677343510629">X</a>, <a target="_blank" href="https://inworld.ai/tts">Announcement</a>)</p><p>Ok this one is definitely in the realm of “voice realistic enough you won’t be able to tell” as this is not an open source model, it’s a new competitor to 11labs and MiniMax - the two leading TTS providers out there. </p><p>Inworld claims to achieve better results on the TTS Arena, while being significantly cheaper and faster (up to 25x less than leading providers like 11labs) </p><p>We tested out their voices and they sounded incredible, replied fast and generally was a very good experience. With 130ms response time for their mini version, this is a very decent new entry into the world of TTS providers. </p><p><strong>Big Companies: Ads in ChatGPT + Claude Constitution</strong></p><p>OpenAI is testing ads in ChatGPT’s free and Go tiers. Ads appear as labeled “Sponsored” content below responses, and OpenAI claim they won’t affect outputs. It’s still a major shift in the product’s business model, and it’s going to shape how people perceive trust in these systems. I don’t love ads, but I understand the economics, they have to make money somehow, with 900M weekly active users, many of them on the free tier, they are bound to make some money with this move. I just hope they won’t turn into a greedy ad optimizing AI machine. </p><p>Meanwhile, Anthropic released an 80‑page “New Constitution for Claude” that they use during training. This isn’t a prompt, it’s a <a target="_blank" href="https://www.anthropic.com/constitution">full set of values</a> baked into the model’s behavior. There’s a fascinating section where they explicitly talk about Claude’s potential wellbeing and how they want to support it. It’s both thoughtful and a little existential. I recommend reading it, especially if you care about alignment and agent design. </p><p>I applaud Anthropic for releasing this with Creative Commons license for public scrutiny and adoption 👏</p><p>This weeks buzz - come join the hackathon I’m hosting Jan 31 in SF</p><p>Quick plug, we have limited seats left open for the hackathon I’m hosting for Weights & Biases at the SF office, and if you’re reading this, and want to join, I’ll approve you if you mention ThursdAI in the application! </p><p>With sponsors like Redis, Vercel, BrowserBase, Daily, Google Cloud, we are going to give out a LOT of cash as prizes! </p><p>I’ve also invited a bunch of my friends from the top agentic AI places to be judges, it’s going to be awesome, <a target="_blank" href="http://lu.ma/weavehacks3?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=jan22">come</a></p><p><strong>Deep dive into Clawdbot: Local-First, Self-Improving, and Way Too Capable agent</strong></p><p>Clawdbot (C‑L‑A‑W‑D) is that rare project where the hype is justified. It’s an open-source personal agent that runs locally on your Mac, but can talk to you through WhatsApp, Telegram, iMessage, Discord, Slack — basically wherever you already talk. What makes it different is not just the integrations; it’s the self‑improvement loop. You can literally tell it “go build a new skill,” and it will… build the skill, install it, then adopt it and start using it. It’s kind of wild to see it working for the first time. Now... it’s definitely not perfect, far far away from the polish of ChatGPT / Claude, but when it works, damn, it really is mindblowing.</p><p>That part actually happened live in the episode. <a target="_blank" href="https://substack.com/profile/2463530-dan-peguine">Dan Peguine 🐧</a> showed how he had it create a skill to anonymize his own data so he could demo it on stream without leaking his personal life. Another example: I told my Clawdbot to handle voice notes in Telegram. It didn’t know how, so it went and found a transcription method, wrote itself a skill, saved it, and from that point on just… did the thing. That was the moment it clicked for me. (just before posting this, it forgot how to do it, I think I screwed something up) </p><p>Dan’s daily brief setup was wild too. It pulls from Apple Health, local calendars, weather, and his own projects, then produces a clean, human daily brief. It also lets him set reminders through WhatsApp and even makes its own decisions about how much to bother him based on context. He shared a moment where it literally told him, “I won’t bug you today because it’s your wife’s birthday.” That isn’t a hardcoded workflow — it’s reasoning layered on top of persistent memory.</p><p>And that persistent memory is a big deal. It’s stored locally as Markdown files and folders, Obsidian‑style, so you don’t lose your life every time you switch models. You can route the brain to Claude Opus 4.5 today and a local model tomorrow, and the memory stays with you. That is a huge step up from “ChatGPT remembers you unless you unsubscribe.”</p><p>There’s also a strong community forming around shared skills via ClawdHub. People are building everything from GA4 analytics skills to app testing automations to Tesla battery status checkers. The core pattern is simple but powerful: talk to it, ask it to build a skill, then it can run that skill forever.</p><p>I definitely have some issues with the security aspect, you are essentially giving full access to an LLM to your machine, so many folks are buying a specific home for their ClawdBot (Mac Mini seems to be the best option for many of them) and are giving it secure access to passwords via a dedicated 1Password vault. I’ll keep you up to date about my endeavors with Clawd but definitely do give it a try! </p><p>Installing</p><p>Installing Clawd on your machine is simple, go to <a target="_blank" href="http://clawd.bot">clawd.bot</a> and follow instructions. Then find the most convenient way for you to talk to it (for me it was telegram, creating a telegram token takes 20 seconds) and then, you can take it from there with Clawdbot itself! Ask it for something to do, like clear your inbox, or set a reminder, or.. a million other things that you need for your personal life, and enjoy the discovery of what a potential ever present always on AI can do! </p><p>Other news that we didn’t have time to cover at length but you should still now about: </p><p>* Overworld released an OpenSource realtime AI World model (<a target="_blank" href="https://x.com/overworld_ai/status/2013673088748245188">X</a>) </p><p>* Runway finally opened up their 4.5 video model, and it has Image2video capabilities, including multiple shots image to video (<a target="_blank" href="https://x.com/runwayml/status/2014090404769976744">X</a>)</p><p>* Vercel launches <a target="_blank" href="skills.sh">skills.sh</a>, an “npm for AI agents skills”</p><p>* Anthropic’s Claude Code VS Code Extension Hits General Availability (<a target="_blank" href="https://x.com/claudeai/status/2013704053226717347">X</a>)</p><p>Ok, this is it for this week folks! I’m going to play with (and try to fix.. ) my clawdbot, and suggest you give it a try. Do let me know if the deepdives are a good format! </p><p>Show notes and links: </p><p>ThursdAI - Jan 22, 2026 - TL;DR and show notes</p><p>* <strong>Hosts and Guests</strong></p><p>* <strong>Alex Volkov</strong> - AI Evangelist & Weights & Biases (<a target="_blank" href="http://x.com/@altryne">@altryne</a>)</p><p>* Co Hosts - <a target="_blank" href="http://x.com/@WolframRvnwlf">@WolframRvnwlf</a> @yampeleg <a target="_blank" href="http://x.com/@nisten">@nisten</a> <a target="_blank" href="http://x.com/@ldjconfirmed">@ldjconfirmed</a></p><p>* Guest Dan Peguine ( <a target="_blank" href="https://x.com/danpeguine">@danpeguine</a> )</p><p>* DeepDive - Clawdbot with Dan & Wolfram</p><p>* Clawdbot: Open-Source AI Agent Running Locally on macOS Transforms Personal Computing with Self-Improving Capabilities (<a target="_blank" href="https://x.com/steipete/status/2013666639330369894">X</a>, <a target="_blank" href="https://macstories.net/stories/clawdbot-showed-me-what-the-future-of-personal-ai-assistants-looks-like/">Blog</a>)</p><p>* <strong>Open Source LLMs</strong></p><p>* <a target="_blank" href="Z.ai">Z.ai</a> releases GLM-4.7-Flash, a 30B parameter MoE model that sets a new standard for lightweight local AI assistants (<a target="_blank" href="https://x.com/Zai_org/status/2013261304060866758">X</a>, <a target="_blank" href="https://z.ai/blog/glm-4.7">Technical Blog</a>, <a target="_blank" href="https://huggingface.co/zai-org/GLM-4.7-Flash">HuggingFace</a>)</p><p>* Liquid AI releases LFM2.5-1.2B-Thinking, a 1.2B parameter reasoning model that runs entirely on-device with under 900MB memory (<a target="_blank" href="https://x.com/liquidai/status/2013633347625324627">X</a>, <a target="_blank" href="https://huggingface.co/LiquidAI/LFM2.5-1.2B-Thinking">HF</a>, <a target="_blank" href="https://leap.liquid.ai/models?model=lfm2.5-1.2b-thinking">Announcement</a>)</p><p>* Sakana AI introduces RePo, a new way for language models to dynamically reorganize their context for better attention (<a target="_blank" href="https://x.com/SakanaAILabs/status/2013046887746843001">X</a>, <a target="_blank" href="https://arxiv.org/abs/2512.14391">Paper</a>, <a target="_blank" href="https://pub.sakana.ai/repo/">Website</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* OpenAI announces testing ads in ChatGPT free and Go tiers, prioritizing user trust and transparency (<a target="_blank" href="https://x.com/OpenAI/status/2012223373489614951">X</a>)</p><p>* Anthropic publishes new 80-page constitution for Claude, shifting from rigid rules to explanatory principles that teach AI ‘why’ rather than ‘what’ to do (<a target="_blank" href="https://x.com/AnthropicAI/status/2014005798691877083">X</a>, <a target="_blank" href="https://anthropic.com/news/claudes-constitution">Blog, Announcement</a>)</p><p>* <strong>This weeks Buzz</strong></p><p>* WandB hackathon Weavehacks 3 - Jan 31-Feb1 in SF - limited seats available lu.ma/weavehacks3</p><p>* <strong>Vision & Video</strong></p><p>* Overworld Releases Waypoint-1: Real-Time AI World Model Running at 60fps on Consumer GPUs (<a target="_blank" href="https://x.com/overworld_ai/status/2013673088748245188">X</a>, <a target="_blank" href="https://over.world/">Announcement</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* Alibaba Qwen Releases Qwen3-TTS: Full Open-Source TTS Family with 97ms Latency, Voice Cloning, and 10-Language Support (<a target="_blank" href="https://x.com/Alibaba_Qwen/status/2014326211913343303">X</a>, <a target="_blank" href="https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base">H, F</a>, <a target="_blank" href="https://github.com/QwenLM/Qwen3-TTS">G, i, t, H, u, b</a>)</p><p>* FlashLabs Releases Chroma 1.0: World’s First Open-Source Real-Time Speech-to-Speech Model with Voice Cloning Under 150ms Latency (<a target="_blank" href="https://x.com/ModelScope2022/status/2014006971855466640">X</a>, <a target="_blank" href="https://huggingface.co/FlashLabs/Chroma-4B">HF</a>, <a target="_blank" href="https://arxiv.org/abs/2601.11141">Arxiv</a>)</p><p>* Inworld AI launches TTS-1.5: #1 ranked text-to-speech with sub-250ms latency at half a cent per minute (<a target="_blank" href="https://x.com/inworld_ai/status/2014020677343510629">X</a>, <a target="_blank" href="https://inworld.ai/tts">Announcement</a>)</p><p>* <strong>Tools</strong></p><p>* Vercel launches <a target="_blank" href="skills.sh">skills.sh</a>, an “npm for AI agents” that hit 20K installs within hours (X, Vercel Changelog, GitHub)</p><p>* Anthropic’s Claude Code VS Code Extension Hits General Availability, Bringing Full Agentic Coding to the IDE (<a target="_blank" href="https://x.com/claudeai/status/2013704053226717347">X</a>, <a target="_blank" href="https://marketplace.visualstudio.com/items?itemName=anthropic.claude-code">VS Code Marketplace</a>, <a target="_blank" href="https://code.claude.com/docs/en/vs-code">Docs</a>)</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-jan-22-clawdbot-deep-dive</link><guid isPermaLink="false">substack:post:185487308</guid><dc:creator><![CDATA[Alex Volkov and Dan Peguine 🐧]]></dc:creator><pubDate>Fri, 23 Jan 2026 02:56:06 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/185487308/09f80e159cc842fd2dfd9648be3fb714.mp3" length="70890957" type="audio/mpeg"/><itunes:author>Alex Volkov and Dan Peguine 🐧</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5907</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/185487308/b45d38d9d42bac431f8d0e461b16b410.jpg"/></item><item><title><![CDATA[📆 ThursdAI - Jan 15 - Agent Skills Deep Dive, GPT 5.2 Codex Builds a Browser, Claude Cowork for the Masses, and the Era of Personalized AI!]]></title><description><![CDATA[<p>Hey ya’ll, Alex here, and this week I was especially giddy to record the show! Mostly because when a thing clicks for me that hasn’t clicked before, I can’t wait to tell you all about it! </p><p>This week, that thing is Agent Skills! The currently best way to customize your AI agents with domain expertise, in a simple, repeatable way that doesn’t blow up the context window! We mentioned skills when Anthropic first released them  (<a target="_blank" href="https://sub.thursdai.news/p/thursdai-oct-16-veo31-haiku-45-chatgpt">Oct 16</a>) and when they became an open standard but it didn’t really click until last week! So more on that below. </p><p>Also this week, Anthropic released a research preview of Claude Cowork, an agentic tool for non coders, OpenAI finally let loos GPT 5.2 Codex (in the API, it was previously available only via Codex), Apple announced a deal with Gemini to power Siri, OpenAI and Anthropic both doubled down on healthcare and much more! We had an incredible show, with an expert in Agent Skills, <a target="_blank" href="https://substack.com/profile/23802032-eleanor-berger">Eleanor Berger</a> and the usual gang on co-hosts, strongly recommend watching the show in addition to the newsletter! </p><p>Also, I vibe coded skills support for all LLMs to Chorus, and promised folks a link to download it, so look for that in the footer, let’s dive in! </p><p><p>ThursdAI is where you stay up to date! Subscribe to keep us going! </p></p><p>Big Company LLMs + APIs: Cowork, Codex, and a Browser in a Week</p><p><strong>Anthropic launches Claude Cowork: Agentic AI for Non‑Coders (research preview)</strong></p><p>Anthropic announced <strong>Claude Cowork</strong>, which is basically Claude Code wrapped in a friendly UI for people who don’t want to touch a terminal. It’s a research preview available on the Max tier, and it gives Claude read/write access to a folder on your Mac so it can do real work without you caring about diffs, git, or command line.</p><p>The wild bit is that Cowork was built in a week and a half, and according to the Anthropic team it was 100% written using Claude Code. This feels like a “we’ve crossed a threshold” moment. If you’re wondering why this matters, it’s because coding agents are general agents. If a model can write code to do tasks, it can do taxes, clean your desktop, or orchestrate workflows, and that means non‑developers can now access the same leverage developers have been enjoying for a year.</p><p>It also isn’t just for files—it comes with a <strong>Chrome connector</strong>, meaning it can navigate the web to gather info, download receipts, or do research and it uses skills (more on those later)</p><p>Earlier this week I recorded this first reactions video about Cowork and I’ve been testing it ever since, it’s a very interesting approach of coding agents that “hide the coding” to just... do things. Will this become as big as Claude Code for anthropic (which is reportedly a 1B business for them)? Let’s see! </p><p>There are real security concerns here, especially if you’re not in the habit of backing up or using git. Cowork sandboxes a folder, but it can still delete things in that folder, so don’t let it loose on your whole drive unless you like chaos.</p><p><strong>GPT‑5.2 Codex: Long‑Running Agents Are Here</strong></p><p>OpenAI shipped <strong>GPT‑5.2 Codex</strong> into the API finally! After being announced as the answer for Opus 4.5 and only being available in Codex. The big headline is SOTA on SWE-Bench and long‑running agentic capability. People describe it as methodical. It takes longer, but it’s reliable on extended tasks, especially when you let it run without micromanaging.</p><p>This model is now integrated into Cursor, GitHub Copilot, VS Code, Factory, and Vercel AI Gateway within hours of launch. It’s also state‑of‑the‑art on SWE‑Bench Pro and Terminal‑Bench 2.0, and it has native context compaction. That last part matters because if you’ve ever run an agent for long sessions, the context gets bloated and the model gets dumber. Compaction is an attempt to keep it coherent by summarizing old context into fresh threads, and we debated whether it really works. I think it helps, but I also agree that the best strategy is still to run smaller, atomic tasks with clean context.</p><p>Cursor vibe-coded browser with GPT-5.2 and 3M lines of code</p><p>The most mind‑blowing thing we discussed is Cursor letting GPT‑5.2 Codex run for a full week to build a browser called FastRenderer. This is not Chromium‑based. It’s a custom HTML parser, CSS cascade, layout engine, text shaping, paint pipeline, and even a JavaScript VM, written in Rust, from scratch. The codebase is open source on <a target="_blank" href="https://github.com/wilsonzlin/fastrender"><strong>GitHub</strong></a>, and the full story is on <a target="_blank" href="https://cursor.com/blog/scaling-agents"><strong>Cursor’s blog</strong></a> </p><p>It took nearly 30,000 commits and millions of lines of code. The system ran hundreds of concurrent agents with a planner‑worker architecture, and GPT‑5.2 was the best model for staying on task in that long‑running regime. That’s the real story, not just “lol a model wrote a browser.” This is a stress test for long‑horizon agentic software development, and it’s a preview of how teams will ship in 2026.</p><p>I said on the show, browsers are REALLY hard, it took two decades for the industry to settle and be able to render websites normally, and there’s a reason everyone’s using Chromium. This is VERY impressive 👏 </p><p>Now as for me, I began using Codex again, but I still find Opus better? Not sure if this is just me expecting something that’s not there? I’ll keep you posted</p><p>Gemini Personal Intelligence: The Data Moat king is back! </p><p>What kind of car do you drive? Does ChatGPT know that? welp, it turns our Google does (based on your emails, Google photos) and now Gemini can tap into this personal info (if you allow it, they are stressing privacy), and give you much more personalized answers! </p><p>Flipping this Beta feature on, lets Gemini reason across Gmail, YouTube, Photos, and Search with explicit opt‑in permissions, and it’s rolling out to Pro and Ultra users in the US first.</p><p>I got to try it early, and it’s uncanny. I asked Gemini what car I drive, and it told me I likely drive a Model Y, but it noticed I recently searched for a Honda Odyssey and asked if I was thinking about switching. It was kinda... freaky because I forgot I had early access and this was turned on 😂 </p><p>Pro Tip: if you’re brave enough to turn this on, ask for a complete profile on you 🙂</p><p>Now the last piece is for Gemini to become proactive, suggesting things for me based on my needs! </p><p><strong>Apple & Google: The Partnership (and Drama Corner)</strong></p><p>We touched on this in the intro, but it’s official: Apple Intelligence will be powered by Google Gemini for “world knowledge” tasks. Apple stated that after “careful evaluation,” Google provided the most capable foundation model for their.. apple foundation models. It’s confusing, I agree.</p><p>Honestly? I got excited about Apple Intelligence, but Siri is still... Siri. It’s 2026 and we are still struggling with basic intents. Hopefully, plugging Gemini into the backend changes that? </p><p><strong>In other drama:</strong> The silicon valley carousel continues. 3 Co-founders (Barret Zoph, Sam Schoenholz and Luke Metz)  from Thinking Machines  (and former OpenAI folks) have returned to the mothership (OpenAI), amid some vague tweets about “unethical conduct.” It’s never a dull week on the timeline. </p><p><strong>This Week’s Buzz: WeaveHacks 3 in SF</strong></p><p>I’ve got one thing in the Buzz corner this week, and it’s a big one. WeaveHacks 3 is back in San Francisco, <strong>January 31st - February 1st</strong>. The theme is self‑improving agents, and if you’ve been itching to build in person, this is it. We’ve got an amazing judge lineup, incredible sponsors, and a ridiculous amount of agent tooling to play with.</p><p>You can sign up here: <a target="_blank" href="https://luma.com/weavehacks3"><strong>https://luma.com/weavehacks3</strong></a></p><p>If you’re coming, add to the form you heard it on ThursdAI and we’ll make sure you get in! </p><p><strong>Deep Dive: Agent Skills With Eleanor Berger</strong></p><p>This was the core of the episode, and I’m still buzzing about it. We brought on Eleanor Berger, who has basically become the skill evangelist for the entire community, and she walked us through why skills are the missing layer in agentic AI.</p><p>Skills are simple markdown files with a tiny bit of metadata in a directory together  optional scripts, references, and assets. The key idea is progressive disclosure. Instead of stuffing your entire knowledge base into the context, the model only sees a small list of skills and let it load only what it needs. That means you can have hundreds of skills without blowing your context window (and making the model dumber and slower in result) </p><p>The technical structure is dead simple, but the implications are huge. Skills create a portable, reusable, composable way to give agents domain expertise, and they now work across most major harnesses. That means you can build a skill once and use it in Claude, Cursor, AMP, or any other agent tool that supports the standard.</p><p>Eleanor made the point that skills are an admission that we now have general‑purpose agents. The model can do the work, but it doesn’t know your preferences, your domain, your workflows. Skills are how you teach it those things. We also talked about how scripts inside skills reduce variance because you’re not asking the model to invent code every time; you’re just invoking trusted tools.</p><p>What really clicked for me this week is how easy it is to create skills using an agent. You don’t need to hand‑craft directories. You can describe your workflow, or even just do the task once in chat, and then ask the agent to turn it into a skill. It really is very very simple! And that’s likely the reason everyone is adopting this simple formart for extension their agents knowledge.</p><p>Get started with skills</p><p>If you use Claude Chat, the simplest way to get started is ask Claude to review your previous conversations and suggest a skill for you. Or, at the end of a long chat where you went back and forth with Claude on a task, ask it to distill the important parts into a skill. If you want to use other people’s skills, and you are using Claude Code, or any of the supported IDE/Agents, here’s where to download the folders and install them: </p><p>If you aren’t a developer and don’t subscribe to Claude, well, I got good news for you! I vibecoded skill support for every LLM 👇</p><p><strong>The Skills Demo That Changed My Mind</strong></p><p>I was resistant to skills at first, mostly because I wanted them inside my chat interface and not just in CLI tools. And I wasn’t subscribed to Claude for a while.  Then I realized I could add skill support directly to Chorus, the open‑source multi‑model chat app, and I used Claude Code plus Ralph loops to vibe code it in a few hours. Now I can run skills with GPT‑5.2 Codex, Claude Opus, and Gemini from the same chat interface. That was my “I know kung fu” moment.</p><p>If you want to try Chorus with skills enabled, you can download my release <a target="_blank" href="https://github.com/altryne/chorus/releases/tag/v0.14.5-skills">here</a>! Only for mac, and they are unsigned, mac will not like it, but you <a target="_blank" href="https://support.apple.com/guide/mac-help/open-a-mac-app-from-an-unknown-developer-mh40616/mac">can run them anyway.</a> </p><p>And if you want to explore more awesome skills, check out <a target="_blank" href="https://vercel.com/blog/introducing-react-best-practices"><strong>Vercel’s React Best Practices skills</strong></a> and <a target="_blank" href="https://www.ui-skills.com/"><strong>UI Skills</strong></a>. It’s the beginning of a new kind of distribution: knowledge packaged as skills, shared like open source libraries (or paid for!) and </p><p>Open Source AI</p><p><strong>Baichuan-M3</strong> is a 235B medical LLM fine-tuned from Qwen3, released under Apache 2.0. The interesting claim here is that it beats GPT-5.2 on OpenAI’s HealthBench, including a remarkably low 3.5% hallucination rate. </p><p>What makes it different from typical medical models is that it’s trained to run actual clinical consultations asking follow-up questions and reasoning through differential diagnoses rather than just spitting out answers. Nisten pointed out that if you’re going to fine-tune something for healthcare, Qwen3 MoE is an excellent base because of its multilingual capabilities, which matters a lot in clinical settings. You can run it with vLLM or SGLang if you’ve got the hardware. (<a target="_blank" href="https://huggingface.co/baichuan-inc/Baichuan-M3-235B">HF</a>)</p><p><strong>LongCat-Flash-Thinking-2601</strong> from Meituan is a 560B MoE (27B active) released fully MIT-licensed. It’s specifically built for agentic tasks, scoring well on tool-use benchmarks like τ²-Bench and BrowseComp. </p><p>There’s a “Heavy Thinking” mode that pushes AIME-25 to 100%. What I like about this one is the training philosophy, they inject noise and broken tools during RL to simulate messy real-world conditions, which is exactly what production agents deal with. You can try it at <a target="_blank" href="longcat.chat"><strong>longcat.chat</strong></a> and <a target="_blank" href="https://github.com/meituan-longcat/LongCat-Flash-Thinking-2601">Github</a></p><p>We also saw Google release <strong>MedGemma</strong> this week (<a target="_blank" href="https://research.google/blog/medgemma-our-most-capable-open-models-for-health-ai-development/">blog</a>) a 4B model optimized for medical imaging like X-rays and CT scans and TranslateGemma (<a target="_blank" href="https://x.com/GoogleDeepMind/status/2011848249850630363">X</a>) a family of on device translations (4B, 12B and 27B) which seem kind of cool! Didn’t have tons of time to dive into them unfortunately. </p><p>Vision, Voice & Art (Rapid Fire)</p><p>* <strong>Veo 3.1</strong> adds native vertical video, 4K output, and better consistency in the Gemini API. Huge for creators (<a target="_blank" href="https://ai.google.dev/gemini-api/docs/video">blog</a>)</p><p>* Viral Kling motion‑transfer <a target="_blank" href="https://twitter.com/venturetwins/status/2011285029541077033">vids</a> are breaking people’s brains about what AI video pipelines will look like.</p><p>* <strong>Pocket TTS</strong> from Kyutai Labs: a 100M‑parameter open‑source TTS model that runs on CPU and clones voices from seconds of audio (<a target="_blank" href="https://x.com/kyutai_labs/status/2011047335892303875">X</a>)</p><p>* <strong>GLM‑Image</strong> drops as an open‑source hybrid AR + diffusion image model with genuinely excellent text rendering but pretty bad for everything else</p><p>* Black Forest Labs drops open source Flux.2 [Klein] 4B and 9B small models that create images super fast! (<a target="_blank" href="https://x.com/bfl_ml/status/2011825819082244266">X</a>, <a target="_blank" href="https://fal.ai/models/fal-ai/flux-2/klein/9b/base">Fal</a>, <a target="_blank" href="https://huggingface.co/collections/black-forest-labs/flux2">HF</a>)</p><p>Phew, ok. I was super excited about this one and I’m really really happy with the result. I was joking on the pod that to prepare for this podcast, I not only had to collect all the news, I also had to ramp up on Agent Skills, and I wish we had an ability to upload information like the Matrix, but alas we didn’t. I also really enjoyed vibecoding a whole feature into Chorus just to explore skills fully, mind was absolutely blown when it worked after 3 hours of Ralphing! </p><p>See you next week, I think I have one more super exciting thing to play with this week before I talk about it! </p><p>TL;DR and Show Notes</p><p>* <strong>Hosts & Guests</strong></p><p>* <strong>Alex Volkov</strong> - AI Evangelist & Weights & Biases (<a target="_blank" href="https://x.com/altryne">@altryne</a>)</p><p>* Co-Hosts: Wolfram Ravenwolf (<a target="_blank" href="https://x.com/WolframRvnwlf">@WolframRvnwlf</a>), Yam Peleg (<a target="_blank" href="https://x.com/yampeleg">@yampeleg</a>), Nisten Tahiraj (<a target="_blank" href="https://x.com/nisten">@nisten</a>), LDJ (<a target="_blank" href="https://x.com/ldjconfirmed">@ldjconfirmed</a>)</p><p>* Guest: Eleanor Berger (<a target="_blank" href="https://x.com/intellectronica">@intellectronica</a>)</p><p>* <strong>Open Source LLMs</strong></p><p>* <strong>Baichuan-M3</strong> - A 235B open-source medical LLM that beats GPT-5.2 on HealthBench with a 3.5% hallucination rate, featuring full clinical consultation capabilities. (<a target="_blank" href="https://huggingface.co/baichuan-inc/Baichuan-M3-235B">HF</a>, <a target="_blank" href="https://www.baichuan-ai.com/">Blog</a>, <a target="_blank" href="https://x.com/poezhao0605/status/2010992177070174544">X Announcement</a>)</p><p>* <strong>LongCat-Flash-Thinking-2601</strong> - Meituan’s 560B MoE (27B active) agentic reasoning model, fully MIT licensed. Features “Heavy Thinking” mode scoring 100% on AIME-25. (<a target="_blank" href="https://github.com/meituan-longcat/LongCat-Flash-Thinking-2601">GitHub</a>, <a target="_blank" href="https://longcat.chat/">Demo</a>, <a target="_blank" href="https://x.com/Meituan_LongCat/status/2011515214521647603">X Announcement</a>)</p><p>* <strong>TranslateGemma</strong> - Google’s open translation family (4B, 12B, 27B) supporting 55 languages. The 4B model runs entirely on-device. (<a target="_blank" href="https://arxiv.org/abs/2601.09012">Arxiv</a>, <a target="_blank" href="https://kaggle.com/models/google/translategemma">Kaggle</a>, <a target="_blank" href="https://x.com/GoogleDeepMind/status/2011848249850630363">X Announcement</a>)</p><p>* <strong>MedGemma 1.5 & MedASR</strong> - Native 3D imaging support (CT/MRI) and a speech model that beats Whisper v3 by 82% on clinical dictation error rates. (<a target="_blank" href="https://huggingface.co/google/medgemma-4b-it">MedGemma HF</a>, <a target="_blank" href="https://huggingface.co/google/medasr">MedASR HF</a>, <a target="_blank" href="https://arxiv.org/abs/2507.05201">Arxiv</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* <strong>Claude Cowork</strong> - Anthropic’s new desktop agent allows non-coders to give Claude file system and browser access to perform complex tasks. (<a target="_blank" href="https://techcrunch.com/2026/01/12/anthropics-new-cowork-tool-offers-claude-code-without-the-code/">TechCrunch</a>, <a target="_blank" href="https://x.com/TechCrunch/status/2010797583569150008">X Coverage</a>)</p><p>* <strong>GPT-5.2 Codex</strong> - Now in the API ($1.75/1M input). Features native context compaction and state-of-the-art performance for long-running agentic loops. (<a target="_blank" href="https://openai.com/index/gpt-5-2-codex/">Blog</a>, <a target="_blank" href="https://openai.com/api/pricing/">Pricing</a>)</p><p>* <strong>Cursor & FastRenderer</strong> - Cursor used GPT-5.2 Codex to build a 3M+ line Rust browser from scratch in one week of autonomous coding. (<a target="_blank" href="https://cursor.com/blog/scaling-agents">Blog</a>, <a target="_blank" href="https://github.com/wilsonzlin/fastrender">GitHub</a>, <a target="_blank" href="https://x.com/mntruell/status/2011562190286045552">X Thread</a>)</p><p>* <strong>Gemini Personal Intelligence</strong> - Google leverages its data moat, letting Gemini reason across Gmail, Photos, and Search for hyper-personalized proactive help. (<a target="_blank" href="https://blog.google/products/gemini/">Blog</a>, <a target="_blank" href="https://x.com/GoogleAI/status/2011545972724425183">X Announcement</a>)</p><p>* <strong>Partnerships & Drama</strong></p><p>* <strong>Apple + Gemini</strong> - Apple officially selects Gemini to power Siri backend capabilities.</p><p>* <strong>OpenAI + Cerebras</strong> - A $10B deal for 750MW of high-speed compute through 2028. (<a target="_blank" href="https://openai.com/index/cerebras-partnership/">Announcement</a>)</p><p>* <strong>Thinking Machines</strong> - Co-founders and CTO return to OpenAI amidst drama; Soumith Chintala named new CTO.</p><p>* <strong>This Week’s Buzz</strong></p><p>* <strong>WeaveHacks 3</strong> - Self-Improving Agents Hackathon in SF (Jan 31-Feb 1). (<a target="_blank" href="https://luma.com/weavehacks3">Sign Up Here</a>)</p><p>* <strong>Vision, Voice & Audio</strong></p><p>* <strong>Veo 3.1</strong> - Native 9:16 vertical video, 4K resolution, and reference image support in Gemini API. (<a target="_blank" href="https://ai.google.dev/gemini-api/docs/video">Docs</a>)</p><p>* <strong>Pocket TTS</strong> - A 100M parameter CPU-only model from Kyutai Labs that clones voices from 5s of audio. (<a target="_blank" href="https://github.com/kyutai-labs/pocket-tts">GitHub</a>, <a target="_blank" href="https://huggingface.co/kyutai/pocket-tts">HF</a>)</p><p>* <strong>GLM-Image</strong> - Hybrid AR + Diffusion model with SOTA text rendering. (<a target="_blank" href="https://huggingface.co/zai-org/GLM-Image">HF</a>, <a target="_blank" href="https://github.com/zai-org/GLM-Image">GitHub</a>)</p><p>* <strong>FLUX.2 [klein]</strong> - Black Forest Labs releases fast 4B (Apache 2.0) and 9B models for sub-second image gen. (<a target="_blank" href="https://huggingface.co/black-forest-labs/FLUX.2-klein-4B">HF Collection</a>, <a target="_blank" href="https://x.com/bfl_ml/status/2011825819082244266">X Announcement</a>)</p><p>* <strong>Kling Motion Transfer</strong> - Viral example of AI video pipelines changing Hollywood workflows. (<a target="_blank" href="https://twitter.com/venturetwins/status/2011285029541077033">X Thread</a>)</p><p>* <strong>Deep Dive: Agent Skills</strong></p><p>* <strong>Vercel React Best Practices</strong> - Pre-packaged skills for agents. (<a target="_blank" href="https://vercel.com/blog/introducing-react-best-practices">Blog</a>)</p><p>* <strong>UI Skills</strong> - Documentation and skill standards. (<a target="_blank" href="https://www.ui-skills.com/">Docs</a>)</p><p>* <strong>Chorus with Skills</strong> - My fork of Chorus enabling skills for all LLMs. (<a target="_blank" href="https://github.com/altryne/chorus/releases/tag/v0.14.5-skills">Release</a>)</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-jan-15-agent-skills-deep</link><guid isPermaLink="false">substack:post:184718759</guid><dc:creator><![CDATA[Alex Volkov and Eleanor Berger]]></dc:creator><pubDate>Fri, 16 Jan 2026 02:17:24 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/184718759/f3d2711faee5edb28749ea1cf08b6d8e.mp3" length="96998682" type="audio/mpeg"/><itunes:author>Alex Volkov and Eleanor Berger</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6062</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/184718759/5a5bc49bfd616212ac5dd8f06e748ce1.jpg"/></item><item><title><![CDATA[ThursdAI - Jan 8 - Vera Rubin's 5x Jump, Ralph Wiggum Goes Viral, GPT Health Launches & XAI Raises $20B Mid-Controversy]]></title><description><![CDATA[<p>Hey folks, Alex here from Weights & Biases, with your weekly AI update (and a first live show of this year!) </p><p>For the first time, we had a co-host of the show also be a guest on the show, Ryan Carson (from Amp) went supernova viral this week with an X article (1.5M views) about Ralph Wiggum (yeah, from Simpsons) and he broke down that agentic coding technique at the end of the show. </p><p>LDJ and Nisten helped cover NVIDIA’s incredible announcements during CES with their Vera Rubin upcoming platform (4-5X improvements) and we all got excited about AI medicine with ChatGPT going into Health officially! </p><p>Plus, a bunch of Open Source news, let’s get into this: </p><p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></p><p>Open Source: The “Small” Models Are Winning</p><p>We often talk about the massive frontier models, but this week, Open Source came largely from unexpected places and focused on efficiency, agents, and specific domains.</p><p>Solar Open 100B: A Data Masterclass</p><p>Upstage released Solar Open 100B, and it’s a beast. It’s a 102B parameter Mixture-of-Experts (MoE) model, but thanks to MoE magic, it only uses about 12B active parameters during inference. This means it punches incredibly high but runs fast.</p><p>What I really appreciated here wasn’t just the weights, but the transparency. They released a technical report detailing their “Data Factory” approach. They trained on nearly 20 trillion tokens, with a huge chunk being synthetic. They also used a dynamic curriculum that adjusted the difficulty and the ratio of synthetic data as training progressed. This transparency is what pushes the whole open source community forward.</p><p>Technically, it hits 88.2 on MMLU and competes with top-tier models, especially in Korean language tasks. You can grab it on <a target="_blank" href="https://huggingface.co/upstage/Solar-Open-100B">Hugging Face</a>.</p><p>MiroThinker 1.5: The DeepSeek Moment for Agents?</p><p>We also saw MiroThinker 1.5, a 30B parameter model that is challenging the notion that you need massive scale to be smart. It uses something they call “Interactive Scaling.”</p><p>Wolfram broke this down for us: this agent forms hypotheses, searches for evidence, and then iteratively revises its answers in a time-sensitive sandbox. It effectively “thinks” before answering. The result? It beats trillion-parameter models on search benchmarks like BrowseComp. It’s significantly cheaper to run, too. This feels like the year where smaller models + clever harnesses (harnesses are the software wrapping the model) will outperform raw scale.</p><p>Liquid AI LFM 2.5: Running on Toasters (Almost)</p><p>We love Liquid AI and they are great friends of the show. They announced LFM 2.5 at CES with AMD, and these are tiny ~1B parameter models designed to run on-device. We’re talking about running capable AI on your laptop, your phone, or edge devices (or the Reachy Mini bot that I showed off during the show! I gotta try and run LFM on him!)</p><p>Probably the coolest part is the audio model. Usually, talking to an AI involves a pipeline: Speech-to-Text (ASR) -> LLM -> Text-to-Speech (TTS). Liquid’s model is end-to-end. It hears audio and speaks audio directly. We watched a demo from Maxime Labonne where the model was doing real-time interaction, interleaving text and audio. It’s incredibly fast and efficient. While it might not write a symphony for you, for on-device tasks like summarization or quick interactions, this is the future.</p><p>NousCoder-14B and Zhipu AI IPO</p><p>A quick shoutout to our friends at Nous Research who released NousCoder-14B, an open-source competitive programming model that achieved a 7% jump on LiveCodeBench accuracy in just four days of RL training on 48 NVIDIA B200 GPUs. The model was trained on 24,000 verifiable problems, and the lead researcher Joe Li noted it achieved in 4 days what took him 2 years as a teenager competing in programming contests. The full RL stack is open-sourced on<strong> </strong><a target="_blank" href="https://github.com/NousResearch/atropos/pull/296"><strong>GitHub</strong></a> and Nous published a great <a target="_blank" href="https://api.wandb.ai/links/jli505/ksz3e9w6">WandB results page as well</a>! </p><p>And in historic news, Zhipu AI (Z.ai)—the folks behind the GLM series—became the world’s first major LLM company to IPO, raising $558 million on the Hong Kong Stock Exchange. Their GLM-4.7 currently ranks #1 among open-source and domestic models on both Artificial Analysis and LM Arena. Congrats to them!</p><p>Big Companies & APIs</p><p>NVIDIA CES: Vera Rubin Changes Everything</p><p>LDJ brought the heat on this one covering Jensen’s CES keynote that unveiled the Vera Rubin platform, and the numbers are almost hard to believe. We’re talking about a complete redesign of six chips: the Rubin GPU delivering 50 petaFLOPS of AI inference (5x Blackwell), the Vera CPU with 88 custom Olympus ARM cores, NVLink 6, ConnectX-9 SuperNIC, BlueField-4 DPU, and Spectrum-6 Ethernet.</p><p>Let me put this in perspective using LDJ’s breakdown: if you look at FP8 performance, the jump from Hopper to Blackwell was about 5x. The jump from Blackwell to Vera Rubin is over 3x again—but here’s the kicker—while only adding about 200 watts of power draw. That’s insane efficiency improvement.</p><p>The real-world implications Jensen shared: training a 10 trillion parameter mixture-of-experts model now requires 75% fewer GPUs compared to Blackwell. Inference token costs drop roughly 10x—a 1MW cluster goes from 1 million to 10 million tokens per second at the same power. HBM4 memory delivers 22 TB/s bandwidth with 288GB capacity, exceeding NVIDIA’s own 2024 projections by nearly 70%.</p><p>As Ryan noted, when people say there’s an AI bubble, this is why it’s hilarious. Jensen keeps saying the need for inference is unbelievable and only going up exponentially. We all see this. I can’t get enough inference—I want to spin up 10 Ralphs running concurrently! The NVL72 rack-scale system achieves 3.6 exaFLOPS inference with 20.7TB total HBM, and it’s already shipping. Runway 4.5 is already running on the new platform, having ported their model from Hopper to Vera Rubin NVL72 in a single day.</p><p>NVIDIA also recently <a target="_blank" href="https://nvidianews.nvidia.com/news">acqui-hidred Groq</a> (with a Q) in a ~$20 billion deal, bringing the inference chip expertise from the guy who created Google’s TPUs in-house.</p><p>Nemotron Speech ASR & The Speed of Voice (<a target="_blank" href="https://x.com/NVIDIAAIDev/status/2008654492204441862">X</a>, <a target="_blank" href="https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b">HF</a>,  <a target="_blank" href="https://huggingface.co/blog/nvidia/nemotron-speech-asr-scaling-voice-agents">Blog</a>)</p><p>NVIDIA also dropped Nemotron Speech ASR. This is a 600M parameter model that offers streaming transcription with 24ms latency.</p><p>We showed a demo from our friend Kwindla Kramer at Daily. He was talking to an AI, and the response was virtually instant. The pipeline is: Nemotron (hearing) -> Llama/Nemotron Nano (thinking) -> Magpie TTS (speaking). The total latency is under 500ms. It feels like magic. Instant voice agents are going to be everywhere this year.</p><p>XAI Raises $20B While Grok Causes Problems (Again)</p><p>So here’s the thing about covering anything Elon-related: it’s impossible to separate signal from noise because there’s an army of fans who hype everything and an army of critics who hate everything. But let me try to be objective here.</p><p>XAI raised another massive Round E of $20 billion! at a $230 billion valuation, with NVIDIA and Cisco as strategic investors. The speed of their infrastructure buildout is genuinely incredible. Grok’s voice mode is impressive. I use Grok for research and it’s really good, notable for it’s unprecedented access to X !</p><p>But. This raise happened in the middle of a controversy where Grok’s image model was being used to “put bikinis” on anyone in reply threads, including—and this is where I draw a hard line—minors. As Nisten pointed out on the show, it’s not even hard to implement guardrails. You just put a 2B VL model in front and ask “is there a minor in this picture?” But people tested it, asked Grok not to use the feature, and it did it anyway. And yeah, putting Bikini on Claude is funny, but basic moderation is lacking! </p><p>The response of “we’ll prosecute illegal users” is stupid when there’s no moderation built into the product. There’s an enormous difference between Photoshop technically being able to do something after hours of work, and a feature that generates edited images in one second as the first comment to a celebrity, then gets amplified by the platform’s algorithm to millions of people. One is a tool. The other is a product with amplification mechanics. Products need guardrails. I don’t often link to CNN (in fact this is the first time) but they have a great writeup about the whole incident <a target="_blank" href="https://www.cnn.com/2026/01/08/tech/elon-musk-xai-digital-undressing">here</a> which apparently includes the quitting of a few trust and safety folks and Elon’s pushback on guardrails. Crazy</p><p>That said, Grok 5 is in training and XAI continues to ship impressive technology. I just wish they’d put the same engineering effort into safety as they do into capabilities!</p><p>OpenAI Launches GPT Health</p><p>This one’s exciting. OpenAI CEO Fidji Simo announced <strong>ChatGPT Health</strong>, a privacy-first space for personalized health conversations that can connect to electronic health records, Apple Health, Function Health, Peloton, and MyFitnessPal.</p><p>Here’s why this matters: health already represents about 5% of all ChatGPT messages globally and touches 25% of weekly active users—often outside clinic hours or in underserved areas. People are already using these models for health advice constantly.</p><p>Nisten, who has worked on AI doctors since the GPT-3 days and even published papers on on-device medical AI, gave us some perspective: the models have been fantastic for health stuff for two years now. The key insight is that medical data seems like a lot, but there are really only about 2,000 prescription drugs and 2,000 diseases (10,000 if you count rare ones). That’s nothing for an LLM. The models excel at pattern recognition across this relatively contained dataset.</p><p>The integration with Function Health is particularly interesting to me. Function does 160+ lab tests, but many doctors won’t interpret them because they didn’t order them. ChatGPT could help bridge that gap, telling you “hey, this biomarker looks off, you should discuss this with your doctor.” The bad news is, this is just a waitlist and you can add yourself to the waitlist <a target="_blank" href="https://chatgpt.com/health/waitlist?openaicom_referred=true">here</a>, we’ll keep monitoring the situation and let you know when it opens up</p><p><strong>Doctronic: AI Prescribing Without Physician Oversight</strong></p><p>Speaking of healthcare, Doctronic launched a pilot in Utah<strong> </strong>where AI can autonomously renew prescriptions for chronic conditions without any physician in the loop. The system covers about 190 routine medications (excluding controlled substances) at just $4 per renewal. Trial data showed 99.2% concordance with physician treatment plans, and they’ve secured pioneering malpractice insurance that treats the AI like a clinician.</p><p>Nisten made the case that it’s ethically wrong to delay this kind of automation when ER wait times keep increasing and doctors are overworked. The open source models are already excellent at medical tasks. Governments should be buying GPUs rather than creating administrative roadblocks. Strong strong agree here! </p><p>Google Brings Gmail into the Gemini Era (<a target="_blank" href="https://x.com/Google/status/2009265269382742346">X</a>)</p><p>Breaking news from the day of our show: <strong>Google announced </strong>Gmail’s biggest AI transformation since its 2004 launch, powered by Gemini 3. This brings AI Overviews that summarize email threads, natural language queries (”Who gave me a plumber quote last year?”), Help Me Write, contextual Suggested Replies matching your writing style, and the upcoming AI Inbox that filters noise to surface VIPs and urgent items.</p><p>For 3 billion Gmail users, this is huge. I’m very excited to test it—though not live on the show because I don’t want you reading my emails.</p><p>This weeks buzz - covering Weights & Biases updates</p><p>Not covered on the show, but a great update on stuff from WandB, Chris Van Pelt (<a target="_blank" href="https://x.com/vanpelt/status/2009316107212235117">@vanpelt</a>), one of the 3 co-founders released a great project I wanted to tell you about! For coders, this is an app that allows you to run multiple Claude Codes on free Github sandboxes, so you can code (or Ralph) and control everything away from home! </p><p>GitHub gives personal users 120 free Codespaces hours/month, and Catnip automatically shuts down inactive instances so you can code for quite a while with Catnip! </p><p>It’s fully open source on <a target="_blank" href="https://github.com/wandb/catnip">Github</a> and you can download the app <a target="_blank" href="https://apps.apple.com/us/app/w-b-catnip/id6755161660">here</a></p><p>Interview: Ryan Carson - What the hell is Ralph Wiggum?</p><p>Okay, let’s talk about the character everyone is seeing on their timeline: Ralph Wiggum. My co-host Ryan Carson went viral this week with an article about this technique, and I had to have him break it down.</p><p>Ralph isn’t a new model; it’s a technique for running agents in a loop to perform autonomous coding. The core idea is deceptively simple: Ralph is a bash script that loops an AI coding agent. In a loop, until it a certain condition is met. But why is it blowing up? </p><p>Normally when you use a coding agent like Cursor, Claude Code, or AMP, you need to be in the loop. You approve changes, look at code, fix things when the agent hits walls or runs out of context. Ralph solves this by letting the agent run autonomously while you sleep.</p><p>Here’s how it works: First, you write a Product Requirements Doc (PRD) by talking to your agent for a few minutes about what you want to build. Then you convert that PRD into a JSON file containing atomic user stories with clear acceptance criteria. Each user story is small enough for the agent to complete in one focused thread.</p><p>The Ralph script then loops: it picks the first incomplete user story, the agent writes code to implement it, tests against the acceptance criteria, commits the changes, marks the story as complete, writes what it learned to a shared “agents.md” file, and loops to the next story. That compound learning step is crucial—without it, the agent would keep making the same mistakes.</p><p>What makes this work is the pre-work. As Ryan put it, “no real work is done one-shot.” This is how software engineering has always worked—you break big problems into smaller problems into user stories and solve them incrementally. The innovation is letting AI agents work through that queue autonomously while you sleep! Ryan’s excellent (and viral) X article is <a target="_blank" href="https://x.com/ryancarson/status/2008548371712135632?s=20">here</a>! </p><p>Vision & Video</p><p>LTX-2 Goes Fully Open Source (<a target="_blank" href="https://huggingface.co/Lightricks/LTX-Video">HF</a>, <a target="_blank" href="https://arxiv.org/abs/2601.03233">Paper</a>)</p><p>Lightricks finally open-sourced <strong>LTX-2</strong>, marking a major milestone as the first fully open audio-video generation model. This isn’t just “we released the weights” open—it’s complete model weights (13B and 2B variants), distilled versions, controllable LoRAs, a full multimodal trainer, benchmarks, and evaluation scripts. For a video model that is aiming to be the open source SORA, supports audio and lipsync</p><p>The model generates synchronized audio and video in a single DiT-based architecture—motion, dialogue, ambience, and music flow simultaneously. Native 4K at up to 50 FPS with audio up to 10 seconds. And there’s also a distilled version (Thanks Pruna AI!) hosted on <a target="_blank" href="https://replicate.com/lightricks/ltx-2-distilled">Replicate</a></p><p>ComfyUI provided day-0 native support, and community testing shows an A6000 generating 1280x720 at 120 frames in 50 seconds. This is near Sora-level quality that you can fine-tune on your own data for custom styles and voices in about an hour.</p><p>What a way to start 2026. From chips that are 5x faster to AI doctors prescribing meds in Utah, the pace is only accelerating. If anyone tells you we’re in an AI bubble, just show them what we covered today. Even if the models stopped improving tomorrow, the techniques like “Ralph” prove we have years of work ahead of us just figuring out how to use the intelligence we already have.</p><p>Thank you for being a ThursdAI subscriber. See you next week!</p><p>As always, here’s the show notes and TL;DR links: </p><p>* <strong>Hosts & Guests</strong></p><p>* <strong>Alex Volkov</strong> - AI Evangelist & Weights & Biases (<a target="_blank" href="https://x.com/altryne">@altryne</a>)</p><p>* <strong>Co-Hosts</strong> - <a target="_blank" href="https://x.com/WolframRvnwlf">@WolframRvnwlf</a>,  <a target="_blank" href="https://x.com/nisten">@nisten</a>, <a target="_blank" href="https://x.com/ldjconfirmed">@ldjconfirmed</a></p><p>* <strong>Special Guest</strong> - Ryan Carson (<a target="_blank" href="https://x.com/ryancarson">@ryancarson</a>) breaking down the Ralph Wiggum technique.</p><p>* <strong>Open Source LLMs</strong></p><p>* <strong>Solar Open 100B</strong> - Upstage’s 102B MoE model. Trained on 19.7T tokens with a heavy focus on “data factory” synthetic data and high-performance Korean reasoning (<a target="_blank" href="https://x.com/kchonyc/status/2008191520881639504">X</a>, <a target="_blank" href="https://huggingface.co/upstage/Solar-Open-100B">HF</a>, <a target="_blank" href="https://t.co/TN8uPdHkNt">Tech Report</a>).</p><p>* <strong>MiroThinker 1.5</strong> - A 30B parameter search agent that uses “Interactive Scaling” to beat trillion-parameter models on search benchmarks like BrowseComp (<a target="_blank" href="https://x.com/miromind_ai/status/2008728943994826773">X</a>, <a target="_blank" href="https://huggingface.co/miromind-ai/MiroThinker-v1.5-30B">HF</a>, <a target="_blank" href="https://github.com/MiroMindAI/MiroThinker">GitHub</a>).</p><p>* <strong>Liquid AI LFM 2.5</strong> - A family of 1B models designed for edge devices. Features a revolutionary end-to-end audio model that skips the ASR-LLM-TTS pipeline (<a target="_blank" href="https://x.com/liquidai/status/2008385292244242942">X</a>, <a target="_blank" href="https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct">HF</a>).</p><p>* <strong>NousCoder-14B</strong> - competitive coding model from Nous Research that saw a 7% LiveCodeBench accuracy jump in just 4 days of RL (<a target="_blank" href="https://x.com/NousResearch/status/2008624474237923495">X</a>, <a target="_blank" href="https://api.wandb.ai/links/jli505/ksz3e9w6">WandB Dashboard</a>).</p><p>* <strong>Zhipu AI IPO</strong> - The makers of GLM became the first major LLM firm to go public on the HKEX, raising $558M (<a target="_blank" href="https://www.zhipuai.cn/en/">Announcement</a>).</p><p>* <strong>Big Co LLMs & APIs</strong></p><p>* <strong>NVIDIA Vera Rubin</strong> - Jensen Huang’s CES reveal of the next-gen platform. Delivers 5x Blackwell inference performance and 75% fewer GPUs needed for MoE training (<a target="_blank" href="https://nvidianews.nvidia.com/news">Blog</a>).</p><p>* <strong>OpenAI ChatGPT Health</strong> - A privacy-first vertical for EHR and fitness data integration (<a target="_blank" href="https://chatgpt.com/health/waitlist">Waitlist</a>).</p><p>* <strong>Google Gmail Era</strong> - Gemini 3 integration into Gmail for 3 billion users, featuring AI Overviews and natural language inbox search (<a target="_blank" href="https://blog.google/products-and-platforms/products/gmail/gmail-is-entering-the-gemini-era/">Blog</a>).</p><p>* <strong>XAI $20B Raise</strong> - Elon’s XAI raises Series E at a $230B valuation, even as Grok faces heat over bikini-gate and safety guardrails (<a target="_blank" href="https://www.cnn.com/2026/01/08/tech/elon-musk-xai-digital-undressing">CNN Report</a>).</p><p>* <strong>Doctronic</strong> - The first US pilot in Utah for autonomous AI prescription renewals without a physician in the loop (<a target="_blank" href="https://doctronic.ai/">Web</a>).</p><p>* <strong>Alexa+ Web</strong> - Amazon brings the “Smart Alexa” experience to browser-based chat (<a target="_blank" href="https://alexa.amazon.com/about">Announcement</a>).</p><p>* <strong>Autonomous Coding & Tools</strong></p><p>* <strong>Ralph Wiggum</strong> - The agentic loop technique for autonomous coding using small, atomic user stories. Ryan Carson’s breakdown of why this is the death of “vibe coding” (<a target="_blank" href="https://x.com/ryancarson/status/2008548371712135632">Viral X Article</a>).</p><p>* <strong>Catnip by W&B</strong> - Chris Van Pelt’s open-source iOS app to run Claude Code anywhere via GitHub Codespaces (<a target="_blank" href="https://apps.apple.com/us/app/w-b-catnip/id6755161660">App Store</a>, <a target="_blank" href="https://github.com/wandb/catnip">GitHub</a>).</p><p>* <strong>Vision & Video</strong></p><p>* <strong>LTX-2</strong> - Lightricks open-sources the first truly open audio-video generation model with synchronized output and full training code (<a target="_blank" href="https://github.com/Lightricks/LTX-Video">GitHub</a>, <a target="_blank" href="https://replicate.com/lightricks/ltx-2-distilled">Replicate Demo</a>).</p><p>* <strong>Avatar Forcing</strong> - KAIST’s framework for real-time interactive talking heads with ~500ms latency (<a target="_blank" href="https://arxiv.org/abs/2601.00664">Arxiv</a>).</p><p>* <strong>Qwen Edit 2512</strong> - Optimized by PrunaAI to generate high-res realistic images in under 7 seconds (<a target="_blank" href="https://replicate.com/p/qwen-edit-2512">Replicate</a>).</p><p>* <strong>Voice & Audio</strong></p><p>* <strong>Nemotron Speech ASR</strong> - NVIDIA’s 600M parameter streaming model with sub-100ms stable latency for massive-scale voice agents (<a target="_blank" href="https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b">HF</a>).</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-jan-8-nvidias-vera-rubin</link><guid isPermaLink="false">substack:post:183956296</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Thu, 08 Jan 2026 23:10:50 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/183956296/b1af7c2151ba30e29f969e75038269dc.mp3" length="102680545" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6417</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/183956296/375623e8cba29d1ce58d6bdcbf1331aa.jpg"/></item><item><title><![CDATA[ThursdAI - Jan 1 2026 - Will Brown Interview + Nvidia buys Groq, Meta buys Manus, Qwen Image 2412 & Alex New Year greetings]]></title><description><![CDATA[<p>Hey all, </p><p>Happy new year! This is Alex, writing to you for the very fresh start of this year, it’s 2026 already, can you believe it? </p><p>There was no live stream today, I figured the cohosts deserve a break and honestly it was a very slow week. Even the chinese labs who don’t really celebrate X-mas and new years didn’t come out with a banger AFAIK. </p><p><p>ThursdAI - AI moves fast, we’re here to make sure you never miss a thing! Subscribe :) </p></p><p>Tho I thought it was an incredible opportunity to finally post the Will Brow interview I recorded in November during the AI Engineer conference. </p><p>Will is a researcher at Prime Intellect (big fans on WandB btw!) and is very known on X as a hot takes ML person, often going viral for tons of memes! </p><p>Will is the creator and maintainer of the Verifiers library (<a target="_blank" href="https://github.com/PrimeIntellect-ai/verifiers">Github</a>) and his talk at AI Engineer was all about RL Environments (what they are, you can hear in the interview, I asked him!) </p><p>TL;DR last week of 2025 in AI</p><p>Besides this, my job here is to keep you up to date, and honestly this was very easy this week, as… almost nothing has happened, but here we go: </p><p>Meta buys Manus</p><p>The year ended with 2 huge acquisitions / aquihires. First we got the news from Alex Wang that Meta has bought Manus.ai which is an agentic AI startup we covered back in <a target="_blank" href="https://sub.thursdai.news/i/159016903/tools-manus-ai-agent-google-ai-studio-youtube-links-and-cursor-embeddings">March</a> for an undisclosed amount (folks claim $2-3B) </p><p>The most interesting thing here is that Manus is a Chinese company, and this deal requires very specific severance from Chinese operations.</p><p>Jensen goes on a new years spending spree, Nvidia buys Groq (not GROK) for $20B</p><p>Groq which we covered often here, and are great friends, is going to NVIDIA, in a… very interesting acqui-hire, which is a “non binding license” + most of Groq top employees apparently are going to NVIDIA. Jonathan Ross the CEO of Groq, was the co-creator of the TPU chips at Google before founding Groq, so this seems like a very strategic aquihire for NVIDIA! Congrats to our friends from Groq on this amazing news for the new year! </p><p>Tencent open-sources HY-MT1.5 translation models with 1.8B edge-deployable and 7B cloud variants supporting 33 languages (<a target="_blank" href="https://x.com/TencentHunyuan/status/2005908069239447988">X</a>, <a target="_blank" href="https://huggingface.co/tencent/HY-MT1.5-1.8B">HF</a>, <a target="_blank" href="https://huggingface.co/tencent/HY-MT1.5-7B">HF</a>, <a target="_blank" href="https://github.com/Tencent-Hunyuan/HY-MT">GitHub</a>)</p><p>It seems that everyone’s is trying to de-throne whisper and this latest attempt from Tencent is a interesting one. a 1.8B and 7B translation models with very interesting stats. </p><p>Alibaba’s Qwen-Image-2512 drops on New Year’s Eve as strongest open-source text-to-image model, topping AI Arena with photorealistic humans and sharper textures (<a target="_blank" href="https://x.com/Alibaba_Qwen/status/2006294325240668255">X</a>, <a target="_blank" href="https://huggingface.co/Qwen/Qwen-Image-2512">HF</a>, <a target="_blank" href="https://arxiv.org/abs/2508.02324">Arxiv</a>)</p><p>Our friends in Tongyi decided to give is a new years present in the form of an updated Qwen-image, with much improved realism</p><p>That’s it folks, this was a quick one, hopefully you all had an amazing new year celebration, and are gearing up to an eventful and crazy 2026. </p><p>I wish you all happiness, excitement and energy to keep up with everything in the new year, and will make sure that we’re here to keep you up to date as always! </p><p></p><p>P.S - I got a little news of my own this yesterday, not related to AI. She said yes 🎉 </p><p></p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-jan-1-2026-will-brown-interview</link><guid isPermaLink="false">substack:post:183167670</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Thu, 01 Jan 2026 21:29:08 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/183167670/0eee7930673fc6153a3ae773ac62c0e7.mp3" length="21388507" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>1782</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/183167670/9988427821c4d5a8a0dd689084d3fe72.jpg"/></item><item><title><![CDATA[🔥 Someone Trained an LLM in Space This Year (And 50 Other Things You Missed)- ThursdAI yearly recap is here!]]></title><description><![CDATA[<p>Ho Ho Ho, Alex here! (a real human writing these words, this needs to be said in 2025) </p><p>Merry Christmas (to those who celebrate) and welcome to the very special yearly ThursdAI recap! This was an intense year in the world of AI, and after 51 weekly episodes (this is episode 52!) we have the ultimate record of all the major and most important AI releases of this year! </p><p>So instead of bringing you  a weekly update (it’s been a slow week so far, most AI labs are taking a well deserved break, the Cchinese AI labs haven’t yet surprised anyone), I’m dropping a comprehensive yearly AI review! Quarter by quarter, month by month, both in written form and as a pod/video! </p><p>Why do this? Who even needs this? Isn’t most of it obsolete?  I have asked myself this exact question while prepping for the show (it was quite a lot of prep, even with Opus’s help). I eventually landed on, hey, if nothing else, this will serve as a record of the insane week of AI progress we all witnessed. Can you imagine that the term <strong>Vibe Coding</strong> is less than 1 year old? That Claude Code was released at the start of THIS year? </p><p>We get hedonicly adapt to new AI goodies so quick, and I figured this will serve as a point in time check, we can get back to and feel the acceleration! </p><p>With that, let’s dive in - P.S. the content below is mostly authored by my co-author for this, Opus 4.5 high, which at the end of 2025 I find the best creative writer with the best long context coherence that can imitate my voice and tone (hey, I’m also on a break! 🎅) </p><p>“Open source AI has never been as hot as this quarter. We’re accelerating as f*ck, and it’s only just beginning—hold on to your butts.” — Alex Volkov, ThursdAI Q1 2025</p><p>🏆 The Big Picture — 2025 - The Year the AI Agents Became Real</p><p>Looking back at 51 episodes and 12 months of relentless AI progress, several mega-themes emerged:</p><p>1. 🧠 Reasoning Models Changed Everything</p><p>From DeepSeek R1 in January to GPT-5.2 in December, reasoning became the defining capability. Models now think for hours, call tools mid-thought, and score perfect on math olympiads.</p><p>2. 🤖 2025 Was Actually the Year of Agents</p><p>We said it in January, and it came true. Claude Code launched the CLI revolution, MCP became the universal protocol, and by December we had ChatGPT Apps, Atlas browser, and AgentKit.</p><p>3. 🇨🇳 Chinese Labs Dominated Open Source</p><p>DeepSeek, Qwen, MiniMax, Kimi, ByteDance — despite chip restrictions, Chinese labs released the best open weights models all year. Qwen 3, Kimi K2, DeepSeek V3.2 were defining releases.</p><p>4. 🎬 We Crossed the Uncanny Valley</p><p>VEO3’s native audio, Suno V5’s indistinguishable music, Sora 2’s social platform — 2025 was the year AI-generated media became indistinguishable from human-created content.</p><p>5. 💰 The Investment Scale Became Absurd</p><p>$500B Stargate, $1.4T compute obligations, $183B valuations, $100-300M researcher packages, LLMs training in space. The numbers stopped making sense.</p><p>6. 🏆 Google Made a Comeback</p><p>After years of “catching up,” Google delivered Gemini 3, Antigravity, Nano Banana Pro, VEO3, and took the #1 spot (briefly). Don’t bet against Google.</p><p>By the Numbers</p><p>Q1 2025 — The Quarter That Changed Everything</p><p><em>DeepSeek R1 crashed NVIDIA’s stock, reasoning models went mainstream, and Chinese labs took over open source. The quarter that proved AI isn’t slowing down—it’s just getting started.</em></p><p><strong>Key Themes:</strong></p><p>* 🧠 Reasoning models went mainstream (DeepSeek R1, o1, QwQ)</p><p>* 🇨🇳 Chinese labs dominated open source (DeepSeek, Alibaba, MiniMax, ByteDance)</p><p>* 🤖 2025 declared “The Year of Agents” (OpenAI Operator, MCP won)</p><p>* 🖼️ Image generation revolution (GPT-4o native image gen, Ghibli-mania)</p><p>* 💰 Massive infrastructure investment (Project Stargate $500B)</p><p>January — DeepSeek Shakes the World</p><p>(<a target="_blank" href="https://sub.thursdai.news/p/thursdai-jan-2-is-25-the-year-of">Jan 02</a> | <a target="_blank" href="https://sub.thursdai.news/p/thursdai-jan-9th-nvidias-tiny-supercomputer">Jan 10</a> | <a target="_blank" href="https://sub.thursdai.news/p/thursdai-jan-16-2025-hailuo-4m-context">Jan 17</a> | <a target="_blank" href="https://sub.thursdai.news/p/thursdai-jan-23-2025-deepseek-r1">Jan 24</a> | <a target="_blank" href="https://sub.thursdai.news/p/thursdai-jan-30-deepseek-vs-nasdaq">Jan 30</a>)</p><p>The earthquake that shattered the AI bubble. <strong>DeepSeek R1</strong> dropped on January 23rd and became the most impactful open source release ever:</p><p>* <strong>Crashed NVIDIA stock 17%</strong> — $560B loss, largest single-company monetary loss in history</p><p>* Hit <strong>#1 on the iOS App Store</strong></p><p>* Cost allegedly only <strong>$5.5M to train</strong> (sparking massive debate)</p><p>* Matched OpenAI’s o1 on reasoning benchmarks at <strong>50x cheaper pricing</strong></p><p>* The <strong>1.5B model beat GPT-4o and Claude 3.5 Sonnet</strong> on math benchmarks 🤯</p><p>“My mom knows about DeepSeek—your grandma probably knows about it, too” — Alex Volkov</p><p><strong>Also this month:</strong></p><p>* <strong>OpenAI Operator</strong> — First agentic ChatGPT (browser control, booking, ordering)</p><p>* <strong>Project Stargate</strong> — $500B AI infrastructure (Manhattan Project for AI)</p><p>* <strong>NVIDIA Project Digits</strong> — $3,000 desktop that runs 200B parameter models</p><p>* <strong>Kokoro TTS</strong> — 82M param model hit #1 on TTS Arena, Apache 2, runs in browser</p><p>* <strong>MiniMax-01</strong> — 4M context window from Hailuo</p><p>* <strong>Gemini Flash Thinking</strong> — 1M token context with thinking traces</p><p>February — Reasoning Mania & The Birth of Vibe Coding</p><p>(<a target="_blank" href="https://sub.thursdai.news/p/thursdai-feb-6-openai-deepresearch">Feb 07</a> | <a target="_blank" href="https://sub.thursdai.news/p/thursdai-feb-13-my-personal-rogue">Feb 13</a> | <a target="_blank" href="https://sub.thursdai.news/p/thursdai-feb-20-live-from-ai-eng">Feb 20</a> | <a target="_blank" href="https://sub.thursdai.news/p/feb-27-2025-gpt-45-drops-today-claude">Feb 28</a>)</p><p>The month that redefined how we work with AI.</p><p><strong>OpenAI Deep Research</strong> (Feb 6) — An agentic research tool that scored <strong>26.6% on Humanity’s Last Exam</strong> (vs 10% for o1/R1). Dr. Derya Unutmaz called it “a phenomenal 25-page patent application that would’ve cost $10,000+.”</p><p><strong>Claude 3.7 Sonnet & Claude Code</strong> (Feb 24-27) — Anthropic’s coding beast hit <strong>70% on SWE-Bench</strong> with 8x more output (64K tokens). <strong>Claude Code launched</strong> as Anthropic’s agentic coding tool — marking the start of the CLI agent revolution.</p><p>“Claude Code is just exactly in the right stack, right around the right location... You can do anything you want with a computer through the terminal.” — Yam Peleg</p><p><strong>GPT-4.5 (Orion)</strong> (Feb 27) — OpenAI’s largest model ever (rumored 10T+ parameters). 62.5% on SimpleQA, foundation for future reasoning models.</p><p><strong>Grok 3</strong> (Feb 20) — xAI enters the arena with 1M token context and “free until GPUs melt.”</p><p><strong>Andrej Karpathy coins “Vibe Coding”</strong> (Feb 2) — The 5.2M view tweet that captured a paradigm shift: developers describe <em>what</em> they want, AI handles implementation.</p><p><strong>OpenAI Roadmap Revelation</strong> (Feb 13) — Sam Altman announced GPT-4.5 will be the last non-chain-of-thought model. GPT-5 will unify everything.</p><p>March — Google’s Revenge & The Ghibli Explosion</p><p>(<a target="_blank" href="https://sub.thursdai.news/p/thursdai-mar-6-2025-alibabas-r1-killer">Mar 06</a> | <a target="_blank" href="https://sub.thursdai.news/p/thursdai-turns-two-gemma-3-gemini">Mar 13</a> | <a target="_blank" href="https://sub.thursdai.news/p/thursdai-mar-20-openais-new-voices">Mar 20</a> | <a target="_blank" href="https://sub.thursdai.news/p/thursdai-mar-27-gemini-25-takes-1">Mar 27</a>)</p><p><strong>Gemini 2.5 Pro Takes #1</strong> (Mar 27) — Google reclaimed the LLM crown with AIME jumping nearly 20 points, 1M context, “thinking” integrated into the core model.</p><p><strong>GPT-4o Native Image Gen — Ghibli-mania</strong> (Mar 27) — The internet lost its collective mind and turned everything into Studio Ghibli. Auto-regressive image gen with perfect text rendering, incredible prompt adherence.</p><p>“The internet lost its collective mind and turned everything into Studio Ghibli” — Alex Volkov</p><p><strong>MCP Won</strong> (Mar 27) — OpenAI officially adopted Anthropic’s Model Context Protocol. No VHS vs Betamax situation. Tools work across Claude AND GPT.</p><p><strong>DeepSeek V3 685B</strong> — AIME jumped from 39.6% → 59.4%, MIT licensed, best non-reasoning open model.</p><p><strong>ThursdAI Turns 2!</strong> (Mar 13) — Two years since the first episode about GPT-4.</p><p><strong>Open Source Highlights:</strong></p><p>* <strong>Gemma 3</strong> (1B-27B) — 128K context, multimodal, 140+ languages, single GPU</p><p>* <strong>QwQ-32B</strong> — Qwen’s reasoning model matches R1, runs on Mac</p><p>* <strong>Mistral Small 3.1</strong> — 24B, beats Gemma 3, Apache 2</p><p>* <strong>Qwen2.5-Omni-7B</strong> — End-to-end multimodal with speech output</p><p>Q2 2025 — The Quarter That Shattered Reality</p><p><em>VEO3 crossed the uncanny valley, Claude 4 arrived with 80% SWE-bench, and Qwen 3 proved open source can match frontier models. The quarter we stopped being able to tell what’s real.</em></p><p><strong>Key Themes:</strong></p><p>* 🎬 Video AI crossed the uncanny valley (VEO3 with native audio)</p><p>* 🧠 Tool-using reasoning models emerged (o3 calling tools mid-thought)</p><p>* 🇨🇳 Open source matched frontier (Qwen 3, Claude 4)</p><p>* 📺 Google I/O delivered everything</p><p>* 💸 AI’s economic impact accelerated ($300B valuations, 80% price drops)</p><p>April — Tool-Using Reasoners & Llama Chaos</p><p>(<a target="_blank" href="https://sub.thursdai.news/p/thursdai-apr-3rd-openai-goes-open">Apr 03</a> | <a target="_blank" href="https://sub.thursdai.news/p/thursdai-100th-episode-meta-llama">Apr 10</a> | <a target="_blank" href="https://sub.thursdai.news/p/thursdai-apr-17-openai-o3-is-sota">Apr 17</a> | <a target="_blank" href="https://sub.thursdai.news/p/thursdai-apr-23rd-gpt-image-and-grok">Apr 24</a>)</p><p><strong>OpenAI o3 & o4-mini</strong> (Apr 17) — The most important reasoning upgrade ever. For the first time, o-series models can <strong>use tools during reasoning</strong>: web search, Python, image gen. Chain 600+ consecutive tool calls. Manipulate images mid-thought.</p><p>“This is almost AGI territory — agents that reason while wielding tools” — Alex Volkov</p><p><strong>GPT-4.1 Family</strong> (Apr 14) — <strong>1 million token context</strong> across all models. Near-perfect recall. GPT-4.5 deprecated.</p><p><strong>Meta Llama 4</strong> (Apr 5) — Scout (17B active/109B total) & Maverick (17B active/400B total). LMArena drama (tested model ≠ released model). Community criticism. Behemoth teased but never released.</p><p><strong>Gemini 2.5 Flash</strong> (Apr 17) — Set “thinking budget” per API call. Ultra-cheap at $0.15/$0.60 per 1M tokens.</p><p><strong>ThursdAI 100th Episode!</strong> 🎉</p><p>May — VEO3 Crosses the Uncanny Valley & Claude 4 Arrives</p><p>(<a target="_blank" href="https://sub.thursdai.news/p/thursdai-may-1-qwen-3-phi-4-openai">May 01</a> | <a target="_blank" href="https://sub.thursdai.news/p/thursdai-may-8th-new-gemini-pro-mistral">May 09</a> | <a target="_blank" href="https://sub.thursdai.news/p/thursdai-may-15-genocidal-grok-chatgpt">May 16</a> | <a target="_blank" href="https://sub.thursdai.news/p/thursdai-veo3-google-io25-claude">May 23</a> | <a target="_blank" href="https://sub.thursdai.news/p/thursdai-may-29-deepseek-r1-resurfaces">May 29</a>)</p><p><strong>VEO3 — The Undisputed Star of Google I/O</strong> (May 20) — Native multimodal audio generation (speech, SFX, music synced perfectly). Perfect lip-sync. Characters understand who’s speaking. Spawned viral “Prompt Theory” phenomenon.</p><p>“VEO3 isn’t just video generation — it’s a world simulator. We crossed the uncanny valley this quarter.” — Alex Volkov</p><p><strong>Claude 4 Opus & Sonnet — Live Drop During ThursdAI!</strong> (May 22) — Anthropic crashed the party mid-show. First models to cross <strong>80% on SWE-bench</strong>. Handles 6-7 hour human tasks. Hybrid reasoning + instant response modes.</p><p><strong>Qwen 3</strong> (May 1) — The most comprehensive open source release ever: 8 models, Apache 2.0. Runtime /think toggle for chain-of-thought. <strong>4B dense beats Qwen 2.5-72B</strong> on multiple benchmarks. 36T training tokens, 119 languages.</p><p>“The 30B MoE is ‘Sonnet 3.5 at home’ — 100+ tokens/sec on MacBooks” — Nisten</p><p><strong>Google I/O Avalanche:</strong></p><p>* Gemini 2.5 Pro Deep Think (84% MMMU)</p><p>* Jules (free async coding agent)</p><p>* Project Mariner (browser control via API)</p><p>* Gemini Ultra tier ($250/mo)</p><p>June — The New Normal</p><p>(<a target="_blank" href="https://sub.thursdai.news/p/thursdai-jun-5-2025-live-from-ai">Jun 06</a> | <a target="_blank" href="https://sub.thursdai.news/p/thursdai-june-12-metas-15b-scaleai">Jun 13</a> | <a target="_blank" href="https://sub.thursdai.news/p/thursdai-june-18-minimax-m1-beats">Jun 20</a> | <a target="_blank" href="https://sub.thursdai.news/p/thursdai-jun-26-gemini-cli-flux-kontext">Jun 26</a>)</p><p><strong>o3 Price Drop 90%</strong> (Jun 12) — From $40/$10 → $8/$2 per million tokens. o3-pro launched at 87% cheaper than o1-pro.</p><p><strong>Meta’s $15B Scale AI Power Play</strong> (Jun 12) — 49% stake in Scale AI. Alex Wang leads new “Superintelligence team” at Meta. Seven-to-nine-figure comp packages for researchers.</p><p><strong>MiniMax M1 — Reasoning MoE That Beats R1</strong> (Jun 19) — 456B total / 45B active parameters. Full weights on Hugging Face.</p><p><strong>Gemini CLI</strong> (Jun 26) — Google’s open source terminal agent brings Gemini 2.5 Pro to your command line.</p><p><strong>Flux Kontext</strong> — SOTA image editing with character consistency.</p><p>Q3 2025 — The Quarter of GPT-5 & Trillion-Parameter Open Source</p><p><em>GPT-5 arrived after 32 months. Open source hit trillion-parameter scale. World models became playable. Chinese labs continued their dominance.</em></p><p><strong>Key Themes:</strong></p><p>* 👑 GPT-5 Era began (unified reasoning + chat)</p><p>* 🇨🇳 Open source hit trillion-scale (Kimi K2, Qwen3-Coder)</p><p>* 🌍 World models became playable (Google Genie-3)</p><p>* 🎥 Video reached “can’t tell” quality</p><p>* 💰 Unprecedented investment ($100B pledges, $183B valuations)</p><p>July — Trillion-Parameter Open Source Arrives</p><p>(<a target="_blank" href="https://sub.thursdai.news/p/thursdai-jul-3-ernie-45-hunyuan-a13b">Jul 03</a> | <a target="_blank" href="https://sub.thursdai.news/p/thursdai-jul-10-grok-4-and-4-heavy">Jul 11</a> | <a target="_blank" href="https://sub.thursdai.news/p/thursdai-july-17th-kimi-k2-openai">Jul 17</a> | <a target="_blank" href="https://sub.thursdai.news/p/thursdai-july-24-2025-qwen-mas-in">Jul 24</a>)</p><p><strong>Kimi K2 — The Trillion-Parameter King</strong> (Jul 17) — Moonshot dropped a <strong>1 trillion parameter</strong> MoE model: 65.8% on SWE-bench Verified (beating Claude Sonnet without reasoning), 32B active parameters, 128K context, Modified MIT license.</p><p>“This isn’t just another model release. This is ‘Sonnet at home’ if you have the hardware.” — Alex Volkov</p><p><strong>Grok-4 & Grok Heavy</strong> (Jul 10) — <strong>50% on Humanity’s Last Exam</strong> with tools. <strong>100% on AIME25</strong>. xAI finally became a serious contender.</p><p><strong>ChatGPT Agent (Odyssey)</strong> (Jul 17) — Unified agentic AI: browser + terminal + research. 41.6% on HLE (double o3).</p><p><strong>Chinese Open Source Explosion:</strong></p><p>* Baidu ERNIE 4.5 (10 models, Apache 2.0)</p><p>* Tencent Hunyuan-A13B (80B MoE, 256K context)</p><p>* Huawei Pangu Pro (trained entirely on Ascend NPUs — no Nvidia!)</p><p>* Qwen3-Coder-480B (69.6% SWE-bench)</p><p>August — GPT-5 Month</p><p>(<a target="_blank" href="https://sub.thursdai.news/p/thursdai-jul-31-2025-qwens-small">Aug 01</a> | <a target="_blank" href="https://sub.thursdai.news/p/thursdai-gpt5-is-here">Aug 07</a> | <a target="_blank" href="https://sub.thursdai.news/p/thursdai-aug-14-a-week-with-gpt5">Aug 15</a> | <a target="_blank" href="https://sub.thursdai.news/p/thursdai-aug-21-deepseek-v31s-hybrid">Aug 21</a>)</p><p><strong>GPT-5 Launch</strong> (Aug 7) — 32 months after GPT-4:</p><p>* <strong>400K context window</strong></p><p>* <strong>$1.25/$10 per million tokens</strong> (Opus is $15/$75)</p><p>* Unified thinking + chat model</p><p>* Router-based architecture (initially buggy)</p><p>* Free tier access for back-to-school</p><p>“32 months since GPT-4 release, 32 months of ThursdAI” — Alex Volkov</p><p><strong>GPT-OSS</strong> (Aug 5) — OpenAI goes <strong>Apache 2.0 open source</strong> for the first time since GPT-2: 120B and 20B models, configurable reasoning, full chain-of-thought access.</p><p><strong>Google Genie-3</strong> (Aug 7) — DeepMind’s world model generates <strong>fully interactive 3D environments</strong>: real-time at 24fps, memory/consistency breakthrough, walk/fly/control in generated worlds.</p><p><strong>DeepSeek V3.1 Hybrid</strong> (Aug 21) — Matches/beats R1 with fewer thinking tokens. 66% SWE-bench Verified. Tool calls inside thinking. MIT licensed.</p><p>September — Shiptember Delivers</p><p>(<a target="_blank" href="https://sub.thursdai.news/p/thursdai-sep-4-codex-rises-anthropic">Sep 05</a> | <a target="_blank" href="https://sub.thursdai.news/p/sep-11">Sep 12</a> | <a target="_blank" href="https://sub.thursdai.news/p/thursdai-sep-18-gpt-5-codex-oai-wins">Sep 19</a> | <a target="_blank" href="https://sub.thursdai.news/p/thursdai-sep-25-grok-fast-oainvidia">Sep 26</a>)</p><p><strong>GPT-5-Codex</strong> (Sep 18) — Works <strong>7+ hours independently</strong>. 93% fewer tokens on simple tasks. Reviews majority of OpenAI’s own PRs. Perfect 12/12 on 2025 ICPC.</p><p><strong>Meta Connect 25</strong> (Sep 18) — AI glasses with <strong>built-in display</strong>, neural band wristband, live translation with subtitles, <strong>$799 shipping immediately</strong>.</p><p><strong>Qwen-mas Strikes Again</strong> (Sep 26):</p><p>* Qwen3-VL-235B (vision reasoner, 1M context for video)</p><p>* Qwen3-Omni-30B (end-to-end omni-modal)</p><p>* Qwen-Max (over 1T parameters, roadmap to 100M token context)</p><p><strong>NVIDIA $100B pledge to OpenAI</strong> — “Biggest infrastructure project in history”</p><p><strong>Suno V5</strong> — The music generation model where we officially can’t tell anymore.</p><p>“I can no longer tell which music is AI and which is human. This is it. We’ve passed the Rubicon.” — Alex Volkov</p><p>Q4 2025 — The Quarter of Agents, Gemini’s Crown & The Reasoning Wars</p><p><em>The densest quarter in AI history. Google took the throne with Gemini 3, OpenAI fired back with GPT-5.2, and agents became real products. Someone trained an LLM in space.</em></p><p><strong>Key Themes:</strong></p><p>* 🚀 Reasoning wars peaked (Gemini 3 → GPT-5.2 → DeepSeek gold medals)</p><p>* 🤖 Agents became products (Atlas, AgentKit, ChatGPT Apps)</p><p>* 👑 Google’s comeback (Gemini 3, Antigravity, Nano Banana)</p><p>* 🏃 ASI race accelerated ($1.4T compute, 2028 autonomous researchers)</p><p>* 🎬 Sora 2 launched AI-native social media</p><p>October — Sora Changes Social Media Forever</p><p>(<a target="_blank" href="https://sub.thursdai.news/p/thursdai-oct-2-sora-2-the-new-tiktok">Oct 03</a> | <a target="_blank" href="https://sub.thursdai.news/p/oct-9-2025-dev-days-agent-era-samsungs">Oct 10</a> | <a target="_blank" href="https://sub.thursdai.news/p/thursdai-oct-16-veo31-haiku-45-chatgpt">Oct 17</a> | <a target="_blank" href="https://sub.thursdai.news/p/thursdai-oct-23-the-ai-browser-wars">Oct 24</a> | <a target="_blank" href="https://sub.thursdai.news/p/thursdai-oct-30-2025-minimax-m2-shocks">Oct 30</a>)</p><p><strong>Sora 2 — AI Social Media is Born</strong> (Oct 2):</p><p>* Shot to <strong>#3 on iOS App Store</strong> within days</p><p>* <strong>Cameos</strong>: upload your face, star in any video</p><p>* Sam Altman shared his Cameo publicly, becoming the internet’s most meme-able person</p><p>* All content is AI-generated — no uploads, only creations</p><p>“This is the first social media with UGC where content can ONLY be generated” — Alex Volkov</p><p><strong>OpenAI Dev Day</strong> (Oct 9):</p><p>* ChatGPT Apps for 800M+ weekly active users</p><p>* AgentKit: drag-and-drop agent builder</p><p>* GPT-5-Pro in API</p><p>* Sam revealed <strong>$1.4 trillion in compute obligations</strong></p><p><strong>AI Makes Novel Cancer Discovery</strong> (Oct 16) — A 27B Gemma-based model generated a novel hypothesis about cancer cells validated in a wet lab. First confirmed case of AI creating genuinely new scientific knowledge.</p><p><strong>Claude Sonnet 4.5</strong> — 61.4% OS World (computer use)<strong>Claude Haiku 4.5</strong> — 73.3% SWE-Bench, lightning fast</p><p>November — The Week That Changed Everything</p><p>(<a target="_blank" href="https://sub.thursdai.news/p/thursdai-nov-6-2025-kimis-1t-thinking">Nov 07</a> | <a target="_blank" href="https://sub.thursdai.news/p/thursdai-nov-13-gpt-51-ernie-45-vl">Nov 13</a> | <a target="_blank" href="https://sub.thursdai.news/p/thursdai-the-week-that-changed-the">Nov 20</a> | <a target="_blank" href="https://sub.thursdai.news/p/thursdai-thanksgiving-special-25">Nov 27</a>)</p><p><strong>THE MOST INSANE WEEK IN AI HISTORY.</strong> In a single span of ~10 days:</p><p>* <strong>Grok 4.1</strong> — #1 LMArena (briefly)</p><p>* <strong>Gemini 3 Pro</strong> — Took the throne with 45.14% on ARC-AGI-2 (Deep Think)</p><p>* <strong>GPT-5.1-Codex-Max</strong> — 24+ hour autonomous coding</p><p>* <strong>Nano Banana Pro</strong> — 4K image generation with perfect text rendering</p><p>* <strong>Meta SAM 3 & SAM 3D</strong> — Open-vocabulary segmentation</p><p>* <strong>Claude Opus 4.5</strong> — 80.9% SWE-Bench Verified, beats GPT-5.1</p><p>“This week almost broke me as a person whose full-time job is to cover and follow AI releases.” — Alex Volkov</p><p><strong>Gemini 3 Pro + Deep Think</strong> (Nov 20) — Google finally took the LLM throne: 45.14% on ARC-AGI-2, roughly double previous SOTA.</p><p><strong>Google Antigravity IDE</strong> (Nov 20) — Free agent-first VS Code fork with browser integration, multiple parallel agents.</p><p><strong>Nano Banana Pro</strong> (Nov 20) — Native 4K resolution with “thinking” traces, perfect text rendering.</p><p><strong>Claude Opus 4.5</strong> (Nov 27) — 80.9% SWE-Bench Verified. $5/$25 per MTok (1/3 previous cost). “Effort” parameter for reasoning control.</p><p>“Opus 4.5 is unbelievable. You can ship a full feature on a mature code base in one day, always. It’s just mind blowing.” — Ryan Carson</p><p><strong>1X NEO</strong> (Oct 30) — First consumer humanoid robot, pre-orders at $20,000, delivery early 2026.</p><p>December — GPT-5.2 Fires Back</p><p>(<a target="_blank" href="https://sub.thursdai.news/p/thursdai-special-googles-new-anti">Dec 02</a> | <a target="_blank" href="https://sub.thursdai.news/p/thursdai-dec-4-2025-deepseek-v32">Dec 05</a> | <a target="_blank" href="https://sub.thursdai.news/p/thursdai-dec-11-gpt-52-is-here-plus">Dec 12</a> | <a target="_blank" href="https://sub.thursdai.news/p/thursdai-dec-18-gemini-3-flash-grok">Dec 19</a>)</p><p><strong>GPT-5.2 — OpenAI’s Answer to Gemini 3</strong> (Dec 11) — Dropped live during ThursdAI:</p><p>* <strong>90.5% on ARC-AGI-1</strong> (Pro X-High configuration)</p><p>* <strong>54%+ on ARC-AGI-2</strong> — reclaiming frontier from Gemini 3</p><p>* <strong>100% on AIME 2025</strong> — perfect math olympiad score</p><p>* <strong>70% on GDPval</strong> (up from 47% in Sept!)</p><p>* Reports of models thinking for <strong>1-3 hours</strong> on hard problems</p><p><strong>DeepSeek V3.2 & V3.2-Speciale — Gold Medal Reasoning</strong> (Dec 4):</p><p>* <strong>96% on AIME</strong> (vs 94% for GPT-5 High)</p><p>* Gold medals on <strong>IMO (35/42), CMO, ICPC (10/12), IOI (492/600)</strong></p><p>* <strong>$0.28/million tokens</strong> on OpenRouter</p><p><strong>MCP Donated to Linux Foundation</strong> (Dec 11) — <strong>Agentic AI Foundation</strong> launched under Linux Foundation. MCP, <a target="_blank" href="AGENTS.md">AGENTS.md</a>, and goose donated to vendor-neutral governance.</p><p><strong>Mistral 3 Returns to Apache 2.0</strong> (Dec 4) — Mistral Large 3 (675B MoE), Ministral 3 (vision, edge-optimized).</p><p><strong>Starcloud: LLM Training in Space</strong> (Dec 11) — An H100 satellite trained nanoGPT on Shakespeare. SSH into an H100… in space… with a US flag in the corner.</p><p>“Peak 2025 energy — the era of weird infra ideas has begun.” — Karpathy reacts</p><p><strong>Gemini 3 Flash</strong> (Dec 18) — Fastest frontier model, pairs with Gemini 3 Pro for speed vs depth tradeoffs.</p><p>🙏 Thank You</p><p>This has been an incredible year of ThursdAI. 51 episodes, countless releases, and a community that keeps showing up every week to make sense of the madness together.</p><p><strong>Huge thanks to our amazing co-hosts and friends of the pod:</strong></p><p>* <strong>Alex Volkov</strong> — AI Evangelist, Weights & Biases (<a target="_blank" href="https://x.com/altryne">@altryne</a>)</p><p>* <strong>Wolfram Ravenwolf</strong> (<a target="_blank" href="https://x.com/WolframRvnwlf">@WolframRvnwlf</a>)</p><p>* <strong>Yam Peleg</strong> (<a target="_blank" href="https://x.com/yampeleg">@yampeleg</a>)</p><p>* <strong>Nisten Tahiraj</strong> (<a target="_blank" href="https://x.com/nisten">@nisten</a>)</p><p>* <strong>LDJ</strong> (<a target="_blank" href="https://x.com/ldjconfirmed">@ldjconfirmed</a>)</p><p>* <strong>Ryan Carson</strong> (<a target="_blank" href="https://x.com/ryancarson">@ryancarson</a>)</p><p>* <strong>Kwindla Hultman Kramer</strong> — CEO of Daily (<a target="_blank" href="https://x.com/kwindla">@kwindla</a>)</p><p>And to everyone who tunes in — whether you’re listening on your commute, doing dishes, or just trying to keep up with the insanity — <strong>thank you</strong>. You make this possible.</p><p>📢 Stay Connected</p><p>* 🎧 <strong>Subscribe</strong>: <a target="_blank" href="https://thursdai.news">thursdai.news</a></p><p>* 🐦 <strong>Follow Alex</strong>: <a target="_blank" href="https://x.com/altryne">@altryne</a></p><p>* 💻 <strong>This recap is open source</strong>: <a target="_blank" href="https://github.com/altryne/thursdAI_yearly_recap">github.com/altryne/thursdAI_yearly_recap</a></p><p>“We’re living through the early days of a technological revolution, and we get to be part of it. That’s something to be genuinely thankful for.” — Alex Volkov</p><p><strong>Happy Holidays, and see you in 2026! 🚀</strong></p><p><em>The best is yet to come. Hold on to your butts.</em></p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-2025-a-year-of-ai-in-review</link><guid isPermaLink="false">substack:post:182523829</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Thu, 25 Dec 2025 19:30:00 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/182523829/d7b5becb52dbd4bd0b7faf73e0898e5d.mp3" length="105283410" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6580</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/182523829/8d195e358081e2242ebc102cb8ae1725.jpg"/></item><item><title><![CDATA[📆 ThursdAI - Dec 18 - Gemini 3 Flash, Grok Voice, ChatGPT Appstore, Image 1.5 & GPT 5.2 Codex, Meta Sam Audio & more AI news]]></title><description><![CDATA[<p>Hey folks 👋 Alex here, dressed as 🎅 for our pre X-mas episode!</p><p>We’re wrapping up 2025, and the AI labs decided they absolutely could NOT let the year end quietly. This week was an absolute banger—we had Gemini 3 Flash dropping with frontier intelligence at flash prices, OpenAI firing off GPT 5.2 Codex as breaking news DURING our show, ChatGPT Images 1.5, Nvidia going all-in on open source with Nemotron 3 Nano, and the voice AI space heating up with Grok Voice and Chatterbox Turbo. Oh, and Google dropped FunctionGemma for all your toaster-to-fridge communication needs (yes, really).</p><p>Today’s show was over three and a half hours long because we tried to cover both this week AND the entire year of 2025 (that yearly recap is coming next week—it’s a banger, we went month by month and you’ll really feel the acceleration). For now, let’s dive into just the insanity that was THIS week.</p><p>00:00 Introduction and Overview</p><p>00:39 Weekly AI News Highlights</p><p>01:40 Open Source AI Developments</p><p>01:44 Nvidia's Nemotron Series</p><p>09:09 Google's Gemini 3 Flash</p><p>19:26 OpenAI's GPT Image 1.5</p><p>20:33 Infographic and GPT Image 1.5 Discussion</p><p>20:53 Nano Banana vs GPT Image 1.5</p><p>21:23 Testing and Comparisons of Image Models</p><p>23:39 Voice and Audio Innovations</p><p>24:22 Grok Voice and Tesla Integration</p><p>26:01 Open Source Robotics and Voice Agents</p><p>29:44 Meta's SAM Audio Release</p><p>32:14 Breaking News: Google Function Gemma</p><p>33:23 Weights & Biases Announcement</p><p>35:19 Breaking News: OpenAI Codex 5.2 Max</p><p>To receive new posts and support my work, consider becoming a free or paid subscriber.</p><p></p><p>Big Companies LLM updates</p><p>Google’s Gemini 3 Flash: The High-Speed Intelligence King</p><p>If we had to title 2025, as Ryan Carson mentioned on the show, it might just be “The Year of Google’s Comeback.” Remember at the start of the year when we were asking “Where is Google?” Well, they are here. Everywhere.</p><p>This week they launched <strong>Gemini 3 Flash</strong>, and it is rightfully turning heads. This is a frontier-class model—meaning it boasts Pro-level intelligence—but it runs at Flash-level speeds and, most importantly, Flash-level pricing. We are talking $0.50 per 1 million input tokens. That is not a typo. The price-to-intelligence ratio here is simply off the charts.</p><p>I’ve been using Gemini 2.5 Flash in production for a while because it was good enough, but Gemini 3 Flash is a different beast. It scores 71 on the Artificial Analysis Intelligence Index (a 13-point jump from the previous Flash), and it achieves 78% on SWE-bench Verified. That actually beats the bigger Gemini 3 Pro on some agentic coding tasks!</p><p>What impressed me most, and something Kwindla pointed out, is the tool calling. Previous Gemini models sometimes struggled with complex tool use compared to OpenAI, but Gemini 3 Flash can handle up to 100 simultaneous function calls. It’s fast, it’s smart, and it’s integrated immediately across the entire Google stack—Workspace, Android, Chrome. Google isn’t just releasing models anymore; they are deploying them instantly to billions of users.</p><p>For anyone building agents, this combination of speed, low latency, and 1 million context window (at this price!) makes it the new default workhorse.</p><p>Google’s FunctionGemma Open Source release</p><p>We also got a smaller, quirkier release from Google: <strong>FunctionGemma</strong>. This is a tiny 270M parameter model. Yes, millions, not billions.</p><p>It’s purpose-built for function calling on edge devices. It requires only 500MB of RAM, meaning it can run on your phone, in your browser, or even on a Raspberry Pi. As Nisten joked on the show, this is finally the model that lets your toaster talk to your fridge.</p><p>Is it going to write a novel? No. But after fine-tuning, it jumped from 58% to 85% accuracy on mobile action tasks. This represents a future where privacy-first agents live entirely on your device, handling your calendar and apps without ever pinging a cloud server.</p><p>OpenAI Image 1.5, GPT 5.2 Codex and ChatGPT Appstore</p><p>OpenAI had a busy week, starting with the release of <strong>GPT Image 1.5</strong>. It’s available now in ChatGPT and the API. The headline here is speed and control—it’s 4x faster than the previous model and 20% cheaper. It also tops the LMSYS Image Arena leaderboards.</p><p>However, I have to give a balanced take here. We’ve been spoiled recently by Google’s “Nano Banana Pro” image generation (which powers Gemini). When we looked at side-by-side <a target="_blank" href="https://x.com/arena/status/2001689438297182252">comparisons</a>, especially with typography and infographic generation, Gemini often looked sharper and more coherent. This is what we call “hedonistic adaptation”—GPT Image 1.5 is great, but the bar has moved so fast that it doesn’t feel like the quantum leap DALL-E 3 was back in the day. Still, for production workflows where you need to edit specific parts of an image without ruining the rest, this is a massive upgrade.</p><p><strong>🚨 BREAKING: GPT 5.2 Codex</strong></p><p>Just as we were nearing the end of the show, OpenAI decided to drop some breaking news: <strong>GPT 5.2 Codex</strong>.</p><p>This is a specialized model optimized specifically for agentic coding, terminal workflows, and cybersecurity. We quickly pulled up the benchmarks live, and they look significant. It hits 56.4% on SWE-Bench Pro and a massive 64% on Terminal-Bench 2.0.</p><p>It supports up to <strong>400k token inputs with native context compaction</strong>, meaning it’s designed for those long, complex coding sessions where you’re debugging an entire repository. The coolest (and scariest?) stat: a security researcher used this model to find three previously unknown vulnerabilities in React in just one week.</p><p>OpenAI is positioning this for “professional software engineering,” and the benchmarks suggest a 30% improvement in token efficiency over the standard GPT 5.2. We are definitely going to be putting this through its paces in our own evaluations soon.</p><p>ChatGPT ... the AppStore!</p><p>Also today (OpenAI is really throwing everything they have to the end of the year release party), OpenAI has unveiled how their App Store is going to look and opened the submission forms to submit your own apps!</p><p>Reminder, ChatGPT apps are powered by MCP and were announced during DevDay, they let companies build a full UI experience right inside ChatGPT, and given OpenAi’s almost 900M weekly active users, this is a big deal! Do you have an app you’d like in there? let me know in the comments!</p><p>Open Source AI</p><p>🔥 Nvidia Nemotron 3 Nano: The Most Important Open Source Release of the Week (<a target="_blank" href="https://x.com/ctnzr/status/2000567572065091791">X</a>, <a target="_blank" href="https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16">HF</a>)</p><p>I think the most important release of this week in open source was Nvidia Nemotron 3 Nano, and it was pretty much everywhere. Nemotron is a series of models from Nvidia that’s been pushing efficiency updates, finetune innovations, pruning, and distillations—all the stuff Nvidia does incredibly well.</p><p>Nemotron 3 Nano is a 30 billion parameter model with only 3 billion active parameters, using a hybrid Mamba-MoE architecture. This is huge. The model achieves 1.5 to 3.3x faster inference than competing models like Qwen 3 while maintaining competitive accuracy on H200 GPUs.</p><p>But the specs aren’t even the most exciting part. NVIDIA didn’t just dump the weights over the wall. They released the datasets—all 25 trillion tokens of pre-training and post-training data. They released the recipes. They released the technical reports. This is what “Open AI” should actually look like.</p><p>What’s next? Nemotron 3 Super at 120B parameters (4x Nano) and Nemotron 3 Ultra at 480B parameters (16x Nano) are coming in the next few months, featuring their innovative Latent Mixture of Experts architecture.</p><p>Check out the release on <a target="_blank" href="https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16">HuggingFace</a></p><p>Other Open Source Highlights</p><p>LDJ brought up <strong>BOLMO from Allen AI</strong>—the first byte-level model that actually reaches parity with similar-size models using regular tokenization. This is really exciting because it could open up new possibilities for spelling accuracy, precise code editing, and potentially better omnimodality since ultimately everything is bytes—images, audio, everything.</p><p>Wolfram highlighted <strong>OLMO 3.1</strong>, also from Allen AI, which is multimodal with video input in three sizes (4B, 7B, 8B). The interesting feature here is that you can give it a video, ask something like “how many times does a ball hit the crown?” and it’ll not only give you the answer but mark the precise coordinates on the video frames where it happens. Very cool for tracking objects throughout a video!</p><p>Mistral OCR 3 (<a target="_blank" href="https://x.com/MistralAI/status/2001669581275033741">X</a>)</p><p>Mistral also dropped <strong>Mistral OCR 3</strong> this week—their next-generation document intelligence model achieving a 74% win rate over OCR 2 across challenging document types. We’re talking forms, low-quality scans, handwritten text, complex tables, and multilingual documents.</p><p>The pricing is aggressive at just $2 per 1,000 pages (or $1 with Batch API discount), and it outperforms enterprise solutions like AWS Textract, Azure Doc AI, and Google DocSeek. Available via <strong>API</strong> and their new Document AI Playground.</p><p><strong>🐝 This Week’s Buzz: Wolfram Joins Weights & Biases!</strong></p><p>I am so, so hyped to announce this. Our very own co-host and evaluation wizard, Wolfram RavenWlf, is officially joining the Weights & Biases / CoreWeave family as an AI Evangelist and “AIvaluator” starting in January!</p><p>Wolfram has been the backbone of the “vibe checks” and deep-dive evals on this show for a long time. Now, he’ll be doing it full-time, building out benchmarks for the community and helping all of us make sense of this flood of models. Expect ThursdAI to get even more data-driven in 2026. Match made in heaven! And if you’re as excited as we are, give <a target="_blank" href="https://wandb.ai/site/weave?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=Dec18">Weave</a> a try, it’s free to get started!</p><p>Voice & Audio: Faster, Cheaper, Better</p><p>If 2025 was the year of the LLM comeback, the end of 2025 is the era of Voice AI commoditization. It is getting so cheap and so fast.</p><p><strong>Grok Voice Agent API (</strong><a target="_blank" href="https://x.com/xai/status/2001385958147752255"><strong>X</strong></a><strong>)</strong></p><p>xAI launched their <strong>Grok Voice Agent API</strong>, and the pricing is aggressive: $0.05 per minute flat rate. That significantly undercuts OpenAI and others. But the real killer feature here is the integration.</p><p>If you drive a Tesla, this is what powers the voice command when you hold down the button. It has native access to vehicle controls, but for developers, it has native tool calling for <strong>Real-time X Search</strong>. This means your voice agent can have up-to-the-minute knowledge about the world, something purely pre-trained models struggle with. It ranks #1 on Big Bench Audio, and with that pricing, we’re going to see voice ubiquity very soon.</p><p>Kwindla had great insights here: it feels like they optimized for the Tesla use case where it’s a question and an answer. You can see this because Big Bench Audio is a hard audio Q&A benchmark but not multi-turn. So it’s super exciting, but it’s not necessarily what we’ll use for multi-turn conversational voice agents yet.</p><p>Here’s what’s really interesting: the entire voice stack was built in-house with custom VAD, tokenizer, and audio models for end-to-end optimization. Tesla was a critical design partner—Grok now powers millions of Tesla vehicles. If you’re building AI voice agents, will you give Grok Voice SDK a try?</p><p>Resemble AI’s Chatterbox Turbo (<a target="_blank" href="https://x.com/0xDevShah/status/2000631462786400718">X</a>, <a target="_blank" href="https://huggingface.co/ResembleAI/chatterbox-turbo">HF</a>, <a target="_blank" href="https://github.com/resemble-ai/chatterbox">GitHub</a>, <a target="_blank" href="https://www.resemble.ai/chatterbox-turbo/">Blog</a>)</p><p>For the open-source heads, <strong>Resemble AI</strong> dropped a bombshell with <strong>Chatterbox Turbo</strong>. This is a 350M parameter open-source TTS model that is beating proprietary giants like ElevenLabs in blind tests.</p><p>It allows for zero-shot voice cloning from just 5 seconds of audio and supports paralinguistic tags—meaning you can type [laugh] or [sigh]and the model actually acts it out naturally. Plus, it has built-in watermarking for safety. It’s MIT licensed, so you can run this yourself. The fact that an open model is winning on quality against the paid APIs is a huge moment for the community.</p><p><strong>Meta SAM Audio</strong></p><p>Finally, Meta extended their “Segment Anything” magic to audio with <strong>SAM Audio</strong>. You know how you can click an object in an image to select it? Now you can do that with sound.</p><p>With Sam Audio, you could isolate just the sound of a train from a messy audio track, or pick out a specific instrument from a song. You can prompt it with text (”guitar”), visual clicks on a video, or time stamps. It’s incredible for creators and audio engineers, effectively automating what used to be painful manual editing.</p><p>Wrapping Up</p><p>What a week to close out 2025. Google proved once again that they’re the gorilla that’s learned to dance—Gemini 3 Flash delivering frontier intelligence at flash prices is going to change how people build AI applications. Nvidia showed that the most valuable company in the world is all-in on open source. OpenAI fired off GPT 5.2 Codex just to make sure we don’t forget about them. And the voice AI space is heating up with options that would have seemed impossible just a year ago.</p><p>Look out for the full 2025 yearly recap episode coming next week—it’s a banger. We went month by month through every major AI release and talked about what we thought were the best overall. You’ll really feel the acceleration from that one.</p><p>Happy holidays, folks! And as always, thanks for being part of the ThursdAI community.</p><p>TL;DR and Show Notes</p><p><strong>Hosts and Guests</strong></p><p>* <strong>Alex Volkov</strong> - AI Evangelist & Weights & Biases (<a target="_blank" href="https://x.com/altryne">@altryne</a>)</p><p>* Co-hosts: <a target="_blank" href="https://x.com/WolframRvnwlf">@WolframRvnwlf</a>, <a target="_blank" href="https://x.com/yampeleg">@yampeleg</a>, <a target="_blank" href="https://x.com/nisten">@nisten</a>, <a target="_blank" href="https://x.com/ldjconfirmed">@ldjconfirmed</a>, <a target="_blank" href="https://x.com/ryancarson">@ryancarson</a></p><p>* Special Guest: <a target="_blank" href="https://x.com/kwindla">@kwindla</a> - CEO of Daily</p><p><strong>Open Source LLMs</strong></p><p>* NVIDIA Nemotron 3 Nano - 30B-3A hybrid Mamba-MoE model (<a target="_blank" href="https://x.com/ctnzr/status/2000567572065091791">X</a>, <a target="_blank" href="https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16">HF</a>, <a target="_blank" href="https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8">HF FP8</a>)</p><p>* FunctionGemma - 270M parameter function calling model (<a target="_blank" href="https://x.com/osanseviero/status/2001704034667769978">X</a>, <a target="_blank" href="https://blog.google/technology/developers/functiongemma/">Blog</a>, <a target="_blank" href="https://ai.google.dev/gemma/docs/functiongemma">Docs</a>)</p><p>* Mistral OCR 3 - Document intelligence model with 74% win rate over v2 (<a target="_blank" href="https://x.com/MistralAI/status/2001669581275033741">X</a>, <a target="_blank" href="https://mistral.ai/news/mistral-ocr-3">Blog</a>, <a target="_blank" href="https://console.mistral.ai/">Console</a>)</p><p>* BOLMO from Allen AI - First byte-level model reaching parity with regular tokenization (<a target="_blank" href="https://x.com/allen_ai/status/2000616646042399047">X</a>)</p><p>* OLMO 2 from Allen AI - Multimodal with video input (4B, 7B, 8B sizes) (<a target="_blank" href="https://x.com/allen_ai/status/2000962068774588536">X</a>)</p><p><strong>Big CO LLMs + APIs</strong></p><p>* Google Gemini 3 Flash - Frontier intelligence at $0.50/1M input tokens, 78% SWE-bench Verified (<a target="_blank" href="https://x.com/OfficialLoganK/status/2001322275656835348">X</a>, <a target="_blank" href="https://deepmind.google/technologies/gemini/flash/">Announcement</a>)</p><p>* OpenAI GPT Image 1.5 - 4x faster, 20% cheaper, #1 on LMSYS Image Arena (<a target="_blank" href="https://x.com/OpenAI/status/2000990989629161873">X</a>)</p><p>* OpenAI GPT 5.2 Codex - 56.4% SWE-Bench Pro, 64% Terminal-Bench 2.0, 400K context (<a target="_blank" href="https://x.com/thsottiaux/status/2001720483872674269">X</a>, <a target="_blank" href="https://openai.com/index/gpt-5-2-codex/">Blog</a>)</p><p>* ChatGPT App Store - MCP-powered apps submission now open (<a target="_blank" href="https://x.com/OpenAIDevs/status/2001419749016899868">X</a>)</p><p><strong>This Week’s Buzz</strong></p><p>* 🐝 Wolfram joins Weights & Biases / CoreWeave as AI Evangelist and AIvaluator!</p><p>* Try <a target="_blank" href="https://wandb.ai/site/weave?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=Dec18">Weave</a> for AI evaluations</p><p><strong>Voice & Audio</strong></p><p>* xAI Grok Voice Agent API - #1 Big Bench Audio (92.3%), $0.05/min flat rate, powers Tesla vehicles (<a target="_blank" href="https://x.com/xai/status/2001385958147752255">X</a>)</p><p>* Resemble AI Chatterbox Turbo - MIT-licensed 350M TTS, beats ElevenLabs in blind tests (<a target="_blank" href="https://x.com/0xDevShah/status/2000631462786400718">X</a>, <a target="_blank" href="https://huggingface.co/ResembleAI/chatterbox-turbo">HF</a>, <a target="_blank" href="https://github.com/resemble-ai/chatterbox">GitHub</a>, <a target="_blank" href="https://www.resemble.ai/chatterbox-turbo/">Blog</a>)</p><p>* Meta SAM Audio - Audio source separation with text/visual/temporal prompts (<a target="_blank" href="https://x.com/AIatMeta/status/2000980784425931067">X</a>, <a target="_blank" href="https://huggingface.co/facebook/sam-audio-large">HF</a>, <a target="_blank" href="https://github.com/facebookresearch/sam-audio">GitHub</a>)</p><p><strong>Show Links</strong></p><p>* Full 2025 Yearly Recap - Coming next week!</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-dec-18-gemini-3-flash-grok</link><guid isPermaLink="false">substack:post:182041543</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Fri, 19 Dec 2025 00:18:19 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/182041543/c08452d414aa4b96588660458f8b3e0f.mp3" length="37629454" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>2352</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/182041543/ada3143ccf7309d2ec10e2e6188c69a8.jpg"/></item><item><title><![CDATA[📆 ThursdAI - Dec 11 - GPT 5.2 is HERE! Plus, LLMs in Space, MCP donated, Devstral surprises and more AI news! ]]></title><description><![CDATA[<p>Hey everyone, </p><p>December started strong and does NOT want to slow down!? OpenAI showed us their response to the Code Red and it’s GPT 5.2, which doesn’t feel like a .1 upgrade! We got it literally as breaking news at the end of the show, and oh boy! The new kind of LLMs is here. </p><p>GPT, then Gemini, then Opus and now GPT again... Who else feels like we’re on a trippy AI rolercoaster? Just me? 🫨 </p><p>I’m writing this newsletter from a fresh “traveling podcaster” setup in SF (huge shoutout to the Chroma team for the studio hospitality). </p><p>P.S - Next week we’re doing a year recap episode (52st episode of the year, what is my life), but <em>today</em> is about the highest-signal stuff that happened <strong>this week</strong>.</p><p>Alright. No more foreplay. Let’s dive in. Please subscribe. </p><p>🔥 The main event: OpenAI launches GPT‑5.2 (and it’s… a lot)</p><p>We started the episode with “garlic in the air” rumors (OpenAI holiday launches always have that <em>Christmas panic energy</em>), and then… boom: <strong>GPT‑5.2</strong> actually drops while we’re live.</p><p>What makes this release feel significant isn’t “one benchmark went up.” It’s that OpenAI is clearly optimizing for <em>the things that have become the frontier in 2025</em>: long-horizon reasoning, agentic coding loops, long context reliability, and lower hallucination rates <em>when browsing/tooling is involved</em>.</p><p>5.2 Instant, Thinking and Pro in ChatGPT and in the API</p><p>OpenAI shipped multiple variants, and even within those there are “levels” (medium/high/extra-high) that effectively change how much compute the model is allowed to burn. At the extreme end, you’re basically running parallel thoughts and selecting winners. That’s powerful, but also… very expensive.</p><p>It’s very clearly aimed at the agentic world: coding agents that run in loops, tool-using research agents, and “do the whole task end-to-end” workflows where spending extra tokens is still cheaper than spending an engineer day.</p><p>Benchmarks </p><p>I’m not going to pretend benchmarks tell the full story (they never do), but the shape of improvements matters. GPT‑5.2 shows huge strength on reasoning + structured work.</p><p>It hits <strong>90.5% on ARC‑AGI‑1</strong> in the Pro X‑High configuration, and <strong>54%+ on ARC‑AGI‑2</strong> depending on the setting. For context, ARC‑AGI‑2 is the one where everyone learns humility again.</p><p>On math/science, this thing is flexing. We saw <strong>100% on AIME 2025</strong>, and strong performance on FrontierMath tiers (with the usual “Tier 4 is where dreams go to die” vibe still intact). GPQA Diamond is up in the 90s too, which is basically “PhD trivia mode.”</p><p>But honestly the most <em>practically</em> interesting one for me is <strong>GDPval</strong> (knowledge-work tasks: slides, spreadsheets, planning, analysis). GPT‑5.2 lands around <strong>70%</strong>, which is a massive jump vs earlier generations. This is the category that translates directly into “is this model useful at my job.” - This is a bench that OpenAI launched only in September and back then, Opus 4.1 was a “measly” 47%! Talk about acceleration! </p><p>Long context: MRCR is the sleeper highlight</p><p>On MRCR (multi-needle long-context retrieval), GPT‑5.2 holds up absurdly well even into 128k and beyond. The graph OpenAI shared shows GPT‑5.1 falling off a cliff as context grows, while GPT‑5.2 stays high much deeper into long contexts.</p><p>If you’ve ever built a real system (RAG, agent memory, doc analysis) you know this pain: long context is easy to offer, hard to <em>use well</em>. If GPT‑5.2 actually delivers this in production, it’s a meaningful shift.</p><p>Hallucinations: down (especially with browsing)</p><p>One thing we called out on the show is that a bunch of user complaints in 2025 have basically collapsed into one phrase: “it hallucinates.” Even people who don’t know what a benchmark is can feel when a model confidently lies.</p><p>OpenAI’s system card shows lower rates of major incorrect claims compared to GPT‑5.1, and lower “incorrect claims” overall when browsing is enabled. That’s exactly the direction they needed.</p><p>Real-world vibes:</p><p>We did the traditional “vibe tests” mid-show: generate a flashy landing page, do a weird engineering prompt, try some coding inside Cursor/Codex.</p><p>Early testers <a target="_blank" href="https://x.com/altryne/status/1999235261381906573">broadly agree</a> on the shape of the improvement. GPT‑5.2 is <strong>much stronger in reasoning, math, long‑context tasks, visual understanding, and multimodal workflows</strong>, with multiple reports of it successfully thinking for <strong>one to three hours</strong> on hard problems. Enterprise users like Box report <strong>faster execution and higher accuracy</strong> on real knowledge‑worker tasks, while researchers note that GPT‑5.2 Pro consistently outperforms the standard “Thinking” variant. The tradeoffs are also clear: creative writing still slightly favors Claude Opus, and the highest reasoning tiers can be slow and expensive. But as a <strong>general‑purpose reasoning model</strong>, GPT‑5.2 is now the strongest publicly available option.</p><p><strong>AI in space: Starcloud trains an LLM on an H100 </strong><strong><em>in orbit</em></strong></p><p>This story is peak 2025.</p><p><strong>Starcloud</strong> put an <strong>NVIDIA H100 on a satellite</strong>, trained Andrej Karpathy’s nanoGPT on Shakespeare, and ran inference on <strong>Gemma</strong>. There’s a viral screenshot vibe here that’s impossible to ignore: SSH into an H100… in space… with a US flag in the corner. It’s engineered excitement, and I’m absolutely here for it.</p><p>But we actually had a real debate on the show: is “GPUs in space” just sci‑fi marketing, or does it make economic sense?</p><p>Nisten made a compelling argument that power is the real bottleneck, not compute, and that big satellites already operate in the ~20kW range. If you can generate that power reliably with solar in orbit, the economics start looking less insane than you’d think. LDJ added the long-term land/power convergence argument: Earth land and grid power get scarcer/more regulated, while launch costs trend down—eventually the curves may cross.</p><p>I played “voice of realism” for a minute: what happens when GPUs fail? It’s hard enough to swap a GPU in a datacenter, now imagine doing it in orbit. Cooling and heat dissipation become a different engineering problem too (radiators instead of fans). Networking is nontrivial. But also: we are clearly entering the era where people will try <em>weird</em> infra ideas because AI demand is pulling the whole economy.</p><p><strong>Big Company: MCP gets donated, OpenRouter drops a report on AI</strong></p><p><strong>Agentic AI Foundation Lands at the Linux Foundation</strong></p><p>This one made me genuinely happy.</p><p>Block, Anthropic, and OpenAI came together to launch the <strong>Agentic AI Foundation</strong> under the Linux Foundation, donating key projects like <strong>MCP</strong>, <a target="_blank" href="AGENTS.md"><strong>AGENTS.md</strong></a>, and <strong>goose</strong>. This is exactly how standards should happen: vendor‑neutral, boring governance, lots of stakeholders.</p><p>It’s not flashy work, but it’s the kind of thing that actually lets ecosystems grow without fragmenting. </p><p>BTW, I was recording my podcast while <a target="_blank" href="https://substack.com/profile/89230629-latentspace">Latent.Space</a> were recording theirs in the same office, and they have a banger episode upcoming about this very topic! All I’ll say is <a target="_blank" href="https://substack.com/profile/3381444-alessio-fanelli">Alessio Fanelli</a> introduced me to David Soria Parra from MCP 👀 Watch out for that episode on Latent space dropping soon! </p><p><strong>OpenRouter’s “State of AI”: 100 Trillion Tokens of Reality</strong></p><p>OpenRouter and a16z dropped a massive report analyzing over <strong>100 trillion tokens</strong> of real‑world usage. A few things stood out:</p><p><strong>Reasoning tokens now dominate.</strong> Above 50%, around 60% of all tokens since early 2025 are reasoning tokens. Remember when we went from “LLMs can’t do math” to reasoning models? That happened in about a year.</p><p><strong>Programming exploded.</strong> From 11% of usage early 2025 to over 50% recently. Claude holds 60% of the coding market. (at least.. on Open Router)</p><p><strong>Open source hit 30% market share</strong>, led by Chinese labs: DeepSeek (14T tokens), Qwen (5.59T), Meta LLaMA (3.96T).</p><p><strong>Context lengths grew massively.</strong> Average prompt length went from 1.5k to 6k+ tokens (4x growth), completions from 133 to 400 tokens (3x).</p><p><strong>The “Glass Slipper” effect.</strong> When users find a model that fits their use case, they stay loyal. Foundational early-user cohorts retain around 40% at month 5. Claude 4 Sonnet still had 50% retention after three months.</p><p><strong>Geography shift.</strong> Asia doubled to 31% of usage (China key), while North America is at 47%.</p><p>Yam made a good point that we should be careful interpreting these graphs—they’re biased toward people trying new models, not necessarily steady usage. But the trends are clear: agentic, reasoning, and coding are the dominant use cases.</p><p><strong>Open Source Is Not Slowing Down (If Anything, It’s Accelerating)</strong></p><p>One of the strongest themes this week was just how fast open source is closing the gap — and in some areas, outright leading. We’re not talking about toy demos anymore. We’re talking about serious models, trained from scratch, hitting benchmarks that were frontier‑only not that long ago.</p><p><strong>Essential AI’s Rnj‑1: A Real Frontier 8B Model</strong></p><p>This one deserves real attention. Essential AI — led by Ashish Vaswani, yes Ashish from the original Transformers paper — released <strong>Rnj‑1</strong>, a pair of 8B open‑weight models trained fully from scratch. No distillation. No “just a fine‑tune.” This is a proper pretrain.</p><p>What stood out to me isn’t just the benchmarks (though those are wild), but the philosophy. Rnj‑1 is intentionally focused on <strong>pretraining quality</strong>: data curation, code execution simulation, STEM reasoning, and agentic behaviors emerging during pretraining instead of being bolted on later with massive RL pipelines.</p><p>In practice, that shows up in places like SWE‑bench Verified, where Rnj‑1 lands in the same ballpark as much larger closed models, and in math and STEM tasks where it punches way above its size. And remember: this is an <strong>8B</strong> model you can actually run locally, quantize aggressively, and deploy without legal gymnastics thanks to its Apache 2.0 license.</p><p><strong>Mistral Devstral 2 + Vibe: Open Coding Goes Hard</strong></p><p>Mistral followed up last week’s momentum with <strong>Devstral 2</strong>, and Mistral Vibe! </p><p>The headline numbers are: the 123B Devstral 2 model lands right at the top of open‑weight coding benchmarks, nearly matching Claude 3.5 Sonnet on SWE‑bench Verified. But what really excited the panel was the <strong>24B Devstral Small 2</strong>, which hits high‑60s SWE‑bench scores while being runnable on consumer hardware.</p><p>This is the kind of model you can realistically run locally as a coding agent, without shipping your entire codebase off to someone else’s servers. Pair that with <strong>Mistral Vibe</strong>, their open‑source CLI agent, and you suddenly have a credible, fully open alternative to things like Claude Code, Codex, or Gemini CLI.</p><p>We talked a lot about why this matters. Some teams can’t send code to closed APIs. Others just don’t want to pay per‑token forever. And some folks — myself included — just like knowing what’s actually running under the hood. Devstral 2 checks all those boxes.</p><p><strong>🐝 This week’s Buzz (W&B): Trace OpenRouter traffic into Weave with zero code</strong></p><p>We did a quick “Buzz” segment on a feature that I think a lot of builders will love:</p><p>OpenRouter launched <a target="_blank" href="https://openrouter.ai/settings/broadcast"><strong>Broadcast</strong></a>, which can stream traces to observability vendors. One of those destinations is <strong>W&B Weave</strong>.</p><p>The magic here is: if you’re using a tool that already talks to OpenRouter, you can get tracing into Weave <strong>without instrumenting your code</strong>. That’s especially useful when instrumentation is hard (certain agent frameworks, black-box tooling, restricted environments, etc.).</p><p>If you want to set it up: <strong>OpenRouter Broadcast settings</strong>.</p><p><strong>Vision Models Are Getting Practical (and Weirdly Competitive)</strong></p><p>Vision‑language models quietly had a massive week.</p><p><strong>Jina‑VLM: Small, Multilingual, and Very Good at Docs</strong></p><p>Jina released a 2.4B VLM that’s absolutely dialed in on document understanding, multilingual VQA, and OCR‑heavy tasks. This is exactly the kind of model you’d want for PDFs, charts, scans, and messy real‑world docs — and it’s small enough to deploy without sweating too much.</p><p><a target="_blank" href="Z.ai"><strong>Z.ai</strong></a><strong> GLM‑4.6V: Long Context, Tool Calling, Serious Agent Potential</strong></p><p>Z.ai’s GLM‑4.6V impressed us with its <strong>128K context</strong>, native tool calling from vision inputs, and strong performance on benchmarks like MathVista and WebVoyager. It’s one of the clearest examples yet of a VLM that’s actually built for <strong>agentic workflows</strong>, not just answering questions about images.</p><p>That said, I did run my unofficial “bee counting test” on it… and yeah, Gemini still wins there 😅</p><p><strong>Perceptron Isaac 0.2: Tiny Models, Serious Perception</strong></p><p>Perceptron’s Isaac 0.2 (1B and 2B variants) showed something I really like seeing: <strong>structured outputs, focus tools, and reliability</strong> in very small models. Watching a 2B model correctly identify, count, and point to objects in an image is still wild to me.</p><p>These are the kinds of models that make physical AI, robotics, and edge deployments actually feasible.</p><p>🧰 Tools: Cursor goes visual, and Google Stitch keeps getting scarier (in a good way)</p><p>Cursor: direct visual editing inside the codebase</p><p>Cursor shipped a new feature that lets you visually manipulate UI elements—click/drag/resize—directly in the editor. We lumped this under “tools” because it’s not just a nicety; it’s the next step in “IDE as design surface.”</p><p>Cursor is also iterating fast on debugging workflows. The meta trend: IDEs are turning into agent platforms, not text editors.</p><p>Stitch by Google: Gemini 3 Pro as default, plus clickable prototypes</p><p>I showed Stitch on the show because it’s one of the clearest examples of “distribution beats raw capability.”</p><p>Stitch (Google’s product born from the Galileo AI acquisition) is doing Shipmas updates and now defaults to “Thinking with Gemini 3 Pro.” It can generate complex UIs, export them, and even stitch multiple screens into prototypes. The killer workflow is exporting directly into AI Studio / agent tooling so you can go from UI idea → code → repo without playing copy-paste Olympics.</p><p>Site: </p><p>https://stitch.withgoogle.com</p><p>🎬 Disney invests $1B into OpenAI (and Sora gets Disney characters)</p><p>This is the corporate story that made me do a double take.</p><p>Disney—arguably the most IP-protective company on Earth—is investing <strong>$1B</strong> into OpenAI and enabling use of Disney characters in <strong>Sora</strong>. That’s huge. It signals the beginning of a more explicit “licensed synthetic media” era, where major IP holders decide which model vendors get official access.</p><p>It also raises the obvious question: does Disney now go harder against other model providers that generate Disney-like content without permission?</p><p>We talked about how weird the timing is too, given Disney has also been sending legal pressure in the broader space. The next year of AI video is going to be shaped as much by licensing and distribution as by model quality.</p><p>Closing thoughts: the intelligence explosion is loud, messy, and accelerating</p><p>This episode had everything: open-source models catching up fast, foundation-level standardization around agents, a usage report that shows what developers <em>actually</em> do with LLMs, voice models getting dramatically better, and OpenAI shipping what looks like a serious “we’re not losing” answer to Gemini 3.</p><p>And yes: we’re also apparently putting GPUs in space.</p><p>Next week’s episode is our year recap, and—of course—we now have to update it because GPT‑5.2 decided to show up like the final boss.</p><p>If you missed any part of the show, check out the chapters in the podcast feed and jump around. See you next week.</p><p>TL;DR + Show Notes (links for everything)</p><p>Hosts</p><p>* <strong>Alex Volkov</strong> — AI Evangelist @ Weights & Biases: <a target="_blank" href="https://x.com/altryne">@altryne</a>. I host ThursdAI and spend an unhealthy amount of time trying to keep up with this firehose of releases.</p><p>* <strong>Co-hosts</strong> — <a target="_blank" href="https://x.com/WolframRvnwlf">@WolframRvnwlf</a>, <a target="_blank" href="https://x.com/yampeleg">@yampeleg</a>, <a target="_blank" href="https://x.com/nisten">@nisten</a>, <a target="_blank" href="https://x.com/ldjconfirmed">@ldjconfirmed</a>. Each of them brings a different “lens” (agents, infra, evaluation, open source, tooling), and it’s why the show works.</p><p>Open Source LLMs</p><p>* <strong>Essential AI — RNJ‑1 (8B base + instruct)</strong>: <a target="_blank" href="https://x.com/essential_ai/status/1997123628765524132">tweet</a>, <a target="_blank" href="https://www.essential.ai/blog/introducing-rnj-1">blog</a>, <a target="_blank" href="https://huggingface.co/EssentialAI/rnj-1-instruct">HF instruct</a>, <a target="_blank" href="https://huggingface.co/EssentialAI/rnj-1">HF base</a>. This is a from-scratch open pretrain led by Ashish Vaswani, and it’s one of the most important “Western open model” signals we’ve seen in a while.</p><p>* <strong>Mistral — Devstral 2 + Devstral Small 2 + Mistral Vibe</strong>: <a target="_blank" href="https://x.com/MistralAI/status/1998407332502405347">tweet</a>, <a target="_blank" href="https://huggingface.co/mistralai/Devstral-Small-2-24B-Instruct-2512">Devstral Small 2 HF</a>, <a target="_blank" href="https://huggingface.co/mistralai/Devstral-2-123B">Devstral 2 HF</a>, <a target="_blank" href="https://mistral.ai/news/">news</a>, <a target="_blank" href="https://github.com/mistralai/mistral-vibe">mistral-vibe GitHub</a>. Devstral is open coding SOTA territory, and Vibe is Mistral’s swing at the CLI agent layer.</p><p>AI in Space</p><p>* <strong>Starcloud trains and runs an LLM in orbit on an H100</strong>: <a target="_blank" href="https://x.com/PhilipJohnston/status/1998771535230939261">Philip Johnston</a>, <a target="_blank" href="https://x.com/AdiOltean/status/1998769997431058927">Adi Oltean</a>, <a target="_blank" href="https://x.com/CNBC/status/1998773618742898933">CNBC</a>, <a target="_blank" href="https://x.com/karpathy/status/1998806260783919434">Karpathy reaction</a>. A satellite H100 trained nanoGPT on Shakespeare and ran Gemma inference, igniting a real debate about power, cooling, repairability, and future orbital compute economics.</p><p>Putnam Math Competition</p><p>* <strong>Nous Research — Nomos 1 (Putnam scoring run)</strong>: <a target="_blank" href="https://x.com/NousResearch/status/1998536543565127968">tweet</a>, <a target="_blank" href="https://huggingface.co/NousResearch/Nomos-1">HF</a>, <a target="_blank" href="https://github.com/NousResearch/nomos">GitHub harness</a>, <a target="_blank" href="https://www.hillclimb.ing/">Hillclimb</a>. This is a strong open-weight math reasoning model plus an open harness, and it shows how orchestration matters as much as raw weights.</p><p>* <strong>Axiom — AxiomProver formal Lean proofs on Putnam</strong>: <a target="_blank" href="https://x.com/axiommathai/status/1997767850279440715">tweet</a>, <a target="_blank" href="https://github.com/nanand2/aristotle_putnam25">repo</a>. Formal proofs are the “no excuses” version of math reasoning, and this is a serious milestone even if you argue about exact framing.</p><p>Big Company LLMs + APIs</p><p>* <strong>OpenAI — GPT‑5.2 release</strong>: <a target="_blank" href="https://x.com/altryne/status/1999235261381906573">Alex tweet</a>, <a target="_blank" href="https://x.com/OpenAI/status/1999182098859700363">OpenAI announcement</a>, <a target="_blank" href="https://x.com/arcprize/status/1999182732845547795">ARC Prize verification</a>, <a target="_blank" href="https://x.com/sama/status/1999184337460428962">Sam Altman tweet</a>. GPT‑5.2 brings major jumps in reasoning, long context, and agentic workflows, and it’s clearly positioned as an answer to the Gemini 3 era.</p><p>* <strong>OpenRouter x a16z — State of AI report (100T+ tokens)</strong>: <a target="_blank" href="https://x.com/OpenRouterAI/status/1996678816820089131">tweet</a>, <a target="_blank" href="https://openrouter.ai/state-of-ai">landing page</a>, <a target="_blank" href="https://openrouter.ai/assets/State-of-AI.pdf">PDF</a>. The report highlights the dominance of programming/agents, the rise of reasoning tokens, and real-world usage patterns that explain why everyone is shipping agent harnesses.</p><p>* <strong>Agentic AI Foundation under Linux Foundation (AAIF)</strong>: <a target="_blank" href="https://x.com/goose_oss/status/1998439050206654584">Goose tweet</a>, <a target="_blank" href="https://block.xyz/blog/agentic-ai-foundation">Block blog</a>, <a target="_blank" href="https://aaif.io/">aaif.io</a>, <a target="_blank" href="https://x.com/linuxfoundation/status/1998446415383871882">Linux Foundation tweet</a>. MCP + <a target="_blank" href="AGENTS.md">AGENTS.md</a> + Goose moving into vendor-neutral governance is huge for interoperability and long-term ecosystem stability.</p><p>* <strong>Disney invests $1B into OpenAI / Sora characters</strong>: (covered on the show as a major IP + distribution moment). This is an early signal of licensed synthetic media becoming a first-class business line rather than a legal gray zone.</p><p>This week’s Buzz (W&B)</p><p>* <strong>OpenRouter Broadcast → W&B Weave tracing</strong>: <a target="_blank" href="https://openrouter.ai/settings/broadcast">Broadcast settings</a>. You can trace OpenRouter-based traffic into Weave with minimal setup, which is especially useful when you can’t (or don’t want to) instrument code directly.</p><p>Vision & Video</p><p>* <strong>Jina — jina‑VLM (2.4B)</strong>: <a target="_blank" href="https://x.com/JinaAI_/status/1997926488843190481">tweet</a>, <a target="_blank" href="https://arxiv.org/abs/2512.04032">arXiv</a>, <a target="_blank" href="https://huggingface.co/jinaai/jina-vlm">HF</a>, <a target="_blank" href="https://jina.ai/news/jina-vlm-a-state-of-the-art-2b-vlm-excelling-in-multilingual-vqa-document-understanding/">blog</a>. A compact multilingual VLM optimized for doc understanding and VQA.</p><p>* <a target="_blank" href="Z.ai"><strong>Z.ai</strong></a><strong> — GLM‑4.6V + Flash</strong>: <a target="_blank" href="https://x.com/Zai_org/status/1998003287216517345">tweet</a>, <a target="_blank" href="https://huggingface.co/collections/zai-org/glm-46v">HF collection</a>, <a target="_blank" href="https://huggingface.co/zai-org/GLM-4.6V">GLM‑4.6V</a>, <a target="_blank" href="https://huggingface.co/zai-org/GLM-4.6V-Flash">Flash</a>, <a target="_blank" href="https://z.ai/blog/glm-4.6v">blog</a>. Strong open VLMs with tool calling and long context, even if my bee counting test still humbled it.</p><p>* <strong>Perceptron — Isaac 0.2 (1B/2B)</strong>: <a target="_blank" href="https://x.com/perceptroninc/status/1998812935821697363">tweet</a>, <a target="_blank" href="https://huggingface.co/PerceptronAI/Isaac-0.2-2B-Preview">HF 2B</a>, <a target="_blank" href="https://huggingface.co/PerceptronAI/Isaac-0.2-1B">HF 1B</a>, <a target="_blank" href="https://perceptron.inc/blog/introducing-isaac-0-2">blog</a>, <a target="_blank" href="https://perceptron.inc/demo">demo</a>. The Focus/zoom tooling and structured outputs point toward “VLMs as reliable perception modules,” not just chatty describers.</p><p>Voice & Audio</p><p>* <strong>Google DeepMind — Gemini 2.5 TTS (Flash + Pro)</strong>: <a target="_blank" href="https://x.com/GoogleAIStudio/status/1998876411734692107">AI Studio tweet</a>, <a target="_blank" href="https://x.com/googleaidevs/status/1998874506912538787">GoogleAI devs tweet</a>, <a target="_blank" href="https://blog.google/technology/developers/gemini-2-5-text-to-speech/">blog</a>, <a target="_blank" href="https://aistudio.google.com/generate-speech">AI Studio speech playground</a>. The key upgrades are control and consistency (emotion, pacing, multi-speaker) across many languages.</p><p>* <strong>OpenBMB — VoxCPM 1.5</strong>: <a target="_blank" href="https://x.com/OpenBMB/status/1998377261859582304">tweet</a>, <a target="_blank" href="https://huggingface.co/openbmb/VoxCPM1.5">HF</a>, <a target="_blank" href="https://github.com/OpenBMB/VoxCPM">GitHub</a>. Open TTS keeps getting better, and this release is especially interesting for fine-tuning and voice cloning workflows.</p><p>Tools</p><p>* <strong>Cursor — direct visual editing (new UI workflow)</strong>: (covered on the show as a major step toward “IDE as design surface”). Cursor continues to push the agentic IDE category into new territory.</p><p>* <strong>Stitch by Google — Shipmas updates + Gemini 3 Pro “Thinking” + Prototypes</strong>: <a target="_blank" href="https://x.com/stitchbygoogle/status/1998837129905058198">tweet 1</a>, <a target="_blank" href="https://x.com/stitchbygoogle/status/1998874499702468828">tweet 2</a>, <a target="_blank" href="https://stitch.withgoogle.com">site</a>, plus background articles: <a target="_blank" href="https://techcrunch.com/2025/05/20/google-launches-stitch-an-ai-powered-tool-to-help-design-apps/">TechCrunch launch</a>, <a target="_blank" href="https://www.techinasia.com/news/google-acquires-aidriven-ui-startup-galileo-ai">acquisition detail</a>. Stitch is turning prompt-to-UI into a full prototype-to-code pipeline with real export paths.</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-dec-11-gpt-52-is-here-plus</link><guid isPermaLink="false">substack:post:181390245</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Fri, 12 Dec 2025 02:51:52 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/181390245/42da04deae5dce126776bb45a8150881.mp3" length="93139784" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5821</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/181390245/9df8dffdd2184e883c1c5a284a666e2a.jpg"/></item><item><title><![CDATA[📆 ThursdAI - Dec 4, 2025 - DeepSeek V3.2 Goes Gold Medal, Mistral Returns to Apache 2.0, OpenAI Hits Code Red, and US-Trained MOEs Are Back!]]></title><description><![CDATA[<p>Hey yall, Alex here 🫡 </p><p>Welcome to the first ThursdAI of December! Snow is falling in Colorado, and AI releases are falling even harder. This week was genuinely one of those “drink from the firehose” weeks where every time I refreshed my timeline, another massive release had dropped.</p><p>We kicked off the show asking our co-hosts for their top AI pick of the week, and the answers were all over the map: Wolfram was excited about Mistral’s return to Apache 2.0, Yam couldn’t stop talking about Claude Opus 4.5 after a full week of using it, and Nisten came out of left field with an AWQ quantization of Prime Intellect’s model that apparently runs incredibly fast on a single GPU. As for me? I’m torn between Opus 4.5 (which literally fixed bugs that Gemini 3 created in my code) and DeepSeek’s gold-medal winning reasoning model.</p><p>Speaking of which, let’s dive into what happened this week, starting with the open source stuff that’s been absolutely cooking. </p><p><strong>Open Source LLMs</strong></p><p><strong>DeepSeek V3.2: The Whale Returns with Gold Medals</strong></p><p>The whale is back, folks! DeepSeek released two major updates this week: V3.2 and V3.2-Speciale. And these aren’t incremental improvements—we’re talking about an open reasoning-first model that’s rivaling GPT-5 and Gemini 3 Pro with actual gold medal Olympiad wins.</p><p>Here’s what makes this release absolutely wild: DeepSeek V3.2-Speciale is achieving 96% on AIME versus 94% for GPT-5 High. It’s getting gold medals on IMO (35/42), CMO, ICPC (10/12), and IOI (492/600). This is a 685 billion parameter MOE model with MIT license, and it literally broke the benchmark graph on HMMT 2025—the score was so high it went outside the chart boundaries. That’s how you DeepSeek, basically.</p><p>But it’s not just about reasoning. The regular V3.2 (not Speciale) is absolutely crushing it on agentic benchmarks: 73.1% on SWE-Bench Verified, first open model over 35% on Tool Decathlon, and 80.3% on τ²-bench. It’s now the second most intelligent open weights model and ranks ahead of Grok 4 and Claude Sonnet 4.5 on Artificial Analysis.</p><p>The price is what really makes this insane: 28 cents per million tokens on OpenRouter. That’s absolutely ridiculous for this level of performance. They’ve also introduced DeepSeek Sparse Attention (DSA) which gives you 2-3x cheaper 128K inference without performance loss. LDJ pointed out on the show that he appreciates how transparent they’re being about not quite matching Gemini 3’s efficiency on reasoning tokens, but it’s open source and incredibly cheap.</p><p>One thing to note: V3.2-Speciale doesn’t support tool calling. As Wolfram pointed out from the model card, it’s “designed exclusively for deep reasoning tasks.” So if you need agentic capabilities, stick with the regular V3.2.</p><p>Check out the <a target="_blank" href="https://huggingface.co/deepseek-ai/DeepSeek-V3.2">full release on Hugging Face</a> or read the <a target="_blank" href="https://platform.deepseek.com/blog/deepseek-v3-2">announcement</a>.</p><p>Mistral 3: Europe’s Favorite AI Lab Returns to Apache 2.0</p><p>Mistral is back, and they’re back with fully open Apache 2.0 licenses across the board! This is huge news for the open source community. They released two major things this week: Mistral Large 3 and the Ministral 3 family of small models.</p><p>Mistral Large 3 is a 675 billion parameter MOE with 41 billion active parameters and a quarter million (256K) context window, trained on 3,000 H200 GPUs. There’s been some debate about this model’s performance, and I want to address the elephant in the room: some folks saw a screenshot showing Mistral Large 3 very far down on Artificial Analysis and started dunking on it. But here’s the key context that Merve from Hugging Face pointed out—this is the only non-reasoning model on that chart besides GPT 5.1. When you compare it to other instruction-tuned (non-reasoning) models, it’s actually performing quite well, sitting at #6 among open models on LMSys Arena.</p><p>Nisten checked LM Arena and confirmed that on coding specifically, Mistral Large 3 is scoring as one of the best open source coding models available. Yam made an important point that we should compare Mistral to other open source players like Qwen and DeepSeek rather than to closed models—and in that context, this is a solid release.</p><p>But the real stars of this release are the Ministral 3 small models: 3B, 8B, and 14B, all with vision capabilities. These are edge-optimized, multimodal, and the 3B actually runs completely in the browser with WebGPU using transformers.js. The 14B reasoning variant achieves 85% on AIME 2025, which is state-of-the-art for its size class. Wolfram confirmed that the multilingual performance is excellent, particularly for German.</p><p>There’s been some discussion about whether Mistral Large 3 is a DeepSeek finetune given the architectural similarities, but Mistral claims these are fully trained models. As Nisten noted, even if they used similar architecture (which is Apache 2.0 licensed), there’s nothing wrong with that—it’s an excellent architecture that works. Lucas Atkins later confirmed on the show that “Mistral Large looks fantastic... it is DeepSeek through and through architecture wise. But Kimi also does that—DeepSeek is the GOAT. Training MOEs is not as easy as just import deepseak and train.”</p><p>Check out <a target="_blank" href="https://huggingface.co/collections/mistralai/mistral-large-3">Mistral Large 3</a> and <a target="_blank" href="https://huggingface.co/collections/mistralai/ministral-3">Ministral 3</a> on Hugging Face.</p><p>Arcee Trinity: US-Trained MOEs Are Back</p><p>We had Lucas Atkins, CTO of Arcee AI, join us on the show to talk about their new Trinity family of models, and this conversation was packed with insights about what it takes to train MOEs from scratch in the US.</p><p>Trinity is a family of open-weight MOEs fully trained end-to-end on American infrastructure with 10 trillion curated tokens from <a target="_blank" href="Datology.ai">Datology.ai</a>. They released Trinity-Mini (26B total, 3B active) and Trinity-Nano-Preview (6B total, 1B active), with Trinity-Large (420B parameters, 13B active) coming in mid-January 2026.</p><p>The benchmarks are impressive: Trinity-Mini hits 84.95% on MMLU (0-shot), 92.1% on Math-500, and 65% on GPQA Diamond. But what really caught my attention was the inference speed—Nano generates at 143 tokens per second on llama.cpp, and Mini hits 157 t/s on consumer GPUs. They’ve even demonstrated it running on an iPhone via MLX Swift.</p><p>I asked Lucas why it matters where models come from, and his answer was nuanced: for individual developers, it doesn’t really matter—use the best model for your task. But for Fortune 500 companies, compliance and legal teams are getting increasingly particular about where models were trained and hosted. This is slowing down enterprise AI adoption, and Trinity aims to solve that.</p><p>Lucas shared a fascinating insight about why they decided to do full pretraining instead of just post-training on other people’s checkpoints: “We at Arcee were relying on other companies releasing capable open weight models... I didn’t like the idea of the foundation of our business being reliant on another company releasing models.” He also dropped some alpha about Trinity-Large: they’re going with 13B active parameters instead of 32B because going sparser actually gave them much faster throughput on Blackwell GPUs.</p><p>The conversation about MOEs being cheaper for RL was particularly interesting. Lucas explained that because MOEs are so inference-efficient, you can do way more rollouts during reinforcement learning, which means more RL benefit per compute dollar. This is likely why we’re seeing labs like MiniMax go from their original 456B/45B-active model to a leaner 220B/10B-active model—they can get more gains in post-training by being able to do more steps.</p><p>Check out <a target="_blank" href="https://huggingface.co/arcee-ai/Trinity-Mini">Trinity-Mini</a> and <a target="_blank" href="https://huggingface.co/arcee-ai/Trinity-Nano-Preview">Trinity-Nano-Preview</a> on Hugging Face, or read <a target="_blank" href="https://www.arcee.ai/blog/the-trinity-manifesto">The Trinity Manifesto</a>.</p><p><strong>OpenAI Code Red: Panic at the Disco (and Garlic?)</strong></p><p>It was ChatGPT’s 3rd birthday this week (Nov 30th), but the party vibes seem… stressful. Reports came out that Sam Altman has declared a “Code Red” at OpenAI.</p><p>Why? <strong>Gemini 3.</strong>The user numbers don’t lie. ChatGPT apparently saw a 6% drop in daily active users following the Gemini 3 launch. Google’s integration is just too good, and their free tier is compelling.</p><p>In response, OpenAI has supposedly paused “side projects” (ads, shopping bots) to focus purely on model intelligence and speed. Rumors point to a secret model codenamed <strong>“Garlic”</strong>—a leaner, more efficient model that beats Gemini 3 and Claude Opus 4.5 on coding reasoning, targeting a release in early 2026 (or maybe sooner if they want to save Christmas).</p><p>Wolfram and Yam nailed the sentiment here: Integration wins. Wolfram’s family uses Gemini because it’s <em>right there</em> on the Pixel, controlling the lights and calendar. OpenAI needs to catch up not just on IQ, but on being helpful in the moment.</p><p>Post the live show, OpenAI also finally added GPT 5.1 Codex Max we covered 2 weeks ago to their API and it’s now available in Cursor, for free, until Dec 11! </p><p>Amazon Nova 2: Enterprise Push with Serious Agentic Chops</p><p>Amazon came back swinging with Nova 2, and the jump on Artificial Analysis is genuinely impressive—from around 30% to 61% on their index. That’s a massive improvement.</p><p>The family includes Nova 2 Lite (7x cheaper, 5x faster than Nova Premier), Nova 2 Pro (93% on τ²-Bench Telecom, 70% on SWE-Bench Verified), Nova 2 Sonic (speech-to-speech with 1.39s time-to-first-audio), and Nova 2 Omni (unified text/image/video/speech with 1M token context window—you can upload 90 minutes of video!).</p><p>Gemini 3 Deep Think Mode</p><p>Google launched Gemini 3 Deep Think mode exclusively for AI Ultra subscribers, and it’s hitting some wild benchmarks: 45.1% on ARC-AGI-2 (a 2x SOTA leap using code execution), 41% on Humanity’s Last Exam, and 93.8% on GPQA Diamond. This builds on their Gemini 2.5 variants that earned gold medals at IMO and ICPC World Finals. The parallel reasoning approach explores multiple hypotheses simultaneously, but it’s compute-heavy—limited to 10 prompts per day at $77 per ARC-AGI-2 task.</p><p><strong>This Week’s Buzz: Mid-Training Evals are Here!</strong></p><p>A huge update from us at Weights & Biases this week: We launched <strong>LLM Evaluation Jobs</strong>. (<a target="_blank" href="https://docs.wandb.ai/models/launch?https://wandb.ai/site/weave?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=dec4">Docs</a>)</p><p>If you are training models or finetuning, you usually wait until the end to run your expensive benchmarks. Now, directly inside W&B, you can trigger evaluations on mid-training checkpoints.</p><p>It integrates with Inspect Evals (over 100+ public benchmarks). You just point it to your checkpoint or an API endpoint (even OpenRouter!), select the evals (MMLU-Pro, GPQA, etc.), and we spin up the managed GPUs to run it. You get a real-time leaderboard of your runs vs. the field.</p><p>Also, a shoutout to users of <a target="_blank" href="Neptune.ai"><strong>Neptune.ai</strong></a>—congrats on the acquisition by OpenAI, but since the service is shutting down, we have built a <a target="_blank" href="http://wandb.me/migrateneptune">migration script</a> to help you move your history over to W&B seamlessly. We aren’t going anywhere!</p><p><strong>Video & Vision: Physics, Audio, and Speed</strong></p><p>The multimodal space was absolutely crowded this week.</p><p>Runway Gen 4.5 (”Whisper Thunder”)</p><p>Runway revealed that the mysterious “Whisper Thunder” model topping the leaderboards is actually Gen 4.5. The key differentiator? Physics and Multi-step adherence. It doesn’t have that “diffusion wobble” anymore. We watched a promo video where the shot changes every 3-4 seconds, and while it’s beautiful, it shows we still haven’t cracked super long consistent takes yet. But for 8-second clips? It’s apparently the new SOTA.</p><p>Kling 2.6: Do you hear that?</p><p>Kling hit back with <strong>Video 2.6</strong>, and the killer feature is <strong>Native Audio</strong>. I generated a clip of two people arguing, and the lip sync was perfect. Not “dubbed over” perfect, but actively generated with the video. It handles multi-character dialogue, singing, and SFX. It’s huge for creators.</p><p>Kling was on a roll this week, releasing not one, but two Video Models (O1 Video is an omni modal one that takes Text, Images and Audio as inputs) and O1 Image and Kling Avatar 2.0 are also great updates! (Find all their releases on <a target="_blank" href="https://x.com/i/trending/1996381535348707447">X</a>)</p><p>P-Image: Sub-Second Generation at Half a Cent</p><p></p><p>Last week we talked about ByteDance’s Z-Image, which was super cool and super cheap. Well, this week Pruna AI came out with P-Image, which is even faster and cheaper: image generation under one second for $0.005, and editing under one second for $0.01.</p><p>I built a Chrome extension this week (completely rewritten by Opus 4.5, by the way—more on that in a second) that lets me play with these new image models inside the Infinite Craft game. When I tested P-Image Turbo against Z-Image, I was genuinely impressed by the quality at that speed. If you want quick iterations before moving to something like Nano Banana Pro for final 4K output, these sub-second models are perfect.</p><p>The extension is <a target="_blank" href="https://github.com/altryne/infinite-fun-extension">available on GitHub</a> if you want to try it—you just need to add your Replicate or Fal API keys.</p><p>SeeDream 4.5: ByteDance Levels Up</p><p>ByteDance also launched SeeDream 4.5 in open beta, with major improvements in detail fidelity, spatial reasoning, and multi-image reference fusion (up to 10 inputs for consistent storyboards). The text rendering is much sharper, and it supports multilingual typography including Japanese. Early tests show it competing well with Nano Banana Pro in prompt adherence and logic.</p><p>🎤 Voice & Audio</p><p>Microsoft VibeVoice-Realtime-0.5B</p><p></p><p>In a surprise drop, Microsoft open-sourced VibeVoice-Realtime-0.5B, a compact TTS model optimized for real-time applications. It delivers initial audible output in just 300 milliseconds while generating up to 10 minutes of speech. The community immediately started creating mirrors because, well, Microsoft has a history of releasing things on Hugging Face and then having legal pull them down. Get it while it’s hot!</p><p><strong>Use Cases: Code, Cursors, and “Antigravity”</strong></p><p>We wrapped up with some killer practical tips:</p><p>* <strong>Opus 4.5 is a beast:</strong> As I mentioned, using Opus inside Cursor’s “Ask” mode is currently the supreme coding experience. It debugs logic flaws that Gemini misses completely. I also used Opus as a prompt engineer for my infographics, and it absolutely demolished GPT at creating the specific layouts I needed</p><p>* <strong>Google’s Secret:</strong> Nisten dropped a bomb at the end of the show—<strong>Opus 4.5 is available for free inside Google’s Antigravity (and Colab)!</strong> If you want to try the model that’s beating GPT-5 without paying, go check Antigravity now before they patch it or run out of compute.</p><p>* <strong>Microsoft VibeVoice:</strong> A surprise drop of a 0.5B speech model on HuggingFace that does real-time TTS (300ms latency). It was briefly questionable if it would stay up, but mirrors are already everywhere.</p><p>That’s a wrap for this week, folks. Next week is probably going to be our final episode of the year, so we’ll be doing recaps and looking at our predictions from last year. Should be fun to see how wrong we were about everything!</p><p>Thank you for tuning in. If you missed the live stream, subscribe to our <a target="_blank" href="https://thursdai.substack.com">Substack</a>, <a target="_blank" href="https://thursdai.news/yt">YouTube</a>, and wherever you get your podcasts. See you next Thursday!</p><p>TL;DR and Show Notes</p><p><strong>Hosts and Guests</strong></p><p>* <strong>Alex Volkov</strong> - AI Evangelist & Weights & Biases (<a target="_blank" href="https://x.com/altryne">@altryne</a>)</p><p>* Co Hosts - <a target="_blank" href="https://x.com/WolframRvnwlf">@WolframRvnwlf</a>, <a target="_blank" href="https://x.com/yampeleg">@yampeleg</a>, <a target="_blank" href="https://x.com/nisten">@nisten</a>, <a target="_blank" href="https://x.com/ldjconfirmed">@ldjconfirmed</a></p><p>* Guest - Lucas Atkins (<a target="_blank" href="https://x.com/latkins">@latkins</a>) - CTO Arcee AI</p><p><strong>Open Source LLMs</strong></p><p>* DeepSeek V3.2 and V3.2-Speciale - Gold medal olympiad wins, MIT license (<a target="_blank" href="https://x.com/deepseek_ai/status/1995452641430651132">X</a>, <a target="_blank" href="https://huggingface.co/deepseek-ai/DeepSeek-V3.2">HF V3.2</a>, <a target="_blank" href="https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Speciale">HF Speciale</a>, <a target="_blank" href="https://platform.deepseek.com/blog/deepseek-v3-2">Announcement</a>)</p><p>* Mistral 3 family - Large 3 and Ministral 3, Apache 2.0 (<a target="_blank" href="https://x.com/MistralAI/status/1995872766177018340">X</a>, <a target="_blank" href="https://mistral.ai/news/mistral-3/">Blog</a>, <a target="_blank" href="https://huggingface.co/collections/mistralai/mistral-large-3">HF Large</a>, <a target="_blank" href="https://huggingface.co/collections/mistralai/ministral-3">HF Ministral</a>)</p><p>* Arcee Trinity - US-trained MOE family (<a target="_blank" href="https://x.com/latkins/status/1995592664637665702">X</a>, <a target="_blank" href="https://huggingface.co/arcee-ai/Trinity-Mini">HF Mini</a>, <a target="_blank" href="https://huggingface.co/arcee-ai/Trinity-Nano-Preview">HF Nano</a>, <a target="_blank" href="https://www.arcee.ai/blog/the-trinity-manifesto">Blog</a>)</p><p>* Hermes 4.3 - Decentralized training, SOTA RefusalBench (<a target="_blank" href="https://x.com/nousresearch">X</a>, <a target="_blank" href="https://huggingface.co/NousResearch/Hermes-4.3-36B">HF</a>)</p><p><strong>Big CO LLMs + APIs</strong></p><p>* OpenAI Code Red - ChatGPT 3rd birthday, Garlic model in development (<a target="_blank" href="https://www.theinformation.com/articles/openai-developing-garlic-model-counter-googles-recent-gains">The Information</a>)</p><p>* Amazon Nova 2 - Lite, Pro, Sonic, and Omni models (<a target="_blank" href="https://x.com/amazonnews/status/1995898375649050753">X</a>, <a target="_blank" href="https://aws.amazon.com/blogs/aws/introducing-amazon-nova-2-lite-a-fast-cost-effective-reasoning-model/">Blog</a>)</p><p>* Gemini 3 Deep Think - 45.1% ARC-AGI-2 (<a target="_blank" href="https://x.com/GeminiApp/status/1996656314983109003">X</a>, <a target="_blank" href="https://blog.google/products/gemini/gemini-3-deep-think/">Blog</a>)</p><p>* Cursor + GPT-5.1-Codex-Max - Free until Dec 11 (<a target="_blank" href="https://x.com/cursor_ai/status/1996645841063604711">X</a>, <a target="_blank" href="https://cursor.com/blog/codex-model-harness">Blog</a>)</p><p><strong>This Week’s Buzz</strong></p><p>* WandB LLM Evaluation Jobs - Evaluate any OpenAI-compatible API (<a target="_blank" href="https://x.com/wandb/status/1995921086257791070">X</a>, <a target="_blank" href="https://wandb.ai/site/articles/llm-evaluation-jobs">Announcement</a>)</p><p><strong>Vision & Video</strong></p><p>* Runway Gen-4.5 - #1 on text-to-video leaderboard, 1,247 Elo (<a target="_blank" href="https://x.com/runwayml/status/1995493445243461846">X</a>)</p><p>* Kling VIDEO 2.6 - First native audio generation (<a target="_blank" href="https://x.com/Kling_ai/status/1996238606814593196">X</a>)</p><p>* Kling O1 Image - Image generation (<a target="_blank" href="https://x.com/Kling_ai/status/1995741899517542818">X</a>)</p><p><strong>Voice & Audio</strong></p><p>* Microsoft VibeVoice-Realtime-0.5B - 300ms latency TTS (<a target="_blank" href="https://x.com/Presidentlin/status/1996461134388625628">X</a>, <a target="_blank" href="https://huggingface.co/microsoft/VibeVoice-Realtime-0.5B">HF</a>)</p><p><strong>AI Art & Diffusion</strong></p><p>* Pruna P-Image - Sub-second generation at $0.005 (<a target="_blank" href="https://x.com/PrunaAI/status/1995524846948700495">X</a>, <a target="_blank" href="https://pruna.ai/p-image">Blog</a>, <a target="_blank" href="https://demo.pruna.ai">Demo</a>)</p><p>* SeeDream 4.5 - Multi-reference fusion, text rendering (<a target="_blank" href="https://x.com/BytePlusGlobal/status/1996212339096576463">X</a>)</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-dec-4-2025-deepseek-v32</link><guid isPermaLink="false">substack:post:180754799</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Fri, 05 Dec 2025 01:03:51 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/180754799/3e652d576663d8e2c44a397acab2769d.mp3" length="90151019" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5634</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/180754799/ef926bf70d3055cf1b6a029cd93e1d8e.jpg"/></item><item><title><![CDATA[ThursdAI Special: Google's New Anti-Gravity IDE, Gemini 3 & Nano Banana Pro Explained (ft. Kevin Hou, Ammaar Reshi & Kat Kampf)]]></title><description><![CDATA[<p>Hey, Alex here, </p><p>I recorded these conversations just in front of the AI Engineer auditorium, back to back, after these great folks gave their talks, and at the epitome of the most epic AI week we’ve seen since I started recording ThursdAI.</p><p>This is less our traditional live recording, and more a real podcast-y conversation with great folks, inspired by <a target="_blank" href="https://substack.com/profile/89230629-latentspace">Latent.Space</a>. I hope you enjoy this format as much as I’ve enjoyed recording and editing it.  </p><p>AntiGravity with Kevin</p><p>Kevin Hou and team just launched Antigravity, Google’s brand new Agentic IDE based on VSCode, and Kevin (second timer on ThursdAI) was awesome enough to hop on and talk about some of the product decisions they made, what makes Antigravity special and highlighted Artifacts as a completely new primitive. </p><p>Gemini 3 in AI Studio</p><p>If you aren’t using Google’s AI Studio (<a target="_blank" href="http://ai.dev">ai.dev</a>) then you’re missing out! We talk about AI Studio all the time on the show, and I’m a daily user! I generate most of my images with Nano Banana Pro in there, most of my Gemini conversations are happening there as well! </p><p>Ammaar and Kat were so fun to talk to, as they covered the newly shipped “build mode” which allows you to vibe code full apps and experiences inside AI Studio, and we also covered Gemini 3’s features, multimodality understanding, UI capabilities. </p><p>These folks gave a LOT of Gemini 3 demo’s so they know everything there is to know about this model’s capabilities! </p><p><p>Tried new things with this one, multi camera angels, conversation with great folks, if you found this content valuable, please subscribe :) </p></p><p></p><p><strong>Topics Covered:</strong></p><p>* Inside Google’s new “AntiGravity” IDE</p><p>* How the “Agent Manager” changes coding workflows</p><p>* Gemini 3’s new multimodal capabilities</p><p>* The power of “Artifacts” and dynamic memory</p><p>* Deep dive into AI Studio updates & Vibe Coding</p><p>* Generating 4K assets with Nano Banana Pro</p><p>Timestamps for your viewing convenience. </p><p>00:00 - Introduction and Overview</p><p>01:13 - Conversation with Kevin Hou: Anti-Gravity IDE</p><p>01:58 - Gemini 3 and Nano Banana Pro Launch Insights</p><p>03:06 - Innovations in Anti-Gravity IDE</p><p>06:56 - Artifacts and Dynamic Memory</p><p>09:48 - Agent Manager and Multimodal Capabilities</p><p>11:32 - Chrome Integration and Future Prospects</p><p>20:11 - Conversation with Ammar and Kat: AI Studio Team</p><p>21:21 - Introduction to AI Studio</p><p>21:51 - What is AI Studio?</p><p>22:52 - Ease of Use and User Feedback</p><p>24:06 - Live Demos and Launch Week</p><p>26:00 - Design Innovations in AI Studio</p><p>30:54 - Generative UIs and Vibe Coding</p><p>33:53 - Nano Banana Pro and Image Generation</p><p>39:45 - Voice Interaction and Future Roadmap</p><p>44:41 - Conclusion and Final Thoughts</p><p>Looking forward to seeing you on Thursday 🫡 </p><p>P.S - I’ve recorded one more conversation during AI Engineer, and will be posting that soon, same format, very interesting person, look out for that soon!  </p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-special-googles-new-anti</link><guid isPermaLink="false">substack:post:180465671</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Tue, 02 Dec 2025 21:30:47 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/180465671/7d757e4d5748694c067c6b9771f98460.mp3" length="33175023" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>2764</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/180465671/d20616e9d8e97982d45b8fa3ce042c09.jpg"/></item><item><title><![CDATA[🦃 ThursdAI - Thanksgiving special 25’ - Claude 4.5, Flux 2 & Z-image vs 🍌, MCP gets Apps + New DeepSeek!?]]></title><description><![CDATA[<p>Hey ya’ll, Happy Thanskgiving to everyone who celebrates and thank you for being a subscriber, I truly appreciate each and every one of you!</p><p>Just wrapped up the third (<a target="_blank" href="https://sub.thursdai.news/p/thursdai-thanksgiving-special-openai">1</a>, <a target="_blank" href="https://sub.thursdai.news/p/thursdai-thanksgiving-special-24">2</a>) Thanksgiving special Episode of ThursdAI, can you believe November is almost over? </p><p>We had another banger week in AI, with a full feast of AI released, Anthropic dropped the long awaited Opus 4.5, which quickly became the top coding LLM, DeepSeek resurfaced with a math model, BFL and Tongyi both tried to take on Nano Banana, and Microsoft dropped a 7B computer use model in Open Source + Intellect 3 from Prime Intellect! </p><p>With so much news to cover, we also had an interview with Ido Sal & Liad Yosef (their second time on the show!) about MCP-Apps, the new standard they are spearheading together with Anthropic, OpenAI & more! </p><p>Exciting episode, let’s get into it! (P.S - I started generating infographics, so the show became much more visual, LMK if you like them) </p><p><p>ThursdAI - I put a lot of work on a weekly basis to bring you the live show, podcast and a sourced newsletter! Please subscribe if you find this content valuable!</p></p><p><strong>Anthropic’s Opus 4.5: The “Premier Intelligence” Returns (</strong><a target="_blank" href="https://www.anthropic.com/news/claude-opus-4-5"><strong>Blog</strong></a><strong>)</strong></p><p>Folks, Anthropic absolutely <em>cooked</em>. After Sonnet and Haiku had their time in the sun, the big brother is finally back. Opus 4.5 launched this week, and it is reclaiming the throne for coding and complex agentic tasks.</p><p>First off, the specs are monstrous. It hits <strong>80.9% on SWE-bench Verified</strong>, topping GPT-5.1 (77.9%) and Gemini 3 Pro (76.2%). But the real kicker? The price! It is now $5 per million input tokens and $25 per million output—literally one-third the cost of the previous Opus.</p><p>Yam, our resident coding wizard, put it best during the show: “Opus knows a lot of tiny details about the stack that you didn’t even know you wanted... It feels like it can go forever.” Unlike Sonnet, which sometimes spirals or loses context on extremely long tasks, Opus 4.5 maintains coherence deep into the conversation.</p><p>Anthropic also introduced a new <strong>“Effort” parameter</strong>, allowing you to control how hard the model thinks (similar to o1 reasoning tokens). Set it to high, and you get massive performance gains; set it to medium, and you get Sonnet-level performance at a fraction of the token cost. Plus, they’ve <a target="_blank" href="https://www.anthropic.com/engineering/advanced-tool-use">added</a> <strong>Tool Search </strong>(cutting enormous token overhead for agents with many tools) and <strong>Programmatic Tool Calling</strong>, which effectively lets Opus write and execute code loops to manage data.</p><p>If you are doing heavy software engineering or complex automations, Opus 4.5 is the new daily driver.</p><p><strong>📱 The Agentic Web: MCP Apps & MCP-UI Standard</strong></p><p>Speaking of MCP updates, Can you believe it’s been exactly one year since the Model Context Protocol (MCP) launched? We’ve been “MCP-pilled” for a while, but this week, the ecosystem took a massive leap forward.</p><p>We brought back our friends <strong>Ido and Liad</strong>, the creators of <strong>MCP-UI</strong>, to discuss huge news: MCP-UI has been officially standardized as <strong>MCP Apps</strong>. This is a joint effort adopted by both Anthropic and OpenAI.</p><p><strong>Why does this matter?</strong> Until now, when an LLM used a tool (like Spotify or Zillow), the output was just text. It lost the brand identity and the user experience. With MCP Apps, agents can now render full, interactive HTML interfaces directly inside the chat! </p><p>Ido and Liad explained that they worked hard to avoid an “iOS vs. Android” fragmentation war. Instead of every lab building their own proprietary app format, we now have a unified standard for the “Agentic Web.” This is how AI stops being a chatbot and starts being an operating system.</p><p>Check out the standard at <a target="_blank" href="mcpui.dev"><strong>mcpui.dev</strong></a>.</p><p>🦃 The Open Source Thanksgiving Feast</p><p>While the big labs were busy, the open-source community decided to drop enough papers and weights to feed us for a month.</p><p>Prime Intellect unveils INTELLECT-3, a 106B MoE (<a target="_blank" href="https://x.com/PrimeIntellect/status/1993895068290388134">X</a>, <a target="_blank" href="https://huggingface.co/PrimeIntellect/INTELLECT-3">HF</a>, <a target="_blank" href="https://www.primeintellect.ai/blog/intellect-3">Blog</a>, <a target="_blank" href="https://chat.primeintellect.ai/">Try It</a>)</p><p>Prime Intellect releases INTELLECT-3, a 106B parameter Mixture-of-Experts model (12B active params) based on GLM-4.5-Air, achieving state-of-the-art performance for its size—including ~90% on AIME 2024/2025 math contests, 69% on LiveCodeBench v6 coding, 74% on GPQA-Diamond reasoning, and 74% on MMLU-Pro—outpacing larger models like DeepSeek-R1. </p><p>Trained over two months on 512 H200 GPUs using their fully open-sourced end-to-end stack (PRIME-RL async trainer, Verifiers & Environments Hub, Prime Sandboxes), it’s now hosted on Hugging Face, OpenRouter, Parasail, and Nebius, empowering any team to scale frontier RL without big-lab resources. Especially notable is their very detailed <a target="_blank" href="https://www.primeintellect.ai/blog/intellect-3">release blog</a>, covering how a lab that previously trained 32B, finetunes a monster 106B MoE model! </p><p><strong>Tencent’s HunyuanOCR: Small but Mighty </strong>(<a target="_blank" href="https://x.com/TencentHunyuan/status/1993202595264131436">X</a>, <a target="_blank" href="https://huggingface.co/tencent/HunyuanOCR">HF</a>, <a target="_blank" href="https://github.com/Tencent-Hunyuan/HunyuanOCR">Github</a>, <a target="_blank" href="https://hunyuan.tencent.com/vision/zh">Blog</a>)</p><p>Tencent released <strong>HunyuanOCR</strong>, a 1 billion parameter model that is absolutely crushing benchmarks. It scored <strong>860 on OCRBench</strong>, beating massive models like Qwen3-VL-72B. It’s an end-to-end model, meaning no separate detection and recognition steps. Great for parsing PDFs, docs, and even video subtitles. It’s heavily restricted (no EU/UK usage), but technically impressive.</p><p><strong>Microsoft’s Fara-7B: On-Device Computer Use</strong></p><p>Microsoft quietly dropped <strong>Fara-7B</strong>, a model fine-tuned from Qwen 2.5, specifically designed for computer use agentic tasks. It hits <strong>73.5% on WebVoyager</strong>, beating OpenAI’s preview models, all while running locally on-device. This is the dream of a local agent that can browse the web for you, click buttons, and book flights without sending screenshots to the cloud.</p><p>DeepSeek-Math-V2: open-weights IMO-gold math LLM (<a target="_blank" href="https://x.com/simonw">X</a>, <a target="_blank" href="https://huggingface.co/deepseek-ai/DeepSeek-Math-V2">HF</a>)</p><p>DeepSeek-Math-V2 is a 685B-parameter, Apache-2.0 licensed, open-weights mathematical reasoning model claiming gold-medal performance on IMO 2025 and CMO 2024, plus a near-perfect 118/120 on Putnam 2024. </p><p>Nisten did note some limitations—specifically that the context window can get choked up on extremely long, complex proofs—but having an open-weight model of this caliber is a gift to researchers everywhere.</p><p><strong>🐝 This Week’s Buzz: Serverless LoRA Inference</strong></p><p>A huge update from us at <strong>Weights & Biases</strong>! We know fine-tuning is powerful, but serving those fine-tunes can be a pain and expensive. We just launched <strong>Serverless LoRA Inference</strong>.</p><p>This means you can upload your small LoRA adapters (which you can train cheaply) to W&B Artifacts, and we will serve them instantly on <strong>CoreWeave</strong> GPUs on top of a base model. No cold starts, no dedicated expensive massive GPU instances for just one adapter.</p><p>I showed a demo of a “Mocking SpongeBob” model I trained in 25 minutes. It just adds that SaRcAsTiC tExT style to the Qwen 2.5 base model. You pass the adapter ID in the API call, and boom—customized intelligence instantly. You can get more details <a target="_blank" href="https://wandb.ai/wandb_fc/llm_tools/reports/Serverless-LoRA-inference-on-W-B-CoreWeave--Vmlldzo5MjUwNzAx">HERE</a> and get started with your own LORA in this <a target="_blank" href="https://wandb.me/lora_nb">nice notebook</a> the team made! </p><p>🎨 Visuals: Image & Video Generation Explosion</p><p><strong>Flux.2: The Multi-Reference Image Creator from BFL </strong>(<a target="_blank" href="https://x.com/bfl_ml/status/1993345470945804563">X</a>, <a target="_blank" href="https://huggingface.co/black-forest-labs/FLUX.2-dev">HF</a>, <a target="_blank" href="https://bfl.ai/blog/flux-2">Blog</a>)</p><p><strong>Black Forest Labs</strong> released <strong>Flux.2</strong>, a series of models including a 32B Flux 2[DEV]. The killer feature here is <strong>Multi-Reference Editing</strong>. You can feed it up to 10 reference images to maintain character consistency, style, or specific objects. It also outputs native 4-megapixel images.</p><p>Honestly, the launch timing was rough, coming right after Google’s Nano Banana Pro and alongside Z-Image, but for precise, high-res editing, this is a serious tool.</p><p>Tongyi drops <strong>Z-Image Turbo</strong>: 6B single-stream DiT lands sub‑second, 8‑step text‑to‑image (<a target="_blank" href="https://github.com/Tongyi-MAI/Z-Image">GitHub</a>, <a target="_blank" href="https://huggingface.co/Tongyi-MAI/Z-Image-Turbo">Hugging Face</a>)</p><p><strong>Alibaba’s Tongyi Lab</strong> released <strong>Z-Image Turbo</strong>, a 6B parameter model that generates images in <em>sub-second</em> time on H800s (and super fast on consumer cards).</p><p>I built a demo to show just how fast this is. You know that “<a target="_blank" href="https://neal.fun/infinite-craft/">Infinite Craft</a>“ game? I hooked it up to Z-Image Turbo so that every time you combine elements (like Pirate + Ghost), it instantly generates the image for “Ghost Pirate.” It changes the game completely when generation is this cheap and fast.</p><p>HunyuanVideo 1.5 – open video gets very real</p><p>Tencent also shipped <strong>HunyuanVideo 1.5</strong>, which they market as “the strongest open‑source video generation model.” For once, the tagline isn’t entirely hype.</p><p>Under the hood it’s an 8.3B‑parameter Diffusion Transformer (DiT) model with a 3D causal VAE in front. The VAE compresses videos aggressively in both space and time, and the DiT backbone models that latent sequence.</p><p>The important bits for you and me:</p><p>* It generates 5–10 second clips at 480p/720p with good motion coherence and physics.</p><p>* With FP16 or FP8 configs you can run it on a single consumer GPU with around 14GB VRAM.</p><p>* There’s a built‑in path to upsample to 1080p for “real” quality.</p><p>LTX Studio Retake: Photoshop for Video (<a target="_blank" href="https://x.com/LTXStudio/status/1993715247031767298">X</a>, <a target="_blank" href="https://replicate.com/lightricks/ltx-2-retake">Try It</a>)</p><p>For the video creators, <strong>LTX Studio</strong> launched <strong>Retake</strong>. This isn’t just “regenerate video.” This allows you to select a specific 2-second segment of a video, change the dialogue (keeping the voice!), change the emotion, or edit the action, all for like $0.10. It blends it perfectly back into the original clip. We are effectively getting a “Director Mode” for AI video where you can fix mistakes without rolling the dice on a whole new generation.</p><p>A secret new model on the Arena called Whisper Thunder beats even Veo 3?</p><p>This was a surprise of the week, while new video models get released often, Veo 3 has been the top one for a while, and now we’re getting a reshuffling of the video giants! But... we don’t yet know who this video model is from! You can sometimes get its generations at the <a target="_blank" href="https://artificialanalysis.ai/video/arena">Artificial Analysis</a> video arena here, and the outputs look quite awesome! </p><p>Thanksgiving reflections from the ThursdAI team</p><p>As we wrapped up the show, Wolfram suggested we take a moment to think about what we’re thankful for in AI, and I think that’s a perfect note to end on.</p><p>Wolfram put it well: he’s thankful for everyone contributing to this wonderful community—the people releasing models, creating open source tools, writing tutorials, sharing knowledge. It’s not just about the money; it’s about the love of learning and building together.</p><p>Yam highlighted something I think is crucial: we’ve reached a point where there’s no real competition between open source and closed source anymore. Everything is moving forward together. Even if you think nobody’s looking at that random code you posted somewhere, chances are someone found it and used it to accelerate their own work. That collective effort is what’s driving this incredible pace of progress.</p><p>For me, I want to thank Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Łukasz Kaiser, and Ilya Polosukhin for the 2017 paper “Attention Is All You Need.” Half Joking! But without the seminal attention is you need paper none of this AI was possible.  But mostly I want to thank all of you—the audience, the co-hosts, the guests—for making ThursdAI what it is.</p><p>If you go back and watch our <a target="_blank" href="https://sub.thursdai.news/p/thursdai-thanksgiving-special-24">2024 Thanksgiving episode</a>, or the <a target="_blank" href="https://sub.thursdai.news/p/thursdai-thanksgiving-special-openai">one from 2023</a>, you’ll be shocked at how far we’ve come. Tools that seemed magical a year ago are now just... normal. That’s hedonic adaptation at work, but it’s also a reminder to stay humble and appreciate just how incredible this moment in history really is.</p><p>We’re living through the early days of a technological revolution, and we get to document it, experiment with it, and help shape where it goes. That’s something to be genuinely thankful for.</p><p>TL;DR and Show Notes</p><p>* <strong>Hosts and Guests</strong></p><p>* <strong>Alex Volkov</strong> - AI Evangelist & Weights & Biases (<a target="_blank" href="https://x.com/altryne">@altryne</a>)</p><p>* Co-Hosts - <a target="_blank" href="https://x.com/WolframRvnwlf">@WolframRvnwlf</a> <a target="_blank" href="https://x.com/yampeleg">@yampeleg</a> <a target="_blank" href="https://x.com/nisten">@nisten</a> <a target="_blank" href="https://x.com/ldjconfirmed">@ldjconfirmed</a></p><p>* Guests: <a target="_blank" href="https://x.com/idosal1">@idosal1</a> <a target="_blank" href="https://x.com/liadyosef">@liadyosef</a> - MCP-UI/MCP Apps</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Anthropic launches <strong>Claude Opus 4.5</strong> - world’s top model for coding, agents, and tool use (<a target="_blank" href="https://x.com/claudeai/status/1993030546243699119">X</a>, <a target="_blank" href="https://www.anthropic.com/news/claude-opus-4-5">Announcement</a>, <a target="_blank" href="https://www.anthropic.com/engineering/advanced-tool-use">Blog</a>)</p><p>* OpenAI Integrates ChatGPT <strong>Voice Mode</strong> Directly into Chats (<a target="_blank" href="https://x.com/OpenAI/status/1993381101369458763">X</a>)</p><p>* <strong>Open Source LLMs</strong></p><p>* Prime Intellect - <strong>INTELLECT-3</strong> 106B MoE (<a target="_blank" href="https://x.com/PrimeIntellect/status/1993895068290388134">X</a>, <a target="_blank" href="https://huggingface.co/PrimeIntellect/INTELLECT-3">HF</a>, <a target="_blank" href="https://www.primeintellect.ai/blog/intellect-3">Blog</a>, <a target="_blank" href="https://chat.primeintellect.ai/">Try It</a>)</p><p>* Tencent - <strong>HunyuanOCR</strong> 1B SOTA OCR model (<a target="_blank" href="https://x.com/TencentHunyuan/status/1993202595264131436">X</a>, <a target="_blank" href="https://huggingface.co/tencent/HunyuanOCR">HF</a>, <a target="_blank" href="https://github.com/Tencent-Hunyuan/HunyuanOCR">Github</a>, <a target="_blank" href="https://hunyuan.tencent.com/vision/zh">Blog</a>)</p><p>* Microsoft - <strong>Fara-7B</strong> on-device computer-use agent (<a target="_blank" href="https://x.com/MSFTResearch/status/1993024319186674114">X</a>, <a target="_blank" href="https://www.microsoft.com/en-us/research/blog/fara-7b-best-in-class-7b-parameter-vision-language-model-for-computer-use/">Blog</a>, <a target="_blank" href="https://huggingface.co/microsoft/Fara-7B">HF</a>, <a target="_blank" href="https://github.com/microsoft/fara">Github</a>)</p><p>* DeepSeek - <strong>Math-V2</strong> IMO-gold math LLM (<a target="_blank" href="https://huggingface.co/deepseek-ai/DeepSeek-Math-V2">HF</a>)</p><p>* <strong>Interview: MCP Apps</strong></p><p>* MCP-UI standardized as <strong>MCP Apps</strong> by Anthropic and OpenAI (<a target="_blank" href="https://x.com/idosal1/status/1992636462186029233">X</a>, <a target="_blank" href="https://blog.modelcontextprotocol.io/posts/2025-11-21-mcp-apps-extending-servers-with-interactive-user-interfaces">Blog</a>, <a target="_blank" href="https://mcpui.dev">Announcement</a>)</p><p>* <strong>Vision & Video</strong></p><p>* Tencent - <strong>HunyuanVideo 1.5</strong> lightweight DiT open video model (<a target="_blank" href="https://x.com/TencentHunyuan/status/1991721236855156984">X</a>, <a target="_blank" href="https://github.com/Tencent/HunyuanVideo">GitHub</a>, <a target="_blank" href="https://huggingface.co/tencent/HunyuanVideo">HF</a>)</p><p>* LTX Studio - <strong>Retake</strong> AI video editing tool (<a target="_blank" href="https://x.com/LTXStudio/status/1993715247031767298">X</a>, <a target="_blank" href="https://replicate.com/lightricks/ltx-2-retake">Try It</a>)</p><p>* <strong>Whisper Thunder</strong> - mystery #1 ranked video model on arena</p><p>* <strong>AI Art & Diffusion</strong></p><p>* Black Forest Labs - <strong>FLUX.2</strong> 32B multi-reference image model (<a target="_blank" href="https://x.com/bfl_ml/status/1993345470945804563">X</a>, <a target="_blank" href="https://huggingface.co/black-forest-labs/FLUX.2-dev">HF</a>, <a target="_blank" href="https://bfl.ai/blog/flux-2">Blog</a>)</p><p>* Tongyi - <strong>Z-Image Turbo</strong> sub-second 6B image gen (<a target="_blank" href="https://github.com/Tongyi-MAI/Z-Image">GitHub</a>, <a target="_blank" href="https://huggingface.co/Tongyi-MAI/Z-Image-Turbo">HF</a>)</p><p>* <strong>This Week’s Buzz</strong></p><p>* W&B launches <strong>Serverless LoRA Inference</strong> on CoreWeave (<a target="_blank" href="https://x.com/wandb/status/1993032159985385978">X</a>, <a target="_blank" href="https://wandb.ai/wandb_fc/llm_tools/reports/Serverless-LoRA-inference-on-W-B-CoreWeave--Vmlldzo5MjUwNzAx">Blog</a>, <a target="_blank" href="https://wandb.me/lora_nb">Notebook</a>)</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-thanksgiving-special-25</link><guid isPermaLink="false">substack:post:180139147</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Thu, 27 Nov 2025 23:05:19 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/180139147/a0888afdb3652382ba6c49e66a9468ad.mp3" length="78045984" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>4878</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/180139147/08a62bf18563f349f333ab2056222f07.jpg"/></item><item><title><![CDATA[📆 ThursdAI - the week that changed the AI landscape forever - Gemini 3, GPT codex max, Grok 4.1 & fast, SAM3 and Nano Banana Pro]]></title><description><![CDATA[<p>Hey everyone, Alex here 👋</p><p>I’m writing this one from a noisy hallway at the AI Engineer conference in New York, still riding the high (and the sleep deprivation) from what might be the craziest week we’ve ever had in AI.</p><p>In the span of a few days:</p><p>Google dropped <strong>Gemini 3 Pro</strong>, a new <strong>Deep Think mode</strong>, generative UIs, and a free agent-first IDE called <strong>Antigravity</strong>.xAI shipped <strong>Grok 4.1</strong>, then followed it up with <strong>Grok 4.1 Fast</strong> plus an Agent Tools API.OpenAI answered with <strong>GPT‑5.1‑Codex‑Max</strong>, a long‑horizon coding monster that can work for more than a day, and quietly upgraded ChatGPT Pro to <strong>GPT‑5.1 Pro</strong>.Meta looked at all of that and said “cool, we’ll just segment literally everything and turn photos into 3D objects” with <strong>SAM 3 and SAM 3D</strong>.Robotics folks dropped a home robot trained with almost no robot data.And Google, just to flex, capped Thursday with Nano Banana Pro, a 4K image model and a provenance system while we were already live! </p><p>For the first time in a while it doesn’t just feel like “new models came out.” It feels like the future actually clicked forward a notch.</p><p>This is why ThursdAI exists. Weeks like this are basically impossible to follow if you have a day job, so my co‑hosts and I do the no‑sleep version so you don’t have to. Plus, being at AI Engineer makes it easy to get super high quality guests so this week we had 3 folks join us, Swyx from Cognition/Latent Space, Thor from DeepMind (on his 3rd day) and Dominik from OpenAI! Alright, deep breath. Let’s untangle the week.</p><p><strong>TL;DR </strong></p><p>If you only skim one section, make it this one (links in the end):</p><p>* <strong>Google</strong></p><p>* <strong>Gemini 3 Pro</strong>: 1M‑token multimodal model, huge reasoning gains - new LLM king</p><p>* <strong>ARC‑AGI‑2:</strong> 31.11% (Pro), <strong>45.14% (Deep Think)</strong> – enormous jumps</p><p>* <strong>Antigravity IDE</strong>: free, Gemini‑powered VS Code fork with agents, plans, walkthroughs, and browser control</p><p>* <strong>Nano Banana Pro</strong>: 4K image generation with <em>perfect</em> text + SynthID provenance; dynamic “generative UIs” in Gemini</p><p>* <strong>xAI</strong></p><p>* <strong>Grok 4.1</strong>: big post‑training upgrade – #1 on human‑preference leaderboards, much better EQ & creative writing, fewer hallucinations</p><p>* <strong>Grok 4.1 Fast + Agent Tools API</strong>: 2M context, SOTA tool‑calling & agent benchmarks (Berkeley FC, T²‑Bench, research evals), aggressive pricing and tight X + web integration</p><p>* <strong>OpenAI</strong></p><p>* <strong>GPT‑5.1‑Codex‑Max</strong>: “frontier agentic coding” model built for <strong>24h+</strong> software tasks with native <strong>compaction</strong> for million‑token sessions; big gains on SWE‑Bench, SWE‑Lancer, TerminalBench 2</p><p>* <strong>GPT‑5.1 Pro</strong>: new “research‑grade” ChatGPT mode that will happily think for minutes on a single query</p><p>* <strong>Meta</strong></p><p>* <strong>SAM 3</strong>: open‑vocabulary segmentation + tracking across images and video (with text & exemplar prompts)</p><p>* <strong>SAM 3D</strong>: single‑image → 3D objects & human bodies; surprisingly high‑quality 3D from one photo</p><p>* <strong>Robotics</strong></p><p>* <strong>Sunday Robotics – ACT‑1 & Memo</strong>: home robot foundation model trained from a <strong>$200 skill glove</strong> instead of $20K teleop rigs; long‑horizon household tasks with solid zero‑shot generalization</p><p>* <strong>Developer Tools</strong></p><p>* <strong>Antigravity</strong> and <strong>Marimo’s VS Code / Cursor extension</strong> both push toward agentic, reactive dev workflows</p><p><strong>Live from AI Engineer New York: Coding Agents Take Center Stage</strong></p><p>We recorded this week’s show on location at the AI Engineer Summit in New York, inside a beautiful podcast studio the team set up right on the expo floor. Huge shout out to Swyx, Ben, and the whole AI Engineer crew for that — last time I was balancing a mic on a hotel nightstand, this time I had broadcast‑grade audio while a robot dog tried to steal the show behind us.</p><p>This year’s summit theme is very on‑the‑nose for this week: <strong>coding agents</strong>.</p><p></p><p>Everywhere you look, there’s a company building an “agent lab” on top of foundation models. Amp, Cognition, Cursor, CodeRabbit, Jules, Google Labs, all the open‑source folks, and even the enterprise players like Capital One and Bloomberg are here, trying to figure out what it means to have real software engineers that are partly human and partly model.</p><p>Swyx framed it nicely when he said that if you take “vertical AI” seriously enough, you eventually end up building an agent lab. Lawyers, healthcare, finance, developer tools — they all converge on “agents that can reason and code.”</p><p>The big labs heard that theme loud and clear. Almost every major release this week is about agents, tools, and long‑horizon workflows, not just chat answers.</p><p><strong>Google Goes All In: Gemini 3 Pro, Antigravity, and the Agent Revolution</strong></p><p>Let’s start with Google because, after years of everyone asking “where’s Google?” in the AI race, they showed up this week with multiple bombshells that had even the skeptics impressed.</p><p><strong>Gemini 3 Pro: Multimodal Intelligence That Actually Delivers</strong></p><p>Google finally released Gemini 3 Pro, and the numbers are genuinely impressive. We’re talking about a 1 million token context window, massive benchmark improvements, and a model that’s finally competing at the very top of the intelligence charts. Thor from DeepMind joined us on the show (literally on day 3 of his new job!) and you could feel the excitement.</p><p>The headline numbers: Gemini 3 Pro with Deep Think mode achieved <strong>45.14% on ARC-AGI-2</strong>—that’s roughly double the previous state-of-the-art on some splits. For context, ARC-AGI has been one of those benchmarks that really tests genuine reasoning and abstraction, not just memorization. The standard Gemini 3 Pro hits 31.11% on the same benchmark, both scores are absolutely out of this world in Arc! </p><p>On GPQA Diamond, Gemini 3 Pro jumped about 10 points compared to prior models. We’re seeing roughly 81% on MMLU-Pro, and the coding performance is where things get really interesting—Gemini 3 Pro is scoring around 56% on SciCode, representing significant improvements in actual software engineering tasks.</p><p>But here’s what made Ryan from Amp switch their default model to Gemini 3 Pro immediately: the real-world usability. Ryan told us on the show that they’d never switched default models before, not even when GPT-5 came out, but Gemini 3 Pro was so noticeably better that they made it the default on Tuesday. Of course, they hit rate limits almost immediately (Google had to scale up fast!), but those have since been resolved.</p><p><strong>Antigravity: Google’s Agent-First IDE</strong></p><p>Then Google dropped Antigravity, and honestly, this might be the most interesting part of the whole release. It’s a free IDE (yes, free!) that’s basically a fork of VS Code, but reimagined around agents rather than human-first coding.</p><p>The key innovation here is something they call the “Agent Manager”—think of it like an inbox for your coding agents. Instead of thinking in folders and files, you’re managing conversations with agents that can run in parallel, handle long-running tasks, and report back when they need your input.</p><p>I got early access and spent time playing with it, and here’s what blew my mind: you can have multiple agents working on different parts of your codebase simultaneously. One agent fixing bugs, another researching documentation, a third refactoring your CSS—all at once, all coordinated through this manager interface.</p><p>The browser integration is crazy too. Antigravity can control Chrome directly, take screenshots and videos of your app, and then use those visuals to debug and iterate. It’s using Gemini 3 Pro for the heavy coding, and even Nano Banana for generating images and assets. The whole thing feels like it’s from a couple years in the future.</p><p>Wolfram on the show called out how good Gemini 3 is for creative writing too—it’s now his main model, replacing GPT-4.5 for German language tasks. The model just “gets” the intention behind your prompts rather than following them literally, which makes for much more natural interactions.</p><p><strong>Nano Banana Pro: 4K Image Generation With Thinking</strong></p><p>And because Google apparently wasn’t done announcing things, they also dropped Nano Banana Pro on Thursday morning—literally breaking news during our live show. This is their image generation model that now supports 4K resolution and includes “thinking” traces before generating.</p><p>I tested it live by having it generate an infographic about all the week’s AI news (which you can see on the top), and the results were wild. Perfect text across the entire image (no garbled letters!), proper logos for all the major labs, and compositional understanding that felt way more sophisticated than typical image models. The file it generated was 8 megabytes—an actual 4K image with stunning detail.</p><p></p><p>What’s particularly clever is that Nano Banana Pro is really Gemini 3 Pro doing the thinking and planning, then handing off to Nano Banana for the actual image generation. So you get multimodal reasoning about your request, then production-quality output. You can even upload reference images—up to 14 of them—and it’ll blend elements while maintaining consistency.</p><p>Oh, and every image is watermarked with SynthID (Google’s invisible watermarking tech) and includes C2PA metadata, so you can verify provenance. This matters as AI-generated content becomes more prevalent.</p><p><strong>Generative UIs: The Future of Interfaces</strong></p><p>One more thing Google showed off: generative UIs in the Gemini app. Wolfram demoed this for us, and it’s genuinely impressive. Instead of just text responses, Gemini can generate full interactive mini-apps on the fly—complete dashboards, data visualizations, interactive widgets—all vibe-coded in real time.</p><p>He asked for “four panels of the top AI news from last week” and Gemini built an entire news dashboard with tabs, live market data (including accurate pre-market NVIDIA stats!), model comparisons, and clickable sections. It pulled real information, verified facts, and presented everything in a polished UI that you could interact with immediately.</p><p>This isn’t just a demo—it’s rolling out in Gemini now. The implication is huge: we’re moving from static responses to dynamic, contextual interfaces generated just-in-time for your specific need.</p><p><strong>xAI Strikes Back: Grok 4.1 and the Agent Tools API</strong></p><p>Not to be outdone, xAI released Grok 4.1 at the start of the week, briefly claimed the #1 spot on LMArena (at 1483 Elo, not 2nd to Gemini 3), and then followed up with Grok 4.1 Fast and a full Agent Tools API.</p><p><strong>Grok 4.1: Emotional Intelligence Meets Raw Performance</strong></p><p>Grok 4.1 brought some really interesting improvements. Beyond the benchmark numbers (64% win rate over the previous Grok in blind tests), what stood out was the emotional intelligence. On EQ-Bench3, Grok 4.1 Thinking scored <strong>1586 Elo</strong>, beating every other model including Gemini, GPT-5, and Claude.</p><p>The creative writing scores jumped by roughly 600 Elo points compared to earlier versions. And perhaps most importantly for practical use, hallucination rates dropped from around 12% to 4%—that’s roughly a 3x improvement in reliability on real user queries.</p><p>xAI’s approach here was clever: they used “frontier agentic reasoning models as reward models” during RL training, which let them optimize for subjective qualities like humor, empathy, and conversational style without just scaling up model size.</p><p><strong>Grok 4.1 Fast: The Agent Platform Play</strong></p><p></p><p>Then came Grok 4.1 Fast, released just yesterday, and this is where things get really interesting for developers. It’s got a <strong>2 million token context window</strong> (compared to Gemini 3’s 1 million) and was specifically trained for agentic, tool-calling workflows.</p><p>The benchmark performance is impressive: <strong>93-100% on τ²-Bench Telecom</strong> (customer support simulation), <strong>~72% on Berkeley Function Calling v4</strong> (top of the leaderboard), and strong scores across research and browsing tasks. But here’s the kicker: the pricing is aggressive.</p><p>At $0.20 per million input tokens and $0.50 per million output tokens, Grok 4.1 Fast is dramatically cheaper than GPT-5 and Claude while matching or exceeding their agentic performance. For the first two weeks, it’s completely free via the xAI API and OpenRouter, which is smart—get developers hooked on your agent platform.</p><p>The Agent Tools API gives Grok native access to X search, web browsing, code execution, and document retrieval. This tight integration with X is a genuine advantage—where else can you get real-time access to breaking news, sentiment, and conversation? Yam tested it on the show and confirmed that Grok will search Reddit too, which other models often refuse to do. I’ve used both these models this week in my N8N research agent and I gotta say, 4.1 fast is a MASSIVE improvement! </p><p><strong>OpenAI’s Endurance Play: GPT-5.1-Codex-Max and Pro</strong></p><p>OpenAI clearly saw Google and xAI making moves and decided they weren’t going to let this week belong to anyone else. They dropped two significant releases: GPT-5.1-Codex-Max and an update to GPT-5.1 Pro.</p><p><strong>GPT-5.1-Codex-Max: Coding That Never Stops</strong></p><p>This is the headline: GPT-5.1-Codex-Max can work autonomously for <strong>over 24 hours</strong>. Not 24 minutes, not 24 queries—24 actual hours on a single software engineering task. I talked to someone from OpenAI at the conference who told me internal checkpoints ran for nearly a <strong>week</strong> on and off.</p><p>How is this even possible? The secret is something OpenAI calls “compaction”—a native mechanism trained into the model that lets it prune and compress its working session history while preserving the important context. Think of it like the model taking notes on itself, discarding tool-calling noise and keeping only the critical design decisions and state.</p><p>The performance numbers back this up:</p><p>* <strong>SOTA 77.9% on SWE-Bench Verified</strong> (up from 73.7%)</p><p>* <strong>SOTA 79.9% on SWE-Lancer IC SWE</strong> (up from 66.3%)</p><p>* <strong> 58.1% on TerminalBench 2.0</strong> (up from 52.8%)</p><p>And crucially, in medium reasoning mode, it uses <strong>30% fewer thinking tokens</strong> while achieving better results. There’s also an “Extra High” reasoning mode for when you truly don’t care about latency and just want maximum capability.</p><p>Yam, one of our co-hosts who’s been testing extensively, said you can feel the difference immediately. The model just “gets it” faster, powers through complex problems, and the earlier version’s quirk of ignoring your questions and just starting to code is fixed—now it actually responds and collaborates.</p><p>Dominic from OpenAI joined us on the show and confirmed that compaction was trained natively into the model using RL, similar to how Claude trained natively for MCP. This means the model doesn’t waste reasoning tokens on maintaining context—it just knows how to do it efficiently.</p><p><strong>GPT-5.1 Pro: Research-Grade Intelligence & ChatGPT joins your group chat1</strong></p><p>Then there’s GPT-5.1 Pro, which is less about coding and more about deep, research-level reasoning. This is the model that can run for 10-17 minutes on a single query, thinking through complex problems with the kind of depth that previously required human experts.</p><p>OpenAI also quietly rolled out group chats—basically, you can now have multiple people in a ChatGPT conversation together, all talking to the model simultaneously. Useful for planning trips, brainstorming with teams, or working through problems collaboratively. If agent mode works in group chats (we haven’t confirmed yet), that could get really interesting.</p><p>Meta drops SAM3 & SAM3D - image and video segmentation models powered by natural language</p><p>Phew ok, big lab releases now done, oh.. wait not yet! Because Meta has decided to also make a dent on this Week with SAM3 and SAM3D, which both are crazy. I’ll just add their video release here instead of going on and on! </p><p>This Week’s Buzz from Weights & Biases</p><p>It’s been a busy week at Weights & Biases as well! We are proud Gold Sponsors of the AI Engineer conference here in NYC. If you’re at the event, please stop by our booth—we’re even giving away a $4,000 robodog!</p><p>This week, I want to highlight a fantastic update from <strong>Marimo</strong>, the reactive Python notebook company we acquired.</p><p>Marimo just shipped a native <strong>VS Code and Cursor extension</strong>. This brings Marimo’s reactive, Git-friendly notebooks directly into your favorite editors.</p><p>Crucially, it integrates deeply with uv for lightning-fast package installs and reproducible environments. If you import a package you don’t have, the extension prompts you to install it and records the dependency in the script metadata. This bridges the gap between experimental notebooks and production-ready code, and it’s a huge boost for AI-native development workflows. (<a target="_blank" href="https://marimo.io/blog">Blog</a> , <a target="_blank" href="https://github.com/marimo-team/marimo-lsp">GitHub</a> )</p><p>The Future Arrived Early</p><p>Phew... if you read all the way until this point, can you leave a ⚡ emoji in the comemnts? I was writing this and it.. is a lot! I was wondering who would even read all the way till here! </p><p>This week we felt the acceleration! 🔥 I can barely breathe, I need a nap! </p><p>A huge thank you to our guests—Ryan, Swyx, Thor, and Dominik—for navigating the chaos with us live on stage, and to the AI Engineer team for hosting us.</p><p>We’ll be back next week to cover whatever the AI world throws at us next. Stay tuned, because at this rate, AGI might be here by Christmas.</p><p>TL;DR - show notes and links</p><p><strong>Hosts and Co‑hosts</strong></p><p>* Alex Volkov – AI Evangelist at Weights & Biases / CoreWeave, host of ThursdAI (<a target="_blank" href="https://x.com/altryne">X</a>)</p><p>* Co‑hosts - Wolfram Ravenwolf – (<a target="_blank" href="https://x.com/WolframRvnwlf">X</a>), Yam Peleg (<a target="_blank" href="https://x.com/yampeleg">X</a>) LDJ (<a target="_blank" href="https://x.com/ldjconfirmed">X</a>)</p><p><strong>Guests</strong></p><p>* Swyx – Founder of AI Engineer World’s Fair and Summit, now at Cognition ( <a target="_blank" href="https://substack.com/profile/89230629-latentspace">Latent.Space</a> , <a target="_blank" href="https://x.com/swyx">X</a>)</p><p>* Ryan Carson –  Amp (<a target="_blank" href="https://x.com/ryancarson">X</a>)</p><p>* Thor Schaeff – Google DeepMind, Gemini API and AI Studio (<a target="_blank" href="https://x.com/thorwebdev">X</a>)</p><p>* Dominik Kundel – Developer Experience at OpenAI (<a target="_blank" href="https://x.com/dkundel">X</a>)</p><p><strong>Open Source LLMs</strong></p><p>* Allen Institute Olmo 3 - 7B/32B fully open reasoning suite with end-to-end training transparency (<a target="_blank" href="https://x.com/allen_ai/status/1991507983881379896">X</a>, <a target="_blank" href="https://allenai.org/olmo">Blog</a>)</p><p><strong>Big CO LLMs + APIs</strong></p><p>* Google Gemini 3 Pro - 1M-token, multimodal, agentic model with Generative UIs (<a target="_blank" href="https://x.com/altryne/status/1990812491304350130">X</a>, <a target="_blank" href="https://x.com/sundarpichai/status/1990812770762215649">X</a>, <a target="_blank" href="https://x.com/GoogleDeepMind/status/1990812966074376261">X</a>)</p><p>* Google Antigravity - Agent-first IDE powered by Gemini 3 Pro (<a target="_blank" href="https://antigravity.google/blog/introducing-google-antigravity">Blog</a>, <a target="_blank" href="https://x.com/GoogleDeepMind/status/1990827890435346787">X</a>)</p><p>* xAI Grok 4.1 and Grok 4.1 Thinking - big gains in Coding, EQ, creativity, and honesty (<a target="_blank" href="https://x.com/altryne/status/1990526775148097662">X</a>, <a target="_blank" href="https://x.ai/blog/grok-4-1">Blog</a>)</p><p>* xAI Grok 4.1 Fast and Agent Tools API - 2M-token context, state-of-the-art tool-calling (<a target="_blank" href="https://x.com/xai/status/1991284813727474073">X</a>)</p><p>* OpenAI GPT-5.1-Codex-Max - long-horizon agentic coding model for 24-hour+ software tasks (<a target="_blank" href="https://x.com/polynoamial/status/1991212955250327768">X</a>, <a target="_blank" href="https://x.com/OpenAIDevs/status/1991217488550359066">X</a>)</p><p>* OpenAI GPT-5.1 Pro - research-grade reasoning model in ChatGPT Pro</p><p>* Microsoft, NVIDIA, and Anthropic partnership - to scale Claude on Azure with massive GPU investments (<a target="_blank" href="https://www.anthropic.com/news/microsoft-nvidia-partnership">Announcement</a>, <a target="_blank" href="https://nvidianews.nvidia.com/news/nvidia-microsoft-anthropic-partnership">NVIDIA</a>, <a target="_blank" href="https://blogs.microsoft.com/2025/11/partnership-anthropic-nvidia-azure">Microsoft Blog</a>)</p><p><strong>This weeks Buzz</strong></p><p>* Marimo ships native VS Code & Cursor extension with reactive notebooks and uv-powered environments (<a target="_blank" href="https://x.com/marimo_io/status/1991207581763981722">X</a>, <a target="_blank" href="https://marimo.io/blog">Blog</a>, <a target="_blank" href="https://github.com/marimo-team/marimo-lsp">GitHub</a>)</p><p><strong>Vision & Video & 3D</strong></p><p>* Meta SAM 3 & SAM 3D - promptable segmentation, tracking, and single-image 3D reconstruction (<a target="_blank" href="https://x.com/AIatMeta/status/1991178519557046380">X</a>, <a target="_blank" href="https://ai.meta.com/blog/sam-3d/">Blog</a>, <a target="_blank" href="https://github.com/facebookresearch/segment-anything">GitHub</a>)</p><p><strong>AI Art & Diffusion</strong></p><p>* Google Nano Banana Pro and SynthID verification - 4K image generation with provenance (<a target="_blank" href="https://blog.google/technology/developers/gemini-3-developers/">Blog</a>)</p><p><strong>Show Notes and other Links</strong></p><p>* AI Engineer Summit NYC - Live from the conference</p><p>* Full livestream available on YouTube</p><p>* ThursdAI - Nov 20, 2025</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-the-week-that-changed-the</link><guid isPermaLink="false">substack:post:179506553</guid><dc:creator><![CDATA[Alex Volkov, Latent.Space, Dominik Kundel, and Ryan Carson]]></dc:creator><pubDate>Thu, 20 Nov 2025 23:27:38 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/179506553/90565c927e17c676a3c4297c7acda574.mp3" length="85643373" type="audio/mpeg"/><itunes:author>Alex Volkov, Latent.Space, Dominik Kundel, and Ryan Carson</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5353</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/179506553/e80fd7fe90494b45ed699d01c7712f89.jpg"/></item><item><title><![CDATA[GPT‑5.1’s New Brain, Grok’s 2M Context, Omnilingual ASR, and a Terminal UI That Sparks Joy]]></title><description><![CDATA[<p>Hey, this is Alex! </p><p>We’re finally so back! Tons of open source releases, OpenAI updates GPT and a few breakthroughs in audio as well, makes this a very dense week! </p><p>Today on the show, we covered the newly released GPT 5.1 update, a few open source releases like Terminal Bench and Project AELLA (renamed OASSAS), and Baidu’s Ernie 4.5 VL that shows impressive visual understanding! </p><p>Also, chatted with Paul from 11Labs and Dima Duev from the wandb SDK team, who brought us a delicious demo of LEET, our new TUI for wandb! </p><p>Tons of news coverage, let’s dive in 👇 (as always links and show notes in the end) </p><p>Open Source AI</p><p>Let’s jump directly into Open Source as this week has seen some impressive big company models. </p><p>Terminal-Bench 2.0 -  a harder, highly‑verified coding and terminal benchmark (<a target="_blank" href="https://x.com/alexgshaw/status/1986911106108211461">X</a>, <a target="_blank" href="https://harborframework.com/">Blog</a>, <a target="_blank" href="https://www.tbench.ai/leaderboard">Leaderboard</a>)</p><p>We opened with Terminal‑Bench 2.0 plus its new harness, Harbor, because this is the kind of benchmark we’ve all been asking for. </p><p>Terminal‑Bench focuses on agentic coding in a real shell. Version 2.0 is a hard set of 89 terminal tasks, each one painstakingly vetted by humans and LLMs to make sure it’s solvable and realistic. Think “I checked out master and broke my personal site, help untangle the git mess” or “implement GPT‑2 code golf with the fewest characters.” </p><p>On the new leaderboard, top agents like Warp’s agentic console and Codex CLI + GPT‑5 sit around fifty percent success. That number is exactly what excites me: we’re nowhere near saturation. When everyone is in the 90‑something range, tiny 0.1 improvements are basically noise. When the best models are at fifty percent, a five‑point jump really means something.</p><p>A huge part of our conversation focused on reproducibility. We’ve seen other benchmarks like OSWorld turn out to be unreliable, with different task sets and non‑reproducible results making scores incomparable. Terminal‑Bench addresses this with Harbor, a harness designed to run sandboxed, containerized agent rollouts at scale in a consistent environment. This means results are actually comparable. It’s a ton of work to build an entire evaluation ecosystem like this, and with over a thousand contributors on their Discord, it’s a fantastic example of a healthy, community‑driven effort. This is one to watch! </p><p><strong>Baidu’s ERNIE‑4.5‑VL “Thinking”: a 3B visual reasoner that punches way up </strong>(<a target="_blank" href="https://x.com/Baidu_Inc/status/1988182106359411178">X</a>, <a target="_blank" href="https://huggingface.co/ERNIE/ERNIE-4.5-VL-28B-A3B-Thinking">HF</a>, <a target="_blank" href="https://github.com/ERNIE/ERNIE-4.5-VL-28B-A3B-Thinking">GitHub</a>)</p><p>Next up, Baidu dropped a really interesting model, ERNIE‑4.5‑VL‑28B‑A3B‑Thinking. This is a compact, 3B active‑parameter multimodal reasoning model focused on vision, and it’s much better than you’d expect for its size. Baidu’s own charts show it competing with much larger closed models like Gemini‑2.5‑Pro and GPT‑5‑High on a bunch of visual benchmarks like ChartQA and DocVQA.</p><p>During the show, I dropped a fairly complex chart into the demo, and ERNIE‑4.5‑VL gave me a clean textual summary almost instantly—it read the chart more cleanly than I could. The model is built to “think with images,” using dynamic zooming and spatial grounding to analyze fine details. It’s released under an Apache‑2.0 license, making it a serious candidate for edge devices, education, and any product where you need a cheap but powerful visual brain.</p><p><strong>Open Source Quick Hits: OSSAS, VibeThinker, and Holo Two</strong></p><p>We also covered a few other key open-source releases. Project AELLA was quickly rebranded to OSSAS (Open Source Summaries At Scale), an initiative to make scientific literature machine‑readable. They’ve released 100k paper summaries, two fine-tuned models for the task, and a 3D visualizer. It’s a niche but powerful tool if you’re working with massive amounts of research. (<a target="_blank" href="https://x.com/samhogan/status/1988306424309706938">X</a>, <a target="_blank" href="https://huggingface.co/inference-net">HF</a>)</p><p>WeiboAI (from the Chinese social media company) released VibeThinker‑1.5B, a tiny 1.5B‑parameter reasoning model that is making bold claims about beating the 671B DeepSeek R1 on math benchmarks. </p><p>We discussed the high probability of benchmark contamination, especially on tests like AIME24, but even with that caveat, getting strong chain‑of‑thought math out of a 1.5B model is impressive and useful for resource‑constrained applications.  (<a target="_blank" href="https://x.com/WeiboLLM/status/1988109435902832896">X</a>, <a target="_blank" href="https://huggingface.co/WeiboAI/VibeThinker-1.5B">HF</a>, <a target="_blank" href="https://arxiv.org/abs/2511.06221">Arxiv</a>)</p><p>Finally, we had some breaking news mid‑show: H Company released Holo Two, their next‑gen multimodal agent for controlling desktops, websites, and mobile apps. It’s a fine‑tune of Qwen3‑VL and comes in 4B and 8B Apache‑2.0 licensed versions, pushing the open agent ecosystem forward. (<a target="_blank" href="https://x.com/hcompany_ai/status/1989013556134638039">X</a>, <a target="_blank" href="https://hcompany.ai/blog">Blog</a>, <a target="_blank" href="https://huggingface.co/hcompany-ai/Holo2-8B">HF</a>)</p><p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></p><p></p><p>Big Companies & APIs</p><p><strong>GPT‑5.1: Instant vs Thinking, and a new personality bar</strong></p><p>The biggest headline of the week was OpenAI shipping GPT‑5.1, and this was a hot topic of debate on the show. The update introduces two modes: “Instant” for fast, low‑compute answers, and “Thinking” for deeper reasoning on hard problems. OpenAI claims Instant mode uses 57% fewer tokens on easy tasks, while Thinking mode dedicates 71% more compute to difficult ones. This adaptive approach is a smart evolution.</p><p>The release also adds a personality dropdown with options like Professional, Friendly, Quirky, and Cynical, aiming for a more “warm” and customizable experience. </p><p>Yam and I felt this was a step in the right direction, as GPT‑5 could often feel a bit cold and uncommunicative. However, Wolfram had a more disappointing experience, finding that GPT‑5.1 performed significantly worse on his German grammar and typography tasks compared to GPT‑4 or Claude Sonnet 4.5. It’s a reminder that “upgrades” can be subjective and task‑dependent.</p><p>Since the show was recorded, GPT 5.1 is also released in the API and they have published a <a target="_blank" href="https://cookbook.openai.com/examples/gpt-5/gpt-5-1_prompting_guide">prompting guide</a> and some evals! With some significant jumps across SWE-bench verified and GPQA Diamond! We’ll be testing this model out all week. </p><p>The highlight for this model is the creative writing, it was made public that this model was being tested on OpenRouter as Polaris-alpha and that one tops the eqbench creative writing benchmarks beating Sonnet 4.5 and Gemini! </p><p><strong>Grok‑4 Fast: 2M context and a native X superpower</strong></p><p>Grok‑4 Fast from xAI apparenly quietly got a substantial upgrade to a 2M‑token context window, but the most interesting part is its unique integration with X. The API version has access to internal tools for semantic search over tweets, retrieving top quote tweets, and understanding embedded images and videos. I’ve started using it as a research agent in my show prep, and it feels like having a research assistant living inside X’s backend—something you simply can’t replicate with public tools.</p><p>I still have my gripes about their “stealth upgrade” versioning strategy, which makes rigorous evaluation difficult, but as a practical tool, Grok‑4 Fast is incredibly powerful. It’s also surprisingly fast and cost‑effective, holding its own against other top models on benchmarks while offering a superpower that no one else has.</p><p><strong>Google SIMA 2: Embodied Agents in Virtual Worlds</strong></p><p>Google’s big contribution this week was SIMA 2, DeepMind’s latest embodied agent for 3D virtual worlds. SIMA lives inside real games like <em>No Man’s Sky</em> and <em>Goat Simulator</em>, seeing the screen and controlling the game via keyboard and mouse, using Gemini as its reasoning brain. Demos showed it following complex, sketch‑based instructions, like finding an object that looks like a drawing of a spaceship and jumping on top of it.</p><p>When you combine this with Genie 3—Google’s world model that can generate playable environments from a single image—you see the bigger picture: agents that learn physics, navigation, and common sense by playing in millions of synthetic worlds. We’re not there yet, but the pieces are clearly being assembled. We also touched on the latest Gemini Live voice upgrade, which users are reporting feels much more natural and responsive</p><p><strong>More Big Company News: Qwen Deep Research, Code Arena, and Cursor</strong></p><p>We also briefly covered Qwen’s <a target="_blank" href="https://x.com/Alibaba_Qwen/status/1989026687611461705">new</a> Deep Research feature, which offers an OpenAI‑style research agent inside their ecosystem. LMSYS launched <a target="_blank" href="https://arena.lmsys.org/blog/code-arena">Blog</a>, a fantastic live evaluation platform where models build real web apps agentically, with humans voting on the results. And in the world of funding, the AI‑native code editor Cursor raised a staggering $2.3 billion, a clear sign that AI is becoming the default way developers interact with code.</p><p><strong>This Week’s Buzz: W&B LEET – a terminal UI that sparks joy</strong></p><p>For this week’s buzz, I brought on Dima Duev from our SDK team at Weights & Biases to show off a side project that has everyone at the company excited: LEET, the Lightweight Experiment Exploration Tool. Imagine you’re training on an air‑gapped HPC cluster, living entirely in your terminal. How do you monitor your runs? With LEET.</p><p>You run your training script in W&B offline mode, and in another terminal, you type wandb beta leet. Your terminal instantly turns into a full TUI dashboard with live metric plots, system stats, and run configs. You can zoom into spikes in your loss curve, filter metrics, and see everything updating in real time, all without a browser or internet connection. It’s one of those tools that just sparks joy. It ships with the latest wandb SDK (v0.23.0+), so just upgrade and give it a try! </p><p><strong>Voice & Audio: Scribe v2 Realtime and Omnilingual ASR</strong></p><p><strong>ElevenLabs Scribe v2 Realtime: ASR built for agents </strong>(<a target="_blank" href="https://x.com/elevenlabsio/status/1988282248445976987">X</a>, <a target="_blank" href="https://docs.elevenlabs.io">Announcement</a>, <a target="_blank" href="https://captions.events/">Demo</a>)</p><p>We’ve talked a lot on this show about ElevenLabs as “the place you go to make your AI talk.” This week, they came for the other half of the conversation. Paul Asjes from ElevenLabs joined us to walk through Scribe v2 Realtime, their new low‑latency speech‑to‑text model. If you’re building a voice agent, you need ears, a brain, and a mouth. ElevenLabs already nailed the mouth, and now they’ve built some seriously good ears.</p><p>Scribe v2 Realtime is designed to run at around 150 milliseconds median latency, across more than ninety languages. Watching Paul’s live demo, it felt comfortably real‑time. When he switched from English to Dutch mid‑sentence, the system just followed along without missing a beat. Community benchmarks and our own impressions show it holding its own or beating competitors like Whisper and Deepgram in noisy, accented, and multi‑speaker scenarios. It’s also context‑aware enough to handle code, initialisms, and numbers correctly, which is critical for real‑world agents. This is a production‑ready ASR for anyone building live voice experiences.</p><p><strong>Meta’s drops Omnilingual ASR: 1,600+ languages, many for the first time + a bunch of open source models  </strong> (<a target="_blank" href="https://x.com/AIatMeta/status/1987946571439444361">X</a>, <a target="_blank" href="https://ai.meta.com/blog/omnilingual-asr-advancing-automatic-speech-recognition">Blog</a>, <a target="_blank" href="https://ai.meta.com/research/publications/omnilingual-asr-open-source-multilingual-speech-recognition-for-1600-languages/">Announcement</a>, <a target="_blank" href="https://huggingface.co/datasets/facebook/omnilingual-asr-corpus">HF</a>)</p><p>On the other end of the spectrum, Meta released something that’s less about ultra‑low latency and more about sheer linguistic coverage: Omnilingual ASR. This is a family of models and a dataset designed to support speech recognition for more than 1,600 languages, including about 500 that have never had any ASR support before. That alone is a massive contribution.</p><p>Technically, it uses a wav2vec 2.0 backbone scaled up to 7B parameters with both CTC and LLM‑style decoders. The LLM‑like architecture allows for in‑context learning, so communities can add support for new languages with only a handful of examples. They’re also releasing the Omnilingual ASR Corpus with data for 350 underserved languages. The models and code are Apache‑2.0, making this a huge step forward for more inclusive speech tech.</p><p><strong>AI Art, Diffusion & 3D</strong></p><p><strong>Qwen Image Edit + Multi‑Angle LoRA: moving the camera after the fact  </strong>(<a target="_blank" href="https://x.com/linoy_tsaban/status/1986090375409533338">X</a>, <a target="_blank" href="https://huggingface.co/spaces/linoyts/Qwen-Image-Edit-Angles">HF</a>, <a target="_blank" href="https://x.com/fal/status/1988693046267969804?s=20">Fal</a>)</p><p>This one was pure fun. A new set of LoRAs for Qwen Image Edit adds direct camera control to still images. A Hugging Face demo lets you upload a photo and use sliders to rotate the camera up to 90 degrees, tilt from a bird’s‑eye to a worm’s‑eye view, and adjust the lens. We played with it live on the show with a portrait of Wolfram and a photo of my cat, generating different angles and then interpolating them into a short “fly‑around” video. It’s incredibly cool and preserves details surprisingly well, feeling like you have a virtual camera inside a 2D picture.</p><p><strong>NVIDIA ChronoEdit‑14B Upscaler LoRA </strong>(<a target="_blank" href="https://x.com/HuanLing6/status/1988098676838060246">X</a>, <a target="_blank" href="https://huggingface.co/NVIDIA/ChronoEdit-14B-Diffusers-Upscaler-LoRA">HF</a>)</p><p>Finally, NVIDIA released an upscaler LoRA based on their ChronoEdit‑14B model and merged the pipeline into Hugging Face Diffusers. ChronoEdit reframes image editing as a temporal reasoning task, like generating a tiny video. This makes it good for maintaining consistency in edits and upscales. It’s a heavy model, requiring ~34GB of VRAM, and for aggressive upscaling, specialized tools might still be better. But for moderate upscales where temporal coherence matters, it’s a very interesting new tool in the toolbox.</p><p>Phew, we made it through this dense week! Looking to next week, I’ll be recoridng the show live from the AI Engieer CODE summit in NY, and we’ll likely see a few good releases from the big G? Maybe? finally? </p><p><p>As always, if this was helpful, please subscribe to ThursdAI and share it with 2 friends, see you next week 🫡 </p></p><p>TL;DR and Show Notes</p><p>* <strong>Hosts and Guests</strong></p><p>* <strong>Alex Volkov</strong> - AI Evangelist & Weights & Biases (<a target="_blank" href="http://x.com/altryne">@altryne</a>)</p><p>* <strong>Co-Hosts</strong> - <a target="_blank" href="http://x.com/WolframRvnwlf">@WolframRvnwlf</a>, <a target="_blank" href="https://x.com/yampeleg">@yampeleg</a>, <a target="_blank" href="http://x.com/ldjconfirmed">@ldjconfirmed</a></p><p>* <strong>Guest</strong>: Dima Duev - SDK team Wandb</p><p>* <strong>Guest</strong>: Paul Asjes - Eleven Labs (<a target="_blank" href="https://x.com/paul_asjes">@paul_asjes</a>)</p><p>* <strong>Open Source LLMs</strong></p><p>* <strong>Terminal-Bench 2.0 and Harbor launch</strong> (<a target="_blank" href="https://x.com/alexgshaw/status/1986911106108211461">X</a>, <a target="_blank" href="https://harborframework.com/">Blog</a>, <a target="_blank" href="https://harborframework.com/docs/running-tbench">Docs</a>, <a target="_blank" href="https://www.tbench.ai/leaderboard">Announcement</a>)</p><p>* <strong>Baidu releases ERNIE-4.5-VL-28B-A3B-Thinking</strong> (<a target="_blank" href="https://x.com/Baidu_Inc/status/1988182106359411178">X</a>, <a target="_blank" href="https://huggingface.co/ERNIE/ERNIE-4.5-VL-28B-A3B-Thinking">HF</a>, <a target="_blank" href="https://github.com/ERNIE/ERNIE-4.5-VL-28B-A3B-Thinking">GitHub</a>, <a target="_blank" href="https://ernie.baidu.com/blog/ernie-4-5-vl-28b-a3b-thinking">Blog</a>, <a target="_blank" href="https://ai.baidu.com/ai-studio">Platform</a>)</p><p>* <strong>Project AELLA (OSSAS): 100K LLM-generated paper summaries</strong> (<a target="_blank" href="https://x.com/samhogan/status/1988306424309706938">X</a>, <a target="_blank" href="https://huggingface.co/inference-net">HF</a>)</p><p>* <strong>WeiboAI’s VibeThinker-1.5B</strong> (<a target="_blank" href="https://x.com/WeiboLLM/status/1988109435902832896">X</a>, <a target="_blank" href="https://huggingface.co/WeiboAI/VibeThinker-1.5B">HF</a>, <a target="_blank" href="https://arxiv.org/abs/2511.06221">Arxiv</a>, <a target="_blank" href="https://venturebeat.com/ai/weibos-new-open-source-ai-model-vibethinker-1-5b-outperforms-deepseek-r1-on">Announcement</a>)</p><p>* <strong>Code Arena — live, agentic coding evaluations</strong> (<a target="_blank" href="https://x.com/arena/status/1988665193275240616">X</a>, <a target="_blank" href="https://arena.lmsys.org/blog/code-arena">Blog</a>, <a target="_blank" href="https://arena.lmsys.org/">Announcement</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* <strong>Grok 4 Fast, Grok Imagine and Nano Banana v1/v2</strong> (<a target="_blank" href="https://x.com/chatgpt21/status/1987976808562589946">X</a>, <a target="_blank" href="https://x.com/cowowhite/status/1988213138333069314">X</a>, <a target="_blank" href="https://x.com/RasNas1994/status/1986426297900245106">X</a>, <a target="_blank" href="https://x.com/XFreeze/status/1987396781861212353">X</a>)</p><p>* <strong>OpenAI launches GPT-5.1</strong> (<a target="_blank" href="https://x.com/fidjissimo/status/1988683216681889887">X</a>, <a target="_blank" href="https://x.com/sama/status/1988692165686620237">X</a>)</p><p>* <strong>This weeks Buzz</strong></p><p>* <strong>W&B LEET — an open-source Terminal UI (TUI) to monitor runs</strong> (<a target="_blank" href="https://x.com/wandb/status/1988401253156876418">X</a>, <a target="_blank" href="https://app.getbeamer.com/wandb/en/meet-wb-leet-a-new-terminal-ui-for-weights-biases-JXSFhyt2">Blog</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* <strong>ElevenLabs launches Scribe v2 Realtime</strong> (<a target="_blank" href="https://x.com/elevenlabsio/status/1988282248445976987">X</a>, <a target="_blank" href="https://elevenlabs.io/agents">Blog</a>, <a target="_blank" href="https://docs.elevenlabs.io">Docs</a>)</p><p>* <strong>Meta releases Omnilingual ASR for 1,600+ languages</strong> (<a target="_blank" href="https://x.com/AIatMeta/status/1987946571439444361">X</a>, <a target="_blank" href="https://ai.meta.com/blog/omnilingual-asr-advancing-automatic-speech-recognition">Blog</a>, <a target="_blank" href="https://ai.meta.com/research/publications/omnilingual-asr-open-source-multilingual-speech-recognition-for-1600-languages/">Paper</a>, <a target="_blank" href="https://huggingface.co/datasets/facebook/omnilingual-asr-corpus">HF Dataset</a>, <a target="_blank" href="https://huggingface.co/spaces/facebook/omniasr-transcriptions">HF Demo</a>, <a target="_blank" href="https://github.com/facebookresearch/omnilingual-asr">GitHub</a>)</p><p>* <strong>Gemini Live conversational upgrade</strong> (<a target="_blank" href="https://x.com/carlovarrasi/status/1988691309591425234">X</a>)</p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* <strong>Qwen Image Edit + Multi‑Angle LoRA for camera control</strong> (<a target="_blank" href="https://x.com/linoy_tsaban/status/1986090375409533338">X</a>, <a target="_blank" href="https://huggingface.co/spaces/linoyts/Qwen-Image-Edit-Angles">HF</a>, <a target="_blank" href="https://x.com/fal/status/1988693046267969804?s=20">Fal</a>)</p><p>* <strong>NVIDIA releases ChronoEdit-14B Upscaler LoRA</strong> (<a target="_blank" href="https://x.com/HuanLing6/status/1988098676838060246">X</a>, <a target="_blank" href="https://huggingface.co/NVIDIA/ChronoEdit-14B-Diffusers-Upscaler-LoRA">HF</a>, <a target="_blank" href="https://huggingface.co/docs/diffusers/main/en/api/pipelines/chronoedit">Docs</a>)</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-nov-13-gpt-51-ernie-45-vl</link><guid isPermaLink="false">substack:post:178831797</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Thu, 13 Nov 2025 22:25:46 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/178831797/9eb8c77a5260405e35e93e80f3311201.mp3" length="67524567" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>4220</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/178831797/33da0fd51fa2eaf9c1239e26bb61a511.jpg"/></item><item><title><![CDATA[📆 ThursdAI - Nov 6, 2025 - Kimi’s 1T Thinking Model Shakes Up Open Source, Apple Bets $1B on Gemini for Siri, and Amazon vs. Perplexity!]]></title><description><![CDATA[<p>Hey, Alex here! </p><p>Quick note, while preparing for this week, I posted on X that I don’t remember such a quiet week in AI since I started doing ThursdAI regularly, but then 45 min before the show started, Kimi dropped a SOTA oss reasoning model, turning a quiet week into an absolute banger. </p><p>Besides Kimi, we covered the updated MCP thinking from Anthropic, and had Kenton Varda from cloudflare as a guest to talk about Code Mode, chatted about Windsurf and Cursor latest updates and covered OpenAI’s insane deals. </p><p>Also, because it was a quiet week, I figured I’d use the opportunity to create an AI powered automation, and used N8N for that, and shared it on the stream, so if you’re interested in automating with AI with relatively low code, this episode is for you. Let’s dive in</p><p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></p><p><strong>Kimi K2 Thinking is Here and It’s a 1 Trillion Parameter Beast! </strong>(<a target="_blank" href="https://x.com/Kimi_Moonshot/status/1986449512538513505">X</a>, <a target="_blank" href="https://huggingface.co/moonshotai/Kimi-K2-Thinking">HF</a>, <a target="_blank" href="https://moonshotai.github.io/Kimi-K2/thinking.html">Tech Blog</a>)</p><p>Let’s start with the news that got everyone’s energy levels skyrocketing right as we went live. Moonshot AI dropped Kimi K2 Thinking, an open-source, 1 trillion-parameter Mixture-of-Experts (MoE) model, and it’s an absolute monster.</p><p>This isn’t just a numbers game; Kimi K2 Thinking is designed from the ground up to be a powerful agent. With just around 32 billion active parameters during inference, a massive 256,000 token context window, and an insane tool-calling capacity. They’re claiming it can handle 200-300 sequential tool calls without any human intervention. </p><p>The benchmarks are just as wild. On the Humanities Last Exam (HLE), they’re reporting a score of 44.9%, beating out both GPT-5 and Claude 4.5 Thinking. While it doesn’t quite top the charts on SWE-bench verified, it’s holding its own against the biggest closed-source models out there. Seeing an open-source model compete at this level is incredibly exciting.</p><p>During the show, we saw some truly mind-blowing demos, from a beautiful interactive visualization of gradient descent to a simulation of a virus attacking cells, all generated by the model. The model’s reasoning traces, which are exposed through the API, also seem qualitatively different from other models, showing a deep and thoughtful process. My co-hosts and I were blown away. The weights and a very detailed technical report are available on Hugging Face, so you can dive in and see for yourself. Shout out to the entire Moonshot AI team for this incredible release!</p><p>Other open source updates from this week</p><p>* HuggingFace released an open source “Smol Training Playbook” on training LLMs, it’s a 200+ interactive beast with visualizations, deep dives into pretraining, dataset, postraining and more! (<a target="_blank" href="https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook">HF</a>)</p><p>* Ai2 launches OlmoEarth — foundation models + open, end-to-end platform for fast, high-resolution Earth intelligence (<a target="_blank" href="https://x.com/allen_ai/status/1985719070407176577">X</a>, <a target="_blank" href="https://allenai.org/blog/olmoearth?utm_source=x&#38;utm_medium=social&#38;utm_campaign=olmoearth">Blog</a>)</p><p>* LongCat-Flash-Omni — open-source omni-modal system with millisecond E2E spoken interaction, 128K context and a 560B ScMoE backbone (<a target="_blank" href="https://x.com/Meituan_LongCat/status/1984398560973242733">X</a>, <a target="_blank" href="https://huggingface.co/meituan-longca">HF</a>, <a target="_blank" href="https://github.com/meituan-longca">Announcement</a>)</p><p><strong>Big Tech’s Big Moves: Apple, Amazon, and OpenAI</strong></p><p>The big companies were making waves this week, starting with a blockbuster deal that might finally make Siri smart. Apple is reportedly will be paying Google around $1 billion per year to license a custom 1.2 trillion-parameter version of Gemini to power a revamped Siri.</p><p>This is a massive move. The Gemini model will run on Apple’s Private Cloud Compute, keeping user data walled off from Google, and will handle Siri’s complex summarizer and planner functions. After years of waiting for Apple to make a significant move in GenAI, it seems they’re outsourcing the heavy lifting for now while they work to catch up with their own in-house models. As a user, I don’t really care who builds the model, as long as Siri stops being dumb!</p><p>In more dramatic news, Perplexity revealed that Amazon sent them a legal threat to block their Comet AI assistant from shopping on <a target="_blank" href="Amazon.com">Amazon.com</a>. This infuriated me. My browser is my browser, and I should be able to use whatever tools I want to interact with the web. Perplexity took a strong stand with their <a target="_blank" href="https://perplexity.ai/hub/blog/bullying-is-not-innovation">blog post</a>, “Bullying is Not Innovation,” arguing that user agents are distinct from scrapers and act on behalf of the user with their own credentials. An AI assistant is just that—an assistant. It shouldn’t matter if I ask my wife or my AI to buy something for me on Amazon. This feels like a move by Amazon to protect its ad revenue at the expense of user choice and innovation, and I have to give major props to Perplexity for being so transparent and fighting back.</p><p>Finally, OpenAI continues its quest for infinite compute, announcing a multi-year strategic partnership with AWS. This comes on top of massive deals with NVIDIA, Microsoft, Oracle, and others, bringing their total commitment to compute into the trillions of dollars. It’s getting to a point where OpenAI seems “too big to fail,” as any hiccup could have serious repercussions for the entire tech economy, which is now heavily propped up by AI investment. Sam has clarified that they don’t think OpenAI wants to be too big to fail in a recent <a target="_blank" href="https://x.com/sama/status/1986514377470845007">post on X</a>, and that the recent miscommunications around the US government backstopping OpenAI’s infrastructure bailouts were taken out of context. 🤔 </p><p><strong>Coding with AI: The Evolution of MCP and New Dev Tools</strong></p><p>This week, we kicked off a new segment on the show: Coding with AI! Essentially realizing that we talk about AI coding a LOT, and decided to add a dedicated corner to it!  And we started with a fascinating development in the world of agentic tooling. Anthropic published a blog post arguing that the standard way of using the Model Context Protocol (MCP) — by loading full tool definitions into the context window — is inefficient.</p><p>Their solution? Have LLMs write code to interact with tools instead. This approach can slash token usage by over 98% in some cases. This idea sounded familiar, and that’s because Cloudflare had already <a target="_blank" href="https://blog.cloudflare.com/code-mode/">explored</a> it with a feature called “Code Mode.” We were lucky enough to have <a target="_blank" href="https://x.com/KentonVarda">Kenton Varda</a>, one of the authors of the Code Mode post and head of engineering for Cloudflare Workers, join us to discuss this shift.</p><p>Kenton explained that LLMs are trained on vast amounts of code, making it a more “native language” for them than the artificial construct of tool calls. By generating code, agents can chain multiple tool calls together, process intermediate results, and operate much more efficiently without sending everything back through the neural network. While MCP still provides crucial standardization for discovering and authorizing tools, this “code execution” pattern seems to be the way forward for building more powerful and scalable agents.</p><p>Windsurfs CodeMaps and Cursor multi agent executions</p><p>In other coding with AI news, Windsurf has pushed an incredible feature, called CodeMaps. They will use their SWE-1 model to (quickly) generate Codemaps that will expalins a code-base to you, in a visual way. What starts where and goes where. It’s really useful to understand a new codebase or re-understand one you forgot about already! You can even chat with codemaps, to see if your overall system’s design is solid! Great addition that I’m sure will help many folks adopt Windsurf! </p><p>And Cursor, another popular AI-native IDE, released a super-performant in-IDE browser and a wild multi-agent feature that queries multiple LLMs in parallel and then synthesizes their answers.</p><p><strong>This Week’s Tutorial</strong></p><p>I finally got around to building some serious automations for ThursdAI, and folks, N8N has been a game-changer. What used to take me 30+ minutes of manual work now happens automatically in the background.</p><p>Here’s what I built: A Telegram bot that takes Twitter/X links, fetches the tweets and all linked content, uses AI agents to extract and summarize the information, and then posts it to our announcement channel and my notes app. The coolest part? I built this whole thing in about 4 hours with the help of Atlas browser and GPT-5 literally telling me what to do at each step.</p><p>During the show, we even live-tested swapping out GPT-4o-mini for Kimi K2 - took literally 30 seconds to connect via OpenRouter. I went through my node and explains how this all works on the show, so if you’ve wanted to learn about n8n, check it out starting around 01:13:00. If you want to see how my automation turned out, it will be posting all my links to the new telegram channel <a target="_blank" href="http://t.me/thursdai_news">t.me/thursdai_news</a> (expect it to be messy at first as I’m testing out the automation) </p><p><strong>Robotics - Xpeng’s “Iron” humanoid: big vibes, few specs</strong></p><p>Another week, another humanoid robot that is supposedly “coming” in 2026! </p><p>A humanoid from Xpeng went viral this week, marketed as “the most human‑like” robot with soft skin, bionic muscles, customizable sexes (yes, really, they have a woman humanoid), something called a VLT brain, and a 2026 production goal. Here’s what we didn’t get: a spec sheet. No DOF, speed, payload, compute TOPS, battery capacity, runtime, or safety pathway. No pricing, manufacturing strategy, or clear target markets. In other words: lots of sizzle, no steak.</p><p>Apparently, there was folks thinking Xpend pulled an Elon and put a human in a robot suit, making the CEO do the “we’ll cut a part of the soft skin to expose the robot underneath so you don’t think we’re lying” stunt. Which I agree, was very effective. </p><p>But, If Xpeng is serious, the next thing we’ll see should be a crisp engineering document: joints, actuation, sensors, compute, and a locomotion/manipulation demo with independent measurements. Until then, treat this as a branding salvo and a reminder that the humanoid category is still sorting itself into “industrial payload first” versus “human likeness first” approaches. </p><p><strong>Voice & Audio</strong></p><p><strong>Maya‑1: open‑source voice design from natural language</strong></p><p>We highlighted Maya‑1, a 3B Llama‑backboned TTS system designed to generate voices from natural language descriptions. Instead of picking from a menu, you describe the voice—age, accent, affect—and Maya conjures it. It supports real‑time streaming and over twenty “emotion tags.” The quality is compelling for its size and the Apache 2 license will make a lot of builders happy. There’s a growing middle class of TTS: tiny but expressive, good enough for in‑app narrators, prototyping, and even stylized content when you don’t want the constraints of commercial voice marketplaces.</p><p><strong>Inworld TTS: a new leader on independent rankings</strong></p><p>We also listened to Inworld’s latest, which currently tops the Artificial Analysis TTS leaderboard. It’s not open source, but the combo of expressivity, speed (sub‑250 ms), and multilingual support puts it firmly in the “commercially viable at scale” tier alongside the usual suspects. If you need SaaS TTS today and care about emotional range, add this to your shortlist. Pricing on their site targets availability rather than hobbyist tinkering, but the quality argues for itself.</p><p>Whew! For a week that started slow, it certainly ended with a bang. It just goes to show you can never count AI out. We’re seeing open source continue to push the boundaries, big tech making landscape-defining moves, and agentic AI becoming more powerful and accessible every day.</p><p>As always, thanks for tuning in. If you’re going to be at the <a target="_blank" href="AI.engineer">AI.engineer</a> conference in New York, please hit me up—I’d love to meet you.</p><p>TL;DR and Show Notes + Links</p><p>* <strong>Hosts and Guests</strong></p><p>* <strong>Alex Volkov</strong> - AI Evangelist & Weights & Biases (<a target="_blank" href="http://x.com/@altryne">@altryne</a>)</p><p>* Co Hosts - <a target="_blank" href="http://x.com/@WolframRvnwlf">@WolframRvnwlf</a> @yampeleg <a target="_blank" href="http://x.com/@nisten">@nisten</a></p><p>* Kenton Varda @ Cloudflare (<a target="_blank" href="https://x.com/KentonVarda">@KentonVarda</a>)</p><p>* <strong>Open Source LLMs</strong></p><p>* Smol Training Playbook — a 200+ page, end-to-end guide to reliably pretrain and operate LLMs (<a target="_blank" href="https://x.com/eliebakouch/1983930328751153159">X</a>, <a target="_blank" href="https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook">Announcement</a>)</p><p>* Ai2 launches OlmoEarth — foundation models + open, end-to-end platform for fast, high-resolution Earth intelligence (<a target="_blank" href="https://x.com/allen_ai/1985719070407176577">X</a>, <a target="_blank" href="https://allenai.org/blog/olmoearth?utm_source=x&#38;utm_medium=social&#38;utm_campaign=olmoearth">Blog</a>)</p><p>* Moonshot AI releases Kimi K2 Thinking — an open-source 1T-parameter MoE agent with 256K context and huge tool-calling capacity (<a target="_blank" href="https://x.com/Kimi_Moonshot/1986449512538513505">X</a>, <a target="_blank" href="https://huggingface.co/moonshotai/Kimi-K2-Thinking">HF</a>, <a target="_blank" href="https://moonshotai.github.io/Kimi-K2/thinki">Blog</a>, <a target="_blank" href="https://huggingface.co/papers/2510.26692">Arxiv</a>)</p><p>* LongCat flash Omni - 560B (27A) omni model (text, audio, video input)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Apple will pay roughly $1B/year to license a custom 1.2 trillion‑parameter Google Gemini model to power a revamped Siri (<a target="_blank" href="https://x.com/markgurman/1986150242698637591">X</a>, <a target="_blank" href="https://www.bloomberg.com/news/articles/2025-11-05/apple-plans-to-use-1-2-trillion-parameter-google-gemini-model-to-power-new-siri?accessToken=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzb3VyY2UiOiJTdWJzY3JpYmVyR2lmdGVkQXJ0aWNsZSIsImlhdCI6MTc2MjM3MDAzOSwiZXhwIjoxNzYyOTc0ODM5LCJhcnRpY2xlSWQiOiJUNTZETzhHUTdMMFIwMCIsImJjb25uZWN0SWQiOiJDNEVEQ0FFMUZBMDU0MEJFQTI0QTlGMjExQzFFOTA4MCJ9._aWk2P25J89KBRkJQ_KdbwuULLM8yUtrPCPfRmsUfSs">Announcement</a>)</p><p>* Perplexity says Amazon issued a legal threat to block Comet AI assistants from shopping on Amazon (<a target="_blank" href="https://x.com/perplexity_ai/1985774904911020319">X</a>, <a target="_blank" href="https://perplexity.ai/hub/blog/bullying-is-not-innovation">Blog</a>)</p><p>* AWS announces multi-year strategic infrastructure partnership with OpenAI to power ChatGPT inference, training, and agentic AI (<a target="_blank" href="https://x.com/ajassy/1985351258333643172">X</a>)</p><p>* <strong>Robotics</strong></p><p>* Xpeng unveils ‘Iron’ humanoid claiming ‘most human-like’ design with soft skin, bionic muscles, VLT brain and a 2026 production plan (<a target="_blank" href="https://x.com/humanoidsdaily/1986063827327201757">X</a>)</p><p>* <strong>Coding with AI</strong></p><p>* Anthropic shows how running MCP-connected tools as code slashes token use and scales agents (<a target="_blank" href="https://x.com/AnthropicAI/1985846791842250860">X</a>, <a target="_blank" href="https://www.anthropic.com/engineering/code-execution-with-mcp">Blog</a>)</p><p>* Windsurf Codemaps — AI‑annotated, navigable maps of your codebase powered by SWE-1.5 (Fast) and Sonnet 4.5 (Smart) (<a target="_blank" href="https://x.com/cognition/1985755284527010167">X</a>, <a target="_blank" href="https://cognition.ai/blog/codemaps">Announcement</a>)</p><p>* Conversation with Kenton Varda (<a target="_blank" href="https://x.com/KentonVarda">@KentonVarda</a>) from Cloudflare about MCP and Code Mode</p><p>* Cursor added in IDE browser - very performant!</p><p>* <strong>Audio & Video</strong></p><p>* Maya-1 - Open source voice generation model.</p><p>* Inworld TTS - new #1 on artifical analysis benchmark.</p><p>* <strong>Tools & Gadgets</strong></p><p>* Sandbar launches Stream — a voice-first personal assistant — and Stream Ring, a wearable ‘mouse for voice’, available for preorder (<a target="_blank" href="https://x.com/sandbar/status/1986112726889078911">X</a>, <a target="_blank" href="https://www.sandbar.com/stream">Blog</a>)</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-nov-6-2025-kimis-1t-thinking</link><guid isPermaLink="false">substack:post:178231559</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Fri, 07 Nov 2025 00:31:39 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/178231559/edec751fd9361d36a5c64072f37eaa97.mp3" length="89043188" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5565</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/178231559/44ddb177c5c4214147786b27b221a0ac.jpg"/></item><item><title><![CDATA[ThursdAI - Oct 30 - From ASI in a Decade to Home Humanoids: MiniMax M2's Speed Demon, OpenAI's Bold Roadmap, and 2026 Robot Revolution]]></title><description><![CDATA[<p>Hey, it’s Alex! Happy Halloween friends! </p><p>I’m excited to bring you this weeks (spooky) AI updates! We started the show today with MiniMax M2, the currently top Open Source LLM, with an interview with their head of eng, Skyler Miao, continued to dive into OpenAIs completed restructuring into a non-profit and a PBC, including a deep dive into a live stream Sam Altman had, with a ton of spicy details, and finally chatted with Arjun Desai from Cartesia, following a release of Sonic 3, a sub 49ms voice model! </p><p>So, 2 interviews + tons of news, let’s dive in! (as always, show notes in the end)</p><p><p>Hey, if you like this content, it would mean a lot if you subscribe as a paid subscriber.</p></p><p>Open Source AI</p><p>MiniMax M2: open-source agentic model at 8% of Claude’s price, 2× speed (<a target="_blank" href="https://x.com/MiniMax__AI/status/1982674798649160175">X</a>, <a target="_blank" href="https://huggingface.co/MiniMaxAI/MiniMax-M2">Hugging Face</a> )</p><p>We kicked off our open-source segment with a banger of an announcement and a special guest. The new king of open-source LLMs is here, and it’s called <strong>MiniMax M2</strong>. We were lucky enough to have Skyler Miao, Head of Engineering at Minimax, join us live to break it all down.</p><p>M2 is an agentic model built for code and complex workflows, and its performance is just staggering. It’s already ranked in the top 5 globally on the Artificial Analysis benchmark, right behind giants like OpenAI and Anthropic. But here’s the crazy part: it delivers nearly <strong>twice the speed</strong> of Claude 3.5 Sonnet at just <strong>8% of the price</strong>. This is basically Sonnet-level performance, at home, in open source.</p><p>Skylar explained that their team saw an “impossible triangle” in the market between performance, cost, and speed—you could only ever get two. Their goal with M2 was to build a model that could solve this, and they absolutely nailed it. It’s a 200B parameter Mixture-of-Experts (MoE) model, but with only 10B active parameters per inference, making it incredibly efficient.</p><p>One key insight Skylar shared was about getting the best performance. M2 supports multiple APIs, but to really unlock its reasoning power, you need to use an API that passes the model’s “thinking” tokens back to it on the next turn, like the Anthropic API. Many open-source tools don’t support this yet, so it’s something to watch out for.</p><p>Huge congrats to the MiniMax team on this Open Weights (MIT licensed) release, you can find the model on <a target="_blank" href="https://huggingface.co/MiniMaxAI/MiniMax-M2">HF</a>! </p><p>MiniMax had quite a week, with 3 additional releases, MiniMax speech 2.6, an update to their video model <a target="_blank" href="https://x.com/Hailuo_AI/status/1983382728343994414">Hailuo 2.3</a> and just after the show, they released a music 2.0 <a target="_blank" href="https://x.com/Hailuo_AI/status/1983964920493568296">model</a> as well! Congrats on the shipping folks! </p><p>OpenAI drops gpt-oss-safeguard - first open-weight safety reasoning models for classification ( <a target="_blank" href="https://x.com/OpenAI/status/1983507392374641071">X</a>, <a target="_blank" href="https://huggingface.co/collections/openai/gpt-oss-safeguard">HF</a> )</p><p>OpenAI is back on the open weights bandwagon, with a finetune release of their previously open weighted gpt-oss models, with gpt-oss-safeguard. </p><p>These models were trained exclusively to help companies build safeguarding policies to make sure their apps remains safe! With gpt-oss-safeguards 20B and 120B, OpenAI is achieving near parity with their internal safety models, and as Nisten said on the show, if anyone knows about censorship and safety, it’s OpenAI! </p><p>The highlight of this release is, unlike traditional pre-trained classifiers, these models allow for updates to policy via natural language!</p><p>These models will be great for businesses that want to safeguard their products in production, and I will advocate to bring these models to W&B Inference soon! </p><p>A Humanoid Robot in Your Home by 2026? 1X NEO announcement ( <a target="_blank" href="https://x.com/1x_tech/status/1983233494575952138">X</a>, <a target="_blank" href="https://www.1x.tech/order">Order page</a>, <a target="_blank" href="https://youtu.be/LTYMWadOW7c">Keynote</a> )</p><p>Things got really spooky when we started talking about robotics. The company 1X, which has been on our radar for a while, officially launched pre-orders for <strong>NEO</strong>, the world’s first consumer humanoid robot designed for your home. And yes, you can order one right now for $20,000, with deliveries expected in early 2026.</p><p>The internet went crazy over this announcement, with folks posting receipts of getting one, other folks stoking the uncanny valley fears that Sci-fi has built into many people over the years, of the Robot uprising and talking about the privacy concerns of having a human tele-operate this Robot in your house to do chores. </p><p>It can handle chores like cleaning and laundry, and for more complex tasks that it hasn’t learned yet, it uses a teleoperation system where a human “1X Expert” can pilot the robot remotely to perform the task. This is how it collects the data to learn to do these tasks autonomously in your specific home environment.</p><p>The whole release is very interesting, from the “soft and quiet” approach 1X is taking, making their robot a 66lbs short king, draped in a knit sweater, to the $20K price point (effectively at loss given how much just the hands cost), the teleoperated by humans addition, to make sure the Robot learns about your unique house layout. </p><p>The conversation on the show was fascinating. We talked about all the potential use cases, from having it water your plants and look after your pets while you’re on vacation to providing remote assistance for elderly relatives. Of course, there are real privacy concerns with having a telepresence device in your home, but 1X says these sessions are scheduled by you and have strict no-go zones.</p><p>Here’s my prediction: by next Halloween, we’ll see videos of these NEO robots dressed up in costumes, helping out at parties. The future is officially here. Will you be getting one? If not this one, when will you think you’ll get one? </p><p>OpenAI’s Grand Plan: From Recapitalization to ASI</p><p>This was by far the biggest update about the world of AI for me this week! Sam Altman was joined by Jakub Pachocki, chief scientist and Wojciech Zaremba, a co-founder, on a live stream to share an update about their corporate structure, plans for the future, and ASI goals (Artificial Superintelligence) </p><p>First, the company now has a new structure: a non-profit <strong>OpenAI Foundation</strong> governs the for-profit <strong>OpenAI Group</strong>. The foundation starts with about 26% equity and has a mission to use AI for public good, including an initial $25 billion commitment to curing diseases and building an “AI Resilience” ecosystem.</p><p>But the real bombshells were about their research timeline. Chief Scientist Jakub Pachocki stated that they believe deep learning systems are <strong>less than a decade away from superintelligence (ASI)</strong>. He said that at this point, AGI isn’t even the right goal anymore. To get there, they’re planning to have an “AI research intern” by September 2026 and a fully <strong>autonomous AI researcher</strong> comparable to their human experts by March 2028. This is insane if you think about it. As Yam mentioned, OpenAI is already shipping at an insane speed, releasing Models and Products, Sora, Atlas, Pulse, ChatGPT app store, and this is with humans, assisted by AI. </p><p>And here, they are talking about complete and fully autonomous researchers, that will be infinitely more scalable than humans, in the next 2 years. The outcomes of this are hard to imagine and are honestly mindblowing. </p><p>To power all this innovation, Sam revealed they have over <strong>$1.4 trillion in obligations for compute</strong> (over 30 GW). And said even that’s not enough. Their aspiration is to build a “compute factory” capable of standing up one gigawatt of new compute <em>per week</em>, and he hinted they may need to “rethink their robotics strategy” to build the data centers fast enough. Does this mean OpenAI humanoid robots building factories? 🤔 </p><p>Plus, don’t forget, Sam is one of the investors in Helion energy, working on power solutions like Fusion, and the above graphic has an Energy block that Sam said they will give an update on later (that’s also what he told me during Dev Day when I asked him about it). </p><p>Super exciting and honestly mind-blowing stuff, Gigawats per week, fully autonomous researchers, the world is going to look way different in a few years! </p><p>The Agent Labs Race: Cursor 2.0 vs. Cognition’s SWE-1.5 (<a target="_blank" href="https://x.com/cursor_ai">X</a>, <a target="_blank" href="https://cursor.com/blog/2-0">Blog</a>)</p><p>This week also saw a major showdown in the agentic coding space. On the very same day, both Cursor and Cognition launched major updates and their own new models, signaling a new era where agent labs are training their own specialized AI.</p><p>First up, <strong>Cursor 2.0</strong> was released with a completely redesigned multi-agent interface and their new model, <strong>Composer</strong>. Composer is claimed to be four times faster than comparable models, and the new UI is built around managing a fleet of agents that can work in parallel on your codebase. It’s a clear shift from being just an IDE to a full-fledged agent platform. Look, the UI even looks like ChatGPT and no code in sight (until you switch to IDE mode) </p><p>Their Composer model is also very interesting, and got a lot of folks excited, but the evaluations they shared, and the fact that they didn’t disclose if that’s a finetune of a chinese model (<a target="_blank" href="https://x.com/auchenberg/status/1983901551048470974">it likely is</a>). Regardless, folks are saying that it’s a very good model that’s also VERY fast! </p><p>Cognition own coding model - SWE 1.5 ( <a target="_blank" href="https://cognition.ai/blog/swe-1-5">Blog</a>, <a target="_blank" href="https://x.com/cognition/status/1983662836896448756">X</a>, <a target="_blank" href="https://windsurf.com/download">Windsurf</a> )</p><p>Then, just hours later, Cognition punched right back with <strong>SWE-1.5</strong>, their new frontier agent model that now powers Windsurf. The headline here is pure speed. Powered by Cerebras, SWE-1.5 hits a blistering <strong>950 tokens per second</strong>—13 times faster than Sonnet 4.5—while achieving near-SOTA performance on SWE-Bench Pro. They’ve achieved this through a co-designed stack where the agent harness, inference system, and model were all built together and optimized with end-to-end reinforcement learning in real coding environments.</p><p>This competition is fantastic news for all of us. We’re seeing specialized, highly-performant models being developed outside of the big labs, putting more power back in the hands of developers.</p><p><strong>This Week’s Buzz</strong></p><p>Just a few quick updates from the world of Weights & Biases and our parent company, CoreWeave.</p><p>First, big news! CoreWeave announced the acquisition of <strong>Marimo</strong>, the company behind the popular open-source, reactive notebook for Python. This is another exciting step in building out the essential cloud for AI, adding powerful development tools to the stack alongside best-in-class GPU infrastructure and MLOps with Weights & Biases. Welcome to the Marimo team!</p><p>Also, <a target="_blank" href="http://fullyconnected.com"><strong>Fully Connected</strong></a> is coming to London next week! It’s our premier conference, and we’ll have speakers from Mistral, Google, LlamaIndex, and more. If you’re in Europe, please come join us. DM me if you need tickets!</p><p>And if you’re in New York from November 19-22, come say hi at the <strong>AI Engineer Code Summit</strong>. We’re sponsoring and will have a big booth. It’s always a great place to meet folks from this community.</p><p><strong>Video & Voice: The Multimodal Explosion</strong></p><p>The world of video and voice AI was on fire this week.</p><p>The absolute highlight was <strong>Odyssey ML V2</strong>, a new real-time interactive AI video platform. This thing is not like other video models that take minutes to generate a clip. With Odyssey, you type a prompt, and a video starts streaming <em>instantly</em>. Then, you can edit it live. We did a demo on the show where we prompted “army of robots in a starship corridor” and then typed “turn these robots into fluffy covered cat robots,” and the video changed in real time. It’s mind-blowing. This is a glimpse into the future of user-driven, playable media.</p><p>On the more traditional video front, <strong>Sora</strong> is now invite-free in the US and Japan, and they launched <strong>Character Cameos</strong>. You can now upload photos of your pets or objects (like your kid’s carved pumpkin!) and turn them into characters that you and others can use in videos. I, of course, immediately made a cameo of my cat, Sonia.</p><p>Voice and Audio - Cartesia launches Sonic 3, sub 50ms AI speech model</p><p>In the world of voice, we had Arjun Desai from <strong>Cartesia</strong> join us to talk about <strong>Sonic-3</strong>, their new real-time TTS engine. Backed by a new <strong>$100M funding round</strong>, Sonic-3 is built on State Space Models (not Transformers) and can achieve insane speeds—we’re talking under 50ms latency. But it’s not just fast; it’s also incredibly expressive. It can laugh, emote, and speak 42 languages with natural code-switching. I used their Pro Voice cloning feature to create an AI version of myself, and the results were scarily good. We even had my AI clone host a segment of the show, see it yourself here, powered by Argil and Sonic 3, this is... AI Alex</p><p><strong>Wrapping Up This Spooky Week 🎃</strong></p><p>As I sit here in my Halloween costume reflecting on this week, I can’t help but feel we’re at an inflection point. We have:</p><p>* Open source models competing with the best proprietary ones</p><p>* Humanoid robots becoming consumer products</p><p>* ASI timelines measured in single-digit years</p><p>* Real-time interactive AI across all modalities</p><p>And yet, nothing about this scares me. If anything, I’m more excited than ever about what we’re building together. Yes, the pace is insane. Yes, keeping up with everything is becoming nearly impossible (and it’s literally my job!). But we’re living through the most transformative period in human history, and we get to be part of it.</p><p>To everyone building, experimenting, and pushing boundaries - keep going. To everyone worried about what’s coming - join us in shaping it responsibly. And to everyone who celebrated Halloween today - I hope your costume was as epic as the AI developments we covered! 👻</p><p>Until next week, this is Alex signing off. Remember to subscribe, give us five stars, and I’ll see you next ThursdAI!</p><p>TL;DR - All Topics Covered</p><p>ThursdAI - Oct 30 - Halloween Special 👻</p><p>* <strong>Hosts and Guests</strong></p><p>* <strong>Alex Volkov</strong> - AI Evangelist & Weights & Biases (<a target="_blank" href="https://x.com/altryne">@altryne</a>)</p><p>* Co Hosts - <a target="_blank" href="https://x.com/WolframRvnwlf">@WolframRvnwlf</a> <a target="_blank" href="https://x.com/yampeleg">@yampeleg</a> <a target="_blank" href="https://x.com/nisten">@nisten</a> <a target="_blank" href="https://x.com/ldjconfirmed">@ldjconfirmed</a> <a target="_blank" href="http://x.com/ryancarson">@ryancarson</a></p><p>* <strong>Guest: Skyler Miao</strong> - Head of Engineering, MiniMax (<a target="_blank" href="https://x.com/SkylerMiao7">@SkylerMiao7</a>)</p><p>* <strong>Guest: Arjun Desai</strong> - CoFounder Cartesia  (<a target="_blank" href="https://x.com/jundesai">@jundesai</a>)</p><p>* <strong>Open Source LLMs</strong></p><p>* <strong>MiniMax M2</strong>: Open-source agentic model at 8% of Claude’s price, 2× speed (<a target="_blank" href="https://x.com/MiniMax__AI/status/1982674798649160175">X</a>, <a target="_blank" href="https://huggingface.co/MiniMaxAI/MiniMax-M2">Hugging Face</a>)</p><p>* <strong>OpenAI GPT-OSS-Safeguard</strong>: First open-weight safety reasoning models (<a target="_blank" href="https://x.com/OpenAI/status/1983507392374641071">X</a>, <a target="_blank" href="https://huggingface.co/collections/openai/gpt-oss-safeguard">HF</a>)</p><p>* <strong>IBM Granite 4.0 Nano</strong>: Ultra-efficient tiny models for edge deployment (<a target="_blank" href="https://x.com/ArtificialAnlys/status/1983611955668775411">X</a>, <a target="_blank" href="https://artificialanalysis.ai/models/granite">Artificial Analysis</a>)</p><p>* <strong>Ming-flash-omni Preview</strong>: Sparse MoE omni-modal model (<a target="_blank" href="https://x.com/AntLingAGI/status/1982831211312722041">X</a>, <a target="_blank" href="https://huggingface.co/inclusionAI/Ming-flash-omni-Preview">HuggingFace</a>)</p><p>* <strong>Kimi Linear</strong>: 48B parameter model with 1M context (<a target="_blank" href="https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct">HF</a>)</p><p>* <strong>Robotics</strong></p><p>* <strong>1X NEO</strong>: First consumer humanoid robot, $20k, delivery 2026 (<a target="_blank" href="https://x.com/1x_tech/status/1983233494575952138">X</a>, <a target="_blank" href="https://www.1x.tech/order">Order page</a>, <a target="_blank" href="https://youtu.be/LTYMWadOW7c">Keynote</a>)</p><p>* <strong>Big Companies & APIs</strong></p><p>* <strong>OpenAI Restructuring</strong>: ASI within 10 years, AI researcher by 2028 (<a target="_blank" href="https://x.com/AndrewCurran_/status/1983161944208220550">X</a>)</p><p>* <strong>Cursor 2.0 & Composer</strong>: 4x faster coding, new model (<a target="_blank" href="https://x.com/cursor_ai">X</a>, <a target="_blank" href="https://cursor.com/blog/2-0">Blog</a>)</p><p>* <strong>Cognition SWE-1.5</strong>: 950 tok/s, 40% SWE-bench Pro (<a target="_blank" href="https://cognition.ai/blog/swe-1-5">Blog</a>, <a target="_blank" href="https://x.com/cognition/status/1983662836896448756">X</a>, <a target="_blank" href="https://windsurf.com/download">Windsurf</a>)</p><p>* <strong>Perplexity Email Assistant</strong>: Privacy-first AI inbox management (<a target="_blank" href="https://x.com/perplexity_ai/status/1983591113903738970">X</a>, <a target="_blank" href="https://www.perplexity.ai/assistant">Assistant Site</a>)</p><p>* <strong>This Week’s Buzz</strong></p><p>* Fully Connected London - <a target="_blank" href="https://fullyconnected.com">fullyconnected.com</a></p><p>* AI Engineer Code Summit NYC - Nov 19-22</p><p>* CoreWeave acquires Marmo notebooks (<a target="_blank" href="https://x.com/marimo_io/status/1983916371869364622">X</a>)</p><p>* <strong>Vision & Video</strong></p><p>* <strong>Odyssey ML V2</strong>: Real-time interactive AI video (<a target="_blank" href="https://x.com/odysseyml/status/1982856110290939989">X</a>, <a target="_blank" href="https://experience.odyssey.ml">Experience</a>)</p><p>* <strong>Sora</strong>: Now invite-free + Character Cameos feature (<a target="_blank" href="https://x.com/OpenAI/status/1983661036533379486">X</a>, <a target="_blank" href="https://sora.chatgpt.com/p/s_6902d8b223d88191a04adfada2e9f5b5">Sonia Cameo</a>)</p><p>* <strong>Hailuo 2.3</strong>: Cinema-grade video generation (<a target="_blank" href="https://x.com/Hailuo_AI/status/1983016390878708131">X</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* <strong>MiniMax Speech 2.6</strong>: <250ms ultra-human voice AI (<a target="_blank" href="https://x.com/Hailuo_AI/status/1983557055819768108">X</a>, <a target="_blank" href="https://minimaxi.com/audio">MiniMax</a>, <a target="_blank" href="https://platform.minimax.io/docs/guides/sp">API Docs</a>)</p><p>* <strong>Cartesia Sonic 3</strong>: Real-time TTS with emotion & laughter, $100M funding (<a target="_blank" href="https://x.com/krandiash/status/1983202316397453676">X</a>, <a target="_blank" href="https://cartesia.ai/sonic">Website</a>, <a target="_blank" href="https://docs.cartesia.ai/2024-11-13/get-started/overview">Docs</a>)</p><p>* <strong>Tools</strong></p><p>* <strong>Pokee</strong>: Agentic workflow builder (<a target="_blank" href="https://x.com/Pokee_AI/status/1983202159262150717">X</a>)</p><p>* <strong>Pomelli</strong>: Google’s AI marketing agent (<a target="_blank" href="https://twitter.com/testingcatalog/status/1983214036259938553">X</a>, <a target="_blank" href="https://labs.google.com/pomelli/about/?ref=testingcatalog.com">Labs</a>)</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-oct-30-2025-minimax-m2-shocks</link><guid isPermaLink="false">substack:post:177607070</guid><dc:creator><![CDATA[Alex Volkov and Skyler Miao]]></dc:creator><pubDate>Thu, 30 Oct 2025 23:54:32 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/177607070/de3003595ba9d2a902b7e2fb4dc0afff.mp3" length="93587974" type="audio/mpeg"/><itunes:author>Alex Volkov and Skyler Miao</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5849</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/177607070/8bd35d7c2f9b558547cba5ecb6efc823.jpg"/></item><item><title><![CDATA[📆 ThursdAI - Oct 23: The AI Browser Wars Begin, DeepSeek's OCR Mind-Trick & The Race to Real-Time Video]]></title><description><![CDATA[<p>Hey everyone, Alex here! </p><p>Welcome... to the browser war II - the AI edition! This week we chatted in depth about ChatGPT’s new Atlas agentic browser, and the additional agentic powers Microsoft added to Edge with Copilot Mode (tho it didn’t work for me) </p><p>Also this week was a kind of crazy OCR week, with more than 4 OCR models releasing, and the crown one is DeepSeek OCR, that turned the whole industry on it’s head (more later) </p><p>Quite a few video updates as well, with real time lipsync from Decart, and a new update from LTX with 4k native video generation, it’s been a busy AI week for sure! </p><p>Additionally, I’ve had the pleasure to talk about AI Browsing agents with Paul from BrowserBase and real time video with Kwindla Kramer from Pipecat/Daily, so make sure to tune in for those interviews, buckle up, let’s dive in! </p><p><p>Thanks for reading ThursdAI - Recaps of the most high signal AI weekly spaces! This post is public so feel free to share it.</p></p><p><strong>Open Source: OCR is Not What You Think It Is </strong>(<a target="_blank" href="https://x.com/Presidentlin/status/1980159652563415094">X</a>, <a target="_blank" href="https://huggingface.co/deepseek-ai/DeepSeek-OCR">HF</a>, <a target="_blank" href="https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf">Paper</a>)</p><p>The most important and frankly mind-bending release this week came from DeepSeek. They dropped DeepSeek-OCR, and let me tell you, this is NOT just another OCR model. The cohost were buzzing about this, and once I dug in, I understood why. This isn’t just about reading text from an image; it’s a revolutionary approach to <strong>context compression</strong>.</p><p>We think that DeepSeek needed this as an internal tool, so we’re really grateful to them for open sourcing this, as they did something crazy here. They are essentially turning text into a visual representation, compressing it, and then using a tiny vision decoder to read it back with incredible accuracy. We’re talking about a compression ratio of up to <strong>10x with 97% decoding accuracy</strong>. Even at 20x compression they are achieving 60% decoding accuracy! My head exploded live on the show when I read that. This is like the middle-out compression algorithm joke from <em>Silicon Valley</em>, but it’s real. As Yam pointed out, this suggests our current methods of text tokenization are far from optimal.</p><p>With only 3B and ~570M active parameters, they are taking a direct stab at long context inefficiency, imagine taking 1M tokens, encoding them into 100K visual tokens, and then feeding those into a model. Since the model is tiny, it’s very cheap to run, for example, <a target="_blank" href="https://x.com/askalphaxiv"><strong>alphaXiv</strong></a><strong> </strong>claimed they have OCRd’ all of the papers on ArXiv with this model for $1000, a task that would have cost $7500 using MistalOCR - as per their paper, with DeepSeek OCR, on a single H100 GPU, its possible to scan up to 200K pages! 🤯 Really innovative stuff! </p><p>OCR and VLM models had quite a week, with multiple models besides DeepSeek OCR releasing, models like Liquids LFM2-VL-3B (<a target="_blank" href="https://x.com/LiquidAI_/status/1980985540196393211">X</a>, <a target="_blank" href="https://huggingface.co/liquidai/lfm2-vl-3b">HF</a>), and the newly updated 2B and 32B of Qwen3-VL (<a target="_blank" href="https://x.com/Alibaba_Qwen/status/1980665932625383868">X</a>, <a target="_blank" href="https://huggingface.co/collections/Qwen3-VL">Hugging Face</a>), and AI2’s olmo-ocr 2-7B (<a target="_blank" href="https://x.com/allen_ai/status/1981029159267659821">X</a>, <a target="_blank" href="https://huggingface.co/allenai/olmOCR-2-7B-1025-FP8">HF</a>). </p><p>The Qwen models are particularly interesting, as the 2B model is a generic VLM (can also do OCR) and is close to previous weeks 4B and 8B brothers, and the newly updated 32B model outperforms GPT-5 mini and Claud 4 sonnet even! </p><p>The Browser Wars are BACK: OpenAI & Microsoft Go Agentic</p><p>Look, I may be aging myself here, but I remember, as a young frontend dev, having to install 5 browers at once to test them out, Chrome, Internet Explorer, Firefox, Opera etc’. That was then, and now, I have Dia, Comet, and the newly released Atlas, and, yeah, today I even installed Microsoft Edge to test their AI features! It seems like the AI boom brought with it a newly possible reason for folks to try and take a bite out of Chrome (who’s agentic features are long rumored with project mariner but are nowhere to be found/shipped yet) </p><p>OpenAI’s ChatGPT Atlas: The Browser Reimagined (<a target="_blank" href="https://x.com/OpenAI/status/1980685602384441368">X</a>, <a target="_blank" href="https://chatgpt.com/atlas/get-started/">Download</a>)</p><p>OpenAI is proving that besides just models, they are a product powerhouse, stepping into categories like Shopping (with a shopify integration), app stores (with ChatGPT apps), social (with Sora2) and now... browsers! This week, they have launched their tightly integrated into ChatGPT browser called Atlas, and it’s a big release! </p><p>I’ll split my review here to 2 parts, the browser features part and the agentic part. </p><p>New fresh take on a chromium based browser</p><p>The tight integration into ChatGPT is everywhere in this browser, from the new tab that looks like the basic ChatGPT interaface, one line of text, to the sidebar on the left that... is the ChatGPT web sidebar with all your chats, projects, custom GPTs etc. </p><p>The integration doesn’t stop there, as you have to sign in to your ChatGPT account to even use this browser (available only to MacOS users, and Pro, Plus and Nano tiers). The browser has a few neat tricks, like a special tool that allows you to search your browsing history with natural language, a-la “what were those shoes I was looking at a few days ago” will find your the tabs you browsed for shoes. </p><p>A special and cool feature is called, confusingly “Cursor”, wherein you can select a text, and then click the little OpenAI logo that pops up, allowing you to ask ChatGPT for changes to that selected text (like fix typos, spruce up your writing etc). It’s surprisingly convenient to rewrite tweets or for any type of document editing. </p><p>ChatGPT Atlas also stores memories about your browsing patterns, which will be additional to the ChatGPT memories it stores about you from chats, helping even more by knowing your browsing patterns, which software you prefer to use, which websites you prefer to order food from etc. This IMO is one of the hugest unlocks for folks inside the ChatGPT ecosystem, as much of a stanard persons peferences can be gleaned from their browser usage and patterns.</p><p>Lastly, the “Ask ChatGPT” sidepane on the right (which can be opened with cmd+.) is really great for chatting with a webpage, or going down search rabbit holes. It receives the context of the webpage you’re looking at by default (only 1 page so far, competitors allow you to add additional tabs with @, (which is supposedly coming to ChatGPT soon) and ask... ChatGPT anything about this. </p><p>Agentic SOTA? not so fast</p><p>The most important “change” to how browsers work in Atlas imo is the agentic mode. This isn’t new, we remember when ChatGPT launched thier Operator Agent back in January of this year (our <a target="_blank" href="https://sub.thursdai.news/p/thursdai-jan-23-2025-deepseek-r1">coverage</a>) and then renamed it Agent Mode and integrated into ChatGPT itself back in <a target="_blank" href="https://sub.thursdai.news/p/thursdai-july-17th-kimi-k2-openai">July</a>. </p><p>So, web browsing agents are not entirely new, what’s novel here though, is the integration into your browser, and the ability for the Atlas browser to use your logged in sessions and cookies, to pretend to be you! This... can be quite scary for some, as prompt injection attacks are getting more popular (where-in malicious a******s add hidden instructions to their website that will get the agent to do something you don’t like) but it’s also very exciting, as the agent can do much much more, without getting blocked by providers who could previously just block Agent Mode as it ran on OpenAI servers! </p><p>Until today, there were 2 main Agentic browsers in the mix, Perplexity’s Comet (where you can choose which model runs the agent) and Atlas. Comet seems to be doing a little bit better on some stuff on my tests, but not by much. I have the same agentic task (go to <a target="_blank" href="http://X.com">X.com</a>, find my bookmarks, open all links, summarize per my specific format) that I’ve been running for a while now, and Comet outdid Atlas this week on that task.</p><p>Who needs agentic browsing? </p><p>For some reason, most of the demos for agentic browsing are showing the same, boring-ish examples. Book some flights, collect a grocery shopping cart. I’ve tried new and different things this week, for example, letting Atlas choose and order food for me (as ChatGPT knows my pescatarian preferences, it’s better than Comet for personal stuff), and one of the longest task I’ve had an agent do yet, I asked it to complete a Compliance training I had to take at work! </p><p>Mind you, this is a very complex task, even for regular people, as these compliance websites are built to not be messed with. They have video players that stop if you switch focus to some other tab, they have interactive quizes and games, drag and drop interfaces, audio buttons, to make sure you really are taking the test. I can happily report, that after 5 hours, and a few stops along the way (where I had to convince the agent to keep going), it completed this very hard task! (and now I have to take this course myself again to actualy be compliant 😅 it will probably take me 2 hours to do manually) </p><p>This experiment made me think, who needs the agentic browsing features and for what? Well, for tasks that require a lot of manual steps to do the same thing over and over again, agentic browser is going to make a lot of peoples browsing a lot easier. Things like kids schedules reviewing in multiple websites, collecitng data and formatting it differently etc. </p><p>Scary security implications </p><p>Atlas could only finish my compliance task while being logged in as me, and ChatGPT Atlas gives a all or nothing control. You can run your agent with full access to your logged in websites (think Gmail etc) or you can essentially give it an incognito mode. </p><p>This, again, due to the risk of promp injections in malicious websites being more and more prevalent. In a rare post detailing how they are thinking about this, OpenAI Chief Information Security officer offered a deep dive into their attempts to mitigate this issue (Simon Willison had a great breakdown of that information <a target="_blank" href="https://simonwillison.net/2025/Oct/22/openai-ciso-on-atlas/">here</a>) but that’s likely not enough, so definitely be aware when you’re running agent mode (which needs to be explicitly turned on right now by selecting Agent) </p><p>This Weeks Buzz - Weights & Biases // Coreweave</p><p>Weights & Biases (now proudly part of CoreWeave) had some exciting updates. Our Fully Connected conference series is hitting Tokyo on October 30-31 and London on November 4-5—perfect for ML practitioners and AI engineers. If you’re in the area, join us for talks, networking, and deep dives into the latest. Register at <a target="_blank" href="Fullyconnected.com"><strong>Fullyconnected.com</strong></a>—DM me if you need a hook-up!</p><p>We also collaborated with Meta and Stanford on Torch Forge, a new PyTorch-native library for scalable RL post-training and agent development. It’s built for massive GPU runs (we provided 520 H100s!), competing with Ray via tools like Monarch scheduler. If you’re training on clusters, check the <a target="_blank" href="https://pytorch.org/blog/introducing-torchforge/">blog</a> —it’s a big deal for efficient multi-GPU workflows.</p><p>Microsoft goes after OpenAI with Edge copilot mode (<a target="_blank" href="https://x.com/mustafasuleyman/status/1981390345578697199">X</a>)</p><p>In a pretty surprising move, Microsoft announced today their take on the agentic browser war, with a bunch of enhancements to Copilot (their overall word for their AI assistance across Microsoft 360, Browser, Bing search etc), Think.. clippy, for the AI age (they even brought clippy back as an <a target="_blank" href="https://x.com/satyanadella/status/1981466897557196837">easter egg</a>) </p><p>The short version is, Edge is getting more powerful with custom agentic features (which I enabled and couldn’t get to work no matter how much I tried, so I can’t tell you how they compare to Atlas/Comet), and they have a voice mode that allows you to talk to your browser, with Edge having a sense of what’s on the actual page! Of course, this being Microsoft, marketing aside and features aside, when I asked Copilot if it has access to other tabs (like the marketing video claims) it said it doesn’t have access, agentic mode didn’t work, and I’m very unlikely to be testing it further! But hey, if you use Copilot app on your mobile phone, and click the new Mico avatar like 25 times it will turn into Clippy, so.. yay? </p><p><strong>Claude Code on the Web, Claude on Desktop upgraded </strong>(<a target="_blank" href="https://x.com/btibor91/status/1980340485152715095">X</a>, <a target="_blank" href="https://www.anthropic.com/news/claude-code-on-the-web">Anthropic</a>)</p><p>Anthropic also made waves by bringing Claude Code to the web. Now you can delegate software tasks to Claude through a web interface with GitHub integration. Nisten was particularly excited about being able to manage his coding projects from his phone. It runs tasks in a secure sandbox, can handle multiple repos, and automatically create pull requests. It’s another powerful coding agent becoming more accessible to developers everywhere. </p><p>They have also made changes to the desktop Claude app, allowing it to see the context of your screen with screenshots, and file sharing, and even a new voice mode that allows you to talk to Claude (which is unfortunately mapped to the tab button, without the ability to remap) </p><p><strong>Browser Automation and Delegated Authentication with Browserbase </strong>(<a target="_blank" href="https://x.com/pk_iv/status/1980653648310071663">X</a>, <a target="_blank" href="https://director.ai/?ref=producthunt">Director.ai</a>, <a target="_blank" href="https://stagehand.dev/">Stagehand</a>)</p><p>While OpenAI and Microsoft are building chat into the browser, what about bringing the browser into our chat-based agents? We had Paul Klein, the founder of Browserbase, join us to talk about this exact topic. His company is tackling one of the biggest hurdles for AI agents: authentication.</p><p>Paul and his team launched Director 2.0, a platform that lets you build web automation with natural language prompts. But the real innovation here is their integration with <strong>1Password</strong>. Instead of giving an agent the “master keys” to all your logged-in sessions like Atlas does, Browserbase allows for delegated, per-site authentication. When an agent running in the cloud needs to log into a site on your behalf, you get a prompt on your local machine to approve it. This is a much safer, more granular way to give agents the access they need. As Paul said, you shouldn’t let an AI the master keys into your house; you should give it permission to enter one room at a time. It’s a brilliant paradigm for secure agentic workflows and I really like this approach of a piece-meal authentication for browser agents. I wish Atlas has something like this for the incognito mode! </p><p>Director 2.0 itself is like V0 for web automation—you give it a prompt, it performs the task, and then it gives you a repeatable script you can deploy. It’s a way to create robust automations without needing to be a developer, and it’s already being used to automate thousands of hours of manual work. </p><p>Video & Audio: The Race to Real-Time</p><p>The world of generative media is moving at lightning speed, with a clear trajectory towards real-time, interactive experiences.</p><p><strong>Decart’s Real-Time Lip Sync API </strong>(<a target="_blank" href="https://x.com/DecartAI/status/1981078296084488293">X</a>)</p><p>We had Kwindla Kramer, one of the worlds leading experts in real-time audio, join us to break down a phenomenal release from Decart AI: a real-time lip-sync API. This isn’t the pre-rendered, slightly-off lip-sync we’re used to. This is a pipeline of models working together to generate perfectly synchronized lip movements for an avatar in real-time.</p><p>Kwindla explained the tech stack: it captures your audio via WebRTC, sends it to Whisper for transcription, gets a response from an LLM like Grok, generates a voice with ElevenLabs, and then Decart’s model modifies the avatar’s video frames to match the new audio, all with a sub-two-second latency. This is how we get to truly interactive, believable AI characters. Kwindla even built a quick demo, though it didn’t seem to work the in the morning, probably GPU issues, so we just played the demo videos. </p><p><strong>LTX-2 and Sora’s Pet Cameos</strong></p><p>The trend towards high-fidelity, real-time generation continued with a breaking news release from Lightricks: LTX-2. This is an open-source (weights coming this fall!) engine that can generate <strong>native 4K video with synchronized audio</strong>. It’s fast, efficient, and is set to be a powerful open alternative to closed models like Sora. And it’s a native 4K, no upscaling! </p><p>Speaking of Sora, they announced that character cameos are getting an upgrade. Soon, you’ll be able to turn anything—your pet, a coffee cup, or even a sunny-side-up egg—into an animated, talking character. I’m really looking forward for this new Sora update and will let you know my impressions when it drops (soon, <a target="_blank" href="https://x.com/billpeeb/status/1981118483607032050">according to</a> Bill from OpenAI) </p><p>What a week folks! What A WEEK! 😅 My head is still spinning! </p><p>From browsers that can do our work for us to OCR that redefines context, we’re seeing foundational shifts across the board. The tools are getting more powerful, more accessible, and more integrated into our daily workflows. The future is being built right now, and we get to watch it happen week by week.</p><p>Thank you for being a ThursdAI subscriber. As always, here are the show notes with all the links and details from this week’s whirlwind of AI news.</p><p>* <strong>Hosts and Guests</strong></p><p>* <strong>Alex Volkov</strong> - AI Evangelist & Weights & Biases (<a target="_blank" href="http://x.com/@altryne">@altryne</a>)</p><p>* Co Hosts - <a target="_blank" href="https://x.com/@yampeleg">@yampeleg</a> <a target="_blank" href="http://x.com/@nisten">@nisten</a> <a target="_blank" href="http://x.com/@ldjconfirmed">@ldjconfirmed</a></p><p>* Paul Kelin <a target="_blank" href="https://x.com/pk_iv">@pk_iv</a> - Browser Base</p><p>* Kwindla Kramer <a target="_blank" href="https://x.com/kwindla">@kwindla</a> - Pipecat & Daily</p><p>* <strong>Open Source LLMs</strong></p><p>* DeepSeek-OCR: Efficient Vision-Text Compression for Massive Contexts (<a target="_blank" href="https://x.com/Presidentlin/status/1980159652563415094">X</a>, <a target="_blank" href="https://huggingface.co/deepseek-ai/DeepSeek-OCR">HF</a>, <a target="_blank" href="https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf">Paper</a>)</p><p>* Liquid AI LFM2-VL-3B: Tiny Multilingual Vision-Language Model (<a target="_blank" href="https://x.com/LiquidAI_/status/1980985540196393211">X</a>, <a target="_blank" href="https://huggingface.co/liquidai/lfm2-vl-3b">HF</a>)</p><p>* PokeeResearch-7B: Open-source SOTA Deep Research Agent (<a target="_blank" href="https://x.com/Pokee_AI/status/1981040897346179256">X</a>, <a target="_blank" href="https://huggingface.co/PokeeAI/pokee_research_7b">HF</a>, <a target="_blank" href="https://pokee.ai/deepresearch-preview">Web</a>, <a target="_blank" href="https://arxiv.org/pdf/2510.15862.pdf">ArXiv</a>, <a target="_blank" href="https://github.com/Pokee-AI/PokeeResearchOSS">GitHub</a>)</p><p>* Qwen3-VL 2B & 32B: compact STEM-tuned multimodal powerhouses (<a target="_blank" href="https://x.com/Alibaba_Qwen/status/1980665932625383868">X</a>, <a target="_blank" href="https://huggingface.co/collections/Qwen3-VL">Hugging Face</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* OpenAI announces Atlas - its agentic AI browser (<a target="_blank" href="https://x.com/OpenAI/status/1980685602384441368">X</a>, <a target="_blank" href="https://chatgpt.com/atlas/get-started/">Download</a>)</p><p>* <a target="_blank" href="https://x.com/cryps1s/status/1981037851279278414">Security Implications, Injection + note from CISO</a></p><p>* Claude Code on the Web: Cloud Coding with Secure Sandboxing (<a target="_blank" href="https://x.com/btibor91/status/1980340485152715095">X</a>, <a target="_blank" href="https://www.anthropic.com/news/claude-code-on-the-web">Anthropic</a>)</p><p>* Meta bans 1‑800‑ChatGPT on WhatsApp</p><p>* Microsoft agentic addition to Copilot Mode in Edge (<a target="_blank" href="https://x.com/MicrosoftEdge/status/1981028712830185914">X</a>)</p><p>* Gemini AI Studio launches “Vibe Coding” (<a target="_blank" href="https://x.com/OfficialLoganK/status/1980674135693971550">X</a>, <a target="_blank" href="https://ai.studio/build">AI Studio Build</a>)</p><p>* <strong>This weeks Buzz</strong></p><p>* Fully connected comes to Tokyo (Oct 30-31) and London (Nov 4-5) ! (<a target="_blank" href="https://fullyconnected.com">register at Fullyconnected.com</a>)</p><p>* <strong>Vision & Video</strong></p><p>* Sora is about to get pet cameos</p><p>* Krea open‑sources a 14‑billion‑parameter real‑time video model (<a target="_blank" href="https://x.com/krea_ai/status/1980358158376988747">X</a>, <a target="_blank" href="https://huggingface.co/krea/krea-realtime-video">HF</a>)</p><p>* Reve’s unannounced video mode!? 1080p + sound</p><p>* LTX-2: open-source 4K audio+video generation engine from Lightricks (<a target="_blank" href="https://x.com/ltx_model/status/1981377626347323480">X</a>, <a target="_blank" href="https://ltx.studio/">Website</a>, <a target="_blank" href="https://github.com/Lightricks/LTX-Video">GitHub</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* Decart Lip Sync API: Real-Time Avatar Lip Movement (<a target="_blank" href="https://x.com/DecartAI/status/1981078296084488293">X</a>)</p><p>* <strong>Tools</strong></p><p>* Browserbase launches Director 2.0: prompt-powered web automation (<a target="_blank" href="https://x.com/pk_iv/status/1980653648310071663">X</a>, <a target="_blank" href="https://director.ai/?ref=producthunt">Director.ai</a>, <a target="_blank" href="https://stagehand.dev/">Stagehand</a>)</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-oct-23-the-ai-browser-wars</link><guid isPermaLink="false">substack:post:176975192</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Fri, 24 Oct 2025 03:02:24 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/176975192/ffa19fab40fc7f919bbc5f57c53d42b1.mp3" length="91456523" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5716</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/176975192/847f39fe33cfd3a78f5ee25b64ae5cd5.jpg"/></item><item><title><![CDATA[📆 ThursdAI - Oct 16 - VEO3.1, Haiku 4.5, ChatGPT adult mode, Claude Skills, NVIDIA DGX spark, Wordlabs RTFM & more AI news]]></title><description><![CDATA[<p>Hey folks, Alex here. </p><p>Can you believe it’s already the middle of October? This week’s show was a special one, not just because of the mind-blowing news, but because we set a new ThursdAI record with <strong>four</strong> incredible interviews back-to-back!</p><p>We had Jessica Gallegos from Google DeepMind walking us through the cinematic new features in VEO 3.1. Then we dove deep into the world of Reinforcement Learning with my new colleague Kyle Corbitt from OpenPipe. We got the scoop on Amp’s wild new ad-supported free tier from CEO Quinn Slack. And just as we were wrapping up, Swyx ( from <a target="_blank" href="https://substack.com/profile/89230629-latentspace">Latent.Space</a> , now with Cognition!) jumped on to break the news about their blazingly fast SWE-grep models. </p><p>But the biggest story? An AI model from Google and Yale made a novel scientific discovery about cancer cells that was then <em>validated in a lab</em>. This is it, folks. This is the “let’s f*****g go” moment we’ve been waiting for. So buckle up, because this week was an absolute monster. Let’s dive in!</p><p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></p><p>Open Source: An AI Model Just Made a Real-World Cancer Discovery</p><p>We always start with open source, but this week felt different. This week, open source AI stepped out of the benchmarks and into the biology lab.</p><p>Our friends at Qwen kicked things off with new 3B and 8B parameter versions of their Qwen3-VL vision model. It’s always great to see powerful models shrink down to sizes that can run on-device. What’s wild is that these small models are outperforming last generation’s giants, like the 72B Qwen2.5-VL, on a whole suite of benchmarks. The 8B model scores a 33.9 on OS World, which is incredible for an on-device agent that can actually see and click things on your screen. For comparison, that’s getting close to what we saw from Sonnet 3.7 just a few months ago. The pace is just relentless.</p><p>But then, Google dropped a <a target="_blank" href="https://blog.google/technology/ai/google-gemma-ai-cancer-therapy-discovery/">bombshell</a>. A 27-billion parameter Gemma-based model they developed with Yale, called C2S-Scale, generated a completely novel hypothesis about how cancer cells behave. This wasn’t a summary of existing research; it was a new idea, something no human scientist had documented before. And here’s the kicker: researchers then took that hypothesis into a wet lab, tested it on living cells, and proved it was true.</p><p>This is a monumental deal. For years, AI skeptics like Gary Marcus have said that LLMs are just stochastic parrots, that they can’t create genuinely new knowledge. This feels like the first, powerful counter-argument. Friend of the pod, Dr. Derya Unutmaz, has been on the show before saying AI is going to solve cancer, and this is the first real sign that he might be right. The researchers noted this was an “emergent capability of scale,” proving once again that as these models get bigger and are trained on more complex data—in this case, turning single-cell RNA sequences into “sentences” for the model to learn from—they unlock completely new abilities. This is AI as a true scientific collaborator. Absolutely incredible.</p><p>Big Companies & APIs</p><p>The big companies weren’t sleeping this week, either. The agentic AI race is heating up, and we’re seeing huge updates across the board.</p><p>Claude Haiku 4.5: Fast, Cheap Model Rivals Sonnet 4 Accuracy (<a target="_blank" href="https://x.com/claudeai/status/1978505436358697052">X</a>, <a target="_blank" href="https://www.anthropic.com/news/introducing-claude-haiku-4-5">Official blog</a>, <a target="_blank" href="https://x.com/danshipper/status/1978506914498834484">X</a>)</p><p>First up, Anthropic released Claude Haiku 4.5, and it is a beast. It’s a fast, cheap model that’s punching way above its weight. On the SWE-bench verified benchmark for coding, <strong>it hit 73.3%</strong>, putting it right up there with giants like GPT-5 Codex, but at a fraction of the cost and twice the speed of previous Claude models. Nisten has already been putting it through its paces and loves it for agentic workflows because it just follows instructions without getting opinionated. It seems like Anthropic has specifically tuned this one to be a workhorse for agents, and it absolutely delivers. </p><p>The thing to note also is the very impressive jump in OSWorld (<strong>50.7%</strong>), which is a computer use benchmark, and at this price and speed ($1/$5 MTok input/output) is going to make computer agents much more streamlined and speedy! </p><p>ChatGPT will loose restrictions; age-gating enables “adult mode” with new personality features coming (<a target="_blank" href="https://x.com/sama/status/1978129344598827128">X</a>) </p><p>Sam Altman set X on fire with a <a target="_blank" href="https://x.com/sama/status/1978129344598827128">thread</a> announcing that ChatGPT will start loosening its restrictions. They’re planning to roll out an “adult mode” in December for age-verified users, potentially allowing for things like erotica. More importantly, they’re bringing back more customizable personalities, trying to recapture some of the magic of GPT-4.0 that so many people missed. It feels like they’re finally ready to treat adults like adults, letting us opt-in to R-rated conversations while keeping strong guardrails for minors. This is a welcome change, and we’ve been advocating for this for a while, and it’s a notable change from the XAI approach I covered <a target="_blank" href="https://open.substack.com/pub/thursdai/p/oct-9-2025-dev-days-agent-era-samsungs?r=2imipa&#38;utm_campaign=post&#38;utm_medium=web&#38;showWelcomeOnShare=true">last week</a>. Opt in for adults with verification while taking precautions vs engagement bait in the form of a flirty animated waifu with engagement mechanics. </p><p>Microsoft is making every windows 11 an AI PC with copilot voice input and agentic powers (<a target="_blank" href="https://blogs.windows.com/windowsexperience/2025/10/16/making-every-windows-11-pc-an-ai-pc/">Blog</a>,<a target="_blank" href="https://x.com/zacbowden/status/1978822883217461388">X</a>)</p><p>And in breaking news from this morning, Microsoft announced that every Windows 11 machine is becoming an AI PC. They’re building a new Copilot agent directly into the OS that can take over and complete tasks for you. The really clever part? It runs in a secure, sandboxed desktop environment that you can watch and interact with. This solves a huge problem with agents that take over your mouse and keyboard, locking you out of your own computer. Now, you can give the agent a task and let it run in the background while you keep working. This is going to put agentic AI in front of hundreds of millions of users, and it’s a massive step towards making AI a true collaborator at the OS level.</p><p>NVIDIA DGX - the tiny personal supercomputer at $4K  (<a target="_blank" href="https://twitter.com/lmsysorg">X</a>, <a target="_blank" href="https://lmsys.org/blog/2025-10-13-nvidia-dgx-spark/">LMSYS Blog</a>)</p><p>NVIDIA finally delivered their promised AI Supercomputer, and while the excitement was in the air with Jensen hand delivering the DGX Spark to OpenAI and Elon (recreating that historical picture when Jensen hand delivered a signed DGX workstation while Elon was still affiliated with OpenAI). The workstation was sold out almost immediately. Folks from LMSys did a great <a target="_blank" href="https://lmsys.org/blog/2025-10-13-nvidia-dgx-spark/">deep dive</a> into specs, all the while, folks on our feeds are saying that if you want to get the maximum possible open source LLMs inference speed, this machine is probably overpriced, compared to what you can get with an M3 Ultra Macbook with 128GB of RAM or the RTX 5090 GPU which can get you similar if not better speeds at significantly lower price points. </p><p>Anthropic’s “Claude Skills”: Your AI Agent Finally Gets a Playbook (<a target="_blank" href="https://www.anthropic.com/news/skills">Blog</a>)</p><p>Just when we thought the week couldn’t get any more packed, Anthropic dropped “Claude Skills,” a huge upgrade that lets you give your agent custom instructions and workflows. Think of them as expertise folders you can create for specific tasks. For example, you can teach Claude your personal coding style, how to format reports for your company, or even give it a script to follow for complex data analysis.</p><p>The best part is that Claude automatically detects which “Skill” is needed for a given task, so you don’t have to manually load them. This is a massive step towards making agents more reliable and personalized, moving beyond just a single custom instruction and into a library of repeatable, expert processes. It’s available now for all paid users, and it’s a feature I’ve been waiting for. Our friend <a target="_blank" href="https://substack.com/profile/5753967-simon-willison">Simon Willison</a> things skills may be <a target="_blank" href="https://simonwillison.net/2025/Oct/16/claude-skills/">a bigger deal than MCPs</a>! </p><p>🎬 Vision & Video: Veo 3.1, Sora Gets Longer, and Real-Time Worlds</p><p>The AI video space is exploding. We started with an amazing interview with Jessica Gallegos, a Senior Product Manager at Google DeepMind, all about the new Veo 3.1. This is a significant 0.1 update, not a whole new model, but the new features are game-changers for creators.</p><p>The audio quality is way better, and they’ve massively improved video extensions. The model now conditions on the last second of a clip—including the audio. This means if you extend a video of someone talking, they keep talking in the same voice! This is huge, saving creators from complex lip-syncing and dubbing workflows. They also added object insertion and removal, which works on both generated and real-world video. Jessica shared an incredible story about working with director Darren Aronofsky to insert a virtual baby into a live-action film shot, something that’s ethically and practically very difficult to do on a real set. These are professional-grade tools that are becoming accessible to everyone. Definitely worth listening to the whole interview with Jessica, starting at 00:25:44</p><p>I’ve played with the new VEO in Google Flow, and while I was somewhat (still) disappointed with the UI itself (it froze sometimes and didn’t play). I wasn’t able to upload my own videos to use the insert/remove features Jessica mentioned yet, but saw examples online and they looked great! </p><p>Ingredients were also improved with VEO 3.1, where you can add up to 3 references, and they will be included in your video but not as first frame, the model will use them to condition the vidoe generation. Jessica clarified that if you upload sound, as in, your voice, it won’t show up in the model as your voice yet, but maybe they will add this in the future (at least this was my feedback to her). </p><p>SORA 2 extends video gen to 15s for all, 25 seconds to pro users with a new storyboard </p><p>Not to be outdone, OpenAI pushed a bit of an update for Sora. All users can now generate up to 15-second clips (up from 8-10), and Pro users can go up to 25 seconds using a new storyboard feature. I’ve been playing with it, and while the new scene-based workflow is powerful, I’ve noticed the quality can start to degrade significantly in the final seconds of a longer generation (posted my experiments <a target="_blank" href="https://x.com/altryne/status/1978569726734545009">here</a>) as you can see. The last few shot so the cowboy don’t have any action, and the face is a blurry mess. </p><p>Worldlabs RTFM: Real-Time Frame Model renders 3D worlds at interactive speeds on a single H100 ( <a target="_blank" href="https://x.com/theworldlabs/status/1978839171058815380">X</a>, <a target="_blank" href="https://www.worldlabs.ai/blog/rtfm">Blog</a>, <a target="_blank" href="https://rtfm.worldlabs.ai/">Demo</a> )</p><p></p><p>And just when we thought we’d seen it all, World Labs dropped a breaking news release: RTFM, the Real-Time Frame Model. This is a generative world model that renders interactive, 3D-consistent worlds on the fly, all on a single H100 GPU. Instead of pre-generating a 3D environment, it’s a “learned renderer” that streams pixels as you move. We played with the demo live on the show, and it’s mind-blowing. The object permanence is impressive; you can turn 360 degrees and the scene stays perfectly coherent. It feels like walking around inside a simulation being generated just for you.</p><p>This Week’s Buzz: RL Made Easy with Serverless RL + interview with Kyle Corbitt</p><p>It was a huge week for us at Weights & Biases and CoreWeave. I was thrilled to finally have my new colleague Kyle Corbitt, founder of OpenPipe, back on the show to talk all things Reinforcement Learning (RL).</p><p>RL is the technique behind the massive performance gains we’re seeing in models for tasks like coding and math. At a high level, it lets a model try things, and then you “reward” it for good outcomes and penalize it for bad ones, allowing it to learn strategies that are better than what was in its original training data. The problem is, it’s incredibly complex and expensive to set up the infrastructure. You have to juggle an inference stack for generating the “rollouts” and a separate training stack for updating the model weights.</p><p>This is the problem Kyle and his team have solved with Serverless RL, which we just launched and we covered <a target="_blank" href="https://open.substack.com/pub/thursdai/p/oct-9-2025-dev-days-agent-era-samsungs?r=2imipa&#38;utm_campaign=post&#38;utm_medium=web&#38;showWelcomeOnShare=true">last week</a>. It’s a new offering that lets you run RL jobs without managing any servers or GPUs. The whole thing is powered by the CoreWeave stack, with tracing and evaluation beautifully visualized in Weave.</p><p>We also launched a <a target="_blank" href="https://wandb.ai/site/inference/cw_openpipe_qwen3-14b-instruct">new model </a>from the OpenPipe team on our inference service: a fine-tune-friendly “instruct” version of Qwen3 14B. The team is not just building amazing products, they’re also contributing great open-source models. It’s awesome to be working with them.</p><p>🛠️ Tools & Agents: Free Agents & Lightning-Fast Code Search</p><p>The agentic coding space saw two massive announcements this week, and we had the representatives of both companies on the show to discuss them!</p><p>First, Quinn Slack from Amp announced that they’re launching a completely free, ad-supported tier. I’ll be honest, my first reaction was, “Ads? In my coding agent? Eww.” But the more I thought about it, the more it made sense. My AI subscriptions are stacking up, and this model makes powerful agentic coding accessible to students and developers who can’t afford another $20/month. The ads are contextual to your codebase (think Baseten or Axiom), and they’re powered by a rotating mix of models using surplus capacity from providers. It’s a bold and fascinating business model.</p><p>This move was met with generally positive responses, though folks from a competing <a target="_blank" href="https://x.com/pashmerepat/status/1978934813253079383">agent</a>, claim that Amp is serving Grok-4-fast which XAI is giving out for free anyway? We’ll see how this shakes up. </p><p>Cognition announces SWE-grep: RL-powered multi-turn context retriever for agentic code search (<a target="_blank" href="https://cognition.ai/blog/swe-grep">Blog</a>, <a target="_blank" href="https://x.com/cognition/status/1978867021669413252">X</a>, <a target="_blank" href="https://playground.cognition.ai/">Playground</a>, <a target="_blank" href="https://windsurf.com/">Windsurf</a>)</p><p>Then, just as we were about to sign off, friend of the pod Swyx (now from Cognition) dropped in with breaking news about SWE-grep. It’s a new, RL-tuned sub-agent for their Windsurf editor that makes code retrieval and context gathering ridiculously fast. We’re talking over 2,800 tokens per second. (yes, they are using Cerebras under the hood) </p><p>The key insight from Swyx is that their model was trained for natively parallel tool calling, running up to eight searches on a codebase simultaneously. This speeds up the “read” phase of an agent’s workflow—which is 60-70% of the work—by 3-5x. It’s all about keeping the developer in a state of flow, and this is a huge leap forward in making agent interactions feel instantaneous. Swyx also dropped a hint that the next thing that comes is CodeMaps and they will make these retrievers look trivial! </p><p>This was one for the books, folks. An AI making a novel cancer discovery, video models taking huge leaps, and the agentic coding space is on fire. The pace of innovation is just breathtaking. Thank you for being a ThursdAI subscriber, and as always, here’s the TL:DR and show notes for everything that happened in AI this week.</p><p>TL;DR and Show Notes</p><p>* <strong>Hosts and Guests</strong></p><p>* <strong>Alex Volkov</strong> - AI Evangelist & Weights & Biases (<a target="_blank" href="http://x.com/@altryne">@altryne</a>)</p><p>* <strong>Co Hosts</strong> - <a target="_blank" href="http://x.com/@WolframRvnwlf">@WolframRvnwlf</a> <a target="_blank" href="http://x.com/@yampeleg">@yampeleg</a> <a target="_blank" href="http://x.com/@nisten">@nisten</a> <a target="_blank" href="http://x.com/@ldjconfirmed">@ldjconfirmed</a></p><p>* <strong>Jessica Gallegos</strong>, Sr. Product Manager, Google DeepMind</p><p>* <strong>Kyle Corbitt (</strong><a target="_blank" href="https://x.com/corbtt"><strong>@corbtt</strong></a><strong>)</strong> - OpenPipe//W&B</p><p>* <strong>Quinn Slack (</strong><a target="_blank" href="https://x.com/sqs/status/1978521044194398713"><strong>@sqs</strong></a><strong>)</strong> - Amp</p><p>* <strong>Swyx (</strong><a target="_blank" href="http://x.com/@swyx"><strong>@swyx</strong></a><strong>)</strong> - Cognition</p><p>* <strong>Open Source LLMs</strong></p><p>* KAIST KROMo - bilingual Korean/English 10B (<a target="_blank" href="https://t.co/kDIylkn5pC">HF</a>, <a target="_blank" href="https://arxiv.org/abs/2510.09426">Paper</a>)</p><p>* Qwen3-VL 3B and 8B (<a target="_blank" href="https://x.com/Alibaba_Qwen/status/1978150959621734624">X post</a>, <a target="_blank" href="https://huggingface.co/collections/Qw">HF</a>)</p><p>* Google’s C2S-Scale 27B: AI Model Validates Cancer Hypothesis in Living Cells (<a target="_blank" href="https://x.com/sundarpichai/status/1978507110477332582">X</a>, <a target="_blank" href="https://blog.google/technology/ai/google-gemma-ai-cancer-therapy-discovery/">Blog</a>, <a target="_blank" href="https://www.biorxiv.org/content/10.1101/2025.04.14.648850v2">Paper</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Claude Haiku 4.5: Fast, Cheap Model Rivals Sonnet 4 Accuracy (<a target="_blank" href="https://x.com/claudeai/status/1978505436358697052">X</a>, <a target="_blank" href="https://www.anthropic.com/news/introducing-claude-haiku-4-5">Official blog</a>)</p><p>* ChatGPT will loose restrictions; age-gating enables “adult mode” with new personality features coming (<a target="_blank" href="https://x.com/sama/status/1978129344598827128">X</a>)</p><p>* OpenAI updates memory management - no more “memory full” (<a target="_blank" href="https://x.com/OpenAI/status/1978608684088643709">X</a>, <a target="_blank" href="https://help.openai.com/en/articles/8590148-memory-faq">FAQ</a>)</p><p>* Microsoft is making every windows 11 an AI PC with copilot voice input (<a target="_blank" href="https://x.com/zacbowden/status/1978822883217461388">X</a>)</p><p>* Claude Skills: Custom instructions for AI agents now live (<a target="_blank" href="https://x.com/claudeai/status/1978855432123723909">X</a>, <a target="_blank" href="https://www.anthropic.com/news/skills">Anthropic News</a>, <a target="_blank" href="https://www.youtube.com/watch?v=IoqpBKrNaZI">YouTube Demo</a>)</p><p>* <strong>Hardware</strong></p><p>* NVIDIA DGX Spark: desktop personal supercomputer for AI prototyping and local inference (<a target="_blank" href="https://lmsys.org/blog/2025-10-13-nvidia-dgx-spark/">LMSYS Blog</a>)</p><p>* Apple announces M5 chip with double AI performance (<a target="_blank" href="https://www.apple.com/newsroom/2025/10/apple-unleashes-m5-the-next-big-leap-in-ai-performance-for-apple-silicon/">Apple Newsroom</a>)</p><p>* OpenAI and Broadcom set to deploy 10 gigawatts of custom AI accelerators (<a target="_blank" href="https://openai.com/index/openai-and-broadcom-announce-strategic-collaboration/">Official announcement</a>)</p><p>* <strong>This weeks Buzz</strong></p><p>* New model - OpenPipe Qwen3 14B instruct (<a target="_blank" href="https://wandb.ai/site/inference/cw_openpipe_qwen3-14b-instruct">link</a>)</p><p>* Interview with Kyle Corbitt - RL, Serverless RL</p><p>* W&B Fully Connected London & Tokyo in 20 days - <a target="_blank" href="https://wandb.ai/site/resources/events/fully-connected/">SIGN UP</a></p><p>* <strong>Vision & Video</strong></p><p>* Veo 3.1: Google’s Next-Gen Video Model Launches with Cinematic Audio (<a target="_blank" href="https://developers.googleblog.com/">Developers Blog</a>)</p><p>* Sora up to 15s and pro now up to 25s generation with a new storyboard feature</p><p>* Baidu’s MuseStreamer has >20 second generations (<a target="_blank" href="https://x.com/Baidu_Inc/status/1978505872805658960">X</a>)</p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* Worldlabs RTFM: Real-Time Frame Model renders 3D worlds at interactive speeds on a single H100 (<a target="_blank" href="https://www.worldlabs.ai/blog/rtfm">Blog</a>, <a target="_blank" href="https://rtfm.worldlabs.ai/">Demo</a>)</p><p>* DiT360: SOTA Panoramic Image Generation with Hybrid Training (<a target="_blank" href="https://fenghora.github.io/DiT360-Page/">Project page</a>, <a target="_blank" href="https://github.com/Insta360-Resea">GitHub</a>)</p><p>* Riverflow 1 tops the image‑editing leaderboard (<a target="_blank" href="https://www.sourceful.com/blog/riverflow-1">Sourceful blog</a>)</p><p>* <strong>Tools</strong></p><p>* Amp launches a Free tier - powered by ads and surplus model capacity (<a target="_blank" href="https://ampcode.com/free">Website</a>)</p><p>* Cognition SWE-grep: RL-powered multi-turn context retriever for agentic code search (<a target="_blank" href="https://cognition.ai/blog/swe-grep">Blog</a>, <a target="_blank" href="https://playground.cognition.ai/">Playground</a>)</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-oct-16-veo31-haiku-45-chatgpt</link><guid isPermaLink="false">substack:post:176372877</guid><dc:creator><![CDATA[Alex Volkov, Kyle, Latent.Space, and Quinn Slack]]></dc:creator><pubDate>Fri, 17 Oct 2025 01:13:58 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/176372877/65d7512ff180fdfef56d57b0cbb10807.mp3" length="90854441" type="audio/mpeg"/><itunes:author>Alex Volkov, Kyle, Latent.Space, and Quinn Slack</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5678</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/176372877/45decf49c14e2c2e84ac69d4612beb9f.jpg"/></item><item><title><![CDATA[📆 Oct 9, 2025 — Dev Day’s Agent Era, Samsung’s 7M TRM Shock, Ling‑1T at 1T, Grok Video goes NSFW, and Serverless RL arrives]]></title><description><![CDATA[<p>Hey everyone, Alex here 👋</p><p>We’re deep in the post-reality era now. Between Sora2, the latest waves of video models, and “is-that-person-real” cameos, it’s getting genuinely hard to trust what we see. Case in point: I recorded a short clip with (the real) Sam Altman this week and a bunch of friends thought I faked it with Sora-style tooling. Someone even added a fake Sora watermark just to mess with people. Welcome to 2025.</p><p>This week’s episode and this write-up focus on a few big arcs we’re all living through at once: OpenAI’s Dev Day and the beginning of the agent-app platform inside ChatGPT, a bizarre and exciting split-screen in model scaling where a 7M recursive model from Samsung is suddenly competitive on reasoning puzzles while inclusionAI is shipping a trillion-parameter mixture-of-reasoners,  and Grok’s image-to-video now does audio and pushes the line on… taste. We also dove into practical evals for coding agents with Eric Provencher from Repo Prompt, and I’ve got big news from my day job world: W&B + CoreWeave launched Serverless RL, so training and deploying RL agents at scale is now one API call away.</p><p>Let’s get into it.</p><p>OpenAI’s 3rd Dev Day - Live Coverage + exclusive interviews</p><p>This is the third Dev Day that I got to attend in person, covering this for ThursdAI (<a target="_blank" href="https://sub.thursdai.news/p/nov-09">2023</a>, <a target="_blank" href="https://sub.thursdai.news/p/oct-3-how-i-met-sam-altman">2024</a>), and this one was the best by far! The production quality of their events rises every year, and this year they’ve opened up the conference to >1500 people, had 3 main launches and a lot of ways to interact with the OpenAI folks! </p><p>I’ve also gotten an exclusive chance to sit in on a fireside chat with Sam Altman and Greg Brokman (snippets of which I’ve included in the podcast, starting 01:15:00 and I got to ask Sam a few questions after that as well. </p><p>Event Ambiance and Vibes</p><p>OpenAI folks outdid themselves with this event, the live demos were quite incredible, the location (Fort Mason), Food and just the whole thing was on point. The event concluded with a 1x1 Sam and Jony Ive chat that I hope will be published on YT sometime, because it was very insightful. </p><p>By far the best reason to go to this event in person is meeting folks and networking, both OpenAI employees, and AI Engineers who use their products. It’s 1 day a year, when OpenAI makes all their employees who attend, Developer Experience folks, as you can and are encouraged to, interact with them, ask questions, give feedback and it’s honestly great! </p><p>I really enjoy meeting folks at this event and consider this to be a very high signal network, and was honored to have quite a few ThursdAI listeners among the participants and OpenAI folk! If you’re reading this, thank you for your patronage 🫡 </p><p>Launches and Ships</p><p>OpenAI also shipped, and shipped a LOT! Sam was up on Keynote with 3 main pillars, which we’ll break down 1 by 1. ChatGPT Apps, AgentKit (+ agent builder) and Codex/New APIs</p><p>Codex & New APIs</p><p>Codex has gotten General Availability, but we’ve been using it all this time so we don’t really care, what we do care about is the new slack integration and the new Codex SDK, which means you can now directly inject Codex agency into your app. This flew a bit over people’s heads, but Romain Huet, VP of DevEx at OpenAI demoed on stage how his mobile app now has a Codex tab, where he can ask Codex to make changes to the app at runtime! It was quite crazy! </p><p>ChatGPT Apps + AppsSDK</p><p>This was maybe the most visual and most surprising release, since they’ve tried to be an appstore before (plugins, customGPTs). But this time, it seems like, based on top of MCP, ChatGPT is going to become a full blown Appstore for 800+ million weekly active ChatGPT users as well. </p><p>Some of the examples they have showed included Spotify and Zillow, where just by typing in “Spotify” in chatGPT, you will have an interactive app with it’s own UI, right inside of ChatGPT. So you could ask it to create a playlist for you based on your history, or ask Zillow to find homes in an area under a certain $$ amount.</p><p>The most impressive thing, is that those are only launch partners, everyone can (technically) build a ChatGPT app with their AppsSDK that’s built on top of... the MCP (model context protocol) spec! </p><p>The main question remains about discoverability, this is where Plugins and CustomGPTs (previous attempts to create apps within ChatGPT have failed), and when I asked him about it, Sam basically said “we’ll iterate and get it right” (starting 01:17:00). So it remains to be seen if folks really need their ChatGPT as yet another Appstore. </p><p>AgentKit, AgentBuilder and ChatKit</p><p>2025 is the year of agents, and besides launching quite a few of their own, OpenAI will not let you, build and host smart agents that can use tools, on their platform. Supposedly, with AgentBuilder, building agents is just dragging a few nodes around, prompting and connecting them. They had a great demo on stage where with less than 8 minutes, they’ve build an agent to interact with the DevDay website.</p><p>It’s also great to see how greatly does OpenAI adapt the MCP spec, as this too, is powered by MCP, as in, any external connection you want to give your agent, must happen with an MCP server. </p><p>Agents for the masses is maybe not quite there yet</p><p>In reality though, things are not so easy. Agents require more than just a nice drag & drop interface, they require knowledge, iteration, constant evaluation (which they’ve also added, kudos!) and eventually, customized agents need code. </p><p>I <a target="_blank" href="https://x.com/altryne/status/1976024045020934317">spent an hour</a> trying it out yesterday, building an agent to search the ThursdAI archives. The experience was a mixed bag. The AI-native features are incredibly cool. For instance, you can just describe the JSON schema you want as an output, and it generates it for you. The widget builder is also impressive, allowing you to create custom UI components for your agent’s responses.</p><p>However, I also ran into the harsh realities of agent building. My agent’s web browsing tool failed because Substack seems to be blocking OpenAI’s crawlers, forcing me to fall back on the old-school RAG approach of uploading our entire archive to a vector store. And while the built-in evaluation and tracing tools are a great idea, they were buggy and failed to help me debug the error. It’s a powerful tool, but it also highlights that building a reliable agent is an iterative, often frustrating process that a nice UI alone can’t solve. It’s not just about the infrastructure; it’s about wrestling with a stochastic machine until it behaves.</p><p>But to get started with something simple, they have definitely pushed the envelope on what is possible without coding. </p><p>OpenAI also dropped a few key API updates:</p><p>* <strong>GPT-5-Pro</strong> is now available via API. It’s incredibly powerful but also incredibly expensive. As Eric mentioned, you’re not going to be running agentic loops with it, but it’s perfect for a high-stakes initial planning step where you need an “expert opinion.”</p><p>* <strong>SORA 2</strong> is also in the API, allowing developers to integrate their state-of-the-art video generation model into their own apps. The API can access the 15-second “Pro” model but doesn’t support the “Cameo” feature for now.</p><p>* <strong>Realtime-mini</strong> is a game-changer for voice AI. It’s a new, ultra-fast speech-to-speech model that’s <strong>80% cheaper </strong>than the original Realtime API. This massive price drop removes one of the biggest barriers to building truly conversational, low-latency voice agents.</p><p><strong>My Chat with Sam & Greg - On Power, Responsibility, and Energy</strong></p><p>After the announcements, I’ve got to sit in a fireside chat with Sam Altman and Greg Brockman and ask some questions. Here’s what stood out:</p><p>When I asked about the energy requirements for their massive compute plans (remember the $500B Stargate deal?), Sam said they’d have announcements about Helion (his fusion investment) soon but weren’t ready to talk about it. Then someone from Semi Analysis told me most power will come from... generator trucks. 15-megawatt generator trucks that just drive up to data centers. We’re literally going to power AGI with diesel trucks!</p><p>On responsibility, when I brought up the <a target="_blank" href="https://sub.thursdai.news/p/thursdai-may-1-qwen-3-phi-4-openai">glazing</a> incident and asked how they deal with being in the lives of 800+ million people weekly, Sam’s response was sobering: “This is not the excitement of ‘oh we’re building something important.’ This is just the stress of the responsibility... The fact that 10% of the world is talking to kind of one brain is a strange thing and there’s a lot of responsibility.”</p><p>Greg added something profound: “AI is far more surprising than I anticipated... The deep nuance of how these problems contact reality is something that I think no one had anticipated.”</p><p><strong>This Week’s Buzz: RL X-mas came early with Serverless RL! </strong>(<a target="_blank" href="https://x.com/wandb/status/1975917052920678528">X</a>, <a target="_blank" href="https://openpipe.ai/blog/serverless-rl">Blog</a>)</p><p>Big news from our side of the world! About a month ago, the incredible OpenPipe team joined us at Weights & Biases and CoreWeave. They are absolute wizards when it comes to fine-tuning and Reinforcement Learning (RL), and they wasted no time combining their expertise with CoreWeave’s massive infrastructure.</p><p>This week, they launched <strong>Serverless RL</strong>, a managed reinforcement learning service that completely abstracts away the infrastructure nightmare that usually comes with RL. It automatically scales your training and inference compute, integrates with W&B Inference for instant deployment, and simplifies the creation of reward functions and verifiers. RL is what turns a good model into a great model for a specific task, often with surprisingly little data. This new service massively lowers the barrier to entry, and I’m so excited to see what people build with it. We’ll be doing a deeper dive on this soon but please check out the <a target="_blank" href="wandb.me/RLS">Colab Notebook</a> to get a taste of what AutoRL is like! </p><p><strong>Open Source</strong></p><p>While OpenAI was holding its big event, the open-source community was busy dropping bombshells of its own.</p><p><strong>Samsung’s TRM: Is This 7M Parameter Model... Magic? </strong>(<a target="_blank" href="https://x.com/jm_alexia/status/1975560628657164426">X</a>, <a target="_blank" href="https://t.co/w5ZDsHDDPE">Blog</a>, <a target="_blank" href="https://arxiv.org/pdf/2510.04871.pdf">arXiv</a>)</p><p>This was the release that had everyone’s jaws on the floor. A single researcher from the Samsung AI Lab in Montreal released a paper on a <strong>Tiny Recursive Model (TRM)</strong>. Get this: it’s a <strong>7 MILLION parameter model</strong> that is outperforming giants like DeepSeek-R1 and Gemini 2.5 Pro on complex reasoning benchmarks like ARC-AGI. I had to read that twice. 7 million, not billion.</p><p>How is this possible? Instead of relying on brute-force scale, TRM uses a recursive process. It generates a first draft of an answer, then repeatedly critiques and refines its own logic in a hidden “scratchpad” up to 16 times. As Yam pointed out, the paper is incredibly insightful, and it’s a groundbreaking piece of work from a single author, which is almost unheard of these days. Eric made a great point that because it’s so small, it opens the door for hobbyists and solo researchers to experiment with cutting-edge architectures on their home GPUs. This feels like a completely new direction for AI, and it’s incredibly exciting.</p><p><strong>inclusionAI’s Ling-1T: Enter the Trillion Parameter Club </strong>(<a target="_blank" href="https://x.com/AntLingAGI/status/1975942293330018426">X</a>, <a target="_blank" href="https://huggingface.co/inclusionAI/Ling-1T">HF</a>, <a target="_blank" href="https://zenmux.ai/settings/chat?model=inclusionai/ling-1t">Try it</a>)</p><p>On the complete opposite end of the scale (about 3 OOM away), we have <strong>Ling-1T</strong>from inclusionAI. This is a <strong>1 TRILLION parameter</strong> Mixture-of-Experts (MoE) model. The key here is efficiency; while it has a trillion total parameters, it only uses about 37 billion active parameters per token.</p><p>The benchmarks are wild, showing it beating models like GPT-5-Main (in non-thinking mode) and Gemini 2.5 on a range of reasoning tasks. They claim to match Gemini’s performance using about half the compute. Of course, with any new model posting huge scores, there’s always the question of whether it was trained on the public test sets, but the results are undeniably impressive. It’s another example of the push towards maintaining top-tier performance while drastically reducing the computational cost of inference.</p><p><strong>More Open Source Goodness: Microsoft, AI21, and IBM</strong></p><p>It didn’t stop there.</p><p>* <strong>Microsoft</strong> released <strong>UserLM-8B</strong>, a fascinating Llama 3 finetune trained not to be an assistant, but to simulate the <em>user</em> in a conversation. As Yam explained from his own experience, this is a super useful technique for generating high-quality, multi-turn synthetic data to train more robust chatbot agents. (<a target="_blank" href="https://x.com/altryne/status/1976122132355580113">X</a><strong>, </strong><a target="_blank" href="https://huggingface.co/microsoft/UserLM-8b"><strong>HF</strong></a><strong>)</strong></p><p>* Our friends at <strong>AI21 Labs</strong> are back with <strong>Jamba Reasoning 3B</strong>, a tiny but mighty 3-billion-parameter model. It uses a hybrid SSM-Transformer architecture, which makes it incredibly fast for its size, making it a great option for local inference on a laptop.</p><p>* <strong>IBM</strong> also released their <strong>Granite</strong> family of models, which also use a hybrid design. Their big focus is on enterprise-grade governance and trust, and it’s the first open model family to get an ISO certification for AI management systems.<strong>Big Company Moves: Grok Imagine Levels Up... And Leans In</strong></p><p>Finally, let’s talk about the latest update to <strong>Grok Imagine</strong>. They’ve rolled out video generation with synchronized sound, and it’s fast—often faster than Sora. The quality has significantly improved, and it’s a powerful tool.</p><p>However, I have to talk about the other side of this. Grok is positioning itself as the “uncensored” alternative, and they are leaning into that hard. Their video generator has a “spicy” mode that explicitly generates 18+ content. But the thing that truly disturbed me was a new feature with their animated character, “Annie.” It’s a gamified engagement mechanic where you “make your connection better” by talking to her every day to unlock special rewards, like new outfits.</p><p>To be blunt, this is disgusting. We talk a lot on this show about the immense responsibility that comes with building these powerful AIs. I know from my conversations with folks at OpenAI and other labs that they think deeply about safety, preventing misuse, and the psychological impact these systems can have. This feature from Grok is the polar opposite. It leans into the worst fears about AI creating addictive, para-social relationships. It’s exploitative, and frankly, the team behind it should reconsider their choices IMO. </p><p>All righty, this is mostly the new for this week, it’s been a very busy week, and if you’d like to see our live coverage + DevDay keynote + interviews I’ve had with <a target="_blank" href="https://substack.com/profile/5753967-simon-willison">Simon Willison</a> , Greg Kamradt, Jeffrey Huber, Allesio from <a target="_blank" href="https://substack.com/profile/89230629-latentspace">Latent.Space</a>, Matthew Berman and more impactful folks, our livestream can be found here: </p><p>I’m incredibly humbled and privileged to keep being invited to the Dev Day, and looking forward to cover more events, with exclusive interviews, on the ground reporting and insights. Please subscribe if you like this content to continue. </p><p>TL;DR and Show Notes</p><p>* <strong>Show Notes & Guests</strong></p><p>* <strong>Alex Volkov</strong> - AI Evangelist & Weights & Biases (<a target="_blank" href="http://x.com/@altryne">@altryne</a>)</p><p>* <strong>Co-Hosts</strong> - <a target="_blank" href="http://x.com/@WolframRvnwlf">@WolframRvnwlf</a>, <a target="_blank" href="http://x.com/yampeleg">@yampeleg</a>, <a target="_blank" href="http://x.com/@nisten">@nisten</a>, <a target="_blank" href="http://x.com/@ldjconfirmed">@ldjconfirmed</a></p><p>* <strong>Guest</strong>: Kyle Corbitt - OpenPipe / CoreWeave (<a target="_blank" href="https://x.com/corbtt">@corbtt</a>)</p><p>* <strong>Guest</strong>: Eric Provencher - Repo Prompt (<a target="_blank" href="https://x.com/pvncher">@pvncher</a>)</p><p>* <strong>OpenAI Dev Day</strong></p><p>* OpenAI AgentKit All-in-One Agent Builder (<a target="_blank" href="https://x.com/rohanpaul_ai/status/1975309479047798835">X</a>, <a target="_blank" href="https://openai.com/index/introducing-agentkit/">OpenAI</a>)</p><p>* ChatGPT Apps & New APIs (GPT-5-pro, SORA, realtime-mini)</p><p>* <strong>Open Source LLMs</strong></p><p>* Microsoft UserLM-8B Model Released (<a target="_blank" href="https://x.com/altryne/status/1976122132355580113">X</a>, <a target="_blank" href="https://huggingface.co/microsoft/UserLM-8b">HF</a>)</p><p>* Samsung Tiny Recursive Model (TRM) (<a target="_blank" href="https://x.com/jm_alexia/status/1975560628657164426">X</a>, <a target="_blank" href="https://t.co/w5ZDsHDDPE">Blog</a>, <a target="_blank" href="https://arxiv.org/pdf/2510.04871.pdf">arXiv</a>)</p><p>* AI21 Labs releases Jamba Reasoning 3B (<a target="_blank" href="https://x.com/AI21Labs/status/1976271434004541641">X</a>, <a target="_blank" href="https://huggingface.co/ai21labs/AI21-Jamba-Reasoning-3B">HF</a>)</p><p>* inclusionAI debuts Ling-1T: Trillion-Scale Efficient Reasoner (<a target="_blank" href="https://x.com/AntLingAGI/status/1975942293330018426">X</a>, <a target="_blank" href="https://huggingface.co/inclusionAI/Ling-1T">HF</a>, <a target="_blank" href="https://zenmux.ai/settings/chat?model=inclusionai/ling-1t">Try it</a>)</p><p>* IBM Granite Models</p><p>* <strong>Evals</strong></p><p>* Repo Bench by Repo Prompt (<a target="_blank" href="https://repo.prompt.com/bench">Web</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Qwen 3 Omni & Realtime Models</p><p>* Google DeepMind unveils Gemini 2.5 Computer-Use model (<a target="_blank" href="https://x.com/GoogleDeepMind/status/1975917052920678528">X</a>, <a target="_blank" href="https://blog.google/technology/google-deepmind/gemini-computer-use-model">Blog</a>)</p><p>* Google Gemini Flash 2.5 (new)</p><p>* Grok Imagine updated with video and sound</p><p>* <strong>This weeks Buzz</strong></p><p>* OpenPipe (part of Coreweave,W&B) launch Serverless RL (<a target="_blank" href="https://x.com/wandb/status/1975917052920678528">X</a>, <a target="_blank" href="https://openpipe.ai/blog/serverless-rl">Blog</a>, <a target="_blank" href="http://wandb.me/RLS">Notebook</a>)</p><p>* <strong>Vision & Video</strong></p><p>* Ovi: Open Source Video & Synchronized Audio Generation (<a target="_blank" href="https://x.com/linoy_tsaban/status/1975924336935743737">X</a>, <a target="_blank" href="https://huggingface.co/blog/MonsterMMORPG/ovi-generate-videos-with-audio-like-veo-3-or-sora">HF</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* GPT-realtime-mini: OpenAI’s ultra-fast speech-to-speech model API (<a target="_blank" href="https://platform.openai.com/docs/models/gpt-realtime-mini">OpenAI Blog</a>, <a target="_blank" href="https://techcrunch.com/2025/10/06/openai-ramps-up-developer-push-with-more-powerful-models-in-its-api/">TechCrunch</a>)</p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* Bagel.com: Paris – Decentralized Diffusion Model (<a target="_blank" href="https://x.com/bageldotcom/status/1975596255624769858">X</a>, <a target="_blank" href="https://huggingface.co/bageldotcom/paris">HF</a>, <a target="_blank" href="https://blog.bagel.com/p/paris">Blogpost</a>)</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/oct-9-2025-dev-days-agent-era-samsungs</link><guid isPermaLink="false">substack:post:175763277</guid><dc:creator><![CDATA[Alex Volkov and Eric Provencher]]></dc:creator><pubDate>Fri, 10 Oct 2025 01:46:16 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/175763277/0905ca56ecf1a8ce098f08da58ad1c60.mp3" length="97420495" type="audio/mpeg"/><itunes:author>Alex Volkov and Eric Provencher</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6089</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/175763277/eee589bf6f7bcdfd1cf810c7882b72ce.jpg"/></item><item><title><![CDATA[Sora 2 Crushes TikTok, Claude 4.5 Fizzles, DeepSeek innovates attention and GLM 4.6 Takes the Crown! 🔥]]></title><description><![CDATA[<p>Hey everyone, Alex here (yes the real me if you’re reading this) </p><p>The weeks are getting crazier, but what OpenAI pulled this week, with a whole new social media app attached to their latest AI breakthroughs is definitely breathtaking! Sora2 released and instantly became a viral sensation, shooting to the top 3 free iOS spot on AppStore, with millions of videos watched, and remixed. </p><p>On weeks like these, even huge releases like Claude 4.5 are taking the backseat, but we still covered them! </p><p>For listeners of the pod, the second half of the show was very visual heavy, so it may be worth it watching the YT video attached in a comment if you want to fully experience the Sora revolution with us! (and if you want a SORA invite but don’t have one yet, more on that below) </p><p><p>ThursdAI - if you find this valuable, please support us by subscribing! </p></p><p>Sora 2 - the AI video model that signifies a new era of social media</p><p>Look, you’ve probably already heard about the SORA-2 release, but in case you haven’t, OpenAI released a whole new model, but attached it to a new, AI powered social media experiment in the form of a very addictive TikTok style feed. Besides being hyper-realistic, and producing sounds and true to source voice-overs, Sora2 asks you to create your own “Cameo” by taking a quick video, and then allows you to be featured in your own (and your friends) videos. </p><p>This makes a significant break from the previously “slop” based meta Vibes, becuase, well, everyone loves seeing themselves as the stars of the show! </p><p>Cameos are a stroke of genius, and what’s more, one can allow everyone to use their Cameo, which is what Sam Altman did at launch, making everyone Cameo him, and turning him, almost instantly into one of the most meme-able (and approachable) people on the planet! </p><p>Sam sharing away his likeness like this for the sake of the app achieved a few things, it added trust in the safety features, made it instantly viral and showed folks they shouldn’t be afraid of adding their own likeness. </p><p>Vibes based feed and remixing</p><p>Sora 2 is also unique in that, it’s the first social media with UGC (user generated content) where content can ONLY be generated, and all SORA content is created within the app. It’s not possible to upload pictures that have people to create the posts, and you can only create posts with other folks if you have access to their Cameos, or by Remixing existing creations. </p><p>Remixing is also a way to let users “participate” in the creation process, by adding their own twist and vibes! </p><p>Speaking of Vibes, while the SORA app has an algorithmic For You page, they have a completely novel and new way to interact with the algorithm, by using their Pick a Mood feature, where you can describe which type of content you want to see, or not see, with natural language! </p><p>I believe that this feature will come to all social media platforms later, as it’s such a game changer. Want only content in a specific language? or content that doesn’t have Sam Altman in it? Just ask! </p><p>Content that makes you feel good</p><p>The most interesting thing is about the type of content is, there’s no sexualisation (because all content is moderated by OpenAI strong filters), and no gore etc. OpenAI has clearly been thinking about teenagers and have added parent controls, things like being able to turn of the For You page completely etc to the mix. </p><p>Additionally, SORA seems to be a very funny model, and I mean this literally. You can ask the video generation for a joke and you’ll often get a funny one. The scene setup, the dialogue, the things it does even unprompted are genuinely entertaining. </p><p>AI + Product = Profit? </p><p>OpenAI shows that they are one of the worlds best product labs in the world, not just a foundational AI lab. Most AI advancements are tied to products, and in this case, the whole experience is so polished, it’s hard to accept that it’s a brand new app from a company that didn’t do social before. There’s very little buggy behavior, videos are loaded up quick, there’s even DMs! I’m thoroughly impressed and am immersing myself in the SORA sphere. Please give me a follow there and feel free to use my Cameo by tagging <a target="_blank" href="https://sora.chatgpt.com/profile/altryne">@altryne</a> in there. I love seeing how folks have used my Cameo, it makes me laugh 😂 </p><p>The copyright question is.. wild</p><p>Remember last year when I asked Sam why Advanced Voice Mode couldn’t sing Happy Birthday? He said they didn’t have classifiers to detect IP violations. Well, apparently that’s not a concern anymore because SORA 2 will happily generate perfect South Park episodes, Rick and Morty scenes, and Pokemon battles. They’re not even pretending they didn’t train on this stuff. You can even generate videos with any dead famous person (I’ve had zoom meetings with Michael Jackson and 2Pac, JFK and Mister Rogers) </p><p>Our friend Ryan Carson already used it to create a YouTube short ad for his startup in two minutes. What would have cost $100K and three months now takes six generations and you’re done. This is the real game-changer for businesses.</p><p>Getting invited</p><p><p>EDIT: If you’re reading this on Friday, try the code `FRIYAY` and let me know in comments if it worked for you 🙏</p></p><p>I wish I would have invites for all of you, but all invited users have 4 other folks they can invite, so we shared a bunch of invites during the live show, and asked folks to come back and invite other listeners, this went on for half an hour so I bet we’ve got quite a few of you in! If you’re still looking for an invite, you can visit the <a target="_blank" href="thursdai.news/sora">thread on X</a> and see who claimed and invite and ask them for one, tell them you’re also a ThursdAI listener, they hopefully will return the favor! </p><p>Alternatively, OpenAI employees often post codes with a huge invite ratio, so follow <a target="_blank" href="https://x.com/GabrielPeterss4">@GabrielPeterss4</a> who often posts codes and you can get in there fairly quick, and if you’re not in the US, I heard a VPN works well. Just don’t forget to follow me on there as well 😉</p><p>A Week with OpenAI Pulse: The Real Agentic Future is Here</p><p>Listen to me, this may be a hot take. I think OpenAI Pulse is a bigger news story than Sora. I’ve told you about Pulse last week, but today on the show I was able to share my weeks worth of experience, and honestly, it’s now the first thing I look at when I wake up in the morning after brushing my teeth! </p><p>While Sora is changing media, Pulse is changing how we interact with AI on a fundamental level. Released to Pro subscribers for now, Pulse is an agentic, personalized feed that works for you behind the scenes. Every morning, it delivers a briefing based on your interests, your past conversations, your calendar—everything. It’s the first asynchronous AI agent I’ve used that feels truly proactive.</p><p>You don’t have to trigger it. It just works. It knew I had a flight to Atlanta and gave me tips. I told it I was interested in Halloween ideas for my kids, and now it’s feeding me suggestions. Most impressively, this week it surfaced a new open-source video model, Kandinsky 5.0, that I hadn’t seen anywhere on X or my usual news feeds. An agent found something new and relevant for my show, without me even asking.</p><p>This is it. This is the life-changing-level of helpfulness we’ve all been waiting for from AI. Personalized, proactive agents are the future, and Pulse is the first taste of it that feels real. I cannot wait for my next Pulse every morning.</p><p><strong>This Week’s Buzz: The AI Build-Out is NOT a Bubble</strong></p><p>This show is powered by Weights & Biases from CoreWeave, and this week that’s more relevant than ever. I just got back from a company-wide offsite where we got a glimpse into the future of AI infrastructure, and folks, the scale is mind-boggling.</p><p>CoreWeave, our parent company, is one of the key players providing the GPU infrastructure that powers companies like OpenAI and Meta. And the commitments being made are astronomical. In the past few months, CoreWeave has locked in a <strong>$22.4B deal with OpenAI</strong>, a <strong>$14.2B pact with Meta</strong>, and a <strong>$6.3B “backstop” guarantee with NVIDIA</strong> that runs through 2032.</p><p>If you hear anyone talking about an “AI bubble,” show them these numbers. These are multi-year, multi-billion dollar commitments to build the foundational compute layer for the next decade of AI. The demand is real, and it’s accelerating. And the best part? As a Weights & Biases user, you have access to this same best-in-class infrastructure that runs OpenAI through our inference services. Try <a target="_blank" href="wandb.me/inference">wandb.me/inference</a><strong>, </strong>and let me know if you need a bit of a credit boost! </p><p>Claude Sonnet 4.5: The New Coding King Has a Few Quirks</p><p>On any other week, Anthropic’s release of <strong>Claude Sonnet 4.5 </strong>would’ve been the headline news. They’re positioning it as the new best model for coding and complex agents, and the benchmarks are seriously impressive. It matches or beats their previous top-tier model, Opus 4.1, on many difficult evals, all while keeping the same affordable price as the previous Sonnet.</p><p>One of the most significant jumps is on the OS World benchmark, which tests an agent’s ability to use a computer—opening files, manipulating windows, and interacting with applications. Sonnet 4.5 scored a whopping 61.4%, a massive leap from Opus 4.1’s 44%. This clearly signals that Anthropic is doubling down on building agents that can act as real digital assistants.</p><p>However, the real-world experience has been a bit of a mixed bag. My co-host Ryan Carson, whose company Amp switched over to 4.5 right away, noted some regressions and strange errors, saying they’re even considering switching back to the previous version until the rough edges are smoothed out. Nisten also found it could be more susceptible to “slop catalysts” in prompting. It seems that while it’s incredibly powerful, it might require some re-prompting and adjustments to get the best, most stable results. The jury’s still out, but it’s a potent new tool in the developer’s arsenal.</p><p>Open Source LLMs: DeepSeek’s Attention Revolution</p><p>Despite the massive news from the big companies, open source still brought the heat this week, with one release in particular representing a fundamental breakthrough.</p><p>DeepSeek released <strong>V3.2 Experimental</strong>, and the big news is DSA, or DeepSeek Sparse Attention. For those who don’t know, one of the biggest bottlenecks in LLMs is the “quadratic attention problem”—as you double the context length, the computation and memory required quadruple. This makes very long contexts incredibly expensive. DeepSeek’s new architecture makes the cost curve nearly flat, allowing for massive context at a fraction of the cost, all while maintaining the same SOTA performance as their previous model.</p><p>This is one of those “unhobbling moments,” like the invention of RoPE or GRPO, that moves the entire field forward. Everyone will be able to implement this, making all open-source models faster and more efficient. It’s a huge deal.</p><p>We also saw major releases from <a target="_blank" href="Z.ai">Z.ai</a> with <strong>GLM-4.6</strong>, an advanced agentic model with a 200K context window that’s getting incredibly close to Claude’s performance, and a surprise from <strong>ServiceNow SLAM Labs</strong>, who dropped <strong>Apriel-1.5-15B</strong>, a frontier-level multimodal model that’s fully open source. It’s amazing to see a huge enterprise company contributing to the open-source ecosystem at this level.</p><p>Multimodal Madness: Audio, Video, and Image Models updates</p><p>The torrent of releases continued across all modalities this week, a bit overshadowed by SORA but definitely still happened (all links in the TL;DR section)</p><p>In voice and audio, our friends at <strong>Hume AI launched Octave 2</strong>, their next-gen text-to-speech model that’s faster, cheaper, and now fluent in over 11 languages. We also saw <strong>LFM2-Audio from Liquid AI</strong>, an incredibly efficient 1.5B parameter end-to-end audio model with sub-100ms latency.</p><p>In video, the open-source community answered Sora 2 with <strong>Kandinsky 5.0</strong>, a new 2B parameter text-to-video model that is claiming the #1 spot in open source and looks incredibly promising. And as I mentioned on the show, I wouldn’t have even known about it if it weren’t for my new personal AI agent, Pulse!</p><p>Finally, in AI art, Tencent dropped a monster: <strong>HunyuanImage 3.0</strong>, a massive 80-billion-parameter open-source text-to-image model. The scale of these open-source releases is just breathtaking.</p><p>Agentic browsing for all is here</p><p>Just as I was wrapping up the show, Perplexity has decided to let everyone in to use their Comet Agentic browser. I strongly recommend it, as I switched to it lately and it’s great! </p><p>I’m using it right now to run some agents, it can click stuff, scroll through stuff, collect info across tabs, it’s really great. Give it a spin, really, it’s worth getting into the habit of agentic browsing! </p><p>Many of you were asking me for invites before, well, it’s free access now, <a target="_blank" href="https://comet.perplexity.ai/">download it here</a> (not sponsored, I just really like it) </p><p>Phew, ok, this was a WILD week, and I’m itching to get back to creating and seeing all the folks who used my Cameo on SORA, which you can see too btw if you hit the Cameo button here (<a target="_blank" href="https://sora.chatgpt.com/profile/altryne">https://sora.chatgpt.com/profile/altryne</a>) </p><p>Next week is OpenAI’s Dev Day, and for the third year in a row we’re going to cover it, so follow us on social media and tune in Monday 8:30am Pacific. We’ll be live streaming from the location and re-streaming the keynote with Sam so don’t miss it! </p><p>TL;DR and Show Notes</p><p><strong>Hosts and Guests</strong>:</p><p>* <strong>Alex Volkov</strong> - AI Evangelist & Weights & Biases (<a target="_blank" href="http://x.com/@altryne">@altryne</a>)</p><p>* Co Hosts - <a target="_blank" href="http://x.com/@WolframRvnwlf">@WolframRvnwlf</a> <a target="_blank" href="x.com/@yampeleg">@yampeleg</a> <a target="_blank" href="http://x.com/@nisten">@nisten</a> <a target="_blank" href="http://x.com/@ldjconfirmed">@ldjconfirmed</a> <a target="_blank" href="https://x.com/ryancarson/status/1957809743679906246">@ryancarson</a></p><p><strong>Big CO LLMs + APIs</strong>:</p><p>* OpenAI releases SORA2 + a new social media app (<a target="_blank" href="https://x.com/altryne/status/1973568567489798144">X</a>, <a target="_blank" href="https://openai.com/index/sora-2/">Blog</a>, <a target="_blank" href="https://apps.apple.com/us/app/sora-by-openai/id6744034028">App download</a>)</p><p>* Anthropic releases Claude Sonnet 4.5 - same price as 4.1 - leading coding model (<a target="_blank" href="https://x.com/claudeai/status/1972706807345725773">X</a>)</p><p>* OpenAI launches Instant Checkout & Agentic Commerce Protocol (<a target="_blank" href="https://x.com">X</a>, <a target="_blank" href="https://agenticcommerce.dev">Protocol</a>)</p><p><strong>Open Source LLMs</strong>:</p><p>* DeepSeek V3.2 Exp: Sparse Attention, Cost Drop (<a target="_blank" href="https://x.com/deepseek_ai/status/1972604768309871061">X</a>, <a target="_blank" href="https://twitter.com/ArtificialAnlys/status/1973230103854456993">Evals</a>, <a target="_blank" href="https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp">HF</a>)</p><p>* Apriel-1.5-15B-Thinker by ServiceNow SLAM Labs (<a target="_blank" href="https://twitter.com/ServiceNowRSRCH/status/1973100536280027586">X</a>, <a target="_blank" href="https://huggingface.co/ServiceNow-AI/Apriel-1.5-15b-Thinker">HF</a>, <a target="_blank" href="https://arxiv.org/abs/2508.10948">Arxiv</a>)</p><p>* <a target="_blank" href="Z.ai">Z.ai</a> GLM-4.6: advanced Agentic flagship model (<a target="_blank" href="https://x.com/Zai_org/status/1973034639708344767">X</a>, <a target="_blank" href="https://z.ai/blog/glm-4.6">Blog</a>, <a target="_blank" href="https://huggingface.co/zai-org/GLM-4.6">HF</a>)</p><p><strong>This weeks Buzz</strong>:</p><p>* CoreWeave locks <strong>$22.4B OpenAI</strong>, a <strong>$6.3B NVIDIA “backstop”</strong>, and a <strong>$14.2B Meta</strong> compute pact (<a target="_blank" href="https://x.com/CoreWeave/status/1971218329713938942">X</a>)</p><p><strong>Voice & Audio</strong>:</p><p>* Hume AI launches Octave 2 (<a target="_blank" href="https://twitter.com/hume_ai/status/1973450822840152455">X</a>, <a target="_blank" href="https://hume.ai/blog/octave2">Blog</a>)</p><p>* LFM2-Audio: End-to-end audio foundation model (<a target="_blank" href="https://x.com/LiquidAI_/status/1973372092230836405">X</a>, <a target="_blank" href="https://www.liquid.ai/blog/lfm2-audio-an-end-to-end-audio-foundation-model">Blog</a>, <a target="_blank" href="https://huggingface.co/LiquidAI/LFM2-Audio-1.5B">HF</a>)</p><p><strong>Vision & Video</strong>:</p><p>* Kandinsky 5.0 T2V Lite: #1 open-source text-to-video (<a target="_blank" href="https://ai-forever.github.io/Kandinsky-5/">Blog</a>, <a target="_blank" href="https://github.com/ai-forever/Kandinsky-5">GitHub</a>, <a target="_blank" href="https://huggingface.co/collections/ai-forever/kandinsky-50-t2v-lite-68d71892d2cc9b02177e5ae5">HF</a>, <a target="_blank" href="https://t.me/kandinsky_access_bot">Try It</a>)</p><p><strong>AI Art & Diffusion & 3D</strong>:</p><p>* HunyuanImage 3.0: 80B Open-Source Text-to-Image by Tencent (<a target="_blank" href="https://twitter.com/TencentHunyuan/status/1972130405160833334">X</a>, <a target="_blank" href="https://huggingface.co/tencent/HunyuanImage-3.0">HF</a>, <a target="_blank" href="https://github.com/Tencent-Hunyuan">Github</a>)</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-oct-2-sora-2-the-new-tiktok</link><guid isPermaLink="false">substack:post:175152386</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Fri, 03 Oct 2025 01:27:55 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/175152386/f3fa1a8b6bc2506cb187aa4b1247fa99.mp3" length="95981671" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5999</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/175152386/e55aef3f772f47a0f419a5e58a49ab28.jpg"/></item><item><title><![CDATA[📆 ThursdAI - Qwen‑mas Strikes Again: VL/Omni Blitz + Grok‑4 Fast + Nvidia’s $100B Bet]]></title><description><![CDATA[This is a free preview of a paid episode. To hear more, visit <a href="https://sub.thursdai.news?utm_medium=podcast&#38;utm_campaign=CTA_7">sub.thursdai.news</a><br/><br/><p>Hola AI aficionados, it’s yet another ThursdAI, and yet another week FULL of AI news, spanning Open Source LLMs, Multimodal video and audio creation and more! </p><p>Shiptember as they call it does seem to deliver, and it was hard even for me to follow up on all the news, not to mention we had like 3-4 breaking news during the show today! </p><p>This week was yet another Qwen-mas, with Alibaba absolutely dominating across open source, but also NVIDIA promising to invest up to $100 Billion into OpenAI. </p><p>So let’s dive right in! As a reminder, all the show notes are posted at the end of the article for your convenience. </p><p><p>ThursdAI - Because weeks are getting denser, but we’re still here, weekly, sending you the top AI content! Don’t miss out</p></p><p><strong>Table of Contents</strong></p><p>* <a target="_blank" href="#%C2%A7open-source-ai">Open Source AI</a></p><p>* <a target="_blank" href="#%C2%A7qwen-vl-announcement-qwen-vl-b-ab-thinking-x-hf-blog-demo">Qwen3-VL Announcement (Qwen3-VL-235B-A22B-Thinking):</a></p><p>* <a target="_blank" href="#%C2%A7qwen-omni-b-ab-end-to-end-sota-omni-modal-ai-unifying-text-image-audio-and-video-hf-github-qwen-chat-demo-api">Qwen3-Omni-30B-A3B: end-to-end SOTA omni-modal AI unifying text, image, audio, and video</a></p><p>* <a target="_blank" href="#%C2%A7deepseek-v-terminus-a-surgical-bugfix-that-matters-for-agents-x-hf">DeepSeek V3.1 Terminus: a surgical bugfix that matters for agents</a></p><p>* <a target="_blank" href="#%C2%A7evals-and-benchmarks-agents-deception-and-code-at-scale">Evals & Benchmarks: agents, deception, and code at scale</a></p><p>* <a target="_blank" href="#%C2%A7big-companies-bigger-bets">Big Companies, Bigger Bets!</a></p><p>* <a target="_blank" href="#%C2%A7openai-chatgpt-pulse-proactive-ai-news-cards-for-your-day-x-openai-blog">OpenAI: ChatGPT Pulse: Proactive AI news cards for your day</a></p><p>* <a target="_blank" href="#%C2%A7xai-grok-fast-m-context-fewer-thinking-tokens-shockingly-cheap-x-blog">XAI Grok 4 fast - 2M context, 40% fewer thinking tokens, shockingly cheap</a></p><p>* <a target="_blank" href="#%C2%A7alibaba-qwen-max-and-plans-for-scaling-x-blog-api">Alibaba Qwen-Max and plans for scaling</a></p><p>* <a target="_blank" href="#%C2%A7this-weeks-buzz-w-and-b-fully-connected-is-coming-to-london-and-tokyo-and-another-hackathon-in-sf">This Week’s Buzz: W&B Fully Connected is coming to London and Tokyo & Another hackathon in SF</a></p><p>* <a target="_blank" href="#%C2%A7vision-and-video-wan-animate-kling-and-wan-preview">Vision & Video: Wan 2.2 Animate, Kling 2.5, and Wan 4.5 preview</a></p><p>* <a target="_blank" href="#%C2%A7moondream-preview-interview-with-co-founders-via-and-jay">Moondream-3 Preview - Interview with co-founders Via & Jay</a></p><p>* <a target="_blank" href="#%C2%A7wan-open-sourced-wan-animate-aka-wan-animate-motion-transfer-and-lip-sync">Wan open sourced Wan 2.2 Animate (aka “Wan Animate”): motion transfer and lip sync</a></p><p>* <a target="_blank" href="#%C2%A7kling-turbo-cinematic-motion-cheaper-and-with-audio">Kling 2.5 Turbo: cinematic motion, cheaper and with audio</a></p><p>* <a target="_blank" href="#%C2%A7wan-preview-native-multimodality-p-s-and-lip-synced-speech">Wan 4.5 preview: native multimodality, 1080p 10s, and lip-synced speech</a></p><p>* <a target="_blank" href="#%C2%A7voice-and-audio">Voice & Audio</a></p><p>* <a target="_blank" href="#%C2%A7thursdai-sep-tldr-and-show-notes">ThursdAI - Sep 25, 2025 - TL;DR & Show notes</a></p><p>Open Source AI</p><p>This was a Qwen-and-friends week. I joked on stream that I should just count how many times “Alibaba” appears in our show notes. It’s a lot.</p><p>Qwen3-VL Announcement (Qwen3-VL-235B-A22B-Thinking): (<a target="_blank" href="https://x.com/Alibaba_Qwen/status/1970594923503391182">X</a>, <a target="_blank" href="https://huggingface.co/collections/Qwen/qwen3-vl-68d2a7c1b8a8afce4ebd2dbe">HF</a>, <a target="_blank" href="https://qwen.ai/blog?id=99f0335c4ad9ff6153e517418d48535ab6d8afef&#38;from=research.latest-advancements-list">Blog</a>, <a target="_blank" href="https://huggingface.co/spaces/Qwen/Qwen3-VL-Demo">Demo</a>)</p><p>Qwen 3 launched earlier as a text-only family; the vision-enabled variant just arrived, and it’s not timid. The “thinking” version is effectively a reasoner with eyes, built on a 235B-parameter backbone with around 22B active (their mixture-of-experts trick). What jumped out is the breadth of evaluation coverage: MMU, video understanding (Video-MME, LVBench), 2D/3D grounding, doc VQA, chart/table reasoning—pages of it. They’re showing wins against models like Gemini 2.5 Pro and GPT‑5 on some of those reports, and doc VQA is flirting with “nearly solved” territory in their numbers.</p><p>Two caveats. First, whenever scores get that high on imperfect benchmarks, you should expect healthy skepticism; known label issues can inflate numbers. Second, the model is big. Incredible for server-side grounding and long-form reasoning with vision (they’re talking about scaling context to 1M tokens for two-hour video and long PDFs), but not something you throw on a phone.</p><p>Still, if your workload smells like “reasoning + grounding + long context,” Qwen 3 VL looks like one of the strongest open-weight choices right now.</p><p>Qwen3-Omni-30B-A3B: end-to-end SOTA omni-modal AI unifying text, image, audio, and video (<a target="_blank" href="https://huggingface.co/collections/Qwen/qwen3-omni-68d100a86cd0906843ceccbe">HF</a>, <a target="_blank" href="https://github.com/QwenLM/Qwen3-Omni">GitHub</a>, <a target="_blank" href="https://chat.qwen.ai/?models=qwen3-omni-flash">Qwen Chat</a>, <a target="_blank" href="https://huggingface.co/spaces/Qwen/Qwen3-Omni-Demo">Demo</a>, <a target="_blank" href="https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/">API</a>)</p><p>Omni is their end-to-end multimodal chat model that unites text, image, and audio—and crucially, it streams audio responses in real time while thinking separately in the background. Architecturally, it’s a 30B MoE with around 3B active parameters at inference, which is the secret to why it feels snappy on consumer GPUs.</p><p>In practice, that means you can talk to Omni, have it see what you see, and get sub-250 ms replies in nine speaker languages while it quietly plans. It claims to understand 119 languages. When I pushed it in multilingual conversational settings it still code-switched unexpectedly (Chinese suddenly appeared mid-flow), and it occasionally suffered the classic “stuck in thought” behavior we’ve been seeing in agentic voice modes across labs. But the responsiveness is real, and the footprint is exciting for local speech streaming scenarios. I wouldn’t replace a top-tier text reasoner with this for hard problems, yet being able to keep speech native is a real UX upgrade.</p><p>Qwen Image Edit, Qwen TTS Flash, and Qwen‑Guard</p><p>Qwen’s image stack got a handy upgrade with multi-image reference editing for more consistent edits across shots—useful for brand assets and style-tight workflows. TTS Flash (API-only for now) is their fast speech synth line, and Q‑Guard is a new safety/moderation model from the same team. It’s notable because Qwen hasn’t really played in the moderation-model space before; historically Meta’s Llama Guard led that conversation.</p><p>DeepSeek V3.1 Terminus: a surgical bugfix that matters for agents (<a target="_blank" href="url://16">X</a>, <a target="_blank" href="url://59">HF</a>)</p><p>DeepSeek whale resurfaced to push a small 0.1 update to V3.1 that reads like a “quality and stability” release—but those matter if you’re building on top. It fixes a code-switching bug (the “sudden Chinese” syndrome you’ll also see in some Qwen variants), improves tool-use and browser execution, and—importantly—makes agentic flows less likely to overthink and stall. On the numbers, <strong>Humanities Last Exam jumped from 15 to 21.7</strong>, while LiveCodeBench dipped slightly. That’s the story here: they traded a few raw points on coding for more stable, less dithery behavior in end-to-end tasks. If you’ve invested in their tool harness, this may be a net win.</p><p>Liquid Nanos: small models that extract like they’re big (<a target="_blank" href="https://x.com/LiquidAI_/status/1971198690707616157">X</a>, <a target="_blank" href="https://huggingface.co/collections/LiquidAI/liquid-nanos-68b98d898414dd94d4d5f99a">HF</a>)</p><p>Liquid Foundation Models released “Liquid Nanos,” a set of open models from roughly 350M to 2.6B parameters, including “extract” variants that pull structure (JSON/XML/YAML) from messy documents. The pitch is cost-efficiency with surprisingly competitive performance on information extraction tasks versus models 10× their size. If you’re doing at-scale doc ingestion on CPUs or small GPUs, these look worth a try.</p><p>Tiny IBM OCR model that blew up the charts (<a target="_blank" href="https://huggingface.co/ibm-granite/granite-docling-258M">HF</a>)</p><p>We also saw a tiny IBM model (about 250M parameters) for image-to-text document parsing trending on Hugging Face. Run in 8-bit, it squeezes into roughly 250 MB, which means Raspberry Pi and “toaster” deployments suddenly get decent OCR/transcription against scanned docs. It’s the kind of tiny-but-useful release that tends to quietly power entire products.</p><p>Meta’s 32B Code World Model (CWM) released for agentic code reasoning (<a target="_blank" href="https://x.com/syhw/status/1838682364055920980">X</a>, <a target="_blank" href="https://huggingface.co/facebook/cwm">HF</a>)</p><p>Nisten got really excited about this one, and once he explained it, I understood why. Meta released a 32B code world model that doesn’t just generate code - it understands code the way a compiler does. It’s thinking about state, types, and the actual execution context of your entire codebase.</p><p>This isn’t just another coding model - it’s a fundamentally different approach that could change how all future coding models are built. Instead of treating code as fancy text completion, it’s actually modeling the program from the ground up. If this works out, expect everyone to copy this approach.</p><p>Quick note, this one was released with a research license only! </p><p>Evals & Benchmarks: agents, deception, and code at scale</p><p>A big theme this week was “move beyond single-turn Q&A and test how these things behave in the wild.” with a bunch of new evals released. I wanted to cover them all in a separate segment. </p><p>OpenAI’s GDP Eval: “economically valuable tasks” as a bar (<a target="_blank" href="https://x.com/OpenAI/status/1971249374077518226">X</a>, <a target="_blank" href="https://openai.com/index/gdpval/">Blog</a>)</p><p>OpenAI introduced GDP Eval to measure model performance against real-world, economically valuable work. The design is closer to how I think about “AGI as useful work”: 44 occupations across nine sectors, with tasks judged against what an industry professional would produce.</p><p>Two details stood out. First, OpenAI’s own models didn’t top the chart in their published screenshot—Anthropic’s Claude Opus 4.1 led with roughly a 47.6% win rate against human professionals, while GPT‑5-high clocked in around 38%. Releasing a benchmark where you’re not on top earns respect. Second, the tasks are legit. One example was a manufacturing engineer flow where the output required an overall design with an exploded view of components—the kind of deliverable a human would actually make.</p><p>What I like here isn’t the precise percent; it’s the direction. If we anchor progress to tasks an economy cares about, we move past “trivia with citations” and toward “did this thing actually help do the work?”</p><p>GAIA 2 (Meta Super Intelligence Labs + Hugging Face): agents that execute (<a target="_blank" href="https://x.com/ClementDelangue/status/1970885829552705976">X</a>, <a target="_blank" href="https://huggingface.co/blog/gaia2">HF</a>)</p><p>MSL and HF refreshed GAIA, the agent benchmark, with a thousand new human-authored scenarios that test execution, search, ambiguity handling, temporal reasoning, and adaptability—plus a smartphone-like execution environment. GPT‑5-high led across execution and search; Kimi’s K2 was tops among open-weight entries. I like that GAIA 2 bakes in time and budget constraints and forces agents to chain steps, not just spew plans. We need more of these.</p><p><strong>Scale AI’s “SWE-Bench Pro” for coding in the large (</strong><a target="_blank" href="https://huggingface.co/datasets/ScaleAI/SWE-bench_Pro"><strong>HF</strong></a><strong>)</strong></p><p>Scale dropped a stronger coding benchmark focused on multi-file edits, 100+ line changes, and large dependency graphs. On the <a target="_blank" href="https://scale.com/leaderboard/swe_bench_pro_public">public</a> set, GPT‑5 (not Codex) and Claude Opus 4.1 took the top two slots; on a <a target="_blank" href="https://scale.com/leaderboard/swe_bench_pro_commercial">commercial</a> set, Opus edged ahead. The broader takeaway: the action has clearly moved to test-time compute, persistent memory, and program-synthesis outer loops to get through larger codebases with fewer invalid edits. This aligns with what we’re seeing across ARC‑AGI and SWE‑bench Verified.</p><p>The “Among Us” deception test (<a target="_blank" href="https://x.com/shreyk0/status/1970160146975445192">X</a>)</p><p>One more that’s fun but not frivolous: a group benchmarked models on the social deception game Among Us. OpenAI’s latest systems reportedly did the best job both lying convincingly and detecting others’ lies. This line of work matters because social inference and adversarial reasoning show up in real agent deployments—security, procurement, negotiations, even internal assistant safety.</p><p>Big Companies, Bigger Bets!</p><p><strong>Nvidia’s $100B pledge to OpenAI for 10GW of compute</strong></p><p>Let’s say that number again: one hundred billion dollars. Nvidia announced plans to invest up to $100B into OpenAI’s infrastructure build-out, targeting roughly 10 gigawatts of compute and power. Jensen called it the biggest infrastructure project in history. Pair that with OpenAI’s Stargate-related announcements—five new datacenters with Oracle and SoftBank and a flagship site in Abilene, Texas—and you get to wild territory fast.</p><p>Internal notes circulating say OpenAI started the year around 230MW and could exit 2025 north of 2GW operational, while aiming at 20GW in the near term and a staggering 250GW by 2033. Even if those numbers shift, the directional picture is clear: the GPU supply and power curves are going vertical.</p><p>Two reactions. First, yes, the “infinite money loop” memes wrote themselves—OpenAI spends on Nvidia GPUs, Nvidia invests in OpenAI, the market adds another $100B to Nvidia’s cap for good measure. But second, the underlying demand is real. If we need 1–8 GPUs per “full-time agent” and there are 3+ billion working adults, we are orders of magnitude away from compute saturation. The power story is the real constraint—and that’s now being tackled in parallel.</p><p>OpenAI: ChatGPT Pulse: Proactive AI news cards for your day (<a target="_blank" href="https://x.com/OpenAI">X</a>, <a target="_blank" href="https://openai.com/index/introducing-chatgpt-pulse/">OpenAI Blog</a>)</p><p>In a #BreakingNews segment, we got an update from OpenAI, that currently works only for Pro users but will come to everyone soon. Proactive AI, that learns from your chats, email and calendar and will show you a new “feed” of interesting things every morning based on your likes and feedback! </p><p>Pulse marks OpenAI’s first step toward an AI assistant that brings the right info before you ask, tuning itself with every thumbs-up, topic request, or app connection. I’ve tuned mine for today, we’ll see what tomorrow brings! </p><p>P.S - Huxe is a free app from the creators of NotebookLM (Ryza was on our podcast!) that does a similar thing, so if you don’t have pro, check out Huxe, they <a target="_blank" href="https://x.com/gethuxe/status/1970503800885854431">just</a> launched! </p><p>XAI Grok 4 fast - 2M context, 40% fewer thinking tokens, shockingly cheap (<a target="_blank" href="https://x.com/xai/status/1969183326389858448">X</a>, <a target="_blank" href="https://x.ai/news/grok-4-fast">Blog</a>)</p><p>xAI launched Grok‑4 Fast, and the name fits. Think “top-left” on the speed-to-cost chart: up to 2 million tokens of context, a reported 40% reduction in reasoning token usage, and a price tag that’s roughly 1% of some frontier models on common workloads. On LiveCodeBench, Grok‑4 Fast even beat Grok‑4 itself. It’s not the most capable brain on earth, but as a high-throughput assistant that can fan out web searches and stitch answers in something close to real time, it’s compelling.</p><p>Alibaba Qwen-Max and plans for scaling (<a target="_blank" href="https://x.com/Alibaba_Qwen/status/1970599097297183035">X</a>, <a target="_blank" href="https://qwen.ai/blog?id=241398b9cd6353de490b0f82806c7848c5d2777d&#38;from=research.latest-advancements-list">Blog</a>, <a target="_blank" href="https://www.alibabacloud.com/help/en/model-studio/models#c2d5833ae4jmo">API</a>)</p><p>Back in the Alibaba camp, they also released their flagship API model, Qwen 3 Max, and showed off their future roadmap. </p><p>Qwen-max is over 1T parameters, MoE that gets 69.6 on Swe-bench verified and outperforms GPT-5 on LMArena! </p><p>And their plan is simple: scale. They’re planning to go from 1 million to <strong>100 million token context windows</strong> and scale their models into the terabytes of parameters. It culminated in a hilarious moment on the show where we all put on sunglasses to salute a slide from their <a target="_blank" href="https://x.com/tphuang/status/1970886344499990672">presentation</a> that literally said, “Scaling is all you need.” AGI is coming, and it looks like Alibaba is one of the labs determined to scale their way there. Their release schedule lately (as documented by Swyx from Latent.space) is insane. </p><p>This Week’s Buzz: W&B Fully Connected is coming to London and Tokyo & Another hackathon in SF</p><p>Weights & Biases (now part of the CoreWeave family) is bringing Fully Connected to London on Nov 4–5, with another event in Tokyo on Oct 31. If you’re in Europe or Japan and want two days of dense talks and hands-on conversations with teams actually shipping agents, evals, and production ML, come hang out. Readers got a code on stream; if you need help getting a seat, ping me directly.</p><p>Links: <a target="_blank" href="http://fullyconnected.com"><strong>fullyconnected.com</strong></a></p><p>We are also opening up registrations to our second WeaveHacks hackathon in SF, October 11-12, yours trully will be there, come hack with us on Self Improving agents! Register <a target="_blank" href="http://lu.ma/weavehacks2">HERE</a></p><p>Vision & Video: Wan 2.2 Animate, Kling 2.5, and Wan 4.5 preview</p><p>This is the most exciting space in AI week-to-week for me right now. The progress is visible. Literally.</p><p>Moondream-3 Preview - Interview with co-founders Via & Jay</p><p>While I’ve already reported on Moondream-3 in the last weeks newsletter, this week we got the pleasure of hosting Vik Korrapati and Jay Allen the co-founders of MoonDream to tell us all about it. Tune in for that conversation on the pod starting at 00:33:00</p><p>Wan open sourced Wan 2.2 Animate (aka “Wan Animate”): motion transfer and lip sync </p><p>Tongyi’s Wan team shipped an open-source release that the community quickly dubbed “Wanimate.” It’s a character-swap/motion transfer system: provide a single image for a character and a reference video (your own motion), and it maps your movement onto the character with surprisingly strong hair/cloth dynamics and lip sync. If you’ve used runway’s Act One, you’ll recognize the vibe—except this is open, and the fidelity is rising fast.</p><p>The practical uses are broader than “make me a deepfake.” Think onboarding presenters with perfect backgrounds, branded avatars that reliably say what you need, or precise action blocking without guessing at how an AI will move your subject. You act it; it follows.</p><p>Kling 2.5 Turbo: cinematic motion, cheaper and with audio</p><p>Kling quietly rolled out a 2.5 Turbo tier that’s 30% cheaper and finally brings audio into the loop for more complete clips. Prompts adhere better, physics look more coherent (acrobatics stop breaking bones across frames), and the cinematic look has moved from “YouTube short” to “film-school final.” They seeded access to creators and re-shared the strongest results; the consistency is the headline. (Source X: <a target="_blank" href="https://x.com/StevieMac03/status/1970559778804908331">@StevieMac03</a>)</p><p>I’ve chatted with my kiddos today over facetime, and they were building minecraft creepers. I took a screenshot, sent to Nano Banana to make their creepers into actual minecraft ones, and then with Kling, Animated the explosions for them. They LOVED it! Animations were clear, while VEO refused for me to even upload their images, Kling didn’t care haha</p><p>Wan 4.5 preview: native multimodality, 1080p 10s, and lip-synced speech</p><p>Wan also teased a 4.5 preview that unifies understanding and generation across text, image, video, and audio. The eye-catching bit: generate a 1080p, 10-second clip with synced speech from just a script. Or supply your own audio and have it lip-sync the shot. I ran my usual “interview a polar bear dressed like me” test and got one of the better results I’ve seen from any model. We’re not at “dialogue scene” quality, but “talking character shot” is getting… good. </p><p>The generation of audio (not only text  + lipsync) is one of the best ones besides VEO, it’s really great to see how strongly this improves, sad that this wasn’t open sourced! And apparently it supports “draw text to animate”  (Source: <a target="_blank" href="https://x.com/I_Muhammadali44/status/1971085386396147741">X</a>) </p><p>Voice & Audio</p><p><strong>Suno V5: we’ve entered the “I can’t tell anymore” era</strong></p><p>Suno calls V5 a redefinition of audio quality. I’ll be honest, I’m at the edge of my subjective hearing on this. I’ve caught myself listening to Suno streams instead of Spotify and forgetting anything is synthetic. The vocals feel more human, the mixes cleaner, and the remastering path (including upgrading V4 tracks) is useful. The last 10% to “you fooled a producer” is going to be long, but the distance between V4 and V5 already makes me feel like I should re-cut our ThursdAI opener.</p><p><strong>MiMI Audio: a small omni-chat demo that hints at the floor</strong></p><p>We tried a MiMI Audio demo live—a 7B-ish model with speech in/out. It was responsive but stumbled on singing and natural prosody. I’m leaving it in here because it’s a good reminder that the open floor for “real-time voice” is rising quickly even for small models. And the moment you pipe a stronger text brain behind a capable, native speech front-end, the UX leap is immediate.</p><p>Ok, another DENSE week that finishes up Shiptember, tons of open source, Qwen (Tongyi) shines, and video is getting so so good. This is all converging folks, and honestly, I’m just happy to be along for the ride! </p><p>This week was also <strong>Rosh Hashanah</strong>, which is the Jewish new year, and I’ve shared on the pod that I’ve found my X post from 3 years ago, using the state of the art AI models of the time. WHAT A DIFFERENCE 3 years make, just take a look, I had to scale down the 4K one from this year just to fit into the pic! </p><p>Shana Tova to everyone who’s reading this, and we’ll see you next week 🫡</p><p>ThursdAI - Sep 25, 2025 - TL;DR & Show notes</p><p>* <strong>Hosts and Guests</strong></p><p>* <strong>Alex Volkov</strong> - AI Evangelist & Weights & Biases (<a target="_blank" href="https://x.com/altryne">@altryne</a>)</p><p>* Co Hosts - <a target="_blank" href="https://x.com/yampeleg">@yampeleg</a> <a target="_blank" href="https://x.com/nisten">@nisten</a> <a target="_blank" href="https://x.com/ldjconfirmed">@ldjconfirmed</a> <a target="_blank" href="https://x.com/ryancarson">@ryancarson</a></p><p>* Guest - Vik Korrapathy (<a target="_blank" href="https://x.com/vikhyatk">@vikhyatk</a>) - Moondream</p><p>* <strong>Open Source AI (LLMs, VLMs, Papers & more)</strong></p><p>* DeepSeek V3.1 Terminus: cleaner bilingual output, stronger agents, cheaper long-context (<a target="_blank" href="https://x.com/deepseek_ai/status/1968682364055920980">X</a>, <a target="_blank" href="https://huggingface.co/deepseek-ai/DeepSeek-V3.1-Terminus">HF</a>)</p><p>* Meta’s 32B Code World Model (CWM) released for agentic code reasoning (<a target="_blank" href="https://x.com/syhw/status/1838682364055920980">X</a>, <a target="_blank" href="https://huggingface.co/facebook/cwm">HF</a>)</p><p>* Alibaba Tongyi Qwen on a release streak again:</p>]]></description><link>https://sub.thursdai.news/p/thursdai-sep-25-grok-fast-oainvidia</link><guid isPermaLink="false">substack:post:174583904</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Fri, 26 Sep 2025 02:32:56 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/174583904/3e65bc67396dafeafa15c02eca4d14a5.mp3" length="90351230" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5647</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/174583904/71897e4718877a47cc41a2964ec80ada.jpg"/></item><item><title><![CDATA[📆 ThursdAI - Sep 18 - Gpt-5-Codex, OAI wins ICPC, Reve, ARC-AGI SOTA Interview, Meta AI Glasses & more AI news]]></title><description><![CDATA[<p>Hey folks, </p><p>What an absolute packed week this week, which started with yet another crazy model release from OpenAI, but they didn't stop there, they also announced GPT-5 winning the ICPC coding competitions with 12/12 questions answered which is apparently really really <a target="_blank" href="https://x.com/bminaiev/status/1968363052329484642">hard</a>! </p><p>Meanwhile, Zuck took the Meta Connect 25' stage and announced a new set of Meta glasses with a display! On the open source front, we yet again got multiple tiny models doing DeepResearch and Image understanding better than much larger foundational models.</p><p>Also, today I interviewed Jeremy Berman, who topped the ArcAGI with a 79.6% score and some crazy Grok 4 prompts, a new image editing experience called Reve, a new world model and a BUNCH more! So let's dive in! As always, all the releases, links and resources at the end of the article. </p><p></p><p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></p><p>Table of Contents</p><p>* <a target="_blank" href="https://sub.thursdai.news/i/173985701/codex-comes-full-circle-with-gpt-codex-agentic-finetune-x-openai-blog">Codex comes full circle with GPT-5-Codex agentic finetune</a></p><p>* <a target="_blank" href="https://sub.thursdai.news/i/173985701/meta-connect-the-new-meta-glasses-with-display-and-a-neural-control-interface">Meta Connect 25 - The new Meta Glasses with Display & a neural control interface</a></p><p>* <a target="_blank" href="https://sub.thursdai.news/i/173985701/jeremy-berman-beating-frontier-labs-to-sota-score-on-arc-agi">Jeremy Berman: Beating frontier labs to SOTA score on ARC-AGI</a></p><p>* <a target="_blank" href="https://sub.thursdai.news/i/173985701/this-weeks-buzz-weave-inside-w-and-b-modelsrl-just-got-x-ray-vision">This Week’s Buzz: Weave inside W&B models—RL just got x-ray vision</a></p><p>* <a target="_blank" href="https://sub.thursdai.news/i/173985701/open-source">Open Source</a></p><p>* <a target="_blank" href="https://sub.thursdai.news/i/173985701/perceptron-isaac-b-model-that-points-better-than-gpt-x-hf-blog">Perceptron Isaac 0.1 - 2B model that points better than GPT</a></p><p>* <a target="_blank" href="https://sub.thursdai.news/i/173985701/tongyi-deepresearch-ab-open-source-web-agent-claims-parity-with-openai-deep-research-x-hf">Tongyi DeepResearch: A3B open-source web agent claims parity with OpenAI Deep Research</a></p><p>* <a target="_blank" href="https://sub.thursdai.news/i/173985701/reve-launches-a-in-ai-visual-platform-taking-on-nano-and-seedream-x-reve-blog">Reve launches a 4-in-1 AI visual platform taking on Nano 🍌 and Seedream</a></p><p>* <a target="_blank" href="https://sub.thursdai.news/i/173985701/ray-lumas-reasoning-video-model-with-native-hdr-draft-mode-and-hifi-mastering-x-try-it">Ray3: Luma’s “reasoning” video model with native HDR, Draft Mode, and Hi‑Fi mastering</a></p><p>* <a target="_blank" href="https://sub.thursdai.news/i/173985701/world-models-are-getting-closer-worldlabs-announced-marble-demo">World models are getting closer - Worldlabs announced Marble</a></p><p>* <a target="_blank" href="https://sub.thursdai.news/i/173985701/google-puts-gemini-in-chrome-x-blog">Google puts Gemini in Chrome</a></p><p><strong>Codex comes full circle with GPT-5-Codex agentic finetune </strong>(<a target="_blank" href="https://x.com/OpenAI/status/1967636903165038708">X</a>, <a target="_blank" href="https://openai.com/index/introducing-upgrades-to-codex/">OpenAI Blog</a>)</p><p>My personal highlight of the week was definitely the release of GPT-5-Codex. I feel like we've come full circle here. I remember when OpenAI first launched a separate, fine-tuned model for coding called Codex, way back in the GPT-3 days. Now, they've done it again, taking their flagship GPT-5 model and creating a specialized version for agentic coding, and the results are just staggering.</p><p>This isn't just a minor improvement. During their internal testing, OpenAI saw GPT-5-Codex work independently for more than seven hours at a time on large, complex tasks—iterating on its code, fixing test failures, and ultimately delivering a successful implementation. Seven hours! That's an agent that can take on a significant chunk of work while you're sleeping. It's also incredibly efficient, using 93% fewer tokens than the base GPT-5 on simpler tasks, while thinking for longer on the really difficult problems.</p><p>The model is now integrated everywhere - the Codex CLI (just npm install -g codex), VS Code extension, web playground, and yes, even your iPhone. At OpenAI, Codex now reviews the vast majority of their PRs, catching hundreds of issues daily before humans even look at them. Talk about eating your own dog food!</p><p>Other OpenAI updates from this week</p><p>While Codex was the highlight, OpenAI (and Google) also participated and obliterated one of the world’s hardest algorithmic competitions called ICPC. OpenAI used GPT-5 and an unreleased reasoning model to solve 12/12 questions in under 5 hours. </p><p>OpenAI and NBER also released an incredible report on how over 700M people use GPT on a weekly basis, with a lot of insights, that are summed up in this incredible graph:</p><p><strong>Meta Connect 25 - The new Meta Glasses with Display & a neural control interface</strong></p><p>Just when we thought the week couldn't get any crazier, Zuck took the stage for their annual Meta Connect conference and dropped a bombshell. They announced a new generation of their Ray-Ban smart glasses that include a built-in, high-resolution display you can't see from the outside. This isn't just an incremental update; this feels like the arrival of a new category of device. We've had the computer, then the mobile phone, and now we have smart glasses with a display.</p><p>The way you interact with them is just as futuristic. They come with a "neural band" worn on the wrist that reads myoelectric signals from your muscles, allowing you to control the interface silently just by moving your fingers. Zuck's <a target="_blank" href="https://x.com/altryne/status/1968468021988434054/video/1">live demo</a>, where he walked from his trailer onto the stage while taking messages and playing music, was one hell of a way to introduce a product.</p><p>This is how Meta plans to bring its superintelligence into the physical world. You'll wear these glasses, talk to the AI, and see the output directly in your field of view. They showed off live translation with subtitles appearing under the person you're talking to and an agentic AI that can perform research tasks and notify you when it's done. It's an absolutely mind-blowing vision for the future, and at $799, shipping in a week, it's going to be accessible to a lot of people. I've already signed up for a demo.</p><p><strong>Jeremy Berman: Beating frontier labs to SOTA score on ARC-AGI</strong></p><p>We had the privilege of chatting with <strong>Jeremy Berman</strong>, who just achieved SOTA on the notoriously difficult ARC-AGI benchmark using <em>checks notes</em>... Grok 4! 🚀</p><p>He walked us through his innovative approach, which ditches Python scripts in favor of flexible "natural language programs" and uses a program-synthesis outer loop with test-time adaptation. Incredibly, his method achieved these top scores at 1/25th the cost of previous systems</p><p>This is huge because ARC-AGI tests for true general intelligence - solving problems the model has never seen before. The chat with Jeremy is very insightful, available on the podcast starting at 01:11:00 so don't miss it!</p><p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></p><p><strong>This Week’s Buzz: Weave inside W&B models—RL just got x-ray vision</strong></p><p>You know how every RL project produces a mountain of rollouts that you end up spelunking through with grep? We just banished that misery. <strong>Weave</strong> tracing now lives natively inside every W&B Workspace run. Wrap your training-step and rollout functions in @weave.op, call weave.init(), and your traces appear alongside loss curves in real time. I can click a spike, jump straight to the exact conversation that tanked the reward, and diagnose hallucinations without leaving the dashboard. If you’re doing any agentic RL, please go treat yourself. Docs: <a target="_blank" href="https://weave-docs.wandb.ai/guides/tools/weave-in-workspaces"><strong>https://weave-docs.wandb.ai/guides/tools/weave-in-workspaces</strong></a></p><p>Open Source</p><p>Open source did NOT disappoint this week as well, we've had multiple tiny models beating the giants at specific tasks! </p><p>Perceptron Isaac 0.1 - 2B model that points better than GPT ( <a target="_blank" href="https://x.com/perceptroninc/status/1968365052270150077">X</a>, <a target="_blank" href="https://huggingface.co/PerceptronAI/Isaac-0.1">HF</a>, <a target="_blank" href="https://www.perceptron.inc/blog/introducing-isaac-0-1">Blog</a> )</p><p>One of the most impressive demos of the week came from a new lab, Perceptron AI. They released Isaac 0.1, a tiny 2 billion parameter "perceptive-language" model. This model is designed for visual grounding and localization, meaning you can ask it to find things in an image and it will point them out. During the show, we gave it a photo of my kid's Harry Potter alphabet poster and asked it to "find the spell that turns off the light." </p><p>Not only did it correctly identify "Nox," but it drew a box around it on the poster. This little 2B model is doing things that even huge models like GPT-4o and Claude Opus can't, and it's completely open source. Absolutely wild.</p><p>Moondream 3 preview - grounded vision reasoning 9B MoE (2B active) (<a target="_blank" href="https://x.com/vikhyatk/status/1968800178640429496">X</a>, <a target="_blank" href="https://huggingface.co/moondream/moondream3-preview">HF</a>)</p><p>Speaking of vision reasoning models, just a bit after the show concluded, our friend Vik released a demo of Moondream 3, a reasoning vision model 9B (A2B) that is also topping the charts! I didn't have tons of time to get into this, but the release thread shows this to be an exceptional open source visual reasoner also beating the giants!</p><p>Tongyi DeepResearch: A3B open-source web agent claims parity with OpenAI Deep Research ( <a target="_blank" href="https://x.com/Ali_TongyiLab/status/1967988004179546451">X</a>, <a target="_blank" href="https://huggingface.co/Alibaba-NLP/Tongyi-DeepResearch-30B-A3B">HF</a> )</p><p>Speaking of smaller models obliterating huge ones, Tongyi released a bunch of papers and a model this week that can do Deep Research on the level of OpenAI, even beating it, with a Qwen Finetune with only 3B active parameters! </p><p>With insane scores like 32.9 (38.3 in Heavy mode) on Humanity's Last Exam (OAI Deep Research gets 26%) and an insane 98.6% on SimpleQA, this innovative approach uses a lot of RL and synthetic data to train a Qwen model to find what you need. The paper is full of incredible insights into how to build automated RL environments to get to this level. </p><p>AI Art, Diffusion 3D and Video</p><p>This category of AI has been blowing up, we've seen SOTA week after week with Nano Banana then Seedream 4 and now a few more insane models.</p><p><strong>Tencent's Hunyuan released SRPO</strong> (<a target="_blank" href="https://x.com/TencentHunyuan/status/1967853314915315945">X</a>, <a target="_blank" href="https://huggingface.co/tencent/SRPO">HF</a>, <a target="_blank" href="https://tencent.github.io/srpo-project-page/">Project</a>, <a target="_blank" href="https://x.com/hellorob/status/1967667203593183343/photo/2">Comparison X</a>)(Semantic Relative Preference Optimization) which is a new method to finetune diffusion models quickly without breaking the bank. Also released a very realistic looking finetune trained with SRPO. Some of the generated results are super realistic, but it's more than just a model, there's a whole new method of finetuning here! </p><p>Hunyuan also updated their 3D model and announced a full blown <a target="_blank" href="https://x.com/TencentHunyuan/status/1968711532033851657">3D studio</a> that does everything from 3D object generation, meshing, texture editing & more. </p><p>Reve launches a 4-in-1 AI visual platform taking on Nano 🍌 and Seedream (<a target="_blank" href="https://x.com/cantrell/status/1967655268642386361">X</a>, <a target="_blank" href="https://app.reve.com/">Reve</a>, <a target="_blank" href="https://blog.reve.com/posts/the-new-reve/">Blog</a>)</p><p>A newcomer, Reve has launched a comprehensive new AI visual platform bundling image creation, editing, remixing, creative assistant, and API integration, all aimed at making advanced editing as accessible, all using their own proprietary models. </p><p>What stood out to me though, is the image editing UI, which allows you to select on your image exactly what you want to edit, write a specific prompt for that thing (change color, objects, add text etc') and then hit generate and their model takes into account all those new queues! This is way better than just ...  text prompting the other models! </p><p>Ray3: Luma’s “reasoning” video model with native HDR, Draft Mode, and Hi‑Fi mastering (<a target="_blank" href="https://x.com/LumaLabsAI/status/1968684330034606372">X</a>, <a target="_blank" href="https://dream-machine.lumalabs.ai/ideas">Try It</a>)</p><p>Luma released the third iteration of their video model called Ray, and this one does.. HDR! But it also has Draft Mode (for quick iteration), first/last frame interpolation, and they claim to be "production ready" with extreme prompt adherence. </p><p>The thing that struck me is the reasoning part, their video model is now reasoning, to let you create more complex scenes, while the model will ... evaluate itself and select the best generation for you! This is quite bonkers, can't wait to play with it! </p><p>World models are getting closer - Worldlabs announced Marble  (<a target="_blank" href="https://x.com/XRarchitect/status/1968356682888823060">Demo</a>)</p><p>We've covered a whole host of world models, Genie3, Hunyuan 3D world models, Mirage and a bunch more! </p><p>Dr FeiFei's WorldLabs have been one of the first ones to tackle the world model concept and their recent release shows incredible progress (and finally lets us play with it!) </p><p>Marble takes images and creates Gussian Splats, that can be used in 3D environments. So now you can use any AI image generation and turn it into a walkable 3D world! </p><p>Google puts Gemini in Chrome (<a target="_blank" href="https://x.com/search?q=gemini%20chrome&#38;src=typed_query">X</a>, <a target="_blank" href="https://blog.google/products/chrome/chrome-reimagined-with-ai/">Blog</a>)</p><p>This happened after the show today and while not fully rolled out yet, I've told you before when we covered Comet from PPXL and Dia from browser company, that Google will not be far behind! </p><p>So today they announced that Gemini is coming to Chrome, and will allow users to chat with a bunch of their tabs, summarize across tabs and soon do agentic tasks like clicking things and shopping for you? 😅</p><p>I wonder if this means that Google will offer this for free to the over 1B chrome users or introduce some sort of Gemini tier cross-over? Remains to be seen but very exciting to see AI browsers all over! </p><p>The best feature could be a hidden one, where the Gemini in Chrome will have knowledge about your surfing history and you'll be able to ask it about that one website you visited a while ago that had sharks! </p><p>Folks, I can go on and on today, literally there's a new innovative video model from ByteDance, a few more image models, but alas, I have to prioritize and give you only the top important news. So, I'll just remind that I put all the links in the TL;DR below and that you should absolutely check out the video version of our show on YT because a lot of visual things are happening and we're playing with all of them live! </p><p><p>Hey, just before you get to the “links”, consider subscribing to help me keep this going? 🙏</p></p><p>See you next week 🫡 Don't forget to subscribe (and if you already subbed, share this with a friend or two?) </p><p>TL;DR and show notes - September 18, 2025</p><p>* <strong>Hosts and Guests</strong></p><p>* <strong>Alex Volkov</strong> - AI Evangelist & Weights & Biases (<a target="_blank" href="http://x.com/@altryne">@altryne</a>)</p><p>* Co Hosts - <a target="_blank" href="http://x.com/@WolframRvnwlf">@WolframRvnwlf</a> <a target="_blank" href="http://x.com/ldjconfirmed">@ldjconfirmed</a> <a target="_blank" href="http://x.com/nisten">@nisten</a></p><p>* Guest : Jeremy Berman (<a target="_blank" href="https://x.com/jerber888">@jerber888</a>) - SOTA on ARC- AGI</p><p>* <strong>Open Source</strong></p><p>* Perceptron AI introduces Isaac 0.1: a 2B param perceptive-language model (<a target="_blank" href="https://x.com/perceptroninc/status/1968365052270150077">X</a>, <a target="_blank" href="https://huggingface.co/PerceptronAI/Isaac-0.1">HF</a>, <a target="_blank" href="https://www.perceptron.inc/blog/introducing-isaac-0-1">Blog</a>)</p><p>* Tongyi DeepResearch: A3B open-source web agent claims parity with OpenAI Deep Research (<a target="_blank" href="https://x.com/Ali_TongyiLab/status/1967988004179546451">X</a>, <a target="_blank" href="https://huggingface.co/Alibaba-NLP/Tongyi-DeepResearch-30B-A3B">HF</a>)</p><p>* Mistral updates Magistral-Small-2509 (<a target="_blank" href="https://huggingface.co/mistralai/Magistral-Small-2509">HF</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* GPT-5-Codex release: Agentic coding upgrade for Codex (<a target="_blank" href="https://x.com/OpenAI/status/1967636903165038708">X</a>, <a target="_blank" href="https://openai.com/index/introducing-upgrades-to-codex/">OpenAI Blog</a>)</p><p>* Meta Connect - New AI glasses with display, new AI mode (<a target="_blank" href="https://x.com/lukegotbored/status/1968497570008744149">X Recap</a>)</p><p>* NBER & OpenAI - How People Use ChatGPT: Growth, Demographics, and Scale (<a target="_blank" href="https://twitter.com/rohanpaul_ai/status/1967769809929822659">X</a>, <a target="_blank" href="https://forklightning.substack.com/p/how-people-use-chatgpt">Blog</a>, <a target="_blank" href="https://www.nber.org/papers/w34255">NBER Paper</a>)</p><p>* ARC-AGI: New SOTA by Jeremy Berman and Eric Pang using Grok-4 (<a target="_blank" href="https://x.com/arcprize/status/1967998885701538060">X</a>, <a target="_blank" href="https://jeremyberman.substack.com/p/how-i-got-the-highest-score-on-arc-agi-again">Blog</a>)</p><p>* OpenAI’s reasoning system aces 2025 ICPC World Finals with a perfect 12/12 (<a target="_blank" href="https://x.com/MostafaRohani/status/1968360976379703569">X</a>)</p><p>* OpenAI adds thinking budgets to ChatGPT app (<a target="_blank" href="https://x.com/OpenAI/status/1968395215536042241">X</a>)</p><p>* Gemini in Chrome: AI assistant across tabs + smarter omnibox + safer browsing (<a target="_blank" href="https://x.com/search?q=gemini%20chrome&#38;src=typed_query">X</a>, <a target="_blank" href="https://blog.google/products/chrome/chrome-reimagined-with-ai/">Blog</a>)</p><p>* Anthropic admits <strong>Claude bugs</strong> - <a target="_blank" href="https://www.anthropic.com/engineering/a-postmortem-of-three-recent-issues">Detailed analysis</a> </p><p>* <strong>This weeks Buzz</strong></p><p>* W&B Models + Weave! You can now log your RL runs in W&B Weave 👏 (<a target="_blank" href="https://x.com/shawnup/status/1968403633764266189">X</a>, <a target="_blank" href="https://weave-docs.wandb.ai/guides/tools/weave-in-workspaces">W&B Link</a>) </p><p>* W&B Fully Connected London - tickets are running out! Use FCLNTHURSAI for a free ticket on me! (<a target="_blank" href="https://wandb.ai/site/resources/events/fully-connected/london/">Register Here</a>)</p><p>* <strong>Vision & Video</strong></p><p>* Moondream 3 (Preview): 9B MoE VLM with 2B active targets frontier-level visual reasoning (<a target="_blank" href="https://x.com/vikhyatk/status/1968800178640429496">X</a>, <a target="_blank" href="https://huggingface.co/moondream/moondream3-preview">HF</a>)</p><p>* Ray3: Luma’s “reasoning” video model with native HDR, Draft Mode, and Hi‑Fi mastering (<a target="_blank" href="https://x.com/LumaLabsAI/status/1968684330034606372">X</a>)</p><p>* HuMo: human‑centric, multimodal video gen from ByteDance/Tsinghua (<a target="_blank" href="https://x.com/altryne/status/1968003981604733359">X</a>, <a target="_blank" href="https://huggingface.co/bytedance-research/HuMo">HF</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* Reka Speech: high-throughput multilingual ASR and speech translation for batch-scale pipelines (<a target="_blank" href="https://x.com/RekaAILabs/status/1967989101111722272">X</a>, <a target="_blank" href="https://reka.ai/news/reka-speech-high-throughput-speech-transcription-and-translation-model-with-timestamps">Blog</a>)</p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* Hunyuan SRPO (Semantic Relative Preference Optimization) supercharges diffusion models (<a target="_blank" href="https://x.com/TencentHunyuan/status/1967853314915315945">X</a>, <a target="_blank" href="https://huggingface.co/tencent/SRPO">HF</a>, <a target="_blank" href="https://tencent.github.io/srpo-project-page/">Project</a>, <a target="_blank" href="https://x.com/hellorob/status/1967667203593183343/photo/2">Comparison X</a>)</p><p>* Hunyuan 3D 3.0 (<a target="_blank" href="https://x.com/TencentHunyuan/status/1967873084960260470">X</a>, <a target="_blank" href="https://3d.hunyuan.tencent.com/">Try it</a>)</p><p>* FeiFei WorldLabs presents Marble (<a target="_blank" href="https://x.com/XRarchitect/status/1968356682888823060">Demo</a>)</p><p>* Reve launches 4-in-1 AI visual platform (<a target="_blank" href="https://x.com/cantrell/status/1967655268642386361">X</a>, <a target="_blank" href="https://app.reve.com/">Reve</a>, <a target="_blank" href="https://blog.reve.com/posts/the-new-reve/">Blog</a>)</p><p>* <strong>Tools</strong></p><p>* Chrome adds Gemini (<a target="_blank" href="https://blog.google/products/chrome/new-ai-features-for-chrome/">Blog</a>)</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-sep-18-gpt-5-codex-oai-wins</link><guid isPermaLink="false">substack:post:173985701</guid><dc:creator><![CDATA[Alex Volkov and Jeremy Berman]]></dc:creator><pubDate>Fri, 19 Sep 2025 00:36:02 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/173985701/5e84bd3b6dc927b7039d8afb7619b43c.mp3" length="100725631" type="audio/mpeg"/><itunes:author>Alex Volkov and Jeremy Berman</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6295</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/173985701/11aca644d7d20c3cef9206a702265985.jpg"/></item><item><title><![CDATA[📆 ThursdAI - Sep 11 - SeeDream 4, Lucy 14B, ChatGPT gets MCP, OpenAI $300B deal with Oracle, Qwen Next A3B & more AI news]]></title><description><![CDATA[<p>Hey Everyone, Alex here, thanks for being a subscriber! Let's get you caught up on this weeks most important AI news! </p><p>The main thing you need to know this week is likely the incredible Image model that ByteDance released, that overshoots the (incredible image model from last 2 weeks) nano 🍌. ByteDance really outdid themselves on this one! </p><p>But also, a video model with super fast generation, OpenAI rumor made Larry Ellison the richest man alive, ChatGPT gets MCP powers (under a flag you can enable) and much more! </p><p>This week we covered a lot of visual stuff, so while the podcast format is good enough, it's really worth tuning in to the video recording to really enjoy the full show. </p><p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></p><p>AI Art and Diffusion</p><p>It's rare for me to start the newsletter not on Open Source AI news, but hey, at least this way you know that I'm writing it and not some AI right? 😉</p><p>ByteDance SeeDream 4 - 4K SOTA image generation and editing model with up to 6 reference images (<a target="_blank" href="https://fal.ai/models/fal-ai/bytedance/seedream/v4/edit/requests">Fal</a>, <a target="_blank" href="https://replicate.com/bytedance/seedream-4">Replicate</a>)</p><p>The level of detail on ByteDance's new model, has really made all the hosts on ThursdAI stop and go... huh? is this AI? Bytedance really outdid themselves with this image model that not only generates images, it also is a fully functional image editing with natural language model. It's a diffusion transformer, able to generate 2K and 4K images, fast (under 5 seconds?) while enabling up to 6 reference images to be provided for the generation. </p><p>This is going to be incredible for all kinds of purposes, AI art, marketing etc'. The promt adherence is quite incredible, text is also crisp and sharp at those 2/4K resolutions. We created this image live on the show with it (using a prompt extended by another model)</p><p>I then provided my black and white headshot and the above image and asked to replace me as a cartoon character, and it did, super quick, and even got my bomber jacket and the W&B logo on it in there! Notable, nothing else was changed in the image, showing just how incredible this one is for image editing. </p><p></p><p>In you want enhanced realism, our friend FoFr from replicate, reminded us that using IMG_3984.CR2 in the prompt, will make the model show images that are closer to reality, even if they depict some incredibly unrealistic things, like a pack of lions forming his nickname</p><p>Additional uses for this model are just getting discovered, and one user already noted that given this model outputs 4K resolution, it can be used as a <a target="_blank" href="https://x.com/BrentLynch/status/1965922591497134319">creative upscaler</a> for other model outputs. Just shove your photo from another AI in Seedream and ask for an upscale. Just be ware that creative upscalers change some amount of details in the generated picture. </p><p><strong>DecART AI Lucy 14B Redefines Video Generation speeds! </strong></p><p>If Seedteam blew my mind with images, Decart's Lucy 14B absolutely shattered my expectations for video generation speed. We're talking about generating 5-second videos from images in 6.5 seconds. That's almost faster than watching the video itself!</p><p>This video model is not open source yet (despite them adding 14B to the name) but it's smaller 5B brother was open sourced. The speed to quality ratio is really insane here, and while Lucy will not generate or animate text or faces that well, it does produce some decent imagery, but SUPER fast. This is really great for iteration, as AI Video is like a roulette machine, you have to generate a lot of tries to see a good result. </p><p>This paired with Seedream (which is also really fast) are a game changer in the AI Art world! So stoked to see what folks will be creating with these! </p><p><strong>Bonus Round: Decart's Real-Time Minecraft Mod for Oasis 2 (</strong><a target="_blank" href="https://x.com/DecartAI/status/1963758685995368884"><strong>X</strong></a><strong>)</strong></p><p>The same team behind Lucy also dropped Oasis 2.0, a Minecraft mod that generates game environments in real-time using diffusion models. I got to play around with it live, and watching Minecraft transform into different themed worlds as I moved through them was surreal.</p><p>Want a steampunk village? Just type it in. Futuristic city? Done. The frame rate stayed impressively smooth, and the visual coherence as I moved through the world was remarkable. It's like having an AI art director that can completely reskin your game environment on demand. And while the current quality remains low res, if you consider where Stable Diffusion 1.4 was 3 years ago, and where Seedream 4 is now, and do the same extrapolation to Oasis, in 2-3 years we'll be reskinning whole games on the fly and every pixel will be generated (like Jensen loves to say!) </p><p><strong>OpenAI adds full MCP to ChatGPT (under a flag) </strong></p><p>This is huge, folks. I've been waiting for this for a while, and finally, OpenAI quietly added full MCP (Model Context Protocol) support to ChatGPT via a hidden "developer mode."</p><p><strong>How to Enable MCP in ChatGPT</strong></p><p>Here's the quick setup I showed during the stream:</p><p>* Go to ChatGPT settings → Connectors</p><p>* Scroll down to find "Developer Mode" and enable it</p><p>* Add MCP servers (I used Rube.ai from Composio)</p><p>* Use GPT-4o in developer mode to access your connectors</p><p>During the show, I literally had ChatGPT pull Nisten's last five tweets using the Twitter MCP connector. It worked flawlessly (though Nisten was a bit concerned about what tweets it might surface 😂).</p><p>The implications are massive - you can now connect ChatGPT to GitHub, databases, your local files, or chain multiple tools together for complex workflows. As Wolfram pointed out though, watch your context usage - each MCP connector eats into that 200K limit.</p><p><strong>Big Moves: Investments and Infrastructure</strong></p><p>Speaking of OpenAI, Let's talk money, because the stakes are getting astronomical. OpenAI reportedly has a $300 billion (!) deal with Oracle for compute infrastructure over five years, starting in 2027. That's not a typo - $60 billion per year for compute. Larry Ellison just became the world's richest person, and Oracle's stock shot up 40% on the news in just a few days! This has got to be one of the biggest compute deals the world has ever head of!</p><p>The scale is hard to comprehend. We're talking about potentially millions of H100 GPUs worth of compute power. When you consider that most AI companies are still figuring out how to profitably deploy thousands of GPUs, this deal represents infrastructure investment at a completely different magnitude.</p><p>Meanwhile, Mistral just became Europe's newest decacorn, valued at $13.8 billion after receiving $1.3 billion from ASML. For context, ASML makes the lithography machines that TSMC uses to manufacture chips for Nvidia. They're literally at the beginning of the AI chip supply chain, and now they're investing heavily in Europe's answer to OpenAI.</p><p>Wolfram made a great point - we're seeing the emergence of three major AI poles: American companies (OpenAI, Anthropic), Chinese labs (Qwen, Kimi, Ernie), and now European players like Mistral. Each is developing distinct approaches and capabilities, and the competition is driving incredible innovation.</p><p><strong>Anthropic's Mea Culpa and Code Interpreter</strong></p><p>After weeks of users complaining about Claude's degraded performance, Anthropic finally admitted there were bugs affecting both Claude Opus and Sonnet. Nisten, who tracks these things closely, speculated that the issues might be related to running different quantization schemes on different hardware during peak usage times. We already reported last week that they admitted that "something was affecting intelligence" but this week they said they pinpointed (and fixed) 2 bugs realted to inference! </p><p>They also launched a code interpreter feature that lets Claude create and edit files directly. It's essentially their answer to ChatGPT's code interpreter - giving Claude its own computer to work with. The demo showed it creating Excel files, PDFs, and documents with complex calculations. Having watched Claude struggle with file operations for months, this is a welcome addition.</p><p><strong>🐝 This Week's Buzz: GLM 4.5 on W&B and We're on Open Router!</strong></p><p>Over at Weights & Biases, we've got some exciting updates for you. First, we've added Zhipu AI's GLM 4.5 to <a target="_blank" href="http://wandb.me/inference">W&B Inference</a>! This 300B+ parameter model is an absolute beast for coding and tool use, ranking among the top open models on benchmarks like SWE-bench. We've heard from so many of you, including Nisten, about how great this model is, so we're thrilled to host it. You can try it out now and get $2 in free credits to start.</p><p>And for all you developers out there, you can <a target="_blank" href="https://x.com/olafgeibig/status/1949779562860056763">use a proxy</a> like LiteLLM to run GLM 4.5 from our inference endpoint inside Anthropic's Claude Code if you're looking for a powerful and cheap alternative! </p><p>Second, we're now on Open Router! You can find several of our hosted models, like GPT-4-OSS and DeepSeek Coder, on the platform. If you're already using Open Router to manage your model calls, you can now easily route traffic to our high-performance inference stack.</p><p><strong>Open Source Continues to Shine</strong></p><p>Open Source LLM models took a bit of a break this week, but there were still interesting models! </p><p>Baidu released ERNIE-4.5, a very efficient 21B parameter "thinking" MoE that only uses 3B active parameters per token. From the UAE, MBZUAI released K2-Think, a finetune of Qwen 2.5 that's showing some seriously impressive math scores. And Moonshot AI updated Kimi K2, doubling its context window to 256K and further improving its already excellent tool use and writing capabilities.</p><p>Tencent released an update to HunyuanImage 2.1, which is a bit slow, but also generates 2K images and is decent at text. </p><p>Qwen drops Qwen3-<strong>Next</strong>-80B-A3B (<a target="_blank" href="https://x.com/Alibaba_Qwen/status/1966197643904000262">X</a>, <a target="_blank" href="https://t.co/zHHNBB2l5X">HF</a>)</p><p>In breaking news post the show (we were expecting this on the show itself), Alibaba folks dropped a much more streamlined version of the next Qwen, 80B parametes with only 3B active! They call this an "Ultra Sparse MOE" and it beats Qwen3-32B in perf, rivals Qwen3-235B in reasoning & long-context. </p><p>This is quite unprecedented, as getting models as sparse as to work well takes a lot of effort and skill, but the Qwen folks delivered! </p><p>Tools</p><p>We wrapped with a quick shoutout to EBSynth, a nifty video editor that lets you draw or add elements to one frame and extrapolates to the rest—like Photoshop for motion. It's browser-based and fun for VFX tweaks; check the demo video on <a target="_blank" href="https://twitter.com/ebsynth/status/1965448361974362432">X</a>. Simple but powerful for quick video hacks. Speaking of Photoshop, it was confirmed that Nano Banana is going to be embedded into Photoshop for image editing! </p><p>Summary & TL;DR</p><p>What a week—Seedream and Lucy alone have me rethinking how fast AI can iterate creatively, while MCP in ChatGPT feels like the dawn of truly accessible agents. With open-source keeping pace and big deals fueling the fire, AI's multimodal future is accelerating. Thanks for tuning in, folks; if you missed the live vibes, catch the podcast or hit <a target="_blank" href="sub.thursdai.news">sub.thursdai.news</a> for all the links. See you next Thursday—what blew your mind this week? Drop a comment and share with a friend, it's the best way to support this endeavor! </p><p>TL;DR of all topics covered:</p><p><strong>AI Models & APIs:</strong></p><p>* <strong>ChatGPT adds full MCP support</strong> - Developer mode unlocks tool connectors for 400M+ users (<a target="_blank" href="https://x.com/altryne/status/1965843358653481450">Setup Guide</a>)</p><p>* <strong>Sea Dream 4.0</strong> - ByteDance's unified image generation/editing model creates 4K images in ~1.8 seconds (<a target="_blank" href="https://x.com/fofrAI/status/1942899932505035222">X</a>, <a target="_blank" href="https://fal.ai/models/fal-ai/bytedance/seedream/v4/edit/playground">Try it</a>)</p><p>* <strong>Lucy 14B</strong> - Decart's lightning-fast video model generates 5-second clips in 6.5 seconds (<a target="_blank" href="https://fal.ai/models/decart/lucy-14b/image-to-video">Demo</a>, <a target="_blank" href="https://lucy.decart.ai/">Page</a>)</p><p>* <strong>Claude bug fixes</strong> - Anthropic admits to performance issues and releases code interpreter (<a target="_blank" href="https://www.anthropic.com/news/create-files">Blog</a>)</p><p>* <strong>Sonoma Dusk & Sky</strong> - Mystery models on OpenRouter with 2M context, rumored to be Grok (<a target="_blank" href="https://openrouter.ai/openrouter/sonoma-sky-alpha">OpenRouter</a>)</p><p><strong>This Week's W&B Buzz:</strong></p><p>* <strong>OpenRouter integration</strong> - Serving models to broader developer community (<a target="_blank" href="https://openrouter.ai/provider/wandb">Try us</a>)</p><p>* <strong>GLM 4.5</strong> - 350B parameter coding model added to inference (<a target="_blank" href="https://x.com/weights_biases/status/1965176118413344778">X</a>, <a target="_blank" href="https://wandb.ai/site/inference/glm-4.5">Try It</a>)</p><p>* W&B inference in Claude Code with LiteLLM (<a target="_blank" href="https://x.com/olafgeibig/status/1949779562860056763">Olaf's Guide</a>)</p><p><strong>Open Source Releases:</strong></p><p>* <strong>ERNIE 4.5</strong> - Baidu open-sources 21B parameter thinking model with 3B active parameters (<a target="_blank" href="https://x.com/Baidu_Inc">X</a>, <a target="_blank" href="https://huggingface.co/baidu/ERNIE-4.5-21B-A3B-Thinking">HF</a>)</p><p>* <strong>K2-think</strong> - MBZUAI's Qwen 2.5 fine-tune with strong math performance (<a target="_blank" href="https://x.com/ericxing/status/1965667372284739977">X</a>)</p><p>* <strong>Kimi K2 update</strong> - Doubled context to 256K, improved tool use (<a target="_blank" href="https://x.com/LechMazur/status/1965729450940588459">X</a>)</p><p>* <strong>HunyuanImage 2.1</strong> - Tencent's 17B parameter open-source 2K image model (<a target="_blank" href="https://x.com/TencentHunyuan/status/1965433678261354563">X</a>, <a target="_blank" href="https://huggingface.co/tencent/HunyuanImage-2.1">HF</a>)</p><p>* Qwen-next-80B-A3B - Alibaba's next frontier MoE with 3B active param (<a target="_blank" href="https://x.com/Alibaba_Qwen/status/1966197643904000262">X</a>, <a target="_blank" href="https://t.co/zHHNBB2l5X">HF</a>)</p><p><strong>Voice & Audio:</strong></p><p>* <strong>Qwen3-ASR-Flash</strong> - 11-language speech recognition with singing support (<a target="_blank" href="https://x.com/Alibaba_Qwen/status/1965068737297707261">X</a>)</p><p>* <strong>Stable Audio 2.5</strong> - Enterprise audio generator creating 3-minute tracks in <2 seconds (<a target="_blank" href="https://twitter.com/StabilityAI/status/1965784409052995916">X</a>, <a target="_blank" href="https://stability.ai/news/stability-ai-introduces-stable-audio-25-the-first-audio-model-built-for-enterprise-sound-production-at-scale">Blog</a>, <a target="_blank" href="https://stableaudio.com/generate">Try It</a>)</p><p>* <strong>ElevenLabs Voice Remixing</strong> - Modify cloned voices for age, gender, accent (<a target="_blank" href="https://x.com/elevenlabsio">X</a>)</p><p><strong>Business & Investment:</strong></p><p>* <strong>OpenAI-Oracle deal</strong> - $300B infrastructure agreement over 5 years</p><p>* <strong>Mistral funding</strong> - $1.3B investment from ASML at $13.8B valuation (<a target="_blank" href="https://www.cnbc.com/2025/09/09/ai-firm-mistral-valued-at-14-billion-as-asml-takes-major-stake.html">Blog</a>)</p><p><strong>Tools:</strong></p><p>* <strong>Oasis 2.0</strong> - Real-time Minecraft world generation mod from Decart (<a target="_blank" href="http://oasis2.decart.ai/demo">Try It</a>)</p><p>* <strong>EbSynth</strong> - Video editing tool for frame-by-frame manipulation (<a target="_blank" href="https://x.com/ebsynth/status/1965448361974362432">X</a>)</p><p><strong>Hosts:</strong></p><p>* Alex Volkov (<a target="_blank" href="https://x.com/altryne">@altryne</a>)</p><p>* Wolfram RavenWlf (<a target="_blank" href="https://x.com/WolframRvnwlf">@WolframRvnwlf</a>)</p><p>* Yam Peleg (<a target="_blank" href="https://x.com/yampeleg">@yampeleg</a>)</p><p>* Nisten (<a target="_blank" href="https://x.com/nisten">@nisten</a>)</p><p>* LDJ (<a target="_blank" href="https://x.com/ldjconfirmed">@ldjconfirmed</a>)</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/sep-11</link><guid isPermaLink="false">substack:post:173398856</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Fri, 12 Sep 2025 02:41:10 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/173398856/7e0c2ddd7c316c92143c74acd0131cee.mp3" length="90689176" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5668</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/173398856/91aa806c864bcff63ee4dd1e9c10645d.jpg"/></item><item><title><![CDATA[📆 ThursdAI - Sep 4 - Codex Rises, Anthropic Raises $13B, Nous plays poker, Apple speeds up VLMs & more AI news]]></title><description><![CDATA[<p>Wohoo, hey ya’ll, Alex here,</p><p>I'm back from the desert (pic at the end) and what a great feeling it is to be back in the studio to talk about everything that happened in AI! </p><p>It's been a pretty full week (or two) in AI, with Coding agent space heating up, Grok entering the ring and taking over free tokens, Codex 10xing usage and Anthropic... well, we'll get to Anthropic. </p><p>Today on the show we had Roger and Bhavesh from Nous Research cover the awesome Hermes 4 release and the new PokerBots benchmark, then we had a returning favorite, Kwindla Hultman Kramer, to talk about the GA of RealTime voice from OpenAI. </p><p>Plus we got some massive funding news, some drama with model quality on Claude Code, and some very exciting news right here from CoreWeave aquiring OpenPipe! 👏 So grab your beverage of choice, settle in (or skip to the part that interests you) and let's take a look at the last week (or two) in AI! </p><p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></p><p><strong>Open Source: Soulful Models and Poker-Playing Agents</strong></p><p>This week did not disappoint as it comes to Open Source! Our friends at Nous Research released the 14B version of Hermes 4, after releasing the 405B and 70B versions last week. This company continues to excel in finetuning models for powerful, and sometimes just plain weird (in a good way) usecases. </p><p><strong>Nous Hermes 4 (14B, 70B, 405B) and the Quest for a "Model Soul" (</strong><a target="_blank" href="https://x.com/NousResearch/status/1960416954457710982"><strong>X</strong></a><strong>, </strong><a target="_blank" href="https://huggingface.co/NousResearch/Hermes-4-14B"><strong>HF</strong></a><strong>)</strong></p><p>Roger and Bhavash from Nous came to announce the release of the smaller (14B) version of Hermes 4, and cover the last weeks releases of the larger 70B and 405B brothers. Hermes series of finetunes was always on our radar, as unique data mixes turned them into uncensored, valuable and creative models and unlocked a bunch of new use-cases. </p><p>But the wildest part? They told us they intentionally <em>stopped </em>training the model not when reasoning benchmarks plateaued, but when they felt it started to "lose its model soul." They monitor the entropy and chaos in the model's chain-of-thought, and when it became too sterile and predictable, they hit the brakes to preserve that creative spark. This focus on qualities beyond raw benchmark scores is why Hermes 4 is showing some really interesting generalization, performing exceptionally well on benchmarks like EQBench3, which tests emotional and interpersonal understanding. It's a model that's primed for RL not just in math and code, but in creative writing, role-play, and deeper, more "awaken" conversations. It’s a soulful model that's just fun to talk to.</p><p><strong>Nous Husky Hold'em Bench: Can Your LLM Win at Poker? (</strong><a target="_blank" href="https://huskybench.com/"><strong>Bench</strong></a><strong>)</strong></p><p>As if a soulful model wasn't enough, the Nous team also dropped one of the most creative new evals I've seen in a while: <strong>Husky Hold'em Bench</strong>. We had Bhavesh, one of its creators, join the show to explain. This isn't a benchmark where the LLM plays poker directly. Instead, the LLM has to <em>write a Python poker bot</em>from scratch, under time and memory constraints, which then competes against bots written by other LLMs in a high-stakes tournament. Very interesting approach, and we love creative benchmarking here at ThursdAI! </p><p>This is a brilliant way to test for true strategic reasoning and planning, not just pattern matching. It's an "evergreen" benchmark that gets harder as the models get better. Early results are fascinating: Claude 4 Sonnet and Opus are currently leading the pack, but Hermes 4 is the top open-source model.</p><p><strong>More Open Source Goodness</strong></p><p>The hits just kept on coming this week. <strong>Tencent</strong> open-sourced <strong>Hunyuan-MT-7B</strong>, a translation model that swept the WMT2025 competition and rivals GPT-4.1 on some benchmarks. Having a small, powerful, specialized model like this is huge for anyone doing large-scale data translation for training or needing fast on-device capabilities.</p><p>From Switzerland, we got <strong>Apertus-8B and 70B</strong>, a set of fully open (Apache 2.0 license, open data, open training recipes!) multilingual models trained on a massive 15 trillion tokens across 1,800 languages. It’s fantastic to see this level of transparency and contribution from European institutions.</p><p>And <strong>Alibaba’s Tongyi Lab</strong> released <strong>WebWatcher</strong>, a powerful multimodal research agent that can plan steps, use a suite of tools (web search, OCR, code interpreter), and is setting new state-of-the-art results on tough visual-language benchmarks, often beating models like GPT-4o and Gemini.</p><p>All links are in the TL;DR at the end</p><p><strong>BREAKING NEWS: Google Drops Embedding Gemma 308M (</strong><a target="_blank" href="https://x.com/GoogleDeepMind/status/1963635422698856705"><strong>X</strong></a><strong>, </strong><a target="_blank" href="https://huggingface.co/google/embeddinggemma-300m"><strong>HF</strong></a><strong>, </strong><a target="_blank" href="https://huggingface.co/spaces/webml-community/semantic-galaxy"><strong>Try It</strong></a><strong>)</strong></p><p>Just as we were live on the show, news broke from our friends at Google. They've released <strong>Embedding Gemma</strong>, a new family of open-source embedding models. This is a big deal because they are <em>tiny</em>—the smallest is only 300M parameters and takes just 200MB to run—but they are topping the MTEB leaderboard for models under 500M parameters. For anyone building RAG pipelines, especially for on-device or mobile-first applications, having a small, fast, SOTA embedding model like this is a game-changer.</p><p>It's so optimized for on device running that it can run fully in your browser on WebGPU, with <a target="_blank" href="https://huggingface.co/spaces/webml-community/semantic-galaxy">this great example</a> from our friend Xenova highlighted on the release blog! </p><p><strong>Big Companies, Big Money, and Big Problems</strong></p><p>It was a rollercoaster week for the big labs, with massive fundraising, major product releases, and a bit of a reality check on the reliability of their services.</p><p><strong>OpenAI's GPT Real-Time Goes GA and gets an upgraded brain (</strong><a target="_blank" href="https://x.com/OpenAIDevs/status/1961124915719053589"><strong>X</strong></a><strong>, </strong><a target="_blank" href="https://openai.com/index/introducing-gpt-realtime/#image-input"><strong>Docs</strong></a><strong>)</strong></p><p>We had the perfect guest to break down OpenAI's latest voice offering: Kwindla Kramer, founder of Daily and maintainer of the open-source PipeCat framework. OpenAI has officially taken its <strong>Realtime API to General Availability (GA)</strong>, centered around the new gpt-realtime model.</p><p>Kwindla explained that this is a true speech-to-speech model, not a pipeline of separate speech-to-text, LLM, and text-to-speech models. This reduces latency and preserves more nuance and prosody. The GA release comes with huge upgrades, including support for remote MCP servers, the ability to process image inputs during a conversation, and—critically for enterprise—native SIP integration for connecting directly to phone systems.</p><p>However, Kwindla also gave us a dose of reality. While this is the future, for many high-stakes enterprise use cases, the multi-model pipeline approach is still more reliable. Observability is a major issue with the single-model black box; it's hard to know exactly what the model "heard." And in terms of raw instruction-following and accuracy, a specialized pipeline can still outperform the speech-to-speech model. It’s a classic jagged frontier: for the lowest latency and most natural vibe, GPT Real-Time is amazing. For mission-critical reliability, the old way might still be the right way for now.</p><p>ChatGpt has branching! </p><p>Just as I was about to finish writing this up, ChatGPT announced a new feature, and this one I had to tell you about! Finally you can branch chats in their interface, which is a highly requested feature! </p><p>Branching seems to be live on the chat interface, and honestly, tiny but important UI changes like these are how OpenAI remains the best chat experience! </p><p><strong>The Money Printer Goes Brrrr: Anthropic's $13B Raise</strong></p><p>Let's talk about the money. Anthropic announced it has raised an absolutely staggering <strong>$13 billion in a Series F round, valuing the company at $183 billion</strong>. Their revenue growth is just off the charts, jumping from a run rate of around $1 billion at the start of the year to over $5 billion by August. This growth is heavily driven by enterprise adoption and the massive success of Claude Code. It's clear that the AI gold rush is far from over, and investors are betting big on the major players. In related news, OpenAI is also reportedly raising $10 billion at a valuation of around $500 billion, primarily to allow employees to sell shares—a huge moment for the folks who have been building there for years.</p><p><strong>Oops... Did We Nerf Your AI? Anthropic's Apology</strong></p><p>While Anthropic was celebrating its fundraise, it was also dealing with a self-inflicted wound. After days of users on X and other forums complaining that Claude Opus felt "dumber," the company finally issued a statement admitting that yes, for about three days, the model's quality was degraded due to a change in their infrastructure stack.</p><p>Honestly, this is not okay. We're at a point where hundreds of thousands of developers and businesses rely on these models as critical tools. To have the quality of that tool change under your feet without any warning is a huge problem. It messes with people's ability to do their jobs and trust the platform. While it was likely an honest mistake in pursuit of efficiency, it highlights a fundamental issue with closed, proprietary models. You're at the mercy of the provider. It's a powerful argument for the stability and control that comes with open-source and self-hosted models. These companies need to realize that they are no longer just providing experimental toys; they're providing essential infrastructure, and that comes with a responsibility for stability and transparency.</p><p><strong>This Week's Buzz: CoreWeave Acquires OpenPipe! 🎉</strong></p><p>Super exciting news from the Weights & Biases and CoreWeave family - we've acquired OpenPipe! Kyle and David Corbett and their team are joining us to help build out the complete AI infrastructure stack from metal to model.</p><p>OpenPipe has been doing incredible work on SFT and RL workflows with their open source ART framework. As Yam showed during the show, they demonstrated you can train a model to SOTA performance on deep research tasks for just $300 in a few hours - and it's all automated! The system can generate synthetic data, apply RLHF, and evaluate against any benchmark you specify.</p><p>This fits perfectly into our vision at CoreWeave - bare metal infrastructure, training and observability with Weights & Biases, fine-tuning and RL with OpenPipe's tools, evaluation with Weave, and inference to serve it all. We're building the complete platform, and I couldn't be more excited!</p><p><strong>Vision & Speed: Apple's FastVLM (</strong><a target="_blank" href="https://huggingface.co/apple/FastVLM-1.5B-int8">HF</a><strong>)</strong></p><p>Just before Apple's event next week, they dropped FastVLM - a speed-first vision model that's 85x faster on time-to-first-token than comparable models. They released it in three sizes (7B, 1.5B, and 0.5B), all optimized for on-device use.</p><p>The demo that blew my mind was real-time video captioning running in WebGPU. HF CEO Clem showed it processing Apple's keynote video with maybe 250ms latency - the captions were describing what was happening almost in real-time. When someone complained it wasn't accurate because it described "an older man in headphones" when showing an F1 car, Clem pointed out that was actually the previous frame showing Tim Cook - the model was just slightly behind!</p><p><strong>Tools Showdown: Codex vs Claude Code</strong></p><p>To wrap up, we dove into the heated debate between Codex and Claude Code. Sam Altman reported that Codex usage is up <strong>10x in the past two weeks</strong> (!) and improvements are coming. Yam gave us a live demo, and while Claude Code failed to even start up during the show (highlighting why people are switching), Codex with GPT-5 was smooth as butter.</p><p>The key advantages? Codex authenticates with your OpenAI account (no API key juggling), it has MCP support, and perhaps most importantly - it's not just a CLI tool. You can use it for PR reviews on GitHub, as a cloud-based agent, and integrated into Cursor and Windsurf. Though as Yam pointed out, OpenAI really needs to stop calling everything "Codex" - there are like five different products with that name now! 😅</p><p>If you're tried Codex (the CLI!) when it was released, and gave up, give it a try now, it's significantly upgraded! </p><p>Ok, phew, what a great episode we had, if you're only reading, I strongly recommend checking out the live recording or the edited podcast, and of course, if this newsletter is helpful to you, the best way you can do to support it is to subscribe, and share with friends 👏 </p><p>P.S - Just came back after my first burning man, it was a challenging, all consuming experience, where I truly disconnected for the first time (first ThursdAI in over 2 years that I didn't know what's going on with AI). It was really fun but I'm happy to =be back! See you next week! </p><p>TL;DR and Show Notes</p><p>* <strong>Hosts and Guests</strong></p><p>* <strong>Alex Volkov</strong> - AI Evangelist & Weights & Biases (<a target="_blank" href="http://x.com/@altryne">@altryne</a>)</p><p>* Co Hosts - <a target="_blank" href="http://x.com/@WolframRvnwlf">@WolframRvnwlf</a> <a target="_blank" href="http://x.com/@yampeleg">@yampeleg</a> <a target="_blank" href="http://x.com/@nisten">@nisten</a> <a target="_blank" href="http://x.com/@ldjconfirmed">@ldjconfirmed</a></p><p>* Guests - Roger Jin - <a target="_blank" href="https://x.com/rogershijin">@rogershijin</a> & Bhavesh Kumar <a target="_blank" href="https://x.com/bha_ku21">@bha_ku21</a></p><p>* Kwindla Kramer - <a target="_blank" href="https://x.com/kwindla">@kwindla</a></p><p>* <strong>Open Source LLMs</strong></p><p>* Nous Hermes 4 — 14B launches: compact hybrid reasoning model with tool calling for local and cloud use (<a target="_blank" href="https://twitter.com/NousResearch/status/1963349882837897535">X</a>, <a target="_blank" href="https://huggingface.co/NousResearch/Hermes-4-14B">HF</a>, <a target="_blank" href="https://arxiv.org/pdf/2508.18255">Tech Report</a>)</p><p>* Tencent open-sources Hunyuan-MT-7B translation model after sweeping WMT2025 (<a target="_blank" href="https://x.com/TencentHunyuan/status/1962466712378577300">X</a>, <a target="_blank" href="https://huggingface.co/tencent/Hunyuan-MT-7B">HF</a>)</p><p>* Nous - Husky Hold'em Bench launches as an open-source pokerbot eval for LLM strategic play (<a target="_blank" href="https://x.com/NousResearch/status/1963371292318749043">X</a>, <a target="_blank" href="https://huskybench.com/">Bench</a>)</p><p>* WebWatcher: Alibaba's Tongyi Lab open-sources a vision-language deep research agent that sets new SOTA (<a target="_blank" href="https://x.com/rohanpaul_ai/status/1963018720571462029">X</a>, <a target="_blank" href="https://huggingface.co/Alibaba-NLP/WebWatcher-32B">HF</a>)</p><p>* Apertus-8B and 70B launch as Switzerland's fully open, multilingual LLMs trained on 15T tokens across 1,800+ languages (<a target="_blank" href="https://x.com/haeggee/status/1962898537294749960">X</a>, <a target="_blank" href="https://huggingface.co/swiss-ai">HF</a>)</p><p>* Google releases Embedding Gemma - 300M param SOTA embeddings model for RAG ([Breaking News])</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Mistral adds 20+ MCP-powered connectors and controllable Memories to Le Chat for enterprise workflows (<a target="_blank" href="https://x.com/MistralAI/status/1962881086440038545">X</a>, <a target="_blank" href="https://mistral.ai/news/le-chat-mcp-connectors-memories">Blog</a>)</p><p>* Anthropic raises $13B Series F at a $183B post-money valuation (<a target="_blank" href="https://x.com/AnthropicAI/status/1962909475594985935">X</a>, <a target="_blank" href="https://www.anthropic.com/news/anthropic-raises-series-f-at-usd183b-post-money-valuation">Blog</a>)</p><p>* OpenAI fundraises $10B at ~$500B valuation - buyback for employees</p><p>* OpenAI ships gpt-realtime and takes Realtime API to GA with remote MCP tools, image input, and SIP phone calling (<a target="_blank" href="https://x.com/OpenAI">X</a>)</p><p>* OpenAI releases projects for free users with larger file uploads and project-only memory controls</p><p>* OpenAI acquires Statsig & Alex for $1.1B+ to strengthen applications team</p><p>* Grok Code 1 - now taking 50% of coding traffic on OpenRouter</p><p>* Codex usage up 10x in 2 weeks per Sam Altman, with improvements coming</p><p>* Anthropic admits to Claude Opus quality degradation for 3 days due to infrastructure changes</p><p>* <strong>This weeks Buzz</strong></p><p>* CoreWeave buys OpenPipe! 🎉 (<a target="_blank" href="https://openpipe.ai/blog/openpipe-coreweave">Blog</a>)</p><p>* <strong>Vision & Video</strong></p><p>* Apple's FastVLM-7B lands with speed-first vision encoder—85x faster TTFT vs peers (<a target="_blank" href="https://x.com/_akhaliq/status/1962018549674684890">X</a>, <a target="_blank" href="https://huggingface.co/apple/FastVLM-7B-int4">HF</a>)</p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* Nano Banana (Imagen 3) continues to dominate as Google's best image model (<a target="_blank" href="http://ai.studio/banana">ai.studio/banana</a>)</p><p>* <strong>Tools</strong></p><p>* Codex vs Claude Code discussion → Codex now significantly better with GPT-5 engine, GitHub PR reviews, and cloud agents</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-sep-4-codex-rises-anthropic</link><guid isPermaLink="false">substack:post:172823476</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Fri, 05 Sep 2025 00:28:28 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/172823476/19b43bc9a4f9b6d97848e8545f0f6906.mp3" length="70562562" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5880</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/172823476/d99b1061ee16eafcea1ef16b57ea024f.jpg"/></item><item><title><![CDATA[📆 ThursdAI - Aug 21 - DeepSeek V3.1’s hybrid upset, ByteDance’s 512K Seed-OSS, Nano Banana wizardry, Agents.md standardizes agents, and more AI]]></title><description><![CDATA[<p>Hey everyone, Alex here 👋</p><p>This week looked quiet… until about 15 hours before we went live. Then the floodgates opened: DeepSeek dropped a hybrid V3.1 that beats their own R1 with fewer thinking tokens, ByteDance quietly shipped a 36B Apache-2.0 long-context family with a “thinking budget” knob, NVIDIA pushed a faster mixed-architecture 9B with open training data, and a stealth image editor dubbed “Nano Banana” started doing mind-bending scene edits that feel like a new tier of 3D-aware control. </p><p>On the big-co side, a mystery “Sonic” model appeared in Cursor and Cline (spoiler: the function call paths say a lot), and OpenAI introduced <a target="_blank" href="Agents.md">Agents.md</a> to stop the config-file explosion in agentic dev tools. We also got a new open desktop-agent RL framework that 4x’d OSWorld SOTA, an IBM + NASA model for solar weather, and Qwen’s fully open 20B image editor that’s shockingly capable and runnable on your own GPU.</p><p>Our show today was one of the shortest yet, as I had to drop early to prepare for Burning Man 🔥🕺 Speaking of which, Wolfram and the team will host the next episode! </p><p>Ok, let's dive in! </p><p>DeepSeek V3.1: a faster hybrid that thinks less, scores more (<a target="_blank" href="https://x.com/deepseek_ai/status/1958417062008918312">X</a>, <a target="_blank" href="https://huggingface.co/deepseek-ai/DeepSeek-V3.1">HF</a>)</p><p>DeepSeek does this thing where they let a base artifact “leak” onto Hugging Face, and the rumor mill goes into overdrive. Then, hours before we went live, the full V3.1 model card and an instruct variant dropped. The headline: it’s a hybrid reasoner that combines the strengths of their V3 (fast, non-thinking) and R1 (deep, RL-trained thinking), and on many tasks it hits R1-level scores with fewer thinking tokens. In human terms: you get similar or better quality, faster.</p><p>A few things I want to call out from the release and early testing:</p><p>* Hybrid reasoning mode done right. The model can plan with thinking tokens and then switch to non-thinking execution, so you don’t have to orchestrate two separate models. This alone simplifies agent frameworks: plan with thinking on, execute with thinking off.</p><p>* Thinking efficiency is real. DeepSeek shows curves where V3.1 reaches or surpasses R1 with significantly fewer thinking tokens. On AIME’25, for example, R1 clocks 87.5% with ~22k thinking tokens; V3.1 hits ~88.4 with ~15k. On GPQA Diamond, V3.1 basically matches R1 with roughly half the thinking budget.</p><p>* Tool-use and search-agent improvements. V3.1 puts tool calls inside the thinking process, instead of doing a monologue and only then calling tools. That’s the pattern you want for multi-turn research agents that iteratively query the web or your internal search.</p><p>* Long-context training was scaled up hard. DeepSeek says they increased the 32K extension phase to ~630B tokens, and the 128K phase to ~209B tokens. That’s a big bet on long-context quality at train time, not just inference-time RoPE tricks. The config shows a max position in the 160K range, with folks consistently running it in the 128K class.</p><p>* Benchmarks show the coding and terminal agent work got a big push. TerminalBench jumps from a painful 5.7 (R1) to 31 with V3.1. Codeforces ratings are up. On SweBench Verified (non-thinking), V3.1 posts  66 vs R1’s ~44. And you feel it: it’s faster to “get to it” without noodling forever.</p><p>* API parity you’ll actually use. Their API now supports the Anthropic-style interface as well, which means a bunch of editor integrations “just work” with minimal glue. If you’re in a Claude-first workflow, you won’t have to rewire the world to try V3.1.</p><p>* License and availability. This release is MIT-licensed, and you can grab the base model on Hugging Face. If you prefer hosted, keep an eye on our inference—we’re working to get V3.1 live so you can benchmark without burning your weekend assembling a serving stack.</p><p>Hugging Face: <a target="_blank" href="https://huggingface.co/deepseek-ai/DeepSeek-V3.1-Base">https://huggingface.co/deepseek-ai/DeepSeek-V3.1-Base</a></p><p>Quick personal note: I’m seeing a lot of small, pragmatic improvements add up here. If you’re building agents, the hybrid mode plus tighter tool integration is a gift. DeepSeek V3.1 is going to be deployed to W&B Inference service soon! Take a look here to see when it's ready <a target="_blank" href="http://wandb.me/inference">wandb.me/inference</a> </p><p>ByteDance Seed-OSS 36B: Apache-2.0, 512K context, and a “thinking budget” knob (<a target="_blank" href="https://x.com/yshan783399/status/1958225093915779256">X</a>, <a target="_blank" href="https://huggingface.co/collections/ByteDance-Seed/seed-oss-68a609f4201e788db05b5dcd">HF</a>, <a target="_blank" href="https://github.com/ByteDance-Seed/seed-oss">Github</a>)</p><p>I didn’t see much chatter about this one, which is a shame because this seems like a serious release. ByteDance’s Seed team open-sourced a trio of 36B dense models—two Base variants (with and without synthetic data) and an Instruct model—under Apache-2.0, trained on 12T tokens and built for long-context and agentic use. The context window is a native half-million tokens, and they include a “thinking budget” control you can set in 512-token increments so you can trade depth for speed.</p><p>They report strong general performance, long-context RULER scores, and solid code/math numbers for a sub-40B model, with the Instruct variant posting very competitive MMLU/MMLU-Pro and LiveCodeBench results. The architecture is a straightforward dense stack (not MoE), and the models ship with Transformers/vLLM support and 4/8-bit quantization ready to go. If you’ve been hunting for a commercial-friendly, long-context 30-something‑B with an explicit reasoning-control dial, this should be on your shortlist.</p><p>A neat detail for the training nerds: two Base releases—one trained with synthetic data, one without—make for a rare apples-to-apples study in how synthetic data shapes base capability. Also worth noting: they previously shipped a Seed-Prover specialized for Lean; it looks like the team is interested in tight domain models and generalists.</p><p>NVIDIA Nemotron Nano 9B V2: mixed architecture, open data, and long-context throughput (<a target="_blank" href="https://x.com/llm_wizard/status/1957516422520996020">X</a>, <a target="_blank" href="https://research.nvidia.com/labs/adlr/NVIDIA-Nemotron-Nano-2/">Blog</a>, <a target="_blank" href="https://huggingface.co/collections/nvidia/nvidia-nemotron-689f6d6e6ead8e77dd641615">HF</a>, <a target="_blank" href="https://huggingface.co/collections/nvidia/nemotron-pre-training-dataset-689d9de36f84279d83786b35">Dataset</a>, <a target="_blank" href="https://build.nvidia.com/nvidia/nvidia-nemotron-nano-9b-v2">Try It</a>) </p><p>NVIDIA shipped a fully open release of Nemotron Nano 9B V2—base, base-before-alignment/pruning, and a realigned reasoning model—and, crucially, they published most of the pretraining dataset details (~6.6T tokens across premium web, math, code, and SFT). That level of data transparency is rare and makes this a great base for fine-tuners who want reproducibility.</p><p>Under the hood, this is a mixed Mamba+Transformer architecture. NVIDIA is claiming up to 6x higher throughput versus a pure-Transformer peer (they compare to Qwen3-8B) and specifically highlight that they pruned a 12B down to 9B while preserving quality. They also note a single A10 can handle 128K context after compression and distillation passes, which is the kind of practical systems work that matters when you’re running fleets.</p><p>A couple of caveats. The license is NVIDIA Open Model License—not Apache-2.0—so read it; it includes restrictions around illegal surveillance and safety bypasses and has revocation clauses. Personally, I appreciate the data openness and the long-context engineering; as always, just make sure the license fits your use case.</p><p>If you’re into longer-context math/coding with small models, the numbers (AIME’25, RULER-128K, GPQA) are impressive for 9B. And if you fine-tune: the availability of both pruned and pre-pruned bases plus the dataset recipe is a rare treat.</p><p>Cohere’s Command-A Reasoning: dense, multilingual, and research-only licensing (<a target="_blank" href="https://x.com/cohere/status/1958542682890047511">X</a>, <a target="_blank" href="https://cohere.com/blog/command-a-reasoning">Blog</a>, <a target="_blank" href="https://huggingface.co/CohereLabs/command-a-reasoning-08-2025?ref=cohere-ai.ghost.io">HF</a>)</p><p>Cohere Dropped a new reasoning model focused on enterprise deployment patterns. It’s dense 111B model, supports a 256K context, and includes very strong multilingual coverage (23 languages is what they called out). What caught my eye: on the BFCL (Berkeley Function-Calling Leaderboard) they show 70%—above DeepSeek R1’s ~63% and GPT-OSS’s ~61%—and they plot the now-familiar test-time compute curves where more thinking tokens yield higher scores.</p><p>This release uses Cohere’s non-commercial research license; if you want commercial usage you’ll need to go through them. That said, for teams who need privately deployable, on-prem reasoning and can work under a research license for prototyping, it’s a serious option. A meta observation from the show: there’s accumulating evidence that more active parameters help multi-hop tool-use chains compared to very sparse MoE at similar effective capacity. This model nudges in that direction.</p><p>Desktop agents leap: ComputerRL hits 48% on OSWorld (<a target="_blank" href="https://arxiv.org/abs/2508.14040">Paper</a>)</p><p>A new framework dubbed ComputerRL from <a target="_blank" href="http://Z.ai">Z.ai</a> and folks at Tsingua Uni, unified API calls with GUI actions and scaled RL across fleets of virtual desktops, posting a 48.1% success rate on OSWorld versus ~12% for earlier open models. The training system spins up thousands of qemu-in-docker VMs via gRPC; the learning loop alternates RL with supervised fine-tuning and uses a clean step-level binary reward to simplify credit assignment. If you care about practical desktop automation across Ubuntu/Windows/macOS, this is a big jump.</p><p>IBM + NASA’s Surya: open model for solar weather (<a target="_blank" href="https://huggingface.co/nasa-ibm-ai4science/Surya-1.0">HF</a>)</p><p>Scientists get some love: IBM and NASA open-sourced Surya, a transformer trained on nine years of multi-instrument observations (nearly 200 TB) to forecast solar dynamics and space weather—the stuff that can knock satellites and power grids sideways. It’s on Hugging Face, it’s actually runnable, and it’s a fantastic example of open models delivering real-world scientific utility.</p><p>Smaller but notable: InternLM and OpenCUA, plus Intel’s quants</p><p>Two quick flags from the “worth your time” pile. InternLM shipped <a target="_blank" href="https://x.com/intern_lm/status/1958479430361461008">S1 Mini</a>, an 8B vision+language model (ViT on top) that’s multimodal and lightweight; if you need on-device omni-ish behavior on a laptop or tablet, give it a look. And <a target="_blank" href="https://x.com/xywang626/status/1956400403911962757">OpenCUA</a> 32B (Qwen-based) is a specialized computer-usage agent with strong scores; if you’re building automations that need native OS control, it’s worth benchmarking.</p><p>Also, if you’re running 4-bit: the Intel quantization work is excellent right now. Their 4-bit quants have been extremely high precision in my testing, especially for large coders and reasoners like DeepSeek V3.1. It’s an easy win if you’re trying to squeeze a 30B+ onto a workstation without hemorrhaging quality.</p><p>Big-co updates and platform shifts</p><p>Sonic appears in Cursor and Cline</p><p>If you open Cursor or fire up Cline, you may see a new “Sonic” model toggle. It’s labeled as a reasoning model and, when you poke the function-calling internals, the call paths include “xai/…” strings. Folks report it’s fast and solid for coding. No official docs yet, but I’d be surprised if this isn’t Grok Code in pre-release clothes.</p><p>Agents.md: one file to rule your agents</p><p>Agentic dev stacks have multiplied config files like gremlins: Cursor’s rules.json, Windsurf’s prompts, MCP server manifests, tool schemas, install scripts… and every tool wants a different filename and format. OpenAI’s <a target="_blank" href="Agents.md">Agents.md</a> is a strong attempt at standardization. It’s just Markdown at repo root that says, “here’s how to set up, build, test, and run this project,” plus any agent-specific caveats. Tools then auto-detect and follow your instructions instead of guessing.</p><p>It’s already supported by OpenAI Codex, Amp, Jules, Cursor, RooCode, and more, with tens of thousands of public repos adopting the pattern. In monorepos, the nearest <a target="_blank" href="Agents.md">Agents.md</a> wins, so you can override at the package level. And human chat instructions still override the file’s guidance, which is the right default.</p><p>GPT‑5 context truncation in the web UI (<a target="_blank" href="https://x.com/pvncher/status/1958289479283650741">reports</a>)</p><p>There’s been a wave of reports that GPT‑5 in the web UI is truncating long prompts even when you’re under the documented context limit. The folks at Repo Prompt reproduced this multiple times and got confirmation from OpenAI that it’s a bug (not a deliberate nerf). If you saw GPT‑5 suddenly forget the bottom half of your carefully structured system prompt in the web app, this likely explains it. The API doesn’t seem affected. Fingers crossed for a quick fix—GPT‑5 is still the best model I’ve used for 300k‑token “read the whole repo and propose a plan” tasks.</p><p>Image and 3D: Nano Banana and Qwen’s open image editor</p><p>Nano Banana: 3D-consistent scene editing from thin air</p><p>A stealth model nicknamed “Nano Banana” surfaced in a web demo and started doing the kind of edits you’d normally expect from a 3D suite with a modeler at the controls. Take two photos—your living room and a product shot—and it composites the object into the scene with shockingly consistent lighting and geometry. Ask for a 3D mesh “five inches off the skin,” and the mesh really does offset. Ask for a new camera angle on a single still, and it renders the scene from above with plausible structure. People have been calling it a game-changer and, for once, it doesn’t feel like hyperbole.</p><p>There’s a strong whiff of 3D world modeling under the hood—some volumetric representation or neural field that enables true view synthesis—and Logan Kilpatrick posted a banana emoji that set speculation on fire. We’ll see where it lands and under what license, but for now the demo has been doing the rounds and it is… wow.</p><p>If you’re wondering where to try it: LMarena is the currently only way to give it a try, but it's supossedly dropping soon! </p><p>Qwen Image Edit (20B): fully open and already practical (<a target="_blank" href="https://twitter.com/Alibaba_Qwen/status/1957500569029079083">X</a>, <a target="_blank" href="https://huggingface.co/Qwen/Qwen-Image-Edit">HF</a>)</p><p>Qwen shipped a 20B image-editing model layered on their existing vision stack, and it’s properly open (permissive license) with strong bilingual text editing in Chinese and English. It handles high-level semantic edits (pose adjustments, rotations, style/IP creation) and low-level touch-ups (add/remove/insert). You can swap objects, expand aspect ratios, keep character identity consistent across panels, and do clean style transfer. It runs locally if you’ve got a decent GPU.</p><p>What I appreciate here is the precision. Product placement tasks like “put this book in this person’s hand at this angle,” or “make the shoes match the dress” come out with the kind of control that used to require hand masking and a dozen passes. And yes, the capybara mascot is back in the release materials, which made my day! 👏</p><p>If Nano Banana is the closed-world demo of what’s “beyond SOTA,” Qwen Image Edit is the open tool you can actually ship with today.</p><p>This week’s buzz from Weights & Biases</p><p>Two quick updates from our side. First, we’re working to bring DeepSeek V3.1 to our inference as fast as we can so you can run serious benchmarks without fussing over serving stacks. Keep an eye on our channels; we’ll shout when it’s live and we’ll have some credits for early feedback.</p><p>Second, our cofounder Chris Van Pelt released Catnip, a containerized multi‑agent coding workspace that runs multiple Claude Code sessions (or other agents) in isolated sandboxes, each with its own context and notification stream. If you’ve been juggling parallel coding agents that step on each other’s toes, this is catnip indeed.</p><p>Catnip Github: <a target="_blank" href="https://github.com/wandb/catnip">https://github.com/wandb/catnip</a></p><p>Closing thoughts</p><p>A year ago, “thinking tokens” weren't even a curiosity; We only got the first whiff of "reasoning" back in September, and now we’re watching hybrid models that do more with less thinking, tool calls woven inside the reasoning loop, and long-context training regimes scaled up by an order of magnitude. The agent stack is maturing fast—desktop RL is finally clearing real tasks; editor ecosystems are converging on a single config file; and even the stealth drops are clearly building toward world-model‑aware editing and control.</p><p>If you only try two things this week: run DeepSeek V3.1 in both modes (planning with thinking on, execution with thinking off) and throw a complex multi-step tool workflow at it; then take Qwen Image Edit for a spin on a real product-placement or character-consistency task. You’ll feel the future in your hands.</p><p>I’m off to the desert next week for a bit (no internet where I’m going), but Wolfram and the crew will keep the ship on course. If you’re at Burning Man, DM me—would love to say hi out there. As always, thank you for tuning in and nerding out with us every week.</p><p>— Alex</p><p>TL;DR and show notes</p><p>ThursdAI - Aug 21, 2025 - TL;DR</p><p>ThursdAI - Aug 21, 2024 - TL;DR</p><p>TL;DR of all topics covered:</p><p>* <strong>Hosts and Guests</strong></p><p>* <a target="_blank" href="http://x.com/@altryne"><strong>Alex Volkov</strong></a> - AI Evangelist & Weights & Biases</p><p>* <strong>Co Hosts</strong> - <a target="_blank" href="http://x.com/@WolframRvnwlf">@WolframRvnwlf</a>, <a target="_blank" href="http://x.com/@yampeleg">@yampeleg</a>, <a target="_blank" href="http://x.com/@nisten">@nisten</a>, <a target="_blank" href="http://x.com/@ldjconfirmed">@ldjconfirmed</a></p><p>* <strong>Open Source LLMs // Papers</strong></p><p>* <strong>ByteDance Seed-OSS</strong> - 36B open-source LLM family (<a target="_blank" href="https://x.com/gm8xx8/status/1958258474154143923">X</a>, <a target="_blank" href="https://huggingface.co/collections/ByteDance-Seed/seed-oss-68a609f4201e788db05b5dcd">HF</a>, <a target="_blank" href="https://github.com/ByteDance-Seed/seed-oss">GitHub</a>)</p><p>* <strong>DeepSeek V3.1</strong> - Updated Hybrid model (<a target="_blank" href="https://huggingface.co/deepseek-ai/DeepSeek-V3.1-Base">HF</a>)</p><p>* <strong>Cohere CMD-a Reasoning - </strong>(<a target="_blank" href="https://x.com/cohere/status/1958542682890047511">X</a>, <a target="_blank" href="https://cohere.com/blog/command-a-reasoning">Blog</a>, <a target="_blank" href="https://huggingface.co/CohereLabs/command-a-reasoning-08-2025?ref=cohere-ai.ghost.io">HF</a>)</p><p>* <strong>Zai/Tsinghua ComputerRL</strong> - Framework for desktop agents (<a target="_blank" href="https://x.com/Zai_org/status/1958175133706891613">X</a>, <a target="_blank" href="https://arxiv.org/abs/2508.14040">Paper</a>, <a target="_blank" href="https://os-world.github.io">Benchmark</a>)</p><p>* <strong>IBM & NASA Surya</strong> - Solar weather prediction (<a target="_blank" href="https://huggingface.co/nasa-ibm-ai4science/Surya-1.0">HF</a>)</p><p>* <strong>NVIDIA Nemotron Nano 9B V2 - </strong>(<a target="_blank" href="https://x.com/llm_wizard/status/1957516422520996020">X</a>, <a target="_blank" href="https://research.nvidia.com/labs/adlr/NVIDIA-Nemotron-Nano-2/">Blog</a>, <a target="_blank" href="https://huggingface.co/collections/nvidia/nvidia-nemotron-689f6d6e6ead8e77dd641615">HF</a>, <a target="_blank" href="https://huggingface.co/collections/nvidia/nemotron-pre-training-dataset-689d9de36f84279d83786b35">Dataset</a>, <a target="_blank" href="https://build.nvidia.com/nvidia/nvidia-nemotron-nano-9b-v2">Try It</a>) </p><p>* <strong>Alibaba Quark Med</strong></p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* <strong>Sonic Stealth Model</strong> - Likely Grok Code</p><p>* <strong>OpenAI </strong><a target="_blank" href="agents.md"><strong>agents.md</strong></a> - Unified agent files (<a target="_blank" href="https://agents.md">agents.md</a>)</p><p>* <strong>GPT-5 Nerf</strong></p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* <strong>Nano Banana</strong> - Image model (rumored Google)</p><p>* <strong>Qwen-Image-Edit</strong> - 20B Image editing (<a target="_blank" href="https://x.com/Alibaba_Qwen/status/1957500569029079083">X</a>, <a target="_blank" href="https://huggingface.co/Qwen/Qwen-Image-Edit">HF</a>)</p><p>* <strong>This weeks Buzz</strong></p><p>* <strong>Catnip</strong> - Containerized AI agent runner (<a target="_blank" href="https://github.com/wandb/catnip">GitHub</a>)</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-aug-21-deepseek-v31s-hybrid</link><guid isPermaLink="false">substack:post:171591625</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Thu, 21 Aug 2025 20:43:51 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/171591625/78f6f20a7aa4df5f7d27ccbed16f7a02.mp3" length="79676122" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>3984</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/171591625/e72ad6d834f5b71e9529a5e2c608dd82.jpg"/></item><item><title><![CDATA[📆 ThursdAI - Aug 14 - A week with GPT5, OSS world models, VLMs in OSS, Tiny Gemma & more AI news]]></title><description><![CDATA[<p>Hey everyone, Alex here 👋</p><p>Last week, I tried to test GPT-5 and got really surprisingly bad results, but it turns out, as you'll see below, it's partly because they had a bug in the router, and partly because ... well, the router itself! See below for an introduction, written by GPT-5, it's actually not bad?</p><p>Last week was a whirlwind. We live‑streamed GPT‑5’s “birthday,” ran long, and then promptly spent the next seven days poking every corner of the new router‑driven universe.</p><p>This week looked quieter on the surface, but it actually delivered a ton: two open‑source world models you can drive in real time, a lean vision‑language model built for edge devices, a 4B local search assistant that tops Perplexity Pro on SimpleQA, a base model “extraction” from GPT‑OSS that reverses alignment, fresh memory features landing across the big labs, and a practical prompting guide to unlock GPT‑5’s reasoning reliably.</p><p>We also had Alan Dao join to talk about Jan‑v1 and what it takes to train a small model that consistently finds the right answers on the open web—locally.</p><p>Not bad eh? Much better than last time 👏 Ok let's dive in, a lot to talk about in this "chill" AI week (show notes at the end as always) first open source, and then GPT-5 reactions and then... world models!</p><p>00:00 Introduction and Welcome</p><p>00:33 Host Introductions and Health Updates</p><p>01:26 Recap of Last Week's AI News</p><p>01:46 Discussion on GPT-5 and Prompt Techniques</p><p>03:03 World Models and Genie 3</p><p>03:28 Interview with Alan Dow from Jan</p><p>04:59 Open Source AI Releases</p><p>06:55 Big Companies and APIs</p><p>10:14 New Features and Tools</p><p>14:09 Liquid Vision Language Model</p><p>26:18 Focusing on the Task at Hand</p><p>26:18 Reinforcement Learning and Reward Functions</p><p>26:35 Offline AI and Privacy</p><p>27:13 Web Retrieval and API Integration</p><p>30:34 Breaking News: New AI Models</p><p>30:41 Google's New Model: Gemma 3</p><p>33:53 Meta's Dino E3: Advancements in Computer Vision</p><p>38:50 Open Source Model Updates</p><p>45:56 Weights & Biases: New Features and Updates</p><p>51:32 GPT-5: A Week in Review</p><p>55:12 Community Outcry Over AI Model Changes</p><p>56:06 OpenAI's Response to User Feedback</p><p>56:38 Emotional Attachment to AI Models</p><p>57:52 GPT-5's Performance in Coding and Writing</p><p>59:55 Challenges with GPT-5's Custom Instructions</p><p>01:01:45 New Prompting Techniques for GPT-5</p><p>01:04:10 Evaluating GPT-5's Reasoning Capabilities</p><p>01:20:01 Open Source World Models and Video Generation</p><p>01:27:54 Conclusion and Future Expectations</p><p>Open Source AI</p><p>We've had quite a lot of Open Source this week on the show, including a breaking news from the Gemma team!</p><p>Liquid AI's drops LFM2-VL<strong> </strong>(<a target="_blank" href="https://x.com/ramin_m_h/status/1955332731942174960">X</a>, <a target="_blank" href="https://www.liquid.ai/blog/lfm2-vl-efficient-vision-language-models">blog</a>, <a target="_blank" href="https://huggingface.co/LiquidAI/LFM2-VL-1.6B">HF</a>)</p><p>Let's kick things off with our friends at Liquid AI who released LFM2-VL - their new vision-language models coming in at a tiny 440M and 1.6B parameters.</p><p>Liquid folks continue to surprise with speedy, mobile device ready models, that run 2X faster vs top VLM peers. With a native 512x512 resolution (which breaks any image size into 512 smart tiles) and an OCRBench of 74, this tiny model beats SmolVLM2 while being half the size. We've chatted with Maxime from liquid about LFM2 <a target="_blank" href="https://sub.thursdai.news/i/168031500/liquid-ais-lfm-blazing-fast-models-for-the-edge-&#120143;-hugging-face">back in july</a>, and it's great to see they are making them multimodal as well with the same efficiency gains!</p><p>Zhipu (z.ai) unleashes GLM-4.5V - 106B VLM (<a target="_blank" href="https://x.com/Zai_org/status/1954898011181789431">X</a>, <a target="_blank" href="https://huggingface.co/zai-org/GLM-4.5V">Hugging Face</a>)</p><p>In another "previous good model that now has eyes" release, the fine folks from Zhipu continued training thier recently released (and excelled) GLM 4.5-air with a vision encoder, resulting in probably one of the top vision models in the open source!</p><p>It's an MoE with only 12B active parameters (106B total) and gets SOTA across 42 public vision-language benches + has a "thinking mode" that reasons about what it sees.</p><p>Given that GLM-4.5Air is really a strong model, this is de fact the best visual intelligence in the open source, able to rebuild websites from a picture for example and identify statues and locations!</p><p>Jan V1 - a tiny (4B) local search assistant QwenFinetune (<a target="_blank" href="https://x.com/jandotai/status/1955176280535732415">X</a>, <a target="_blank" href="https://huggingface.co/janhq/Jan-v1-4B">Hugging Face</a>)</p><p>This one release got a lot of attention, as the folks at Menlo Research (Alan Dao who came to chat with us about Jan on the pod today) released an Apache 2 finetune of Qwen3-4B-thinking, that's focused on SimpleQA.</p><p>They showed that their tiny model is beating Perplexity Pro on SimpleQA.</p><p>Alan told us on the pod that Jan (the open source <a target="_blank" href="https://jan.ai/">Jan app</a>) is born to be an open source alternative to searching with local models!</p><p>The trick is, you have to enable some source of search data (Exa, Serper, Tavily) via MCP and then enable tools in Jan, and then.. you have a tiny, completely local perplexity clone with a 4B model!</p><p>Google drops Gemma 3 270M (<a target="_blank" href="https://t.co/E0BB5nlI1k">blog</a>)</p><p>In some #breakingNews, Google open sourced a tiny (270M) parameters, "good at instruction following" Gemma variant. This joins models like SmolLM and LFM2 in the "smol models" arena, being only 300MB, you can run this.. on a toaster. This one apparently also fine-tunes very well while being very energy efficient!</p><p>Big Companies (AKA OpenAI corner this past 2 weeks)</p><p>Ok ok, we're finally here, a week with GPT-5! After watching the live stream and getting access to GPT-5, my first reactions were not great. Apparently, so have other peoples, and many folks outcried and complained about the model, some even yelling "AGI is cancelled".</p><p>What apparently happened is (and since, been fixed by OpenAI) is that GPT-5 wasn't just a model that launched, it was a "smart" router between a few models, and not only did they have a routing bug, the basic GPT-5 model, the one without thinking, is... not great.</p><p>But the thinking GPT-5, the one that the router refused to send me to, is really good (as confirmed independently by <a target="_blank" href="https://x.com/shai_s_shwartz/status/1955968602978320727">multiple evals</a> at this point)</p><p>For one, it's the most accurate function calling model on OpenRouter</p><p>It's also one of the best on this new FormulaOne benchmark that was just launched</p><p>You're prompting it wrong!</p><p>Apparently, not only is GPT-5 more intelligent, it's also significantly "surgical" in instruction following, and so, for many folks, just replacing GPT-5 into their tools or prompts didn't just "work", as this model, more than before, is sensitive to conflicting things in the prompt.</p><p>OpenAI has released a <a target="_blank" href="https://cookbook.openai.com/examples/gpt-5/gpt-5_prompting_guide#optimizing-intelligence-and-instruction-following">guide</a> for prompting the model, mostly aimed at developers (as users shouldn't be learning to prompt as models get more intelligent) + they also released a <a target="_blank" href="https://platform.openai.com/chat/edit?models=gpt-5&#38;optimize=true">prompt optimizer</a>! Just dump your long and complex prompts in there, and you'll get an updated prompt with explanations of why they changed what they changed!</p><p>Model Picker (and legacy models) are back!</p><p>So, OpenAI tried and super quickly reversed course on removing the "model picker". At first, it was only GPT-5 there, but many people complained about the abrupt removal of 4o, their .. favorite models. At first, OpenAI added back the models via a hidden setting, and later, they have added 4o back to everyone by default, while increasing the reasoning quota to 3000 messages per week!</p><p>Generally, my thoughts are, if you've tried GPT-5 and weren't impressed, give it another go! (especially now that it's connected to Gmail <a target="_blank" href="https://x.com/altryne/status/1955386020909981905">in chats</a>!)</p><p>Other notable Big Company updates</p><p>In other news, Claude has extended the context window of Sonnet to 1M in the API, and apparently both Claude and Gemini have been adding memory features!</p><p>Grok video has been catching up and is now free for a while to all users</p><p><strong>This Week's Buzz: Weave DX improvements</strong></p><p>Quick update from my day job at Weights & Biases - we've rolled out some quality-of-life improvements to Weave, our LLM observability platform. We now have a unified assets tab where you can manage all your prompts, models, and datasets with full versioning support.</p><p>Prompts are being version tracked, so if you use that GPT-5 prompt optimizer, we'll store all the previous revisions for ya!</p><p>The coolest addition? Threads! Perfect for tracking agent executions or grouping related API calls. You just add a thread_id to your traces and Weave handles the rest. If you're building AI applications and not tracking everything, you're flying blind - give Weave a try at <a target="_blank" href="http://wandb.me/weave!">wandb.me/weave!</a></p><p>World models are getting... open sourced!</p><p>I still think that Google's Genie-3 release from last week was maybe the more important one, though we didn't really get to play with it yet!</p><p>And while getting excited by world models, I was thinking that it's goig to take a while for Open Source to catch up. But this week, not 1, but two world models were open sourced, making me think that we'll get to generated worlds quicker than I expected and the race has begun!</p><p>Skywork's Matrix-Game 2.0 (<a target="_blank" href="https://matrix-game-v2.github.io/">project</a>, <a target="_blank" href="https://huggingface.co/Skywork/Matrix-Game-2.0">HF</a>)</p><p>Matrix-game 2 is a auto-regressive diffusion model, that was trained on 1200 hours of Unreal Engine and GTA-5 environments that runs at 25 frames per second!</p><p>It works by creating an "action injection module" that embeds the mouse/keyboard inputs into the generation, enabling frame-level controls.</p><p>Hunyuan open-sources GameCraft for real-time, high-dynamic game video generation (<a target="_blank" href="https://twitter.com/TencentHunyuan/status/1955839140173631656">X</a>, <a target="_blank" href="https://huggingface.co/tencent/Hunyuan-GameCraft-1.0">Hugging Face</a>)</p><p>Two world-models (well, game models) in the same week? Tencent (who had Hunyuan video before) have trained a game engine on top of their excellent HY-video and have shown the same examples, of building a full world based on a few images.</p><p>Their pipeline trained on 1M game play clips from AAA titles, and they also map W/A/S/D and mouse signals into continuous camera/action embeddings, allowing for control and angle creation.</p><p>The cool thing? A quantized 13B version supposedly can run on a RTX 4090!</p><p>Funnily, they already had Matrix-Game (the one that came out a few days before) benchmarked and beat on the release today!</p><p><strong>Genie 3 is not messing around</strong></p><p>While all the open source is impressive, I was… absolutely blown away by this video from an artist who got the Genie3 team to extend a video of his. Just look at the collision of the plane with the sphere, out of nowhere, Genie3 adds a shadow, and then collision mechanics, the plane bouncing off, and even the jet trails subside and then resume! It really really is crazy to imagine that no prompting was given and the model just.. knew how to do this!</p><p>Phew, that was a lot! Much more as always on the actual show, despite it being a "quiet" week by summer of 2025 standards, there was a LOT of open source releases and GPT-5 situation shows that even OpenAI can stumble on new tech releases!</p><p>Keep the feedback coming - find me on Twitter/X at @altryne or email the show. And remember, if you want to catch all the technical details and demos, the video version on YouTube has everything the podcast can't show.</p><p>Until next week, keep tinkering, keep questioning, and keep pushing the boundaries of what's possible with AI!</p><p>See you next ThursdAI 🚀</p><p>TL;DR - All Topics Covered</p><p>Hosts and Guests</p><p>* <strong>Alex Volkov</strong> - AI Evangelist @ Weights & Biases (@altryne)</p><p>* Co-hosts: @WolframRvnwlf, @yampeleg, @nisten, @ldjconfirmed</p><p>* Guest: Alan Dao from Jan (@jandotai)</p><p>Open Source LLMs</p><p>* <strong>Liquid AI LFM2-VL</strong>: 440M & 1.6B vision-language models with 2x GPU speedup (<a target="_blank" href="https://www.liquid.ai/blog/lfm2-vl-efficient-vision-language-models">Blog</a>, <a target="_blank" href="https://huggingface.co/LiquidAI/LFM2-VL-1.6B">HF</a>)</p><p>* <strong>Jan V1</strong>: 4B parameter search assistant beating Perplexity Pro on SimpleQA (<a target="_blank" href="https://x.com/jandotai/status/1955176280535732415">X</a>, <a target="_blank" href="https://huggingface.co/janhq/Jan-v1-4B">HF</a>)</p><p>* <strong>GPT-OSS Base</strong>: Reverse-engineered base model from Jack Morris (<a target="_blank" href="https://x.com/jxmnop/status/1955436067353502083">X Thread</a>)</p><p>* <strong>Gemma 3</strong>: Google's 270M parameter model with strong instruction following (<a target="_blank" href="https://huggingface.co/google/gemma-3">HF</a>)</p><p>* <strong>Meta Dino v3</strong>: Self-supervised vision model for segmentation (<a target="_blank" href="https://ai.meta.com/blog/dino-v3">Blog</a>)</p><p>Big Companies & APIs</p><p>* <strong>Mistral Medium 3.1</strong>: New model on Mistral platform</p><p>* <strong>Gemini & Claude</strong>: Added memory/personalization features</p><p>* <strong>GPT-5 Updates</strong>: Router fixes, model selector returned, prompting guide released (<a target="_blank" href="https://cookbook.openai.com/examples/gpt-5/gpt-5_prompting_guide">Guide</a>, <a target="_blank" href="https://platform.openai.com/chat/edit?models=gpt-5&#38;optimize=true">Optimizer</a>)</p><p>* <strong>Claude Sonnet 4</strong>: 1M token context window (<a target="_blank" href="https://www.anthropic.com/news/claude-sonnet-4-long-context">Announcement</a>)</p><p>This Week's Buzz</p><p>* Weave updates: Unified assets tab and threads for agent tracking</p><p>* New features for LLM observability and evaluation</p><p>Vision & Video</p><p>* <strong>Hunyuan Large Vision</strong>: 1B vision encoder + 389B MoE language model (<a target="_blank" href="https://vision.hunyuan.tencent.com/en">Project</a>)</p><p>* <strong>GLM-4.5V</strong>: 106B open source VLM from Zhipu AI (<a target="_blank" href="https://x.com/Zai_org/status/1954898011181789431">X</a>, <a target="_blank" href="https://huggingface.co/zai-org/GLM-4.5V">HF</a>)</p><p>World Models</p><p>* <strong>Matrix Game 2.0</strong>: Real-time interactive world model, 25fps generation (<a target="_blank" href="https://matrix-game-v2.github.io/">Project</a>, <a target="_blank" href="https://huggingface.co/Skywork/Matrix-Game-2.0">HF</a>)</p><p>* <strong>Hunyuan GameCraft</strong>: Game video generation with physics understanding (<a target="_blank" href="https://twitter.com/TencentHunyuan/status/1955839140173631656">X</a>, <a target="_blank" href="https://huggingface.co/tencent/Hunyuan-GameCraft-1.0">HF</a>)</p><p>Tools</p><p>* <strong>Grok</strong>: Now includes video generation for all users</p><p>* <strong>Jan Desktop App</strong>: Local AI with MCP support and search capabilities</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-aug-14-a-week-with-gpt5</link><guid isPermaLink="false">substack:post:171017410</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Fri, 15 Aug 2025 00:52:06 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/171017410/f2a4bb674e8a6b645b6441dc72df84a4.mp3" length="64577724" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5381</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/171017410/4d693981f2b432bd74471cd60de2cc54.jpg"/></item><item><title><![CDATA[📅 ThursdAI - GPT5 is here]]></title><description><![CDATA[<p>Hey folks 👋 Alex here, writing to you, from a makeshift recording studio in an Eastern European hookah bar, where I spent the last 7 hours. Why you ask? Well, when GPT-5 drops, the same week as OpenAI dropping the long awaited OSS models + Google is shipping perfect memory World Models (Genie 3) and tons of other AI drops, well I just couldn't stay away from the stream.</p><p>Vacation or not, ThursdAI is keeping you up to date (for 32 months straight, which is also the time since the original GPT-4 release which gave this show its name!)</p><p>So, what did we have today on the stream? Well, we started as usual, talking about the AI releases of the week, as if OpenAI dropping OSS models (apache 2) 120B and 20B is "usual". We then covered incredible releases like Google's World model Genie3 (more on this next week!) and Qwen-image + a few small Qwens.</p><p>We then were VERY excited to tune in, and watch the (very long) announcement stream from OpenAI, in which they spent an hour to tell us about GPT-5.</p><p>This was our longest stream by far (3.5 hours, 1hr was just OpenAI live stream) and I'm putting this here mostly unedited, but chapters are up so feel free to skip to the parts that are interesting to you the most.</p><p>00:00 Introduction and Special Guests</p><p>00:56 Twitter Space and Live Streaming Plans</p><p>02:12 Open Source AI Models Overview</p><p>03:44 Qwen and Other New AI Models</p><p>08:59 Community Interaction and Comments</p><p>10:01 Technical Deep Dive into AI Models</p><p>25:06 OpenAI's New Releases and Benchmarks</p><p>38:49 Expectations and Use Cases for AI Models</p><p>40:03 Tool Use vs. Deep Knowledge in AI</p><p>41:02 Evaluating GPT OSS and OpenAI Critique</p><p>42:29 Historical and Medical Knowledge in AI</p><p>51:16 Opus 4.1 and Coding Models</p><p>55:38 Google's Genie 3: A New World Model</p><p>01:00:43 Kitten TTS: A Lightweight Text-to-Speech Model</p><p>01:02:07 11 Labs' Music Generation AI</p><p>01:08:51 OpenAI's GPT-5 Launch Event</p><p>01:24:33 Building a French Learning Web App</p><p>01:26:22 Exploring the Web App Features</p><p>01:29:19 Introducing Enhanced Voice Features</p><p>01:30:02 Voice Model Demonstrations</p><p>01:32:32 Personalizing Chat GPT</p><p>01:33:23 Memory and Scheduling Features</p><p>01:35:06 Safety and Training Enhancements</p><p>01:39:17 Health Applications of GPT-5</p><p>01:45:07 Coding with GPT-5</p><p>01:46:57 Advanced Coding Capabilities</p><p>01:52:59 Real-World Coding Demonstrations</p><p>02:10:26 Enterprise Applications of GPT-5</p><p>02:11:49 Amgen's Use of GPT-5 in Drug Design</p><p>02:12:09 BBVA's Financial Analysis with GPT-5</p><p>02:12:33 Healthcare Applications of GPT-5</p><p>02:12:52 Government Adoption of GPT-5</p><p>02:13:22 Pricing and Availability of GPT-5</p><p>02:13:51 Closing Remarks by Chief Scientist Yakob</p><p>02:16:03 Live Reactions and Discussions</p><p>02:16:41 Technical Demonstrations and Comparisons</p><p>02:33:53 Healthcare and Scientific Advancements with GPT-5</p><p>02:47:09 Final Thoughts and Wrap-Up</p><p>---</p><p>My first reactions to GPT-5</p><p>Look, I gotta keep it real with you, my first gut reaction was, hey, I'm on vacation, I don't have time to edit and write the newsletter (EU timezone) so let's see how ChatGPT-5 handles this task. After all, OpenAI has removed all other models from the dropdown, it's all GPT-5 now. (pricing from the incredible writeup by <a target="_blank" href="https://substack.com/profile/5753967-simon-willison">Simon Willison</a> available <a target="_blank" href="https://simonwillison.net/2025/Aug/7/gpt-5/">here</a>)</p><p>And to tell you the truth, I was really disappointed! GPT seems to be incredible at coding benchmarks, with 400K tokens and incredible pricing (just $1.25/$10 compared to Opus $15/$75) this model, per the many friends who got to test it early, is a beast at coding! Readily beating opus on affordability per token, switching from thinking to less thinking when it needs to, it definitely seems like a great improvement for coding and agentic tasks.</p><p>But for my, very much honed prompt of "hey, help me with ThursdAI drafts, here's previous drafts that I wrote myself, mimic my tone" it failed.. spectacularly!</p><p>Here's just a funny example, after me replying that it did a bad job:</p><p>It literally wrote "I'm Alex, I build the mind, not the vibe" 🤦‍♂️ What.. the actual...</p><p>For comparison, here's o3, with the same prompt, with a fairly true to tone draft:</p><p>High taste testers take on GPT-5</p><p>But hey, I have tons of previous speakers in our group chats, and many of them who got early access (I didn't... OpenAI, I can be trusted lol) rave about this model. They are saying that this is a huge jump in intelligence.</p><p>Folks like Dr Derya Unutmaz, who jumped on the live show and described how GPT5 does incredible things with less hallucinations, folks like Swyx from <a target="_blank" href="https://substack.com/profile/89230629-latentspace">Latent.Space</a> who had <a target="_blank" href="https://www.latent.space/p/gpt-5-review">early access</a> and even got invited to give first reactions to the OpenAI office, and <a target="_blank" href="https://x.com/skirano/status/1953516768317628818">Pietro Schirano</a> who also showed up in an OpenAI video.</p><p>So definitely, definitely check out their vibes, as we all try to wrap our heads around this new intelligence king we got!</p><p>Other GPT5 updates</p><p>OpenAI definitely cooked, don't get me wrong, with this model plugging into everything else in their platform like memory, voice (that was upgraded and works in custom GPTs now, yay!), canvas and study mode, this will definitely be an upgrade for many folks using the models.</p><p>They have now also opened access to GPT-5 to free users, just in time for schools to reopen, including a very interesting Quiz mode (that just showed up for me without asking for it), and connection to Gmail, all those will now work with GPT5.</p><p>It now has 400K context, way less hallucinations but fewer refusals also, and the developer upgrades like a new verbosity setting and a new "minimal" reasoning setting are all very welcome!</p><p>OpenAI finally launches gpt-oss (120B / 20B) apache 2 licensed models (<a target="_blank" href="https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7637/oai_gpt-oss_model_card.pdf">model card</a>, HF)</p><p>It was really funny, on the stream Nisten talked about the open source models OpenAI dropped, and said "when we covered it last week", while it was just two days ago! It really does feel like this world is moving really fast.</p><p>OpenAI's long promised open source models are here, and they got a fairly mixed bag of reviews from folks. Many folks are celebrating that the western world is now back in the game, releasing incredible local models, with an open license!</p><p>Though, after the initial excitement, the vibes are split on these models. Folks are saying that maybe these were trained with only synthetic data, because, like Phi, they seem to be very good at benchmarks, and on the specific tasks they were optimized for (code, math) but <a target="_blank" href="https://x.com/sam_paech/status/1952839665670922360">really bad</a> at creative writing (Sam Paech from EQBench was not impressed), they are also not multilingual, though OpenAI did release a cookbook <a target="_blank" href="https://cookbook.openai.com/articles/gpt-oss/fine-tune-transfomers">on finetuning</a> with HuggingFace!</p><p>Overall, these models are trained for agentic workflows—supporting function calling, web search, Python execution, configurable reasoning effort, and full raw chain-of-thought access, which we will never get from GPT5.</p><p>I particularly love the new approach, where a reasoning effort can be defined directly via the system prompt, by just adding "reasoning: high" to the system prompt, this model will reason for way longer! Can't wait to get back and bench these and share with you.</p><p>Overall, the fine-tuning and open source community is split for now, but it's been only a few days, so we'll keep you up to date on how well these models land, regardless, this was a historic week for OpenAI!</p><p>Speaking of open models, did you have a chance to try our W&B Inference? The team worked hard to bring these new models to you in record time and incredible pricing (just $.05 for 20B and $.15 for 120B!), these models are definitely worth giving a try!</p><p>Plus, if you comment "OSS Power" on our <a target="_blank" href="https://x.com/weights_biases/status/1952885962641699287">announcement post</a>, we'll likely give you a few credits to try it out and let us know what you think!</p><p>World models "holy crap" moment - Google Genie3</p><p>The other very important release this week was.... not a release at all, but an announcement from Deepmind, with Genie3.</p><p>This World Model takes a single image or text prompt and creates a fully interactive, controllable 3D environment that runs in real-time at 24fps. An environment you as a user can control, walk (or fly) in, move around the camera view. It's really mindblowing stuff.</p><p>We've covered world models like Mirage on previous episodes, but what Google released is a MAJOR step up in coherency, temporal consistency and just overall quality!</p><p>The key breakthrough here is consistency and memory. In one demo, a user could "paint" a virtual wall, turn away, and when they turned back, the paint was still there. This is a massive step towards generalist agents that can train, plan, and reason in entirely simulated worlds, with huge implications for robotics and gaming.</p><p>We’re hoping to have the Genie 3 team on the show next week to dive even deeper into this incredible technology!!</p><p>Other AI news this week</p><p>This week, the "other" news could have filled a full show 2 years ago, we got Qwen keeping the third week of releases with 2 new tiny models + a new diffusion model called Qwen-image (<a target="_blank" href="https://qwenlm.github.io/blog/qwen-image/">Blog</a>, <a target="_blank" href="https://huggingface.co/Qwen/Qwen-Image">HF</a>)</p><p>Anthropic decided to pre-empt the GPT5 release, and upgraded Opus 4 and gave us Opus 4.1 with a slight bump in specs.</p><p>ElevenLabs released a music API called ElevenMusic, which sounds very very good (this on top of last weeks Riffusion + <a target="_blank" href="http://Producer.ai">Producer.ai</a> news, that I'm still raving about)</p><p>Also in voice an audio, a SUPER TINY TTS model called KittenTTS released, with just 15M parameters and a model that's 25MB, it's surprisingly decent at generating voice (<a target="_blank" href="https://x.com/divamgupta/status/1952762876504187065">X</a>)</p><p>And to cap it off with breaking news, the Cursor team, who showed up on the OpenAI stream today (marking quite the change in direction from OpenAI + Windsurf previous friendship), dropped their own CLI version of cursor, reminiscent of Claude Code!</p><p>PHEW, wow ok this was a LOT to process. Not only did we tune in for the full GPT-5 release, we did a live stream when gpt-oss dropped as well.</p><p>On a personal note, I was very humbled when Sam Altman said it was 32 months since GPT-4 release, because it means this was 32 months of ThursdAI, as many of you know, we started live streaming on March 13, 2023, when GPT-4 was released.</p><p>I'm very proud of the incredible community we've built (50K views total across all streams this week!), the incredible co-hosts I have, who step up when I'm on vacation and the awesome guests we have on the show, to keep you up to date every week!</p><p></p><p>So, a little favor to ask, if you find our content valuable, entertaining, the best way to support this pod is upgrade to a paid sub, and share ThursdAI with a friend or two! 👏 See you next week 🫡</p><p></p><p></p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-gpt5-is-here</link><guid isPermaLink="false">substack:post:170398983</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Thu, 07 Aug 2025 22:35:56 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/170398983/89891138f999a7770a1b16bed39d7b3c.mp3" length="126948737" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>10579</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/170398983/59c7350ceb30c3c005b5e8a5597bf503.jpg"/></item><item><title><![CDATA[📆 ThursdAI – Jul 31, 2025 – Qwen’s Small Models Go Big, StepFun’s Multimodal Leap, GLM-4.5’s Chart Crimes, and Runway’s Mind‑Bending Video Edits + GPT-5 soon?]]></title><description><![CDATA[This is a free preview of a paid episode. To hear more, visit <a href="https://sub.thursdai.news?utm_medium=podcast&#38;utm_campaign=CTA_7">sub.thursdai.news</a><br/><br/><p>Woohoo, we're almost done with July (my favorite month) and the Open Source AI decided to go out with some fireworks 🎉</p><p>Hey everyone, Alex here, writing this without my own personal superintelligence (more: later) and this week has been VERY BUSY with many new open source releases.</p><p>Just 1 hour before the show we already had 4 breaking news releases, a tiny Qwen3-coder, Cohere and StepFun both dropped multimodal SOTAs and our friends from Krea dropped a combined model with BFL called Flux[Krea] 👏 </p><p>This is on top of a very very busy week, with Runway adding conversation to their video model Alpha, Zucks' superintelligence vision and a new SOTA open video model Wan 2.2. So let's dive straight into this (as always, all show notes and links are in the end) </p><p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></p><p>Open Source LLMs & VLMs </p><p>Tons of new stuff here, I'll try to be brief but each one of these releases deserves a deeper dive for sure. </p><p>Alibaba is on 🔥 with 3 new Qwen models this week</p><p>Yes, this is very similar to last week, where they have also dropped 3 new SOTA models in a week, but, these are additional ones. </p><p>It seems that someone in Alibaba figured out that after splitting away from the hybrid models, they can now release each model separately and get a lot of attention per model! </p><p>Here's the timeline: </p><p>* <strong>Friday (just after our show)</strong>: Qwen3-235B-Thinking-2507 drops (235B total, 22B active, <a target="_blank" href="https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507">HF</a>) </p><p>* <strong>Tuesday</strong>: Qwen3-30B-Thinking-2507 (30B total, 3B active, <a target="_blank" href="https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507">HF</a>)</p><p>* <strong>Today</strong>: Qwen3-Coder-Flash-2507 lands (30B total, 3B active for coding, <a target="_blank" href="https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8">HF</a>)</p><p>Lets start with the SOTA reasoner, the 235B(A22B)-2507 is absolutely the best reasoner among the open source models.</p><p>We've put the model on our inference service (at crazy prices $.10/$.10) and it's performing absolutely incredible on reasoning tasks. </p><p>It also jumped to the top OSS model on Artificial Analysis scores, EQBench, Long Context and more evals. It a really really good reasoning model! </p><p>Smaller Qwens for local use</p><p>Just a week ago, we've asked Junyang on our show, about smaller models that folks can run on their devices, and he avoided by saying "we're focusing on the larger models" and this week, they delivered not 1 but 2 smaller versions of the bigger models (perfect for Speculative Decoding if you can host the larger ones that is) </p><p>The most interesting one is the Qwen3-Coder-flash, which came out today, with very very impressive stats - and the ability to run locally with almost 80 tok/s on a macbook! </p><p>So for the last two weeks, we now have 3 Qwens (Instruct, Thinking, Coder) and 2 sizes for each (all three have a 30B/A3B version now for local use) 👏</p><p>Z.ai GLM and StepFun Step3 </p><p>As we've said previously, Chinese companies completely dominate the open source AI field right now, and this week as saw yet another crazy testament to how stark the difference is! </p><p>We've seen a rebranded Zhipu (<a target="_blank" href="http://Z.ai">Z.ai</a> previously THUDM) release their new GLM 4.5 - which gives Qwen3-thinking a run for it's money. Not quite at that level, but definitely very close. I personally didn't love the release esthetics, showing a blended eval score, which nobody can replicate feels a bit off. </p><p>We also talked about how StepFun has stepped in (sorry for the pun) with a new SOTA in multimodality, called <a target="_blank" href="https://stepfun.ai/research/en/step3">Step3</a>. It's a 321B MoE (with a huge 38B active param count) that achieves very significant multi modal scores (The benchmarks look incredible: 74% on MMMU, 64% on MathVision) </p><p>Big Companies APIs & LLMs</p><p>Well, we were definitely thinking we'll get GPT-5 or the Open Source AI model from OpenAI this week, but alas, the tea leaves readers were misled (or were being misleading). We 100% know that gpt-5 is coming as multiple screenshots were blurred and then deleted showing companies already testing it. </p><p>But it looks like August is going to be even hotter than July, with multiple sightings of anonymous testing models on Web Dev arena, like Zenith, Summit, Lobster and a new mystery model on OpenRouter called Zenith - that some claim are the different thinking modes of GPT-5 and the open source model? </p><p>Zuck shares vision for personalized superintelligence (<a target="_blank" href="https://meta.com/superintelligence">Meta</a>)</p><p>In a very "Nat Fridman" like post, Mark Zuckerberg finally shared the vision behind his latest push to assemble the most cracked AI engineers.</p><p>In his vision, Meta is the right place to provide each one with personalized superintelligence, enhancing individual abilities with user agency according to their own values. (as opposed to a centralized model, which feels like his shot across the bow for the other frontier labs) </p><p>A few highlights: Zuck leans heavily into the rise of personal devices on top of which humans will interact with this superintelligence, including AR glasses and a departure from a complete "let's open source everything" dogman of the past, now there will be a more deliberate considerations of what to open source. </p><p><strong>This Week's Buzz: Putting Open Source to Work with W&B</strong></p><p>With all these incredible new models, the biggest question is: how can you actually use them? I'm incredibly proud to say that the team at Weights & Biases had all three of the big new Qwen models—Thinking, Instruct, and Coder—live on <strong>W&B Inference </strong>on day one (<a target="_blank" href="http://wandb.me/inference?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=jul31">link</a>)</p><p>And our pricing is just unbeatable. Wolfram did a benchmark run that would have cost him <strong>$150</strong> using Claude Opus. On W&B Inference with the Qwen3-Thinking model, it cost him <strong>22 cents</strong>. That's not a typo. It's a game-changer for developers and researchers.</p><p>To make it even easier, a listener of the show, Olaf Geibig, posted a <a target="_blank" href="https://x.com/olafgeibig/status/1949779562860056763">fantastic tutorial</a> on how you can use our free credits and W&B Inference to power tools like Claude Code and VS Code using LiteLLM. It takes less than five minutes to set up and gives you access to state-of-the-art models for pennies. All you need to do is add <a target="_blank" href="https://gist.github.com/olafgeibig/7cdaa4c9405e22dba02dc57ce2c7b31f">this</a> config to vllm and run claude (or vscode) through it! </p><p>Give our inference service a try <a target="_blank" href="http://wandb.me/inference?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=jul31">here</a> and follow our main account <a target="_blank" href="http://x.com/weights_biases">@weights_biases</a> a follow as we often drop ways to get additional free credits when new models release</p><p>Vision & Video models</p><p>Wan2.2: Open-Source MoE Video Generation Model Launches (<a target="_blank" href="https://x.com/Alibaba_Wan/status/1949827662416937443">X</a>, <a target="_blank" href="https://huggingface.co/Wan-AI">HF</a>)</p><p>This is likely the best open source video model, but definitely the first MoE video model! It came out with text2video, image2video and a combined version. </p><p>With 5 second 720p videos, that can even be generator at home on a single 4090, this is definitely a step up in the quality of video models that are fully open source. </p><p>Runway changes the game again - Gen-3 Aleph model for AI video editing / transformation (<a target="_blank" href="https://x.com/blizaine/status/1950007468324491523">X</a>, <a target="_blank" href="https://x.com/runwayml/status/1950180894477529490">X</a>)</p><p>Look, there's simply no denying this, AI video has had an incredible year, from open source like Wan, to proprietary models with sounds like VEO3. And it's not surprising that we're seeing this trend, but it's definitely very exciting when we see an approach like Runway has, to editing. </p><p>This adds a chat to the model, and your ability to edit.. anything in the scene. Remove / Add people and environmental effects, see the same scene from a different angle and a lot more! </p><p>Expect personalized entertainment very soon! </p><p>AI Art & Diffusion & 3D</p><p>FLUX.1 Krea [dev] launches as a state-of-the-art open-weights text-to-image model (<a target="_blank" href="https://x.com/bfl_ml/status/1950920537741336801">X</a>, <a target="_blank" href="https://huggingface.co/black-forest-labs/FLUX.1-Krea-dev">HuggingFace</a>)</p><p>Black Forest Labs teamed with Krea AI for Flux.1 Krea [dev], an open-weights text-to-image model ditching the "AI gloss" for natural, distinctive vibes—think DALL-E 2's quirky grain without the saturation. It outperforms open peers and rivals pros in prefs, fully Flux-compatible for LoRAs/tools. Yam and I geeked over the aesthetics frontier; it's a flexible base for fine-tunes, available on Hugging Face with commercial options via FAL/Replicate. If you're tired of cookie-cutter outputs, this breathes fresh life into generations.</p><p>Ideogram Character launches: one-shot character consistency for everyone (<a target="_blank" href="https://x.com/ideogram_ai/status/1950255115753095307">X</a>)</p><p>Ideogram's Characters feature lets you upload one pic for instant, consistent variants—free for all, with inpainting to swap into memes/art. My tests nailed expressions/scenes (me in cyberpunk? Spot-on), though not always photoreal. Wolfram praised the accuracy; it's a meme-maker's dream! and they give like 10 free ones so give it a go</p><p>Tencent Hunyuan3D World Model 1.0 launches as the first open-source, explorable 3D world generator (<a target="_blank" href="https://x.com/TencentHunyuan/status/1949288986192834718">X</a>, <a target="_blank" href="https://huggingface.co/tencent/HunyuanWorld-1">HF</a>)</p><p>Tencent's Hunyuan3D World Model 1.0 is the first open-source generator of explorable 3D worlds from text/image—360° immersive, exportable meshes for games/modeling. ~33GB VRAM on complex scenes, but Wolfram called it a metaverse step; I wandered a demo scene, loving the potential despite edges. Integrate into CG pipelines? Game-changer for VR/creators.</p><p>Voice & Audio </p><p>Look I wasn't even mentioning this on the show, but it came across my feed just as I was about to wrap up ThursdAI, and it's really something. Riffusion joined forces producer and using FUZZ-2 they now have a fully Chatable studio producer, you can ask for.. anything you would ask in a studio! </p><p>Here's my first reaction, and it's really fun, I think they still are open with the invite code 'STUDIO'... I'm not afiliated with them at all! </p><p>Tools </p><p>Ok I promised some folks we'll add this in,  Nisten went super <a target="_blank" href="https://x.com/nisten/status/1950620243258151122">viral</a> last week with him using a new open source tool called Crush from CharmBracelet, which is an open version of VSCode and it looks awesome! </p><p>He gave a demo live on the show, including how to set it up to work, with subagents etc. If you're into vibe coding, and using the open source models, def. give Crush a try it's really flying and looks cool! </p><p>Phew, ok, we somehow were able to cover ALLL these releases this week, and we didn’t even have an interview! </p><p>Here’s the TL;DR and links to the folks who subscribed (I’m trying a new thing to promote subs on this newsletter) and see you in two weeks (next week is Wolframs turn again as I’m somewhere in Europe!) </p><p>ThursdAI - July 31st, 2025 - TL;DR</p><p>* Hosts and Guests</p><p>* <strong>Alex Volkov</strong> - AI Evangelist & Weights & Biases (<a target="_blank" href="https://x.com/altryne">@altryne</a>)</p><p>* Co Hosts - <a target="_blank" href="https://x.com/WolframRvnwlf">@WolframRvnwlf</a> <a target="_blank" href="https://x.com/yampeleg">@yampeleg</a> <a target="_blank" href="http://x.com/nisten">@nisten</a> <a target="_blank" href="https://x.com/ldjconfirmed">@ldj</a></p><p>* Open Source LLMs</p><p>* Zhipu drops GLM-4.5 355B (A32B) AI model (<a target="_blank" href="https://x.com/Zai_org/status/1949831552189518044">X</a>, <a target="_blank" href="https://huggingface.co/zai-org/GLM-4.5">HF</a>)</p><p>* ARCEE AFM‑4.5B and AFM‑4.5B‑Base weights released (<a target="_blank" href="https://x.com/LucasAtkins7/status/1950278100874645621">X</a>, <a target="_blank" href="https://huggingface.co/arcee-ai/AFM-4.5B">HF</a>)</p><p>* Qwen is on 🔥 - 3 new models:</p>]]></description><link>https://sub.thursdai.news/p/thursdai-jul-31-2025-qwens-small</link><guid isPermaLink="false">substack:post:169789297</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Fri, 01 Aug 2025 01:21:44 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/169789297/425d9d6a67f0dbfa3f6a579fca6185a4.mp3" length="70903862" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5908</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/169789297/e777193d3edbb9b379b45b33e3d6ccbb.jpg"/></item><item><title><![CDATA[📆 ThursdAI - July 24, 2025 - Qwen-mas in July, The White House's AI Action Plan & Math Olympiad Gold for AIs + coding a 3d tetris on stream]]></title><description><![CDATA[<p>What a WEEK! Qwen-mass in July. Folks, AI doesn't seem to be wanting to slow down, especially Open Source! This week we see yet another jump on SWE-bench verified (3rd week in a row?) this time from our friends at Alibaba Qwen. </p><p>Was a pleasure of mine to host Junyang Lin from the team at Alibaba to come and chat with us about their incredible release with, with not 1 but three new models! </p><p>Then, we had a great chat with Joseph Nelson from Roboflow, who not only dropped additional SOTA models, but was also in Washington at the annocement of the new AI Action plan from the WhiteHouse. </p><p>Great conversations this week, as always, TL;DR in the end, tune in! </p><p>Open Source AI - QwenMass in July</p><p>This week, the open-source world belonged to our friends at Alibaba Qwen. They didn't just release one model; they went on an absolute tear, dropping bomb after bomb on the community and resetting the state-of-the-art multiple times.</p><p><strong>A "Small" Update with Massive Impact: Qwen3-235B-A22B-Instruct-2507</strong></p><p>Alibaba called this a <em>minor</em> refresh of their 235B parameter mixture-of-experts.</p><p>Sure—if you consider +13 points on GPQA, 256K context window minor. The 2507 drops hybrid thinking. Instead, Qwen now ships separate instruct and chain-of-thought models, avoiding token bloat when you just want a quick answer. Benchmarks? 81 % MMLU-Redux, 70 % LiveCodeBench, new SOTA on BFCL function-calling. All with 22 B active params.</p><p>Our friend of the pod, and head of development at Alibaba Qwen, Junyang Lin, join the pod, and talked to us about their decision to uncouple this model from the hybrid reasoner Qwen3.</p><p>"After talking with the community and thinking it through," he said, "we decided to stop using hybrid thinking mode. Instead, we'll train instruct and thinking models separately so we can get the best quality possible."</p><p>The community felt the hybrid model sometimes had conflicts and didn't always perform at its best. So, Qwen delivered a pure non-reasoning instruct model, and the results are staggering. Even without explicit reasoning, it's crushing benchmarks. Wolfram tested it on his MMLU-Pro benchmark and it got the top score of all open-weights models he's ever tested. Nisten saw the same thing on medical benchmarks, where it scored the highest on MedMCQA. This thing is a beast, getting a massive 77.5 on GPQA (up from 62.9) and 51.8 on LiveCodeBench (up from 32). This is a huge leap forward, and it proves that a powerful, well-trained instruct model can still push the boundaries of reasoning.</p><p><strong> The New (open) King of Code: Qwen3-Coder-480B (</strong><a target="_blank" href="https://x.com/Alibaba_Qwen/status/1947766835023335516"><strong>X</strong></a><strong>, </strong><a target="_blank" href="https://wandb.me/qcoder-colab"><strong>Try It</strong></a><strong>, </strong><a target="_blank" href="https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct"><strong>HF</strong></a><strong>)</strong></p><p>Just as we were catching our breath, they dropped the main event: <strong>Qwen3-Coder</strong>. This is a 480-billion-parameter coding-specific behemoth (35B active) trained on a staggering 7.5 trillion tokens, with a 70% code ratio, that gets a new SOTA on SWE-bench verified with 69.6% (just a week after Kimi got SOTA with 65% and 2 weeks after Devstral's SOTA of 53% 😮) </p><p>To get this model to SOTA, Junyang explained they used reinforcement learning with over 20,000 parallel sandbox environments. This allows the model to interact with the environment, write code, see the output, get the reward, and learn from it in a continuous loop. The results speak for themselves.</p><p>With long context abilities 256K with up to 1M extended with YaRN, this coding beast tops the charts, and is achieving Sonnet level performance for significantly less cost! </p><p>Both models supported day-1 on W&B Inference (<a target="_blank" href="https://x.com/weights_biases/status/1947859654400434538">X</a>, <a target="_blank" href="https://wandb.me/qcoder-colab">Get Started</a>)</p><p>I'm very very proud to announce that both these incredible models get Day-1 support on our W&B inference (and that yours truly is now part of the decision of which models we host!) </p><p>With unbeatable prices ($0.10/$0.10 input/output 1M for A22B, $1/$1.5 for Qwen3 Coder) and speed, we are hosting these models at full precision to give you the maximum possible intelligence and the best bang for your buck! </p><p>Nisten has setup our (OpenAI compatible) endpoint with his Cline coding assistant and has built a 3D Tetris game live on the show, and it absolutely went flying. </p><p>This demo perfectly captures the convergence of everything we're excited about: a state-of-the-art open-source model, running on a blazing-fast inference service, integrated into a powerful open-source tool, creating something complex and interactive in seconds.</p><p>If you want to try this yourself, we're giving away credits for W&B Inference. Just find our <a target="_blank" href="https://x.com/weights_biases/status/1947859654400434538">announcement tweet</a> for the Qwen models on the <strong>@weights_biases</strong> X account and reply with <strong>"coding capybara"</strong> (a nod to Qwen's old mascot!). Add "ThursdAI" and I'll personally make sure you get bumped up the list!</p><p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></p><p>Big Companies & APIs</p><p><strong>America’s AI Action Plan: A New Space Race for AI Dominance (</strong><a target="_blank" href="ai.gov"><strong>ai.gov</strong></a><strong>)</strong></p><p>Switching gears to policy, I’m was excited to cover the White House’s newly unveiled “America’s AI Action Plan.” This 25-page strategy, dropped this week, frames AI as a national priority on par with the space race or Cold War, aiming to secure U.S. dominance with 90 policy proposals. I was thrilled to have Joseph Nelson from RoboFlow join us fresh from the announcement event in Washington, sharing the room’s energy and insights. The plan pushes for deregulation, massive data center buildouts, workforce training, and—most exciting for us—explicit support for open-source and open-weight models. It’s a bold move to counter global competition, especially from China, while fast-tracking infrastructure like chip fabrication and energy grids.</p><p>Joseph broke down the vibe at the event, including a surreal moment where the President riffed on Nvidia’s market dominance right in front of Jensen Huang. But beyond the anecdotes, what strikes me is the plan’s call for startups and innovation—think grants and investments via the Department of Defense and Small Business Administration. It’s like a request for new AI companies to step up. As someone who’s railed against past moratorium fears on this show, seeing this pro-innovation stance is a huge relief.</p><p><strong>🔊 Voice & Audio – Higgs Audio v2 Levels Up (</strong><a target="_blank" href="https://x.com/reach_vb/status/1947997596456272203"><strong>X</strong></a><strong>)</strong></p><p>Boson AI fused a 3B-param Llama 3.2 with a 2.2B audio Dual-FFN and trained on ten million hours of speech + music. Result: Higgs Audio v2 beats GPT-4o-mini and ElevenLabs v2 on prosody, does zero-shot multi-speaker dialog, and even hums melodies. The demo runs on a single A100 and sounds pretty-good. </p><p>The first demo I played was not super impressive, but the laugh track made up for it! </p><p><strong>🤖 A Week with ChatGPT Agent</strong></p><p>Last week, OpenAI dropped the ChatGPT Agent on us during our stream, and now we've had a full week to play with it. It's a combination of their browser-operating agent and their deeper research agent, and the experience is pretty wild.</p><p>Yam had it watching YouTube videos and scouring Reddit comments to create a comparison of different CLI tools. He was blown away, seeing the cursor move around and navigate complex sites right on his phone.</p><p>I put it through its paces as well. I tried to get it to order flowers for my girlfriend (it got all the way to checkout!), and it successfully found and filled out the forms for a travel insurance policy I needed. My ultimate test (<a target="_blank" href="https://x.com/altryne/status/1948111176203911222">live stream here</a>), however, was asking it to prepare the show notes for ThursdAI, a complex task involving summarizing dozens of my X bookmarks. It did a decent job (a solid C/B), but still needed my intervention. It's not quite a "fire-and-forget" tool for complex, multi-step tasks yet, but it's a huge leap forward. As Yam put it, "This is the worst that agents are going to be." And that's an exciting thought.</p><p>What a week. From open-source models that rival the best closed-source giants to governments getting serious about AI innovation, the pace is just relentless. It's moments like Nisten's live demo that remind me why we do this show—to witness and share these incredible leaps forward as they happen. We're living in an amazing time.</p><p>Thank you for being a ThursdAI subscriber. As always, here's the TL;DR and show notes for everything that happened in AI this week.</p><p><p>Thanks for reading ThursdAI - Recaps of the most high signal AI weekly spaces! This post is public so feel free to share it.</p></p><p>TL;DR and Show Notes</p><p>* <strong>Hosts and Guests</strong></p><p>* <strong>Alex Volkov</strong> - AI Evangelist & Weights & Biases (<a target="_blank" href="http://x.com/altryne">@altryne</a>)</p><p>* <strong>Co-Hosts</strong> - <a target="_blank" href="http://x.com/WolframRvnwlf">@WolframRvnwlf</a>, <a target="_blank" href="http://x.com/yampeleg">@yampeleg</a>, <a target="_blank" href="http://x.com/nisten">@nisten</a>, <a target="_blank" href="http://x.com/ldjconfirmed">@ldjconfirmed</a></p><p>* <strong>Junyang Lin</strong> - Qwen Team, Alibaba (<a target="_blank" href="https://x.com/JustinLin610">@JustinLin610</a>)</p><p>* <strong>Joseph Nelson</strong> - Co-founder & CEO, Roboflow (<a target="_blank" href="https://x.com/josephnelson">@josephnelson</a>)</p><p>* <strong>Open Source LLMs</strong></p><p>* Sapient Intelligence releases <strong>Hierarchical Reasoning Model (HRM)</strong>, a tiny 27M param model with impressive reasoning on specific tasks (<a target="_blank" href="https://x.com/makingAGI/status/1947286324735856747">X</a>, <a target="_blank" href="https://arxiv.org/abs/2506.21734">arXiv</a>).</p><p>* Qwen drops a "little" update: <strong>Qwen3-235B-A22B-Instruct-2507</strong>, a powerful non-reasoning model (<a target="_blank" href="https://x.com/JustinLin610/status/1947364588340523222">X</a>, <a target="_blank" href="https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507">HF Model</a>).</p><p>* Qwen releases the new SOTA coding agent model: <strong>Qwen3-Coder-480B-A35B-Instruct</strong> (<a target="_blank" href="https://x.com/Alibaba_Qwen/status/1947790753414369280">X</a>, <a target="_blank" href="https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct">HF Model</a>).</p><p>* <strong>Hermes-Reasoning Tool-Use dataset</strong> with 51k tool-calling examples is released (<a target="_blank" href="httpshttps://x.com/intstr1Irinja/status/1947444760393773185">X</a>, <a target="_blank" href="https://huggingface.co/datasets/interstellarninja/hermes_reasoning_tool_use">HF Dataset</a>).</p><p>* NVIDIA releases updates to their <strong>Nemotron</strong> reasoning models.</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* The White House unveils <strong>"America’s AI Action Plan"</strong> to "win the AI race" (<a target="_blank" href="https://x.com/NetChoice/status/1948042669906624554">X</a>, <a target="_blank" href="https://www.whitehouse.gov/wp-content/uploads/2025/07/Americas-AI-Action-Plan.pdf">White House PDF</a>).</p><p>* Both <strong>OpenAI</strong> (<a target="_blank" href="https://x.com/alexwei_/status/1946477742855532918">X</a>) and <strong>Google DeepMind</strong> win Gold at the International Math Olympiad (IMO), with <strong>ByteDance's Seed-Prover</strong> taking Silver (<a target="_blank" href="https://github.com/ByteDance-Seed/Seed-Prover">GitHub</a>).</p><p>* The AI math breakthrough has a "gut punch" effect on the math community (<a target="_blank" href="https://x.com/Dave_White_/status/1947461492783386827">Dave White on X</a>).</p><p>* Google now processes over <strong>980 trillion tokens</strong> per month across its services.</p><p>* A week with <strong>ChatGPT Agent</strong>: testing its capabilities on real-world tasks.</p><p>* <strong>This Week's Buzz</strong></p><p>* Day 0 support for both new Qwen models on <strong>W&B Inference</strong> (<a target="_blank" href="https://wandb.ai/inference">Try it</a>, <a target="_blank" href="https://wandb.me/qcoder-colab">Colab</a>). Reply to our <a target="_blank" href="https://x.com/weightsandbiases">tweet</a> with "coding capybara ThursdAI" for credits!</p><p>* Live on-stream demo of Qwen3-Coder building a 3D Tetris game using kline.</p><p>* <strong>Interesting Research</strong></p><p>* Researchers discover <strong>subliminal learning</strong> in LLMs, where traits are passed through seemingly innocuous data (<a target="_blank" href="https://x.com/0wain_evans/status/1947709848103255232">X</a>, <a target="_blank" href="https://arxiv.org/abs/2507.14805">arXiv</a>).</p><p>* Apple proposes <strong>multi-token prediction</strong>, speeding up LLMs by up to 5x without quality loss (<a target="_blank" href="https://x.com/JacksonAtkinsX/status/1947408593638002639">X</a>, <a target="_blank" href="https://arxiv.org/abs/2507.11851">arXiv</a>).</p><p>* <strong>Voice & Audio</strong></p><p>* Boson AI open-sources <strong>Higgs Audio v2</strong>, a unified TTS model that beats GPT-4o-mini and ElevenLabs (<a target="_blank" href="https://x.com/reach_vb/status/1947997596456272203">X</a>, <a target="_blank" href="https://huggingface.co/bosonai/higgs-audio-v2-generation-3B-base">HF Model</a>).</p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* Decart AI Releases <strong>MirageLSD</strong>, a real-time live-stream diffusion model for instant video transformation (<a target="_blank" href="https://x.com/DecartAI/status/1945947692871692667">X Post</a>).</p><p>* <strong>Tools</strong></p><p>* Qwen releases <strong>qwen-code</strong>, a CLI tool and agent for their new coder models. (<a target="_blank" href="https://github.com/QwenLM/qwen-code">Github</a>)</p><p>* <strong>GitHub Spark</strong>, a new AI-powered feature from GitHub (<a target="_blank" href="https://x.com/simonw/status/1948407932418457968">Simon Willison on X</a>).</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-july-24-2025-qwen-mas-in</link><guid isPermaLink="false">substack:post:169174663</guid><dc:creator><![CDATA[Alex Volkov and Joseph]]></dc:creator><pubDate>Thu, 24 Jul 2025 21:11:05 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/169174663/cd4dfdc17f3dfd5ca99ae25116d740cc.mp3" length="74435973" type="audio/mpeg"/><itunes:author>Alex Volkov and Joseph</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6203</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/169174663/3b5d9680b679dc9a8c6e65e9dcda3d88.jpg"/></item><item><title><![CDATA[📆 ThursdAI - July 17th - Kimi K2 👑, OpenAI Agents, Grok Waifus, Amazon Kiro, W&B Inference & more AI news!]]></title><description><![CDATA[<p>Hey everyone, Alex here 👋 and WHAT a week to turn a year older! Not only did I get to celebrate my birthday with 30,000+ of you live during the OpenAI stream, but we also witnessed what might be the biggest open-source AI release since DeepSeek dropped. Buckle up, because we're diving into a trillion-parameter behemoth, agentic capabilities that'll make your head spin, and somehow Elon Musk decided Grok waifus are the solution to... something.</p><p>This was one of those weeks where I kept checking if I was dreaming. Remember when DeepSeek dropped and we all lost our minds? Well, buckle up because Moonshot's Kimi K2 just made that look like a warm-up act. And that's not even the wildest part of this week! </p><p>As always, all the show notes and links are at the bottom, here's our liveshow (which included the full OAI ChatGPT agents watch party) - Let's get into it! </p><p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></p><p><strong>🚀 Open Source LLMs: The Kimi K2 Revolution</strong></p><p><strong>The New Open Source King Has Arrived</strong></p><p>Folks, I need you to understand something - just a little after we finished streaming last week celebrating Grok 4, a company called Moonshot decided to casually drop what might be the most significant open source release since... well, maybe ever?</p><p><strong>Kimi K2</strong> is a 1 trillion parameter model. Yes, you read that right - TRILLION. Not billion. And before you ask "but can my GPU run it?" - this is an MOE (Mixture of Experts) with only 32B active parameters, which means it's actually usable while being absolutely massive.</p><p>Let me give you the numbers that made my jaw drop:</p><p>* <strong>65.8% on SWE-bench Verified</strong> - This non-reasoning model beats Claude Sonnet (and almost everything else)</p><p>* <strong>384 experts</strong> in the mixture (the scale here is bonkers)</p><p>* <strong>128K context window</strong> standard, with rumors of 2M+ capability</p><p>* Trained on <strong>15.5 trillion tokens</strong> with the new Muon optimizer</p><p>The main thing about the SWE-bench score is not even just the incredible performance, it's the performance without thinking/reasoning + price! </p><p><strong>The Muon Magic</strong></p><p>Here's where it gets really interesting for the ML nerds among us. These folks didn't use AdamW - they used a new optimizer called Muon (with their own Muon Clip variant). Why does this matter? They trained to 15.5 trillion tokens with ZERO loss spikes. That beautiful loss curve had everyone in our community slack channels going absolutely wild. </p><p>As Yam explained during the show, claiming you have a better optimizer than AdamW is like saying you've cured cancer - everyone says it, nobody delivers. Well, Moonshot just delivered at 1 trillion parameter scale.</p><p><strong>Why This Changes Everything</strong></p><p>This isn't just another model release. This is "Sonnet at home" if you have the hardware. But more importantly:</p><p>* <strong>Modified MIT license</strong> (actually open!)</p><p>* <strong>5x cheaper than proprietary alternatives</strong></p><p>* <strong>Base model released</strong> (the first time we get a base model this powerful)</p><p>* Already has <strong>Anthropic-compatible API</strong> (they knew what they were doing)</p><p>The vibes are OFF THE CHARTS. Every high-taste model tester I know is saying this is the best open source model they've ever used. It doesn't have that "open source smell" - it feels like a frontier model because it IS a frontier model.</p><p><strong>Not only a math genius</strong></p><p>Importantly, this model is great at multiple things, as folks called out it's personality or writing style specifically! Our Friend Sam Paech, creator of <a target="_blank" href="https://eqbench.com/">EQBench</a>, has noted that this is maybe the first time an open source model writes this well, and is in fact SOTA on his Creative Writing benchmark and EQBench! </p><p><strong>Quick Shoutouts</strong></p><p>Before we dive deeper, huge props to:</p><p>* <strong>Teknium</strong> for dropping the Hermes 3 dataset (nearly 1M high-quality entries!) (<a target="_blank" href="https://x.com/Teknium1/status/1945259797517099126">X</a>)</p><p>* <strong>LG</strong> (yes, the fridge company) for EXAONE 4.0 - their 32B model getting 81.8% on MMLU Pro is no joke (<a target="_blank" href="https://x.com/testingcatalog/status/1945142194303537225">X</a>)</p><p><strong>🎉 This Week's Buzz: W&B Inference Goes Live with Kimi-K2! (</strong><a target="_blank" href="https://x.com/weights_biases/status/1945204732735447222"><strong>X</strong></a><strong>)</strong></p><p>Ok, but what if you want to try Kimi-K2 but don't have the ability to run 1T models willy nilly? Well, Folks, I've been waiting TWO AND A HALF YEARS to say this: <strong>We're no longer GPU poor!</strong></p><p>Weights & Biases + CoreWeave = Your new inference playground. We launched Kimi K2 on our infrastructure within 3 days of release! </p><p>Sitting behind the scenes on this launch was surreal - as I've been covering all the other inference service launches, I knew exactly what we all want, fast inference, full non-quantized weights, OpenAI API compatibility, great playground to test it out, function calling and tool use. And we've gotten almost all of these, while the super cracked CoreWeave and W&B Weave teams worked their ass off over the weekend to get this shipped in just a few days! </p><p>And here’s the kicker: I’m giving away $50 in inference credits to 20 of you to try Kimi K2 on our platform. Just reply “K2-Koolaid-ThursdAI” to our X launch post <a target="_blank" href="https://x.com/weights_biases/status/1945204732735447222">here</a> and we'll pick up to 20 winners with $50 worth of credits! 🫡</p><p>It’s live now at <a target="_blank" href="api.inference.wandb.ai/v1"><strong>api.inference.wandb.ai/v1</strong></a> (model ID: moonshotai/Kimi-K2-Instruct), fully integrated with Weave for tracing and evaluation. We’re just getting started, and I want your feedback to make this even better. More on <a target="_blank" href="https://weave-docs.wandb.ai/guides/integrations/inference/#advanced-example-use-weave-evaluations-and-leaderboards-with-the-inference-service"><strong>W&B Inference Docs</strong></a> - oh and everyone gets $2 free even without me, which is like 500K tokens to test it out.</p><p>Big CO LLMs + APIs</p><p>The big players didn't sleep this week either—funding flew like confetti, Grok went full anime, and OpenAI dropped agents mid-stream (we reacted live!). Amazon snuck in with dev tools, and Gemini embeddings claimed the throne. Let's get through some of these openers before we get to the "main course" which of course came from OpenAI</p><p><strong>Grok Gets... Waifus?</strong></p><p>I can't believe I'm writing this in a serious AI newsletter, but here we are. XAI added animated 3D characters to Grok, including "Annie" - and let's just say she's very... interactive. XAI partnered with a company that does real time animated 3d avatars and these are powered by Grok so... they are a bit unhinged! </p><p>The same Elon who's worried about birth rates just created nuclear-grade digital companions. The Grok app shot to #1 in the Japanese App Store immediately. Make of that what you will. 😅</p><p>They even posted a job for "Full Stack Waifu Engineer" - we truly live in the strangest timeline.</p><p>XAI also this week <a target="_blank" href="https://x.com/xai/status/1945039609840185489">addressed</a> the concerns we all had with "mechahitler" and the Grok4 issues post launch (where it used it's web search to see "what does Elon think" when it was asked about a few topics) </p><p>Credit for finding the prompt change: Simon Willison</p><p><strong>Other Quick Hits from Big Tech</strong></p><p>* <strong>Gemini Embedding Model</strong>: New SOTA on MTEB leaderboards (68.32 score) (<a target="_blank" href="https://developers.googleblog.com/en/gemini-embedding-text-model-now-available-gemini-api/">dev blog</a>)</p><p>* <strong>Amazon S3 Vectors</strong>: Native vector storage in S3 (huge for RAG applications) (<a target="_blank" href="https://x.com/awscloud/status/1945271447619809504">X</a>)</p><p>* <strong>Amazon Kiro</strong>: Their VS Code fork with spec-driven development (think PM-first coding) (<a target="_blank" href="https://x.com/ajassy/status/1944785963663966633">X</a>)</p><p><strong>🔥 OpenAI Agents: ChatGPT Levels Up to Do-It-All Sidekick </strong></p><p>We timed it perfectly—OpenAI's live stream hit mid-show, and we reacted with 30,000+ of you! And while we didn't get the rumored Open Source model from OAI, we did get... ChatGPT Agent (codename Odyssey) which merges Deep Research's fast-reading text browser with Operator's clicky visual browser and terminal access, all RL-tuned to pick tools smartly. It browses, codes, calls APIs (Google Drive, GitHub, etc., if you connect), generates images, and builds spreadsheets/slides—handling interruptions, clarifications, and takeovers for collaboration. <strong>SOTA jumps: 41.6% on Humanities Last Exam (double O3), 27.4% on FrontierMath</strong>, 45.5% on SpreadsheetBench, 68.9% on BrowseComp.</p><p>These are insane jumps in capabilities folks, just... mindblowing that we can now have agents that are SO good! </p><p>The team demoed wedding planning (outfits, hotels, gifts with weather/venue checks), sticker design/ordering, and an MLB itinerary spreadsheet—wild to watch it chain thoughts on recordings. </p><p>Wolfram called it the official start of agent year; Yam hyped the product polish (mobile control!); Nisten noted it's packaged perfection over DIY. I refreshed ChatGPT obsessively—mind-blown at turning my phone into a task master. Available now for Pro/Plus/Team (400/40 queries/month), Enterprise soon. This is the "feel the AGI" moment Sam mentioned—game over for tedious tasks (OpenAI announcement: <a target="_blank" href="https://openai.com/index/introducing-chatgpt-agent/)."><strong>https://openai.com/index/introducing-chatgpt-agent/</strong></a><a target="_blank" href="https://openai.com/index/introducing-chatgpt-agent/).">).</a></p><p>I've yet to get access to it, but I'm very much looking forward to testing it out and letting you guys know how it works! </p><p>Combining the two browser modes (visual that has my cookies and textual that can scan tons of websites super quick) + CLI + deep research abilities + RL for the right kind of tool use all sounds incredibly intriguing! </p><p><strong>Vision & Video</strong></p><p><strong>Runway’s Act-Two: Motion Capture Gets a Major Upgrade </strong>(<a target="_blank" href="https://x.com/runwayml/status/1945189222542880909">X</a>, <a target="_blank" href="https://www.youtube.com/watch?v=JW8PHlFD7HM">YouTube</a>)</p><p>Runway’s latest drop, Act-Two, is a next-gen motion capture model that’s got creatives buzzing. It tracks head, face, body, and hands with insane fidelity, animating any character from a single performance video. It’s a huge leap from Act-One, already in use for film, VFX, and gaming, and available now to enterprise and creative customers with a full rollout soon. </p><p><strong>Voice & Audio</strong></p><p><strong>Mistral’s Voxtral: Open Speech Recognition Champ </strong>(<a target="_blank" href="https://x.com/MistralAI/status/1945130173751288311">X</a>, <a target="_blank" href="https://huggingface.co/mistralai">HF</a>)</p><p>Mistral AI is killing it with Voxtral, a state-of-the-art open speech recognition model. With Voxtral Small at 24B for production and Mini at 3B for edge devices, it outperforms OpenAI’s Whisper large-v3 across English and multilingual tasks like French, Spanish, Hindi, and German. Supporting up to 32K token context (about 30-40 minutes of audio), it offers summarization and Q&A features, all under an Apache 2.0 license. At just $0.001 per minute via API, it’s a steal for real-time or batch transcription. </p><p><strong>Tools</strong></p><p><strong>Liquid AI’s LEAP and Apollo: On-Device AI for All</strong></p><p>Liquid AI is bringing AI to your pocket with LEAP, a developer platform for building on-device models, and Apollo, a lightweight iOS app to run small LLMs locally. We’re talking 50-300MB models optimized for minimal battery drain and instant inference, no cloud needed. It’s privacy-focused and plug-and-play, perfect for offline workflows on Android and iOS. Developers, this is your prototyping dream—join the community via <strong>X</strong>.</p><p><strong>Amazon Kiro: Your Spec-Driven Coding Buddy</strong></p><p>I’ve already touched on Amazon’s Kiro, but let me reiterate—this spec-driven AI IDE is a standout. It structures your dev process around requirements, letting you define projects in plain language or diagrams before coding starts. It automates docs, testing, and more, feeling like a technical PM guiding you from concept to production. Early users are hooked on its PRD mode, and it’s free during preview. Give it a spin—details on <strong>X</strong>.</p><p><strong>Wrapping Up: An Unforgettable AI Birthday Bash</strong></p><p>What a week, folks! From Kimi K2 redefining open-source power to OpenAI’s ChatGPT Agent ushering in a new era of task automation, this has been a whirlwind of innovation. Throw in Grok’s quirky waifus and our own W&B Inference launch, and I’m left speechless on my birthday. Sharing this with over 30,000 of you during our live stream was the ultimate gift—AI is moving at a pace I couldn’t have dreamed of when I started ThursdAI. Here’s to more breakthroughs, and I can’t wait to see what you build with Kimi K2 credits. Let’s keep pushing the boundaries together!</p><p><p>P.S - If you'd like to support this podcast/newsletter and give me a birthday present, the best way is to tell your friends about it and the second best way is to subscribe 👏 </p></p><p></p><p>TL;DR and Show Notes</p><p>Here’s everything we covered this week on ThursdAI for July 17, 2025, packed with links and key highlights for you to dive deeper:</p><p>* <strong>Hosts and Guests</strong></p><p>* <strong>Alex Volkov</strong> - AI Evangelist & Weights & Biases (<a target="_blank" href="https://x.com/@altryne">@altryne</a>)</p><p>* Co-Hosts - <a target="_blank" href="https://x.com/@WolframRvnwlf">@WolframRvnwlf</a>, <a target="_blank" href="https://x.com/@yampeleg">@yampeleg</a>, <a target="_blank" href="https://x.com/@nisten">@nisten</a>, <a target="_blank" href="https://x.com/@ldjconfirmed">@ldjconfirmed</a></p><p>* <strong>Open Source LLMs</strong></p><p>* Moonshot launches Kimi K2 - a 1T param MoE crushing SWE Bench Verified at 65.8% (<a target="_blank" href="https://x.com/Kimi_Moonshot/status/1943687594560332025">X post</a>, <a target="_blank" href="https://huggingface.co/moonshotai">HuggingFace</a>, <a target="_blank" href="https://platform.moonshot.ai">API & docs</a>, <a target="_blank" href="https://github.com/MoonshotAI/Kimi-K2">GitHub</a>)</p><p>* Teknium drops Hermes 3 dataset - nearly 1M samples for training agentic models (<a target="_blank" href="https://x.com/Teknium1/status/1945259797517099126">X</a>)</p><p>* LGAI EXAONE-4.0 - hybrid attention, 32B & 1.2B models with 131K+ context (<a target="_blank" href="https://x.com/Presidentlin/status/1944977367111291161">X</a>, <a target="_blank" href="https://huggingface.co/LGAI-EXAONE/EXAONE-4.0-32B">HuggingFace</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* OpenAI’s ChatGPT Agent - unified agentic AI for real-world tasks, scoring 41.6% on HLE (<a target="_blank" href="https://openai.com/index/introducing-chatgpt-agent/">Announcement</a>)</p><p>* Grok 4 waifus - XAI adds animated characters, topping Japan’s App Store</p><p>* Mira Murati’s Thinking Machines Lab - $2B funding for open AI science (<a target="_blank" href="https://x.com/miramurati/status/1945166365834535247">X</a>)</p><p>* Gemini Embedding Model - #1 on MTEB with 68.32 score (<a target="_blank" href="https://x.com/OfficialLoganK/status/1944806630979461445">X</a>, <a target="_blank" href="https://developers.googleblog.com/en/gemini-embedding-text-model-now-available-gemini-api/">Dev Blog</a>)</p><p>* Amazon S3 Vectors - preview for vector storage, up to 90% cost savings (<a target="_blank" href="https://x.com/awscloud/status/1945271447619809504">X</a>)</p><p>* <strong>This Week’s Buzz</strong></p><p>* Kimi K2 on W&B Inference - open, scalable production access, $50 credits with “K2KoolAid” (<a target="_blank" href="https://x.com/weights_biases/status/1945204732735447222">X</a>, <a target="_blank" href="https://weave-docs.wandb.ai/guides/integrations/inference/#advanced-example-use-weave-evaluations-and-leaderboards-with-the-inference-service">Docs</a>)</p><p>* Wolfram’s Evaluation of W&B service (<a target="_blank" href="https://x.com/altryne/status/1945586487627554938">X</a>)</p><p>* <strong>Vision & Video</strong></p><p>* Runway’s Act-Two - next-gen motion capture for head, face, body, hands (<a target="_blank" href="https://x.com/runwayml/status/1945189222542880909">X</a>, <a target="_blank" href="https://www.youtube.com/watch?v=JW8PHlFD7HM">YouTube</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* Mistral’s Voxtral - open SOTA speech recognition, beats Whisper v3 (<a target="_blank" href="https://x.com/MistralAI/status/1945130173751288311">X</a>, <a target="_blank" href="https://huggingface.co/mistralai">HuggingFace</a>)</p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* OpenAI image service API adds high-quality mode (<a target="_blank" href="https://x.com/OpenAIDevs/status/1945538534884135132">X</a>)</p><p>* <strong>Tools</strong></p><p>* Liquid AI’s LEAP & Apollo - on-device AI for mobile, privacy-first (<a target="_blank" href="https://x.com/LiquidAI_/status/1945105323846504821">X</a>)</p><p>* Amazon’s Kiro - spec-driven AI IDE, free in preview (<a target="_blank" href="https://x.com/ajassy/status/1944785963663966633">X</a>)</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-july-17th-kimi-k2-openai</link><guid isPermaLink="false">substack:post:168589148</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Thu, 17 Jul 2025 20:51:29 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/168589148/5cd1d3a6b6015d208238b9bedc0936f3.mp3" length="75947659" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6329</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/168589148/a837e3d175a5eb6ee02cd4a5a6801cb0.jpg"/></item><item><title><![CDATA[📆 ThursdAI - Jul 10 - Grok 4 and 4 Heavy, SmolLM3, Liquid LFM2, Reka Flash & Vision, Perplexity Comet Browser, Devstral 1.1 & More AI News]]></title><description><![CDATA[<p>Hey everyone, Alex here</p><p>Don't you just love "new top LLM" drop weeks? I sure do! This week, we had a watch party for Grok-4, with over 20K tuning in to watch together, as the folks at XAI unveiled their newest and best model around. Two models in fact, Grok-4 and Grok-4 Heavy. </p><p>We also had a very big open source week, we had the pleasure to chat with the creators of 3 open source models on the show, first with Elie from HuggingFace who just released SmoLM3, then with our friend Maxime Labonne who together with Liquid released a beautiful series of tiny on device models. </p><p>Finally we had a chat with folks from Reka AI, and as they were on stage, someone in their org published a new open source Reka Flash model 👏 Talk about Breaking News right on the show! </p><p>It was a very fun week and a great episode, so grab your favorite beverage and let me update you on everything that's going on in AI (as always, show notes at the end of the article) </p><p>Open Source LLMs</p><p>As always, even on big weeks like this, we open the show with Open Source models first and this week, the western world caught up to the Chinese open source models we saw last week! </p><p>HuggingFace SmolLM3 - SOTA fully open 3B with dual reasoning and long-context (<a target="_blank" href="https://x.com/LoubnaBenAllal1/status/1942614508549333211">𝕏</a>, <a target="_blank" href="https://huggingface.co/blog/smollm3">HF</a>)</p><p>We had Eli Bakouch from Hugging Face on the show and you could feel the pride radiating through the webcam. SmolLM 3 isn’t just “another tiny model”; it’s an 11-trillion-token monster masquerading inside a 3-billion-parameter body. It reasons, it follows instructions, and it does both “think step-by-step” and “give me the answer straight” on demand. Hugging Face open-sourced every checkpoint, every dataset recipe, every graph in W&B – so if you ever wanted a fully reproducible, multi-lingual pocket assistant that fits on a single GPU, this is it.</p><p>They achieved the long context (128 K today, 256 K in internal tests) with a NoPE + YaRN recipe and salvaged the performance drop by literally merging two fine-tunes at 2 a.m. the night before release. Science by duct-tape, but it works: SmolLM 3 edges out Llama-3.2-3B, challenges Qwen-3, and stays within arm’s reach of Gemma-3-4B – all while loading faster than you can say “model soup.” 🤯</p><p>Liquid AI’s LFM2: Blazing-Fast Models for the Edge (<a target="_blank" href="https://x.com/maximelabonne/thread/1943295061275381864">𝕏</a>, <a target="_blank" href="https://huggingface.co/collections/LiquidAI/lfm2-686d721927015b2ad73eaa38">Hugging Face</a>)</p><p>We started the show and I immediately got to hit the #BREAKINGNEWS button, as Liquid AI dropped LFM2, a new series of tiny (350M-1.2B) models focused on Edge devices.</p><p>We then had the pleasure to host our friend Maxime Labonne, head of Post Training at Liquid AI, to come and tell us all about this incredible effort! </p><p>Maxime, a legend in the model merging community, explained that LFM2 was designed from the ground up for efficiency. They’re not just scaled-down big models; they feature a novel hybrid architecture with convolution and attention layers specifically optimized for running on CPUs and devices like the Samsung Galaxy S24.</p><p>Maxime pointed out that Out of the box, they won't replace ChatGPT, but when you fine-tune them for a specific task like translation, they can match models 60 times their size. This is a game-changer for creating powerful, specialized agents that run locally. Definitely a great release and on ThursdAI of all days! </p><p>Mistrals updated Devstral 1.1 Smashes Coding Benchmarks (<a target="_blank" href="https://x.com/MistralAI/status/1943316390863118716">𝕏</a>, <a target="_blank" href="https://huggingface.co/mistralai/Devstral-Small-2507">HF</a>)</p><p>Mistral didn't want to be left behind on this Open Source bonanza week, and also, today, dropped an update to their excellent coding model Devstral. </p><p>With 2 versions, an open weights Small and API-only Medium model, they have claimed an amazing 61.6% score on Swe Bench and the open source Small gets a SOTA 53%, the highest among the open source models! 10 points higher than the excellent DeepSwe we covered just last week!</p><p>The thing to watch here is the incredible price performance, with this model beating Gemini 2.5 Pro and Claude 3.7 Sonnet while being 8x cheaper to run! </p><p>DevStral small comes to us with an Apache 2.0 license, which we always welcome from the great folks at Mistral! </p><p>Big Companies LLMs and APIs</p><p>There's only 1 winner this week, it seems that other foundational labs were very quiet to see what XAI is going to release. </p><p>XAI releases Grok-4 and Grok-4 heavy - the world leading reasoning model (<a target="_blank" href="https://x.com/altryne/status/1943140257920172148">𝕏</a>, <a target="_blank" href="https://grok.com/">Try It</a>) </p><p>Wow, what a show! Space uncle Elon together with the XAI crew, came fashionably late to their own stream, and unveiled the youngest but smartest brother of the Grok family, Grok 4 plus a multiple agents swarm they call Grok Heavy. We had a watch party with over 25K viewers across all streams who joined and watched together, this, fairly historic event! </p><p>Why historic? Well, for one, they have scaled RL (Reinforcement Learning) for this model significantly more than any other lab did so far, which resulted in an incredible reasoner, able to solve HLE (Humanity's Last Exam) benchmark at an unprecedented 50% (while using tools) </p><p>The other very much unprecedented result, is on the ArcAGI benchmark, specifically V2, which is designed to be very easy for humans and very hard for LLMs, Grok-4 got an incredible 15.9%, almost 2x better than Opus 4 the best performing model before it! (ArcAGI president Greg Kamradt <a target="_blank" href="https://x.com/GregKamradt/status/1943169631491100856">says</a> it Grok-4 shows signs of Fluid Intelligence!)</p><p>Real World benchmarks</p><p>Of course, academic benchmarks don't tell the full story, and while it's great to see that Grok-4 gets a perfect 100% on AIME25 and a very high 88.9% on GPQA Diamond, the most interesting benchmark they've showed was the Vending-Bench. This is a very interesting new benchmark from AndonLabs, where they simulate a vending machine, and let an LLM manage it, take orders, restock and basically count how much money a model can make while operating a "real" business. </p><p>Grok scored a very significant $4K profit, selling 4569 items, 4x more than Opus, which shows a real impact on real world tasks! </p><p>Not without controversy</p><p>Grok-4 release comes just 1 day after Grok-3 over at X, started calling itself MechaHitler and started spewing Nazi Antisemitic propaganda, which was a very bad episode. We've covered the previous "misalignment" from Grok, and this seemed even worse. Many examples (which XAI folks deleted) or Grok talking about Antisemitic tropes, blaming people with Jewish surnames for multiple things and generally acting jailbroken and up to no good.</p><p>Xai have addressed the last episode by a token excuse, supposedly open sourcing their prompts, which were updated all of 4 times in the last 2 month, while addressing this episode with a "we noticed, and we'll add guardrails to prevent this from happening" </p><p>IMO this isn't enough, Grok is consistently (this is the 3rd time on my count) breaking alignment, way more than other foundational LLMs, and we must ask for more transparency for a model as significant and as widely used as this! And to my (lack of) surprise</p><p>First principles thinking == Elon's thoughts? </p><p>Adding insult to injury, while Grok-4 was just launched, some folks asked it thoughts on the Israel-Palestine conflict and instead of coming up with an answer on its own, Grok-4 did a <a target="_blank" href="https://x.com/jeremyphoward/status/1943436621556466171">X search</a> to see what Elon Musk things on this topic to form its opinion. It's so so wrong to claim a model is great at "first principles" and have the first few tests from folks, show that Grok defaults to see "what Elon thinks" </p><p>Look, I'm all for "moving fast" and of course I love AI progress, but we need to ask more from the foundational labs, especially given the incredible amount of people who count on these models more and more! </p><p>This weeks Buzz</p><p>We're well over 300 registrations to our hackathon at the Weights & Biases SF officess this weekend (July 12-13) and I'm packing my suitcase after writing this, as I'm excited to see all the amazing projets folks will build to try and win over $15K in prizes including an awesome ROBODOG</p><p>Not to late to come and hack with us, register at <a target="_blank" href="http://lu.ma/weavehacks">lu.ma/weavehacks</a> </p><p>Tools – Browsers grow brains</p><p>Perplexity’s <strong>Comet</strong> landed on my Mac and within ten minutes it was triaging my LinkedIn invites by itself. This isn’t a Chrome extension; it’s a Chromium fork where natural-language commands are first-class citizens. Tell it “find my oldest unread Stripe invoice and download the PDF” and watch the mouse move. The Gmail connector lets you ask, “what flights do I still need to expense?” and get a draft report. Think Cursor, but for every tab.</p><p>I <a target="_blank" href="https://x.com/altryne/status/1943012655817544142">benchmarked</a> Comet against OpenAI Operator on my “scroll Alex’s 200 tweet bookmarks, extract the juicy links, drop them into Notion” task—Operator died halfway, Comet almost finished. Almost. The AI browser war has begun; Chrome’s Mariner project and OpenAI’s rumored Chromium team better move fast. </p><p>Comet is available to Perplexity MAX subscribers now, and will come to pro subscribers with invites soon, as soon as I'll have them I'll tell you how to get one! </p><p>Vision & Video</p><p>Reka dropped in with a double-whammy of announcements. First, they showcased <strong>Reka Vision</strong>, an agentic platform that can search, analyze, and even edit your video library using natural language. The demo of it automatically generating short-form social media reels from long videos was super impressive.</p><p>Then, in a surprise live reveal, they dropped <strong>Reka Flash 3.1</strong>, a new 21B parameter <em>open-source</em> multimodal model! It boasts great performance on coding and math benchmarks, including a 65% on AIME24. It was awesome to see them drop this right on the show.</p><p>We also saw LTX Video release three new open-source LoRAs for precise video control (Pose, Depth, and Canny), and Moonvalley launched <strong>Marey</strong>, a video model for filmmakers that's built exclusively on licensed, commercially-safe data—a first for the industry.</p><p>Veo3 making talking pets</p><p>Google have released an update to VEO 3, allowing you to upload an image and have the characters in the image say what you want! It’s really cool for human like generations, but it’s way more fun to animate… your pets! Here’s two of the best doggos in Colorado presenting themselves! </p><p>The full prompt to create your own after you upload an image was: </p><p>Two dogs presenting themselves, the left one barking first and then saying "Hey, I'm George Washington Fox" and the right dog following up with a woof and then says "and I'm his younger brother, Dr Emmet Brown". </p><p>Then both are saying "we're good boys" and barking</p><p>Both should sound exiting with an american accent and a dog accent</p><p>Phew, what a week! From open source Breaking News from the folks who trained the models right on the podcast, to watch parties and Nazi LLMs, this has been one hell of a ride! </p><p>Next week, there are already rumors of a potential Gemini 3 release, the OpenAI open source model is rumored to be dropping, and I'm sure we'll get all kinds of incredible things lined up + it's going to be my birthday on Thursday so, looking forward! </p><p>See you next week 🫡</p><p>Show notes and Links</p><p>TL;DR of all topics covered:</p><p>* <strong>Hosts and Guests</strong></p><p>* <strong>Alex Volkov</strong> - AI Evangelist & Weights & Biases (<a target="_blank" href="http://x.com/@altryne">@altryne</a>)</p><p>* Co Hosts - <a target="_blank" href="http://x.com/@WolframRvnwlf">@WolframRvnwlf</a> <a target="_blank" href="x.com/@yampeleg">@yampeleg</a> <a target="_blank" href="http://x.com/@nisten">@nisten</a> <a target="_blank" href="http://x.com/@ldjconfirmed">@ldjconfirmed</a>) <a target="_blank" href="http://x.com/@ryancarson">@ryancarson</a></p><p>* Guests</p><p>* Elie Bakouch - Training at Hugging Face (<a target="_blank" href="https://x.com/eliebakouch">@eliebakouch</a>)</p><p>* Maxime Labonne - Head of postrainig and Liquid AI (<a target="_blank" href="https://twitter.com/maximelabonne/status/1943295061275381864">@maximelabonne</a>) author of <a target="_blank" href="https://github.com/mlabonne/llm-course">LLM-Course</a></p><p>* Mattia Atzeni - Member of Technical Staff @ Reka</p><p>* Meenal Nalwaya - Head of Product, Reka Al</p><p>* <strong>Open Source LLMs</strong></p><p>* HuggingFace - SmolLM3 SOTA, fully open-source 3B dual-mode reasoning and long-context support (<a target="_blank" href="https://x.com/LoubnaBenAllal1/status/1942614508549333211">X</a>, <a target="_blank" href="https://huggingface.co/blog/smollm3">HF</a>)</p><p>* Liquid AI launches LFM2: the fastest, most efficient open-source edge LLMs yet (<a target="_blank" href="https://x.com/maximelabonne/thread/1943295061275381864">X</a>, <a target="_blank" href="https://huggingface.co/collections/LiquidAI/lfm2-686d721927015b2ad73eaa38">HF</a>)</p><p>* Reachy Mini: Hugging Face and Pollen Robotics launch a $299 open-source desktop robot (<a target="_blank" href="https://x.com/Thom_Wolf/status/1942887160983466096">X</a>, <a target="_blank" href="https://huggingface.co/blog/reachy-mini">HF</a>)</p><p>* NextCoder-32B: Microsoft’s new code-editing LLM rivals GPT-4o on complex code tasks (<a target="_blank" href="https://www.microsoft.com/en-us/research/publication/nextcoder-robust-adaptation-of-code-lms-to-diverse-code-edits/">Microsoft Research</a>, <a target="_blank" href="https://huggingface.co/microsoft/NextCoder-32B">HF</a>)</p><p>* Mistral AI updates Devstral Small 1.1 and Devstral Medium, setting new open-source coding agent benchmarks (<a target="_blank" href="https://x.com/MistralAI/status/1943316390863118716">X</a>, <a target="_blank" href="https://huggingface.co/mistralai/Devstral-Small-2507">HF</a>, <a target="_blank" href="https://mistral.ai/news/devstral-2507">Blog</a>)</p><p>* Reka updates RekaFlash 1.1 (<a target="_blank" href="https://huggingface.co/RekaAI/reka-flash-3.1">HF</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* 👑 Grok 4 Release: A Historic Leap from XAI - Grok 4 and Grok 4 heavy <a target="_blank" href="https://x.com">X</a></p><p>* Grok 3 is going nazi racing on X - MeinPrompt gate (<a target="_blank" href="https://x.com/altryne/status/1943077695178391572">X</a>)</p><p>* Gemini API Batch Mode launches with 50% cost savings for large-scale AI jobs (<a target="_blank" href="https://x.com/_philschmid/status/1942238040593699077">X</a>, <a target="_blank" href="https://developers.googleblog.com/en/scale-your-ai-workloads-batch-mode-gemini-api/">Google Blog</a>)</p><p>* <strong>This weeks Buzz</strong></p><p>* W&B Hackathon is nearing capacity - Robodog is ready to be given out (<a target="_blank" href="https://lu.ma/weavehacks">lu.ma/weavehacks</a>)</p><p>* <strong>Vision & Video</strong></p><p>* Reka Vision: Multimodal Agent for Visual Understanding and Search (<a target="_blank" href="https://x.com/RekaAILabs/status/1942621988390088771">Reka on X</a>, <a target="_blank" href="https://app.reka.ai/vision">Vision app</a>)</p><p>* LTX Video launches 3 open-source LoRAs for video control: Pose, Depth, Canny (<a target="_blank" href="https://x.com/LTXStudio/status/1942604844449292614">LTX Studio on X</a>, <a target="_blank" href="https://github.com/Lightricks/LTX-Video">GitHub</a>, <a target="_blank" href="https://huggingface.co/Lightricks/LTX-Video">HF model</a>)</p><p>* Marey by Moonvalley: the first professional, licensed AI video tool built for creative control (<a target="_blank" href="https://x.com/moonvalley/status/1942570142430552163">Moonvalley on X</a>, <a target="_blank" href="https://www.moonvalley.com/marey">Product page</a>)</p><p>* <strong>Tools</strong></p><p>* Perplexity Launches Comet: The AI-Powered Browser for Modern Productivity (<a target="_blank" href="https://x.com/AravSrinivas/thread/1942968552727941477">X</a>, <a target="_blank" href="https://huggingface.co/perplexity-ai">HF</a>)</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-jul-10-grok-4-and-4-heavy</link><guid isPermaLink="false">substack:post:168031500</guid><dc:creator><![CDATA[Alex Volkov, Maxime Labonne, Mattia Atzeni, and Elie Bakouch]]></dc:creator><pubDate>Fri, 11 Jul 2025 01:29:43 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/168031500/967a2f11af59d90c4111b66e70c5d1cf.mp3" length="79032108" type="audio/mpeg"/><itunes:author>Alex Volkov, Maxime Labonne, Mattia Atzeni, and Elie Bakouch</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6586</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/168031500/ad80affd3837311bcdd4196aca81e1c4.jpg"/></item><item><title><![CDATA[📆 ThursdAI - Jul 3 - ERNIE 4.5, Hunyuan A13B, MAI-DxO outperforms doctors, RL beats SWE bench, Zuck MSL hiring spree & more AI news]]></title><description><![CDATA[<p>Hey everyone, Alex here 👋</p><p>Welcome back to another mind-blowing week on ThursdAI! We’re diving into the first show of the second half of 2025, and let me tell you, AI is not slowing down. This week, we’ve got a massive wave of open-source models from Chinese giants like Baidu and Tencent that are shaking up the game, Meta’s jaw-dropping hiring spree with Zuck assembling an AI dream team, and Microsoft’s medical AI outperforming doctors on the toughest cases. Plus, a real-time AI game engine that had me geeking out on stream. Buckle up, folks, because we’ve got a lot to unpack!</p><p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></p><p>We had incredible guests like Michael Luo from Agentica, dropping knowledge on RL coding agents, and Ivan Burazin from Daytona, revealing the infrastructure powering the agent era. We had an incredible episode this week, with over 8,000 views for the live show (as always, Links and Show notes in the end, and the YT live video is here for your convienience if you'd prefer watching) </p><p>Open Source AI & LLMs: The Chinese Powerhouse Wave</p><p>Man, if there’s one takeaway from this week, it’s that Chinese companies are absolutely dominating the open-source LLM scene. Let’s break down the heavy hitters that dropped this week and why they’ve got everyone talking.</p><p>Baidu’s ERNIE 4.5: A Suite of 10 Models to Rule Them All</p><p>Baidu, a giant in the Chinese tech space, just flipped the script by open-sourcing their ERNIE 4.5 series. We’re talking 10 distinct models ranging from a whopping 424 billion parameters down to a tiny 0.3 billion. With an Apache 2.0 license, 128K context window, and multimodal capabilities handling image, video, and text input, this is a massive drop. Their biggest Mixture-of-Experts (MoE) model, with 47B active parameters, even outshines OpenAI’s o1 on visual knowledge tasks like DocVQA, scoring 93% compared to o1’s 81%! </p><p>What’s wild to me is Baidu’s shift. They’ve been running ERNIE in production for years—think chatbots and more across their ecosystem—but they weren’t always open-source fans. Now, they’re not just joining the party, they’re hosting it. If you’re into tinkering, this is your playground—check it out on Hugging Face (<a target="_blank" href="https://huggingface.co/baidu">HF</a>) or dive into their technical paper (<a target="_blank" href="https://ernie.baidu.com/blog/publication/ERNIE_Technical_Report.pdf">Paper</a>).</p><p>Tencent’s Hunyuan-A13B-Instruct: WizardLM Team Strikes Again</p><p>Next up, Tencent dropped Hunyuan-A13B-Instruct, and oh boy, does it have a backstory. This 80B parameter MoE model (13B active at inference) comes from the legendary WizardLM team, poached from Microsoft after a messy saga where their killer models got yanked from the internet over “safety concerns.” I remember the frustration—we were all hyped, then bam, gone. Now, under Tencent’s wing, they’ve cooked up a model with a 256K context window, hybrid fast-and-slow reasoning modes, and benchmarks that rival DeepSeek R1 and OpenAI o1 on agentic tasks. It scores an impressive 87% on AIME 2024, though it dips to 76% on 2025, hinting at some overfitting quirks. Though for a 12B active parameters model this all is still VERY impressive.</p><p>Here’s the catch—the license. It excludes commercial use in the EU, UK, and South Korea, and bans usage if you’ve got over 100M active users. So, not as open as we’d like, but for its size, it’s a beast that fits on a single machine, making it a practical choice for many. They’ve also released two datasets, ArtifactsBench and C3-Bench, for code and agent evaluation. I’m not sold on the name—Hunyuan doesn’t roll off the tongue for Western markets—but the WizardLM pedigree means it’s worth a look. Try it out on Hugging Face (<a target="_blank" href="https://huggingface.co/tencent/Hunyuan-A13B-Instruct">HF</a>) or test it directly (<a target="_blank" href="https://hunyuan.tencent.com/">Try It</a>).</p><p>Huawei’s Pangu Pro MoE: Sidestepping Sanctions with Ascend NPUs</p><p>Huawei entered the fray with Pangu Pro MoE, a 72B parameter model with 16B active per token, and here’s what got me hyped—it’s trained entirely on their own Ascend NPUs, not Nvidia or AMD hardware. This is a bold move to bypass US sanctions, using 4,000 of these chips to preprocess 13 trillion tokens. The result? Up to 1,528 tokens per second per card with speculative decoding, outpacing dense models in speed and cost-efficiency. Performance-wise, it’s close to DeepSeek and Qwen, making it a contender for those outside the Nvidia ecosystem.</p><p>I’m intrigued by the geopolitical angle here. Huawei’s proving you don’t need Western tech to build frontier models, and while we don’t know who’s got access to these Ascend NPUs, it’s likely a game-changer for Chinese firms. Licensing isn’t as permissive as MIT or Apache, but it’s still open-weight. Peek at it on Hugging Face (<a target="_blank" href="https://huggingface.co/IntervitensInc/pangu-pro-moe-model">HF</a>) for more details.</p><p>DeepSWE-Preview: RL Coding Agent Hits 59% on SWE-Bench</p><p>Switching gears, I was blown away chatting with Michael Luo from Agentica about DeepSWE-Preview, an open-source coding agent trained with reinforcement learning (RL) on Qwen3-32B. This thing scored a stellar 59% on SWE-Bench-Verified (42.2% Pass@1, 71% Pass@16), one of the top open-weight results out there. What’s cool is they did this without distilling from proprietary giants like Claude—just pure RL over six days on 64 H100 GPUs. Michael shared how RL is surging because pre-training hits data limits, and DeepSWE learned emergent behaviors like paranoia, double-checking edge cases to avoid shaky fixes.</p><p>This underdog story of academic researchers breaking benchmarks with limited resources is inspiring. They’ve open-sourced everything—code, data, logs—making it a goldmine for the community. I’m rooting for them to get more compute to push past even higher scores. Dive into the details on their blog (<a target="_blank" href="https://pretty-radio-b75.notion.site/DeepSWE-Training-a-Fully-Open-sourced-State-of-the-Art-Coding-Agent-by-Scaling-RL-22281902c1468193aabbe9a8c59bbe33">Notion</a>) or check the model on Hugging Face (<a target="_blank" href="https://huggingface.co/Agentica/DeepSWE-Preview">HF Model</a>).</p><p>This Week’s Buzz from Weights & Biases: come Hack with Us! 🔥</p><p>As always, I’ve got some exciting news from Weights & Biases to share. We’re hosting the first of our Weavehacks hackathons in San Francisco on July 12-13. It’s all about agent protocols like MCP and A2A, and I’m stoked to you guys in person—come say hi for a high-five! We’ve got cool prizes, including a custom W&B RoboDog that’s been a conference hit, plus $13-14K in cash. Spots are filling fast, so register now and we'll let you in (<a target="_blank" href="http://lu.ma/weavehacks">Sign Up</a>).</p><p>We’re also rolling out Online Evaluations in Weave, letting you monitor LLM apps live with judge agents on production data—super handy for catching hiccups. And our inference service via CoreWeave GPUs offers free credits for open-source model testing. Want in or curious about Weave’s tracing tools? Reach out to me anywhere, and I’ll hook you up. Can’t wait to demo this next week!</p><p>Big Companies & APIs: AI’s NBA Draft and Medical Marvels</p><p>Shifting to the big players, this week felt like an AI sports season with blockbuster hires and game-changing releases. From Meta’s talent poaching to Microsoft’s medical breakthroughs, let’s unpack the drama and innovation.</p><p>Meta Superintelligence Labs: Zuck’s Dream Team Draft </p><p>Imagine an AI NBA draft—that’s what Meta’s up to with their new Superintelligence Labs (MSL). Led by Alex Wang (formerly of Scale AI) and Nat Friedman (ex-GitHub CEO), MSL is Zuck’s power move after Llama 4’s lukewarm reception. They’ve poached up to 10 key researchers from OpenAI, including folks behind GPT-4’s image generation and o1’s foundations, with comp packages rumored at $100M for the first year and up to $300M over four years. That’s more than many Meta execs or even Tim Cook’s salary! They’ve also snagged talent from Google DeepMind and even tried to acquire Ilya Sutskever’s SSI outright (to which he said he's flattered but no) </p><p>This is brute force at its finest, and I’m joking that I didn’t get a $100M offer myself—ThursdAI’s still waiting for that email, Zuck! OpenAI’s Sam Altman fired back with “missionaries beat mercenaries,” hinting at a culture clash, while Mark Chen felt like Meta “broke into their house and took something” It’s war, folks, and I’m hyped to see if MSL delivers a Llama that crushes it. With FAIR and GenAI folding under this new crack team of 50, plus Meta’s GPU arsenal, the stakes are sky-high.</p><p>If you're like to see the list of "mercenaries" worth over 100M, you can see who they are and their achievements <a target="_blank" href="https://docs.google.com/spreadsheets/d/1qX7_VK8vN2v2urpiBY_we-FNz2PS3ZKsWp_9kXCyQB0/edit?usp=sharing">here</a></p><p>Cursor’s Killer Hires and Web Expansion</p><p>Speaking of talent wars, Cursor (built by AnySphere) just pulled off a stunner by hiring Boris Cherny and Cat Wu, key creators of Claude Code, as Chief Architect and Head of Product. This skyrockets Cursor’s cred in code generation, and I’m not surprised—Claude Code was a side project that exploded, and now Cursor’s got the brains behind it. On top of that, they’ve rolled out AI coding agents to web and mobile, even integrating with Slack. No more being tied to your desktop—launch, monitor, and collab on code tasks anywhere.</p><p>The lines between native and web tools are blurring fast, and Cursor’s leading the charge. I haven’t tested the Slack bit yet, but if you have, hit me up in the comments. This, plus their recent $20M raise, shows they’re playing to win. Learn more at (<a target="_blank" href="https://cursor.com/agents">Cursor</a>).</p><p>Microsoft MAI-DxO: AI Diagnoses Better Than Doctors</p><p>Now, onto something that hits close to home for me—Microsoft’s MAI-DxO, an AI system that’s outdiagnosing doctors on open-ended medical cases. On 304 of the toughest New England Journal of Medicine cases, it scored 85.5% accuracy, over four times the 20% rate of experienced physicians. I’ve had my share of frustrating medical waits, and seeing AI step in as a tool for doctors—not a replacement—gets me excited for the future.</p><p>It’s an orchestration of models simulating a virtual clinician panel, asking follow-up questions, ordering tests, and even factoring in cost controls for diagnostics. This isn’t just acing multiple-choice; it handles real-world ambiguity. My co-host Yam and I stressed—don’t skip your doctor for ChatGPT, but expect your doc to be AI-superpowered soon. Read more on Microsoft’s blog (<a target="_blank" href="https://microsoft.ai/new/the-path-to-medical-superintelligence/">Blog</a>).</p><p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></p><p>Cloudflare’s One-Click AI Bot Block: Protecting the Internet</p><p>Cloudflare made waves with a one-click feature to block AI bots and scrapers, available to all customers, even free-tier ones. With bots like Bytespider and GPTBot hitting nearly 40% of top sites, but only 3% blocking them, this addresses a huge shift. I’m with the CEO here—the old internet deal was Google scraping for traffic; now, AI summaries keep users from clicking through, breaking monetization for creators. Yam suggested a global license for training data with royalties, and I’m curious if that’s the future. For now, Cloudflare’s ML detects even sneaky bots spoofing as browsers. Big move—check their announcement (<a target="_blank" href="https://x.com/Cloudflare/status/1939988601976021156">X</a>) and the cool website <a target="_blank" href="http://goodaibots.com">goodaibots.com</a> </p><p>Cypher Alpha: Mystery 1M Context Model on OpenRouter</p><p>Lastly, a mysterious 1M context model, Cypher Alpha, popped up on OpenRouter for free testing (with data logging). It’s fast at 70 tokens/sec, low latency, but not a reasoning model—refusals on basic queries stumped me. Speculation points to Amazon Titan, which would be a surprise entry. I’m intrigued by who’s behind this—Gemini, OpenAI, and Qwen hit 1M context, but Amazon? Let’s see. Try it yourself (<a target="_blank" href="https://openrouter.ai/openrouter/cypher-alpha:free/providers">Link</a>).</p><p>Vision & Video: Mirage’s AI-Native Game Engine Blows Minds 🤯</p><p>Okay, folks, I’ve gotta geek out here.  Dynamics Lab unveiled the world’s first AI-native user-generated content (UGC) game engine, live with playable demos like a GTA-style “Urban Chaos” and a racing “Coastal Drift.” Running at 16 frames per second, it generates photorealistic worlds in real-time via natural language or controller input. You can jump, run, fight, or drive, and even upload an image to spawn a new game environment on the fly.</p><p>What’s nuts is there’s no pre-built game behind this—it’s infinite, custom content created as you play. I was floored showing this on stream; it’s obviously not perfect with clipping and delays, but we’re witnessing the dawn of personalized gaming. You gotta try this—head to their site for the demos (<a target="_blank" href="https://blog.dynamicslab.ai/">Playable Demo</a>).</p><p>This brings us even more closer to the "every pixel will be generated" dream of Jensen Huang.</p><p>Voice & Audio: TTS Gets Real with Kyutai and Qwen</p><p>This week brought fresh text-to-speech (TTS) updates that hint at smarter conversational AI down the line. Kyutai TTS, from the French team behind Moshi, dropped with ultra-low latency (220ms first-token) and high speaker similarity (77.1% English, 78.7% French), plus a word error rate of just 2.82% in English. It’s production-ready with a Rust server and voice cloning from a 10-second clip—perfect for LLM-integrated apps. Check it out (<a target="_blank" href="https://x.com/kyutai_labs/status/1940767331921416302">X Announcement</a>, <a target="_blank" href="https://huggingface.co/kyutai/tts-1.6b-en_fr">HF Model</a>).</p><p>Qwen-TTS from Alibaba also launched, focusing on Chinese dialects like Pekingese and Shanghainese, but with English support too. It’s got human-level naturalness via API, though less relevant for our English audience. Still, it’s a solid step—see more (<a target="_blank" href="https://x.com/Alibaba_Qwen/status/1939553252166836457">X Post</a>). Both are pieces of the puzzle for richer virtual interactions, and I’m pumped to see where this goes.</p><p>Infrastructure for Agents: Daytona’s Sandbox Revolution</p><p>I’m thrilled to have chatted with Ivan Burazin from Daytona, a cloud provider delivering agent-native runtimes—or sandboxes—that give agents their own computers for tasks like code execution or data analysis. They’ve hit over $1M in annualized run rate just two months post-launch, with 15,000 signups and 1,500 credit cards on file. That’s insane growth for infrastructure, which usually ramps slowly due to integration delays.</p><p>Why’s this hot? 2025 is the year of agents, and as Ivan shared, even OpenAI and Anthropic recently redefined agents as needing runtimes. From YC’s latest batch (37% building agents) to Cursor’s web move, every task may soon spin up a sandbox. Daytona’s “stateful serverless” tech spins fast, lasts long, and scales across regions like the US, UK, Germany, and India, addressing latency and GDPR needs. If you’re building agents, this is your unsung hero—explore it at (<a target="_blank" href="https://github.com/DaytonaIO">Daytona IO</a>) and grab $200 in credits, or up to $50K for startups (<a target="_blank" href="https://daytona.io/startups">Startups</a>).</p><p>Wrapping Up: AI’s Relentless Pace</p><p>What a week, folks! From Chinese open-source titans like ERNIE 4.5 and Hunyuan-A13B redefining accessibility, to Meta’s blockbuster hires signaling an AI arms race, and Microsoft’s MAI-DxO paving the way for smarter healthcare, we’re witnessing AI’s relentless acceleration. Mirage’s game engine and Daytona’s sandboxes remind us that creativity and infrastructure are just as critical as models themselves. I’m buzzing with anticipation for what’s next—will Meta’s dream team deliver? Will agents redefine every app? Stick with ThursdAI to find out. See you next week for more!</p><p>TL;DR and Show Notes</p><p>Here’s the quick rundown of everything we covered this week, packed with links to dive deeper:</p><p>* <strong>Show Notes & Guests</strong></p><p>* <strong>Alex Volkov</strong> - AI Evangelist & Weights & Biases (<a target="_blank" href="http://x.com/@altryne">@altryne</a>)</p><p>* <strong>Co-Hosts</strong> - <a target="_blank" href="http://x.com/@WolframRvnwlf">@WolframRvnwlf</a>, <a target="_blank" href="http://x.com/@yampeleg">@yampeleg</a>, <a target="_blank" href="http://x.com/@nisten">@nisten</a>, <a target="_blank" href="http://x.com/@ldjconfirmed">@ldjconfirmed</a></p><p>* <strong>Guests</strong> - Ivan Burazin (Daytona), Michael Luo (Agentica)</p><p>* <strong>Open Source LLMs</strong></p><p>* Baidu’s ERNIE 4.5 Series - 10 models, 424B to 0.3B, multimodal, beats o1 on DocVQA (<a target="_blank" href="https://x.com/Baidu_Inc/status/1939724778157511126">X</a>, <a target="_blank" href="https://huggingface.co/baidu">HF</a>, <a target="_blank" href="https://ernie.baidu.com/blog/publication/ERNIE_Technical_Report.pdf">Paper</a>)</p><p>* Tencent’s Hunyuan-A13B-Instruct - 80B total, 13B active, 256K context, WizardLM legacy (<a target="_blank" href="https://x.com/TencentHunyuan/status/1938525874904801490">X</a>, <a target="_blank" href="https://huggingface.co/tencent/Hunyuan-A13B-Instruct">HF</a>, <a target="_blank" href="https://hunyuan.tencent.com/">Try It</a>)</p><p>* Huawei’s Pangu Pro MoE - 72B, trained on Ascend NPUs, 1,528 tokens/sec (<a target="_blank" href="https://x.com/search?q=pangu%20pro&#38;src=typed_query">X</a>, <a target="_blank" href="https://huggingface.co/IntervitensInc/pangu-pro-moe-model">HF</a>)</p><p>* DeepSWE-Preview - RL agent, 59% SWE-Bench-Verified on Qwen3-32B (<a target="_blank" href="https://pretty-radio-b75.notion.site/DeepSWE-Training-a-Fully-Open-sourced-State-of-the-Art-Coding-Agent-by-Scaling-RL-22281902c1468193aabbe9a8c59bbe33">Notion</a>, <a target="_blank" href="https://huggingface.co/Agentica/DeepSWE-Preview">HF Model</a>)</p><p>* <strong>This Week’s Buzz</strong></p><p>* Weights & Biases Weavehacks Hackathon - SF, July 12-13, agent protocols focus (<a target="_blank" href="http://lu.ma/weavehacks">Sign Up</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Meta Superintelligence Labs (MSL) - Zuck hires dream team, up to $300M comp packages from OpenAI talent (<a target="_blank" href="https://docs.google.com/spreadsheets/d/1qX7_VK8vN2v2urpiBY_we-FNz2PS3ZKsWp_9kXCyQB0/edit?usp=sharing">list</a>)</p><p>* Cursor - Hires Claude Code creators, web/mobile agents with Slack (<a target="_blank" href="https://cursor.com/agents">Cursor</a>, <a target="_blank" href="https://huggingface.co/spaces/cursor">HF</a>)</p><p>* Microsoft MAI-DxO - 85.5% accuracy on NEJM cases vs. 20% for doctors (<a target="_blank" href="https://x.com/mustafasuleyman/status/1939670330332868696">X</a>, <a target="_blank" href="https://microsoft.ai/new/the-path-to-medical-superintelligence/">Blog</a>)</p><p>* Cloudflare - One-click AI bot blocking, tackles scraping economics (<a target="_blank" href="https://x.com/Cloudflare/status/1939988601976021156">X</a>)</p><p>* Cypher Alpha - Mystery 1M context model, possibly Amazon Titan (<a target="_blank" href="https://openrouter.ai/openrouter/cypher-alpha:free/providers">Link</a>)</p><p>* Gemini Pro 2.5 - Returned to Google’s free tier</p><p>* <strong>Vision & Video</strong></p><p>* Mirage - AI-native UGC game engine, real-time photorealistic demos (<a target="_blank" href="https://blog.dynamicslab.ai/">Playable Demo</a>)</p><p>* Workflow - Restyle videos with Flux Kontext and Luma Modify (<a target="_blank" href="https://x.com/lucataco93/status/1940113275221344566">X</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* Kyutai TTS - Low-latency, high similarity in EN/FR (<a target="_blank" href="https://x.com/kyutai_labs/status/1940767331921416302">X</a>, <a target="_blank" href="https://huggingface.co/kyutai/tts-1.6b-en_fr">HF</a>)</p><p>* Qwen-TTS - Bilingual Chinese/English, human-level naturalness (<a target="_blank" href="https://x.com/Alibaba_Qwen/status/1939553252166836457">X</a>, <a target="_blank" href="https://huggingface.co/spaces/Qwen/Qwen-TTS">HF</a>)</p><p>* <strong>Infrastructure</strong></p><p>* Daytona - Agent-native sandboxes, $1M run rate in 2 months (<a target="_blank" href="https://github.com/DaytonaIO">GitHub</a>, <a target="_blank" href="https://daytona.io/startups">Startups</a>)</p><p>* <strong>Tools</strong></p><p>* Chai Discovery’s Chai-2 - Zero-shot antibody design (<a target="_blank" href="https://www.chaidiscovery.com/news/introducing-chai-2">Chai Discovery</a>)</p><p>Thanks for reading all the way through ThursdAI, folks! Share this with friends to spread the AI love, and I’ll catch you next week for more!</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-jul-3-ernie-45-hunyuan-a13b</link><guid isPermaLink="false">substack:post:167472903</guid><dc:creator><![CDATA[Alex Volkov, Ivan Burazin, and Michael Luo]]></dc:creator><pubDate>Thu, 03 Jul 2025 21:22:33 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/167472903/57babd452444cfa1f63c999b4e5735af.mp3" length="69319891" type="audio/mpeg"/><itunes:author>Alex Volkov, Ivan Burazin, and Michael Luo</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5776</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/167472903/8e37fe98ac9bbb3146f2fd8490f1ea7a.jpg"/></item><item><title><![CDATA[📅 ThursdAI - Jun 26 - Gemini CLI, Flux Kontext Dev, Search Live, Anthropic destroys books, Zucks superintelligent team & more AI news ]]></title><description><![CDATA[<p>Hey folks, Alex here, writing from... a undisclosed tropical paradise location 🏝️ I'm on vacation, but the AI news doesn't stop of course, and neither does ThursdAI. So huge shoutout to Wolfram Ravenwlf for running the show this week, Nisten, LDJ and Yam who joined. </p><p>So... no long blogpost with analysis this week, but I'll def. recommend tuning in to the show that the folks ran, they had a few guests on, and even got some breaking news (new Flux Kontext that's open source) </p><p>Of course many of you are readers and are here for the links, so I'm including the raw TL;DR + speaker notes as prepared by the folks for the show! </p><p>P.S - our (rescheduled) hackathon is coming up in San Francisco, on July 12-13 called WeaveHacks, if you're interested at a chance to win a RoboDog, welcome to join us and give it a try. Register <a target="_blank" href="https://lu.ma/weavehacks">HERE</a></p><p>Ok, that's it for this week, please enjoy the show and see you next week! </p><p>ThursdAI - June 26th, 2025 - TL;DR</p><p>* <strong>Hosts and Guests</strong></p><p>* <strong>WolframRvnwlf</strong> - Host (<a target="_blank" href="http://x.com/WolframRvnwlf">@WolframRvnwlf</a>)</p><p>* Co-Hosts - <a target="_blank" href="http://x.com/yampeleg">@yampeleg</a>, <a target="_blank" href="http://x.com/nisten">@nisten</a>, <a target="_blank" href="http://x.com/ldjconfirmed">@ldjconfirmed</a></p><p>* Guest - <strong>Jason Kneen</strong> (<a target="_blank" href="http://x.com/jasonkneen">@jasonkneen</a>) - Discussing MCPs, coding tools, and agents</p><p>* Guest - <strong>Hrishioa</strong> (<a target="_blank" href="http://x.com/hrishioa">@hrishioa</a>) - Discussing agentic coding and spec-driven development</p><p>* <strong>Open Source LLMs</strong></p><p>* Mistral Small 3.2 released with improved instruction following, reduced repetition & better function calling (<a target="_blank" href="https://x.com/MistralAI/status/1936093325116781016">X</a>)</p><p>* Unsloth AI releases dynamic GGUFs with fixed chat templates (<a target="_blank" href="https://x.com/UnslothAI/status/1936426567850487925">X</a>)</p><p>* Kimi-VL-A3B-Thinking-2506 multimodal model updated for better video reasoning and higher resolution (<a target="_blank" href="https://huggingface.co/blog/moonshotai/kimi-vl-a3b-thinking-2506">Blog</a>)</p><p>* Chinese Academy of Science releases Stream-Omni, a new Any-to-Any model for unified multimodal input (<a target="_blank" href="https://huggingface.co/ICTNLP/stream-omni-8b">HF</a>, <a target="_blank" href="https://huggingface.co/papers/2506.13642">Paper</a>)</p><p>* Prime Intellect launches SYNTHETIC-2, an open reasoning dataset and synthetic data generation platform (<a target="_blank" href="https://x.com/PrimeIntellect/status/1937272174295023951">X</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* <strong>Google</strong></p><p>* Gemini CLI, a new open-source AI agent, brings Gemini 2.5 Pro to your terminal (<a target="_blank" href="https://web.archive.org/web/20250625051706/https://blog.google/technology/developers/introducing-gemini-cli/">Blog</a>, <a target="_blank" href="https://github.com/google-gemini/gemini-cli">GitHub</a>)</p><p>* Google reduces free tier API limits for previous generation Gemini Flash models (<a target="_blank" href="https://x.com/ai_for_success/status/1937493142279971210">X</a>)</p><p>* Search Live with voice conversation is now rolling out in AI Mode in the US (<a target="_blank" href="https://blog.google/products/search/search-live-ai-mode/">Blog</a>, <a target="_blank" href="https://x.com/rajanpatel/status/1935484294182608954">X</a>)</p><p>* Gemini API is now faster for video and PDF processing with improved caching (<a target="_blank" href="https://ai.google.dev/gemini-api/docs/caching">Docs</a>)</p><p>* <strong>Anthropic</strong></p><p>* Claude introduces an "artifacts" space for building, hosting, and sharing AI-powered apps (<a target="_blank" href="https://x.com/AnthropicAI/status/1937921801000219041">X</a>)</p><p>* Federal judge rules Anthropic's use of books for training Claude qualifies as fair use (<a target="_blank" href="https://x.com/ai_for_success/status/1937515997076029449">X</a>)</p><p>* <strong>xAI</strong></p><p>* Elon Musk announces the successful launch of Tesla's Robotaxi (<a target="_blank" href="https://x.com/elonmusk/status/1936876178356490546">X</a>)</p><p>* <strong>Microsoft</strong></p><p>* Introduces Mu, a new language model powering the agent in Windows Settings (<a target="_blank" href="https://blogs.windows.com/windowsexperience/2025/06/23/introducing-mu-language-model-and-how-it-enabled-the-agent-in-windows-settings/">Blog</a>)</p><p>* <strong>Meta</strong></p><p>* Report: Meta pursued acquiring Ilya Sutskever's SSI, now hires co-founders Nat Friedman and Daniel Gross (<a target="_blank" href="https://x.com/kimmonismus/status/1935954015998624181">X</a>)</p><p>* <strong>OpenAI</strong></p><p>* OpenAI removes mentions of its acquisition of Jony Ive's startup 'io' amid a trademark dispute (<a target="_blank" href="https://x.com/rowancheung/status/1937414172322439439">X</a>)</p><p>* OpenAI announces the release of DeepResearch in API + Webhook support (<a target="_blank" href="https://x.com/stevendcoffey/status/1938286946075418784">X</a>)</p><p>* <strong>This weeks Buzz</strong></p><p>* Alex is on vacation; WolframRvnwlf is attending AI Tinkerers Munich on July 25 (<a target="_blank" href="https://munich.aitinkerers.org/p/ai-tinkerers-munich-july-25">Event</a>)</p><p>* Join W&B Hackathon happening in 2 weeks in San Francisco - grand prize is a RoboDog! (Register <a target="_blank" href="https://lu.ma/weavehacks">for Free</a>)</p><p>* <strong>Vision & Video</strong></p><p>* MeiGen-MultiTalk code and checkpoints for multi-person talking head generation are released (<a target="_blank" href="https://github.com/MeiGen-AI/MultiTalk">GitHub</a>, <a target="_blank" href="https://huggingface.co/MeiGen-AI/MeiGen-MultiTalk">HF</a>)</p><p>* Google releases VideoPrism for generating adaptable video embeddings for various tasks (<a target="_blank" href="https://hf.co/google/videoprism">HF</a>, <a target="_blank" href="https://arxiv.org/abs/2402.13217">Paper</a>, <a target="_blank" href="https://github.com/google-deepmind/videoprism">GitHub</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* ElevenLabs launches <a target="_blank" href="11.ai">11.ai</a>, a voice-first personal assistant with MCP support (<a target="_blank" href="http://11.ai/">Sign Up</a>, <a target="_blank" href="https://x.com/elevenlabsio/status/1937200086515097939">X</a>)</p><p>* Google Magenta releases Magenta RealTime, an open weights model for real-time music generation (<a target="_blank" href="https://colab.research.google.com/github/magenta/magenta-realtime/blob/main/notebooks/Magenta_RT_Demo.ipynb">Colab</a>, <a target="_blank" href="https://g.co/magenta/rt">Blog</a>)</p><p>* ElevenLabs launches a mobile app for iOS and Android for on-the-go voice generation (<a target="_blank" href="https://x.com/elevenlabsio/status/1937541389140611367">X</a>)</p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* Google rolls out Imagen 4 and Imagen 4 Ultra in the Gemini API and Google AI Studio (<a target="_blank" href="https://developers.googleblog.com/en/imagen-4-now-available-in-the-gemini-api-and-google-ai-studio/">Blog</a>)</p><p>* OmniGen 2 open weights model for enhanced image generation and editing is released (<a target="_blank" href="https://vectorspacelab.github.io/OmniGen2/">Project Page</a>, <a target="_blank" href="https://huggingface.co/spaces/OmniGen2/OmniGen2">Demo</a>, <a target="_blank" href="https://huggingface.co/papers/2506.18871">Paper</a>)</p><p>* <strong>Tools</strong></p><p>* OpenMemory Chrome Extension provides shared memory across ChatGPT, Claude, Gemini and more (<a target="_blank" href="https://x.com/taranjeetio/status/1937537163270451494">X</a>)</p><p>* LM Studio adds MCP support to connect local LLMs with your favorite servers (<a target="_blank" href="https://lmstudio.ai/blog/mcp">Blog</a>)</p><p>* Cursor is now available as a Slack integration (<a target="_blank" href="http://cursor.com/dashboard">Dashboard</a>)</p><p>* All Hands AI releases the OpenHands CLI, a model-agnostic, open-source coding agent (<a target="_blank" href="https://all-hands.dev/blog/the-openhands-cli-ai-powered-development-in-your-terminal">Blog</a>, <a target="_blank" href="https://docs.all-hands.dev/usage/how-to/cli-mode#cli">Docs</a>)</p><p>* Warp 2.0 launches as an Agentic Development Environment with multi-threading (<a target="_blank" href="https://x.com/warpdotdev/status/1937525185843752969">X</a>)</p><p>* <strong>Studies and Others</strong></p><p>* The /r/LocalLLaMA subreddit is back online after a brief moderation issue (<a target="_blank" href="https://www.reddit.com/r/LocalLLaMA/comments/1ljlr5b/subreddit_back_in_business/">Reddit</a>, <a target="_blank" href="https://x.com/localllamasub">News</a>)</p><p>* Andrej Karpathy's talk "Software 3.0" discusses the future of programming in the age of AI (<a target="_blank" href="https://www.youtube.com/watch?v=LCEmiRjPEtQ">YouTube</a>, <a target="_blank" href="https://www.latent.space/p/s3">Summary</a>)</p><p>Thank you, see you next week! </p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-jun-26-gemini-cli-flux-kontext</link><guid isPermaLink="false">substack:post:166925628</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Thu, 26 Jun 2025 20:36:45 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/166925628/b8c0a6d84609c2fb70c2fb5e12990764.mp3" length="71751410" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5979</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/166925628/ef8761733063aa09bcb83c1eae186b7e.jpg"/></item><item><title><![CDATA[📆 ThursdAI - June 19 - MiniMax M1 beats R1, OpenAI records your meetings, Gemini in GA, W&B uses Coreweave GPUs & more AI news]]></title><description><![CDATA[<p>Hey all, Alex here 👋</p><p>This week, while not the busiest week in releases (we can't get a SOTA LLM every week now can we), was full of interesting open source releases, and feature updates such as the chatGPT meetings recorder (which we live tested on the show, the limit is 2 hours!)</p><p>It was also a day after our annual W&B conference called FullyConnected, and so I had a few goodies to share with you, like answering the main question, when will W&B have some use of those GPUs from CoreWeave, the answer is... now! (We launched a brand new preview of an inference service with open source models)</p><p>And finally, we had a great chat with Pankaj Gupta, co-founder and CEO of Yupp, a new service that lets users chat with the top AIs for free, while turning their votes into leaderboards for everyone else to understand which Gen AI model is best for which task/topic. It was a great conversation, and he even shared an invite code with all of us (I'll attach to the TL;DR and show notes, let's dive in!)</p><p>00:00 Introduction and Welcome</p><p>01:04 Show Overview and Audience Interaction</p><p>01:49 Special Guest Announcement and Experiment</p><p>03:05 Wolfram's Background and Upcoming Hosting</p><p>04:42 TLDR: This Week's Highlights</p><p>15:38 Open Source AI Releases</p><p>32:34 Big Companies and APIs</p><p>32:45 Google's Gemini Updates</p><p>42:25 OpenAI's Latest Features</p><p>54:30 Exciting Updates from Weights & Biases</p><p>56:42 Introduction to Weights & Biases Inference Service</p><p>57:41 Exploring the New Inference Playground</p><p>58:44 User Questions and Model Recommendations</p><p>59:44 Deep Dive into Model Evaluations</p><p>01:05:55 Announcing Online Evaluations via Weave</p><p>01:09:05 Introducing Pankaj Gupta from <a target="_blank" href="http://YUP.AI">YUP.AI</a></p><p>01:10:23 <a target="_blank" href="http://YUP.AI">YUP.AI</a>: A New Platform for Model Evaluations</p><p>01:13:05 Discussion on Crowdsourced Evaluations</p><p>01:27:11 New Developments in Video Models</p><p>01:36:23 OpenAI's New Transcription Service</p><p>01:39:48 Show Wrap-Up and Future Plans</p><p></p><p>Here's the TL;DR and show notes links</p><p>ThursdAI - June 19th, 2025 - TL;DR</p><p>* <strong>Hosts and Guests</strong></p><p>* <strong>Alex Volkov</strong> - AI Evangelist & Weights & Biases (<a target="_blank" href="http://x.com/@altryne">@altryne</a>)</p><p>* Co Hosts - <a target="_blank" href="http://x.com/@WolframRvnwlf">@WolframRvnwlf</a> <a target="_blank" href="x.com/@yampeleg">@yampeleg</a> <a target="_blank" href="http://x.com/@nisten">@nisten</a> <a target="_blank" href="http://x.com/@ldjconfirmed)">@ldjconfirmed</a></p><p>* Guest - <a target="_blank" href="http://x.com/@pankaj">@pankaj</a> - co-founder of <a target="_blank" href="https://yupp.ai/join/thursdAI">Yupp.ai</a></p><p>* <strong>Open Source LLMs</strong></p><p>* Moonshot AI open-sourced Kimi-Dev-72B (<a target="_blank" href="https://github.com/MoonshotAI/Kimi-Dev?tab=readme-ov-file">Github</a>, <a target="_blank" href="https://huggingface.co/moonshotai/Kimi-Dev-72B">HF</a>)</p><p>* MiniMax-M1 456B (45B Active) - reasoning model (<a target="_blank" href="https://arxiv.org/abs/2506.13585">Paper</a>, <a target="_blank" href="https://huggingface.co/MiniMaxAI/MiniMax-M1-40k">HF</a>, <a target="_blank" href="https://huggingface.co/spaces/MiniMaxAI/MiniMax-M1">Try It</a>, <a target="_blank" href="https://github.com/MiniMax-AI/MiniMax-M1">Github</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Google drops Gemini 2.5 Pro/Flash GA, 2.5 Flash-Lite in Preview ( <a target="_blank" href="https://blog.google/products/gemini/gemini-2-5-model-family-expands/">Blog</a>, <a target="_blank" href="https://storage.googleapis.com/gemini-technical-report">Tech report</a>, <a target="_blank" href="https://x.com/google/status/192905415">Tweet</a>)</p><p>* Google launches Search Live: Talk, listen and explore in real time with AI Mode (<a target="_blank" href="https://blog.google/products/search/search-live-ai-mode/">Blog</a>)</p><p>* OpenAI adds MCP support to Deep Research in chatGPT (<a target="_blank" href="https://x.com/altryne/status/1934644274227769431">X</a>, <a target="_blank" href="https://platform.openai.com/docs/mcp">Docs</a>)</p><p>* OpenAI launches their meetings recorder in mac App (<a target="_blank" href="https://help.openai.com/en/articles/11487532-chatgpt-record">docs</a>)</p><p>* Zuck update: Considering bringing Nat Friedman and Daniel Gross to Meta (<a target="_blank" href="https://x.com/amir/status/1935461177045516568">information</a>)</p><p>* <strong>This weeks Buzz</strong></p><p>* NEW! W&B Inference provides a unified interface to access and run top open-source AI models (<a target="_blank" href="https://wandb.ai/inference">inference</a>, <a target="_blank" href="https://weave-docs.wandb.ai/guides/integrations/inference/">docs</a>)</p><p>* NEW! W&B Weave Online Evaluations delivers real-time production insights and continuous evaluation for AI agents across any cloud. (<a target="_blank" href="https://x.com/altryne/status/1935412384283107572">X</a>)</p><p>* The new platform offers "metal-to-token" observability, linking hardware performance directly to application-level metrics.</p><p>* Vision & Video</p><p>* ByteDance new video model beats VEO3 - Seedance.1.0 mini (<a target="_blank" href="https://dreamina.capcut.com/ai-tool/video/generate">Site</a>, <a target="_blank" href="https://fal.ai/models/fal-ai/bytedance/seedance/v1/lite/image-to-video">FAL</a>)</p><p>* MiniMax Hailuo 02 - 1080p native, SOTA instruction following (<a target="_blank" href="https://www.minimax.io/news/minimax-hailuo-02">X</a>, <a target="_blank" href="https://fal.ai/models/fal-ai/minimax/hailuo-02/pro/image-to-video">FAL</a>)</p><p>* Midjourney video is also here - great visuals (<a target="_blank" href="https://x.com/angrypenguinPNG/status/1932931137179176960">X</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* Kyutai launches open-source, high-throughput streaming Speech-To-Text models for real-time applications (<a target="_blank" href="https://x.com/kyutai_labs/thread/1935652243119788111">X</a>, <a target="_blank" href="https://join.yupp.ai/thursdai">website</a>)</p><p>* Studies and Others</p><p>* LLMs Flunk Real-World Coding Contests, Exposing a Major Skill Gap (<a target="_blank" href="https://arxiv.org/pdf/2506.11928">Arxiv</a>)</p><p>* MIT Study: ChatGPT Use Causes Sharp Cognitive Decline (<a target="_blank" href="https://arxiv.org/abs/2506.08872">Arxiv</a>)</p><p>* Andrej Karpathy's "Software 3.0": The Dawn of English as a Programming Language (<a target="_blank" href="https://www.youtube.com/watch?v=LCEmiRjPEtQ">youtube</a>, <a target="_blank" href="https://drive.google.com/file/d/1HIEMdVlzCxke22ISVzornd2-UpWHngRZ/view?usp=sharing">deck</a>)</p><p>* <strong>Tools</strong></p><p>* Yupp launches with 500+ AI models, a new leaderboard, and a user-powered feedback economy - use <a target="_blank" href="https://yupp.ai/join/thursdAI">thursdai link</a>* to get 50% extra credits</p><p>* BrowserBase announces <a target="_blank" href="http://director.ai">director.ai</a> - an agent to run things on the web</p><p>* Universal system prompt for reduction of hallucination (from <a target="_blank" href="https://www.reddit.com/r/PromptEngineering/comments/1kup28y/chatgpt_and_gemini_ai_will_gaslight_you_everyone/">Reddit</a>)</p><p>*Disclosure: while this isn't a paid promotion, I do think that yupp has a great value, I do get a bit more credits on their platform if you click my link and so do you. You can go to <a target="_blank" href="http://yupp.ai">yupp.ai</a> and register with no affiliation if you wish.</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-june-18-minimax-m1-beats</link><guid isPermaLink="false">substack:post:166359660</guid><dc:creator><![CDATA[Alex Volkov and Pankaj Gupta]]></dc:creator><pubDate>Fri, 20 Jun 2025 00:16:49 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/166359660/f1b1c1f7f43660f1142fa15e5f115418.mp3" length="73088836" type="audio/mpeg"/><itunes:author>Alex Volkov and Pankaj Gupta</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6091</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/166359660/de84877d7c740e2e501a05b55eaf3293.jpg"/></item><item><title><![CDATA[📆 ThursdAI - June 12 - Meta’s $15B ScaleAI Power Play, OpenAI’s o3-pro & 90% Price Drop!]]></title><description><![CDATA[<p>Hey folks, this is Alex, finally back home! </p><p>This week was full of crazy AI news, both model related but also shifts in the AI landscape and big companies, with Zuck going all in on scale & execu-hiring Alex Wang for a crazy $14B dollars. </p><p>OpenAI meanwhile, maybe received a new shipment of GPUs? Otherwise, it’s hard to explain how they have dropped the o3 price by 80%, while also shipping o3-pro (in chat and API). </p><p>Apple was also featured in today’s episode, but more so for the lack of AI news, completely delaying the “very personalized private Siri powered by Apple Intelligence” during WWDC25 this week. </p><p>We had 2 guests on the show this week, <a target="_blank" href="https://substack.com/profile/109432335-stefania-druga">Stefania Druga</a> and Eric Provencher (who builds RepoPrompt). Stefania helped me cover the AI Engineer conference we all went to last week, and shared some cool Science CoPilot stuff she’s working on, while Eric is the GOTO guy for O3-pro helped us understand what this model is great for! </p><p>As always, TL;DR and show notes at the bottom, video for those who prefer watching is attached below, let’s dive in! </p><p>Big Companies LLMs & APIs</p><p>Let’s start with big companies, because the landscape has shifted, new top reasoner models dropped and some huge companies didn’t deliver this week! </p><p>Zuck goes all in on SuperIntelligence - Meta’s $14B stake in ScaleAI and Alex Wang</p><p>This may be the most consequential piece of AI news today. Fresh from the dissapointing results of LLama 4, reports of top researchers leaving the Llama team, many have decided to exclude Meta from the AI race. We have a saying at ThursdAI, don’t bet against Zuck! </p><p>Zuck decided to spend a lot of money (nearly 20% of their reported $65B investment in AI infrastructure) to get a 49% stake in Scale AI and bring Alex Wang it’s (now former) CEO to lead the new Superintelligence team at Meta. </p><p>For folks who are not familiar with Scale, it’s a massive company in providing human annotated data services to all the big AI labs, Google, OpenAI, Microsoft, Anthropic.. all of them really. Alex Wang, is the youngest self made billionaire because of it, and now Zuck not only has access to all their expertise, but also to a very impressive AI persona, who could help revive the excitement about Meta’s AI efforts, help recruit the best researchers, and lead the way inside Meta. </p><p>Wang is also an outspoken China hawk who spends as much time in congressional hearings as in Slack, so the geopolitics here are … spicy. Meta just stapled itself to the biggest annotation funnel on Earth, hired away Google’s Jack Rae (who was on the pod just last week, shipping for Google!) for brainy model alignment, and started waving seven-to-nine-figure comp packages at every researcher with “Transformer” in their citation list. Whatever disappointment you felt over Llama-4’s muted debut, Zuck clearly felt it too—and responded like a founder who still controls every voting share. </p><p><strong>OpenAI’s Game-Changer: o3 Price Slash & o3-pro launches to top the intelligence leaderboards!</strong></p><p>Meanwhile OpenAI dropping not one, but two mind-blowing updates. First, they’ve slashed the price of o3—their premium reasoning model—by a staggering 80%. We’re talking from $40/$10 per million tokens down to just $8/$2. That’s right, folks, it’s now in the same league as Claude Sonnet cost-wise, making top-tier intelligence dirt cheap. I remember when a price drop of 80% after a year got us excited; now it’s 80% in just four months with zero quality loss. They’ve confirmed it’s the full o3 model—no distillation or quantization here. How are they pulling this off? I’m guessing someone got a shipment of shiny new H200s from Jensen!</p><p>And just when you thought it couldn’t get better, OpenAI rolled out o3-pro, their highest intelligence offering yet. Available for pro and team accounts, and via API (87% cheaper than o1-pro, by the way), this model—or consortium of models—is a beast. It’s topping charts on Artificial Analysis, barely edging out Gemini 2.5 as the new king. Benchmarks are insane: 93% on AIME 2024 (state-of-the-art territory), 84% on GPQA Diamond, and nearing a 3000 ELO score on competition coding. Human preference tests show 64-66% of folks prefer o3-pro for clarity and comprehensiveness across tasks like scientific analysis and personal writing.</p><p>I’ve been playing with it myself, and the way o3-pro handles long context and tough problems is unreal. As my friend Eric Provencher (creator of RepoPrompt) shared on the show, it’s surgical—perfect for big refactors and bug diagnosis in coding. It’s got all the tools o3 has—web search, image analysis, memory personalization—and you can run it in background mode via API for async tasks. Sure, it’s slower due to deep reasoning (no streaming thought tokens), but the consistency and depth? Worth it. </p><p>Oh, and funny story—I was prepping <a target="_blank" href="https://youtu.be/KEdoIbBu2Ko">a talk</a> for Hamel Hussain’s evals course, with a slide saying “don’t use large reasoning models if budget’s tight.” The day before, this price drop hits, and I’m scrambling to update everything. That’s AI pace for ya!</p><p><strong>Apple WWDC: Where’s the Smarter Siri?</strong> </p><p>Oh Apple. Sweet, sweet Apple. Remember all those Bella Ramsey ads promising a personalized Siri that knows everything about you? Well, Craig Federighi opened WWDC by basically saying "Yeah, about that smart Siri... she's not coming. Don't wait up."</p><p>Instead, we got:</p><p>* AI that can combine emojis (revolutionary! 🙄)</p><p>* Live translation (actually cool)</p><p>* Direct API access to on-device models (very cool for developers)</p><p>* Liquid glass UI (pretty but... where's the intelligence?)</p><p>The kicker? Apple released a paper called "The Illusion of Thinking" right before WWDC, basically arguing that AI reasoning models hit hard complexity ceilings. Some saw this as Apple making excuses for why they can't ship competitive AI. The timing was... interesting.</p><p>During our recording, Nisten's Siri literally woke up randomly when we were complaining about how dumb it still is. After a decade, it's the same Siri. That moment was pure comedy gold.</p><p><strong>This Week's Buzz</strong></p><p>Our premium conference <strong>Fully Connected</strong> is happening June 17-18 in San Francisco! Use promo code <strong>WBTHURSAI</strong> to <a target="_blank" href="https://fullyconnected.com">register for free</a>. We'll have updates on the CoreWeave acquisition, product announcements, and it's the perfect chance to give feedback directly to the people building the tools you use.</p><p>Also, my talk on Large Reasoning Models as LLM judges is now up on YouTube. Had to update it live because of the O3 price drop - such is life in AI!</p><p><strong>Open Source LLMs: Mistral Goes Reasoning Mode</strong></p><p><strong>Mistral Drops Magistral - Their First Reasoning Model</strong></p><p>The French champagne of LLMs is back! Mistral released <strong>Magistral</strong>, their first reasoning model, in two flavors: a 24B parameter open-source Small version and a closed API-only Medium version. And honestly? The naming continues to be <em>chef's kiss</em> - Mistral really has the branding game locked down.</p><p>Now, here's where it gets spicy. Mistral's benchmarks notably don't include comparisons to Chinese models like Qwen or DeepSeek. Dylan Patel from SemiAnalysis called them out on this, and when he ran the comparisons himself, well... let's just say Magistral Medium barely keeps up with Qwen's tiny 4B parameter model on math benchmarks. Ouch.</p><p>But here's the thing - and Nisten really drove this home during our discussion - benchmarks don't tell the whole story. He's been using Magistral Small for his workflows and swears by it. "It's almost at the point where I don't want to tell people about it," he said, which is the highest praise from someone who runs models locally all day. The 24B Small version apparently hits that sweet spot for local deployment while being genuinely useful for real work.</p><p>The model runs on a single RTX 4090 or a 32GB MacBook after quantization, has a 128K context window (though they recommend capping at 40K), and uses a transparent <think>mode that shows its reasoning process. It's Apache 2.0 licensed, multilingual, and available through their Le Chat interface with "Flash Answers" for real-time reasoning.</p><p><strong>SakanaAI's Text2Lora: The Future is Self-Adapting Models</strong></p><p>This one blew my mind. SakanaAI (co-founded by one of the Transformer paper authors) released <strong>Text2Lora</strong> - a method for adapting LLMs to new tasks using ONLY text descriptions. No training data needed!</p><p>Think about this: instead of fine-tuning a model with thousands of examples to make it better at math, you just... tell it to be better at math. And it works! On Llama 3.1 8B, Text2Lora reaches 77% average accuracy, outperforming all baseline methods.</p><p>What this means is we're approaching a world where models can essentially customize themselves on-the-fly for whatever task you throw at them. As Nisten put it, "This is revolutionary. The model is actually learning, actually changing its own weights." We're just seeing the first glimpses of this capability, but in 6-12 months? </p><p><strong>🎥 Multimedia & Tools: Video, Voice, and Browser Breakthroughs</strong></p><p>Let’s zip through some multimedia and tool updates that caught my eye this week. Google’s VEO3-fast is a creator’s dream—2x faster 720p video generation, 80% cheaper, and now with audio support. I’ve seen clips on social media (like an NBA ad) that are unreal, though Wolfram noted it’s not fully rolled out in Europe yet. You can access it via APIs like Fail or Replicate, and I’m itching to make a full movie if I had the budget!</p><p>Midjourney’s gearing up for a video product with their signature style, but they’re also facing heat—Disney and Universal are suing them for copyright infringement over Star Wars and Avengers-like outputs. It’s Hollywood’s first major strike against AI, and while I get the IP concern, it’s odd they picked the smaller player when OpenAI and Google are out there too. This lawsuit could drag on, so stay tuned.</p><p>OpenAI’s new advanced voice mode dropped, aiming for a natural cadence with better multilingual support (Russian and Hebrew sound great now). But honestly? I’m not loving the breathing and laughing they added—it’s uncanny valley for me. Some folks on X are raving, though, and LDJ noted it’s closing the gap to Sesame’s Maya. I just wish they’d let me pick between old and new voices instead of switching under my feet. If OpenAI’s listening, transparency please!</p><p>On the tools side, Yutori’s Scouts got my timeline buzzing—AI agents that monitor the web for any topic (like “next ThursdAI release”) and notify you of updates. I saw a demo catching leadership changes at xAI, and it’s the future of web interaction. Couldn’t log in live on the show (email login woes—give me passwords, folks!), but it’s beta on <strong>yutori.com</strong>. Also, Browser Company finally launched DIA, an AI-native browser in beta. Chatting with open tabs, rewriting text, and instant answers? I’ve been using it to prep for ThursdAI, and it’s pretty slick. Try it at <strong>diabrowser.com</strong>.</p><p><strong>Wrapping Up: AI’s Breakneck Pace</strong></p><p>What a week, folks! From OpenAI democratizing intelligence with o3-pro and price cuts to Meta’s bold superintelligence play with ScaleAI, we’re witnessing history unfold at lightning speed. Apple’s stumble at WWDC stings, but open-source gems and new tools keep the excitement alive. I’m still riding the high from AI Engineer last week—your high-fives and feedback mean the world. Next week, don’t miss Weights & Biases’ Fully Connected conference in SF on June 18-19. I won’t be there physically, but I’m cheering from afar—grab your spot at <strong>fullyconnected.com </strong>with promo code <strong>WBTHURSAI</strong> for a sweet deal.</p><p>Thanks for being part of the ThursdAI crew. Here’s the full TL;DR and show notes to catch anything you missed. See you next week!</p><p>TL;DR of all topics covered:</p><p>* <strong>Hosts and Guests</strong></p><p>* <strong>Alex Volkov</strong> - AI Evangelist & Weights & Biases (<a target="_blank" href="http://x.com/@altryne">@altryne</a>)</p><p>* Co Hosts - <a target="_blank" href="http://x.com/@WolframRvnwlf">@WolframRvnwlf</a>, <a target="_blank" href="http://x.com/@yampeleg">@yampeleg</a>, <a target="_blank" href="http://x.com/@nisten">@nisten</a>, <a target="_blank" href="http://x.com/@ldjconfirmed">@ldjconfirmed</a></p><p>* Guests - </p><p>* Stefania Druga <a target="_blank" href="https://x.com/Stefania_druga">@stefania_druga</a> (Independent, Former Research Scientist Google DeepMind),Creator of <a target="_blank" href="https://medium.com/bits-and-behavior/supercharge-your-scratch-projects-introducing-cognimates-copilot-an-ai-teammate-for-kids-52e616e4096e">scratch copilot</a>, and AI Engineer <a target="_blank" href="https://ai.engineer/education">education summit</a>. </p><p>* Eric Provencher - <a target="_blank" href="https://x.com/pvncher">@pvncher</a> (Building <a target="_blank" href="https://repoprompt.com/">RepoPrompt</a>)</p><p>* <strong>Chit Chat</strong> - AI Engineer conference vibes, meeting fans, Jack Rae’s move to Meta.</p><p>* <strong>Open Source LLMs</strong></p><p>* Mistral Magistral - 24B reasoning model (<a target="_blank" href="https://x.com/MistralAI/status/1932441507262259564">X</a>, <a target="_blank" href="https://huggingface.co/mistralai/Magistral-Small-2506">HF</a>, <a target="_blank" href="https://mistral.ai/news/magistral">Blog</a>)</p><p>* HuggingFace Screensuite - GUI agents evaluation framework (<a target="_blank" href="https://huggingface.co/blog/screensuite">HF</a>)</p><p>* SakanaAI Text2Lora - Instant, Task-Specific LLM Adaptation (<a target="_blank" href="https://github.com/SakanaAI/Text-to-Lora">Github</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* OpenAI drops o3 price by 90% (<a target="_blank" href="https://t.co/LkObjZtg9s">Blog</a>)</p><p>* OpenAI launches o3-pro - highest intelligence model (<a target="_blank" href="https://x.com/OpenAI/status/1932530409684005048">X</a>)</p><p>* Meta buys 49% stake in ScaleAI, Alex Wang heads superintelligence team (<a target="_blank" href="https://www.theinformation.com/articles/meta-pay-nearly-15-billion-scale-ai-stake-startups-28-year-old-ceo">Blog</a>, <a target="_blank" href="https://www.axios.com/2025/06/10/meta-ai-superintelligence-zuckerberg">Axios</a>)</p><p>* Apple WWDC updates - pause on Apple Intelligence in iOS26, live translation, on-device APIs</p><p>* Apple paper on reasoning as illusion (<a target="_blank" href="https://machinelearning.apple.com/research/illusion-of-thinking">Paper</a>, <a target="_blank" href="https://x.com/ParshinShojaee/status/1932528565788238197">Rebuttal</a>)</p><p>* <strong>This Week’s Buzz</strong></p><p>* Fully Connected: W&B’s 2-day conference, June 17-18 in SF (<a target="_blank" href="http://fullyconnected.com">fullyconnected.com</a>) - Promo Code WBTHURSAI</p><p>* Alex’s talk on LRM as LLM judges on Hamel’s course (<a target="_blank" href="https://www.youtube.com/watch?reload=9&#38;v=KEdoIbBu2Ko">YT</a>)</p><p>* <strong>Vision & Video</strong></p><p>* VEO3-fast - 2x faster 720p generations, 80% cheaper</p><p>* Midjourney to launch video product (<a target="_blank" href="https://x.com/bilawalsidhu/status/1932942424751366383?s=46">X</a>)</p><p>* Topaz Astra - creative 4K video upscaler (<a target="_blank" href="https://x.com/topazlabs/status/1932421641654477275">X</a>, <a target="_blank" href="http://astra.app">Site</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* OpenAI’s new advanced voice mode - mixed responses, better multilingual support</p><p>* Cartesia Ink-Whisper - optimized for real-time chat (<a target="_blank" href="https://cartesia.ai/blog/introducing-ink-speech-to-text">Blog</a>)</p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* Disney & Universal sue Midjourney - first Hollywood vs AI lawsuit (<a target="_blank" href="https://www.nbcnews.com/business/business-news/disney-universal-sue-ai-image-company-midjourney-unlicensed-star-wars-rcna212369">NBC</a>)</p><p>* Krea releases KREA-1 - custom image gen model (<a target="_blank" href="https://x.com/krea_ai/status/1932440476541411670">X</a>)</p><p>* <strong>AI Tools</strong></p><p>* Yutori Scouts - AI agents for web monitoring (<a target="_blank" href="http://yutori.com">Blog</a>)</p><p>* BrowserCompany DIA - AI-native browser in beta (<a target="_blank" href="http://diabrowser.com">Link</a>)</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-june-12-metas-15b-scaleai</link><guid isPermaLink="false">substack:post:165832396</guid><dc:creator><![CDATA[Alex Volkov, Stefania Druga, and Eric Provencher]]></dc:creator><pubDate>Fri, 13 Jun 2025 02:33:37 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/165832396/d6780f653f8b095cf70f66e74760dd5f.mp3" length="67084191" type="audio/mpeg"/><itunes:author>Alex Volkov, Stefania Druga, and Eric Provencher</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5590</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/165832396/0d5da8a2efafb68a8e85d7e59b1cb996.jpg"/></item><item><title><![CDATA[📆 ThursdAI - Jun 5, 2025 - Live from AI Engineer with Swyx, new Gemini 2.5 with Logan K and Jack Rae, Self Replicating agents with Morph Labs]]></title><description><![CDATA[<p>Hey folks, this is Alex, coming to you LIVE from the AI Engineer Worlds Fair! </p><p>What an incredible episode this week, we recorded live from floor 30th at the Marriott in SF, while Yam was doing live correspondence from the floor of the AI Engineer event, all while Swyx, the cohost of Latent Space podcast, and the creator of AI Engineer (both the conference and the concept itself) joined us for the whole stream - here’s the edited version, please take a look.  </p><p>We've had around 6500 people tune in, and at some point we got 2 surprise guests, straight from the keynote stage, Logan Kilpatrick (PM for AI Studio and lead cheerleader for Gemini) and Jack Rae (principal scientist working on reasoning) joined us for a great chat about Gemini! Mind was absolutely blown! </p><p>They have just launched the new Gemini 2.5 Pro and I though it would only be fitting to let their new model cover this podcast this week (so below is <strong>fully AI generated</strong> ... non slop I hope). The show notes and TL;DR is as always in the end. </p><p>Okay, enough preamble… let's dive into the madness!</p><p><strong>🤯 Google Day at AI Engineer: New Gemini 2.5 Pro and a Look Inside the Machine's Mind</strong></p><p>For the first year of this podcast, a recurring theme was us asking, "Where's Google?" Well, it's safe to say that question has been answered with a firehose of innovation. We were lucky enough to be joined by Google DeepMind's Logan Kilpatrick and Jack Rae, the tech lead for "thinking" within Gemini, literally moments after they left the main stage.</p><p><strong>Surprise! A New Gemini 2.5 Pro Drops Live</strong></p><p>Logan kicked things off with a bang, officially announcing a brand new, updated Gemini 2.5 Pro model right there during his keynote. He called it "hopefully the final update to 2.5 Pro," and it comes with a bunch of performance increases, closing the gap on feedback from previous versions and hitting SOTA on benchmarks like Aider.</p><p>It's clear that the organizational shift to bring the research and product teams together under the DeepMind umbrella is paying massive dividends. Logan pointed out that Google has seen a 50x increase in AI inference over the past year. The flywheel is spinning, and it's spinning <em>fast</em>.</p><p><strong>How Gemini "Thinks"</strong></p><p>Then things got even more interesting. Jack Rae gave us an incredible deep dive into what "thinking" actually means for a language model. This was one of the most insightful parts of the conference for me.</p><p>For years, the bottleneck for LLMs has been <strong>test-time compute</strong>. Models were trained to respond immediately, applying a fixed amount of computation to go from a prompt to an answer, no matter how hard the question. The only way to get a "smarter" response was to use a bigger model.</p><p>Jack explained that "Thinking" shatters this limitation. Mechanically, Gemini now has a "thinking stage" where it can generate its own internal text—hypothesizing, testing, correcting, and reasoning—before committing to a final answer. It's an iterative loop of computation that the model can dynamically control, using more compute for harder problems. It learns <em>how</em> to think using reinforcement learning, getting a simple "correct" or "incorrect" signal and backpropagating that to shape its reasoning strategies.</p><p>We're already seeing the results of this. Jack showed a clear trend: as models get better at reasoning, they're also using more test-time compute. This paradigm also gives developers a "thinking budget" slider in the API for Gemini 2.5 Flash and Pro, allowing a continuous trade-off between cost and performance.</p><p>The future of this is even wilder. They're working on <strong>DeepThink</strong>, a high-budget mode for extremely hard problems that uses much deeper, parallel chains of thought. On the tough USA Math Olympiad, where the SOTA was negligible in January, 2.5 Pro reached the 50th percentile of human participants. DeepThink pushes that to the 65th percentile.</p><p>Jack’s ultimate vision is inspired by the mathematician Ramanujan, who derived incredible theorems from a single textbook by just thinking deeply. The goal is for models to do the same—contemplate a small set of knowledge so deeply that they can push the frontiers of human understanding. Absolutely mind-bending stuff.</p><p><strong>🤖 MorphLabs and the Audacious Quest for Verified Superintelligence</strong></p><p>Just when I thought my mind couldn't be bent any further, we were joined by Jesse Han, the founder and CEO of MorphLabs. Fresh off his keynote, he laid out one of the most ambitious visions I've heard: building the infrastructure for the Singularity and developing "verified superintelligence."</p><p>The big news was that <strong>Christian Szegedy</strong> is joining MorphLabs as Chief Scientist. For those who don't know, Christian is a legend—he invented batch norm and adversarial examples, co-founded XAI, and led code reasoning for Grok. That's a serious hire.</p><p>Jesse’s talk was framed around a fascinating question: "What does it mean to have empathy for the machine?" He argues that as AI develops personhood, we need to think about what it wants. And what it wants, according to Morph, is a new kind of cloud infrastructure.</p><p>This is <strong>MorphCloud</strong>, built on a new virtualization stack called <strong>Infinibranch</strong>. Here’s the key unlock: it allows agents to instantaneously snapshot, branch, and replicate their entire VM state. Imagine an agent reaching a decision point. Instead of choosing one path, it can branch its entire existence—all its processes, memory, and state—to explore every option in parallel. It can create save states, roll back to previous checkpoints, and even merge its work back together.</p><p>This is a monumental step for agentic AI. It moves beyond agents that are just a series of API calls to agents that are truly embodied in complex software environments. It unlocks the potential for recursive self-improvement and large-scale reinforcement learning in a way that's currently impossible. It’s a bold, sci-fi vision, but they're building the infrastructure to make it a reality today.</p><p><strong>🔥 The Agent Conversation: OpenAI, MCP, and Magic Moments</strong></p><p>The undeniable buzz on the conference floor was all about <strong>agents</strong>. You couldn't walk ten feet without hearing someone talking about agents, tools, and MCP.</p><p>OpenAI is leaning in here too. This week, they made their <strong>Codex coding agent available to all ChatGPT Plus users</strong> and announced that ChatGPT will soon be able to listen in on your Zoom meetings. This is all part of a broader push to make AI more active and integrated into our workflows.</p><p>The <strong>MCP (Model-Context-Protocol)</strong> track at the conference was packed, with lines going down the hall. (Alex here, I had a blast talking during that track about MCP observability, you can catch our talk <a target="_blank" href="https://youtu.be/z4zXicOAF28?t=19573">here</a> on the live stream of AI Engineer) </p><p>Logan Kilpatrick offered a grounded perspective, suggesting the hype might be a bit overblown but acknowledging the critical need for an open standard for tool use, a void left when OpenAI didn't formalize ChatML.</p><p>I have to share my own jaw-dropping MCP moment from this week. I was coding an agent using an IDE that supports MCP. My agent, which was trying to debug itself, used an MCP tool to check its own observability traces on the Weights & Biases platform. While doing so, it discovered a <em>new tool</em> that our team had just added to the MCP server—a support bot. Without any prompting from me, my coding agent formulated a question, "chatted" with the support agent to get the answer, came back, fixed its own code, and then re-checked its work. Agent-to-agent communication, happening automatically to solve a problem. My jaw was on the floor. That's the magic of open standards.</p><p><strong>This Week's Buzz from Weights & Biases</strong></p><p>Speaking of verification and agents, the buzz from our side is all about it! At our booth here at AI Engineer, we have a Robodog running around, connected to our LLM evaluation platform, <strong>W&B Weave</strong>. As Jesse from MorphLabs discussed, verifying what these complex agentic systems are doing is critical. Whether it's superintelligence or your production application, you need to be able to evaluate, trace, and understand its behavior. We're building the tools to do just that.</p><p>And if you're in San Francisco, don't forget our own conference, <strong>Fully Connected</strong>, is happening on June 18th and 19th! It's going to be another amazing gathering of builders and researchers. <a target="_blank" href="http://Fullyconnected.com">Fullyconnected.com</a>  get in FREE with the promo code  <strong>WBTHURSAI</strong></p><p>What a show. The energy, the announcements, the sheer brainpower in one place was something to behold. We’re at a point where the conversation has shifted from theory to practice, from hype to real, tangible engineering. The tracks on agents and enterprise adoption were overflowing because people are building, right now. It was an honor and a privilege to bring this special episode to you all.</p><p>Thank you for tuning in. We'll be back to our regular programming next week! (and Alex will be back to writing his own newsletter, not send direct AI output!)</p><p>AI News TL;DR and show notes</p><p>* <strong>Hosts and Guests</strong></p><p>* <strong>Alex Volkov</strong> - AI Evangelist & Weights & Biases (<a target="_blank" href="http://x.com/@altryne">@altryne</a>)</p><p>* Co Hosts - <a target="_blank" href="http://x.com/swyx">@swyx</a> <a target="_blank" href="x.com/@yampeleg">@yampeleg</a> <a target="_blank" href="https://twitter.com/romechenko/status/1891007363827593372">@romechenko</a> </p><p>* Guests - <a target="_blank" href="https://x.com/OfficialLoganK">@officialLoganK</a>, <a target="_blank" href="https://x.com/jack_w_rae">@jack_w_rae</a></p><p>* <strong>Open Source LLMs</strong> </p><p>* ByteDance / ContentV-8B - (<a target="_blank" href="https://huggingface.co/ByteDance/ContentV-8B">HF</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Gemini Pro 2.5 updated Jun 5th (<a target="_blank" href="https://x.com/OfficialLoganK/status/1930657743251349854">X</a>)</p><p>* SOTA on HLE, Aider, and GPQA</p><p>* Now supports thinking budgets</p><p>* Same cost, on pareto frontier</p><p>* Closes gap on 03-25 regressions</p><p>* OAI AVM injects ads and stopped singing (<a target="_blank" href="https://x.com/altryne/status/1929312886448337248">X</a>)</p><p>* OpenAI Codex is now available to plus members and has internet access (<a target="_blank" href="https://github.com/aavetis/ai-pr-watcher/">X</a>)</p><p>* ~24,000 NEW PRs overnight from Codex after @OpenAI expands access to free users.</p><p>* OpenAI will record meetings and released connectors like  (<a target="_blank" href="https://twitter.com/testingcatalog/status/1930366893321523676">X</a>)</p><p>* <a target="_blank" href="https://twitter.com/testingcatalog">TestingCatalog News 🗞@testingcatalog</a><a target="_blank" href="https://twitter.com/testingcatalog/status/1930366893321523676">Jun 4, 2025</a></p><p>OpenAI released loads of connectors for Team accounts! Most of these connectors can be used for Deep Research, while Google Drive, SharePoint, Dropbox and Box could be used in all chats. https://t.co/oBEmYGKguE</p><p>* Anthropic cuts windsurf access for Windsurf (<a target="_blank" href="https://x.com/kevinhou22/status/1930401320210706802">X</a>)</p><p>* Without warning, Anthropic cuts off Windsurf from official Claude 3 and 4 APIs</p><p>* This weeks Buzz</p><p>* FULLY - CONNECTED - Fully Connected: W&B's 2-day conference, June 18-19 in SF <a target="_blank" href="fullyconnected.com">fullyconnected.com</a> - Promo Code WBTHURSAI</p><p>* <strong>Vision & Video</strong></p><p>* VEO3 is now available via API on FAL (<a target="_blank" href="https://x.com/FAL/status/1930732632046006718">X</a>)</p><p>* Captions launches Mirage Studio - talking avatars competition to HeyGen/Hedra (<a target="_blank" href="https://x.com/getcaptionsapp/status/1929554635544461727">X</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* ElevenLabs model V3 - supports emotion tags and is "inflection point" (<a target="_blank" href="https://x.com/venturetwins/status/1930727253815759010">X</a>) </p><p>* Supporting 70+ languages, multi-speaker dialogue, and audio tags such as [excited], [sighs], [laughing], and [whispers].</p><p>* <strong>Tools</strong></p><p>* Cursor Launched V1 - Bug Bot reviews PRs, iPython notebooks and one clickMCP</p><p>* 24,000 NEW PRs overnight from Codex after <a target="_blank" href="https://x.com/OpenAI">@OpenAI</a> expands access to plus users (<a target="_blank" href="https://twitter.com/albfresco/status/1930262263199326256">X</a>)</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-jun-5-2025-live-from-ai</link><guid isPermaLink="false">substack:post:165315421</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Fri, 06 Jun 2025 02:13:27 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/165315421/c86aec14207b0d904d34ac55f5058431.mp3" length="99606081" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6225</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/165315421/6240d90c59e33d4e90fdf587cd359e7e.jpg"/></item><item><title><![CDATA[📆 ThursdAI - May 29 - DeepSeek R1 Resurfaces, VEO3 viral moments, Opus 4 a week after, Flux Kontext image editing & more AI news]]></title><description><![CDATA[<p>Hey everyone, Alex here 👋</p><p>Welcome back to another absolutely wild week in AI! I'm coming to you live from the Fontainebleau Hotel in Vegas at the Imagine AI conference, and wow, what a perfect setting to discuss how AI is literally reimagining our world. After last week's absolute explosion of releases (Claude Opus 4, Google I/O madness, OpenAI Codex and Jony colab), this week gave us a chance to breathe... sort of. Because even in a "quiet" week, we still got a new DeepSeek model that's pushing boundaries, and the entire internet discovered that we might all just be prompts. Yeah, it's been that kind of week!</p><p>Before we dive in, quick shoutout to everyone who joined us live - we had some technical hiccups with the Twitter Spaces audio (sorry about that!), but the YouTube stream was fire. And speaking of fire, we had two incredible guests join us: Charlie Holtz from Chorus (the multi-model chat app that's changing how we interact with AI) and Linus Eckenstam, who's been traveling the AI conference circuit and bringing us insights from the frontlines of the generative AI revolution.</p><p>Open Source AI & LLMs: DeepSeek Whales & Mind-Bending Papers</p><p>DeepSeek dropped R1-0528 out of nowhere, an update to their reasoning beast with some serious jumps in performance. We’re talking AIME at 91 (beating previous scores by a mile), LiveCodeBench at 73, and SWE verified at 57.6. It’s edging closer to heavyweights like o3, and folks on X are already calling it “clearer thinking.” There was hype it might’ve been R2, but the impact didn’t quite crash the stock exchange like past releases. Still, it’s likely among the best open-weight models out there.</p><p>So what's new? Early reports and some of my own poking around suggest this model "thinks clearer now." Nisten mentioned that while previous DeepSeek models sometimes liked to "vibe around" and explore the latent space before settling on an answer, this one feels a bit more direct.</p><p>And here’s the kicker—they also released an 8B distilled version based on Qwen3, runnable on your laptop. Yam called it potentially the best 8B model to date, and you can try it on Ollama right now. No need for a monster rig! </p><p><strong>The Mind-Bending "Learning to Reason Without External Rewards" Paper</strong></p><p>Okay, this paper result broke my brain, and apparently everyone else's too. This paper shows that models can improve through reinforcement learning with its own intuition of whether or not it's correct. 😮</p><p>It's like the placebo effect for AI! The researchers trained models without telling them what was good or bad, but rather, utilized a new framework called Intuitor, where the reward was based on how the "self certainty". </p><p>The thing that took my whole timeline by storm is, it works! GRPO (Group Policy Optimization) - the framework that DeepSeek gave to the world with R1 is based on external rewards (human optimize) and Intuitor seems to be mathcing or even exceeding some of GRPO results when Qwen2.5 3B was used to finetune. Incredible incredible stuff</p><p>Big Companies LLMs & APIs</p><p>Claude Opus 4: A Week Later – The Dev Darling?</p><p>Claude Opus 4, whose launch we celebrated live on the show, has had a week to make its mark. Charlie Holtz, who's building Chorus (more on that amazing app in a bit!), shared that while it's sometimes "astrology" to judge the vibes of a new model, Opus 4 feels like a step change, especially in coding. He mentioned that Claude Code, powered by Opus 4 (and Sonnet 4 for implementation), is now tackling GitHub issues that were too complex just weeks ago. He even had a coworker who "vibe coded three websites in a weekend" with it – that's a tangible productivity boost!</p><p>Linus Eckenstam highlighted how <a target="_blank" href="Lovable.dev"><strong>Lovable.dev</strong></a> saw their syntax error rates plummet by nearly 50% after integrating Claude 4. That’s quantifiable proof of improvement! It's clear Anthropic is leaning heavily into the developer/coding space. Claude Opus is now #1 on the LMArena WebDev arena, further cementing its reputation.</p><p>I had my own magical moment with Opus 4 this week. I was working on an MCP observability talk for the AI Engineer conference and trying to integrate Weave (our observability and evals framework at Weights & Biases) into a project. Using Windsurf's Cascade agent (which now lets you bring your own Opus 4 key, by the way – good move, Windsurf!), Opus 4 not only tried to implement Weave into my agent but, when it got stuck, it figured out it had access to the Weights & Biases support bot via our MCP tool. It then formulated a question to the support bot (which is also AI-powered!), got an answer, and used that to fix the implementation. It then went back and checked if the Weave trace appeared in the dashboard! Agents talking to agents to solve a problem, all while I just watched – my jaw was on the floor. Absolutely mind-blowing.</p><p><strong>Quick Hits: Voice Updates from OpenAI & Anthropic</strong></p><p>OpenAI’s Advanced Voice Mode finally sings—yes, I’ve been waiting for this! It can belt out tunes like Mariah Carey, which is just fun. Anthropic also rolled out voice mode on mobile, keeping up in the conversational race. Both are cool steps, but I’m more hyped for what’s next in voice AI—stay tuned below (<strong>OpenAI </strong><a target="_blank" href="https://x.com/nicdunz/status/1927107805032399032"><strong>X</strong></a>, <a target="_blank" href="https://x.com/AnthropicAI/status/1927463559836877214"><strong>Anthropic X</strong></a>).</p><p><strong>🐝 This Week's Buzz: Weights & Biases Updates!</strong></p><p>Alright, time for a quick update from the world of Weights & Biases!</p><p>* <strong>Fully Connected is Coming!</strong> Our flagship 2-day conference, <strong>Fully Connected</strong>, is happening on <strong>June 18th and 19th in San Francisco</strong>. It's going to be packed with amazing speakers and insights into the world of AI development. You can still grab tickets, and as a ThursdAI listener, use the promo code <strong>WBTHURSAI</strong> for a 100% off ticket! I hustled to get yall this discount! (<a target="_blank" href="https://fullyconnected.com">Register here</a>)</p><p>* <strong>AI Engineer World's Fair Next Week!</strong> I'm super excited for the <strong>AI Engineer conference</strong> in San Francisco next week. Yam Peleg and I will be there, and we're planning another live ThursdAI show from the event! If you want to join the livestream or snag a last-minute ticket, use the coupon code <a target="_blank" href="https://ti.to/software-3/ai-engineer-worlds-fair-2025/discount/THANKSTHURSDAI"><strong>THANKSTHURSDAI</strong></a> for 30% off (Get it <a target="_blank" href="https://ti.to/software-3/ai-engineer-worlds-fair-2025/discount/THANKSTHURSDAI">HERE</a>)</p><p><strong>Vision & Video: Reality is Optional Now</strong></p><p><strong>VEO3 and the Prompt Theory Phenomenon</strong></p><p>Google's VEO3 has completely taken over TikTok with the "Prompt Theory" videos. If you haven't seen these yet, stop reading and watch ☝️. The concept is brilliant - AI-generated characters discussing whether they're "made of prompts," creating this meta-commentary on consciousness and reality.</p><p>The technical achievement here is staggering. We're not just talking about good visuals - VEO3 nails temporal consistency, character emotions, situational awareness (characters look at whoever's speaking), perfect lip sync, and contextually appropriate sound effects. </p><p>Linus made a profound point - if not for the audio, VEO3 might not have been as explosive. The combination of visuals AND audio together is what's making people question reality. We're seeing people post actual human videos claiming they're AI-generated because the uncanny valley has been crossed so thoroughly.</p><p><strong>Odyssey's Interactive Worlds: The Holodeck Prototype</strong></p><p>Odyssey dropped their interactive video demo, and folks... we're literally walking through AI-generated worlds in real-time. This isn't a game engine rendering 3D models - this is a world model generating each frame as you move through it with WASD controls.</p><p>Yes, it's blurry. Yes, I got stuck in a doorway. But remember Will Smith eating spaghetti from two years ago? The pace of progress is absolutely insane. As Linus pointed out, we're at the "GAN era" of world models. Combine VEO3's quality with Odyssey's interactivity, and we're looking at completely personalized, infinite entertainment experiences.</p><p>The implications that Yam laid out still have me shook - imagine Netflix shows completely customized to you, with your context and preferences, generated on the fly. Not just choosing from a catalog, but creating entirely new content just for you. We're not ready for this, but it's coming fast.</p><p><strong>Hunyuan's Open Source Avatar Revolution</strong></p><p>While the big companies are keeping their video models closed, Tencent dropped two incredible open source releases: HunyuanPortrait and HunyuanAvatar. These are legitimate competitors to Hedra and HeyGen, but completely open source.</p><p>HunyuanPortrait does high-fidelity portrait animation from a single image plus video. HunyuanAvatar goes further with 1 image + audio, and lipsync, body animation, multi-character support, and emotion control. </p><p>Wolfram tested these extensively and confirmed they're "state of the art for open source." The portrait model is basically perfect for deepfakes (use responsibly, people!), while the avatar model opens up possibilities for AI assistants with consistent visual presence.</p><p>🖼️ AI Art & Diffusion</p><p>Black Forest Labs drops Flux Kontext - SOTA image editing! </p><p>This came as massive breaking news during the show (thought we didn't catch it live!) - Black Forest Labs, creators of Flux, dropped an incredible Image Editing model called Kontext (really, 3 models, Pro, Max and 12B open source Dev in private preview). The are consistent, context aware text and image editing! Just see the below example</p><p>If you used GPT-image to Ghiblify yourself, or VEO, you know that those are not image editing models, your face will look different every generation. These images model keep you consistent, while adding what you wanted. This character consistency is something many folks really want and it's great to see Flux innovating and bringing us SOTA again and are absolutely crushing GPT-image in instruction following, character preservation and style reference!</p><p>Maybe the most important thing about this model is the increible speed. While the Ghiblification chatGPT trend took the world by storm, GPT images are SLOW! Check out the speed comparisons on Kontext! </p><p>You can play around with these models on the new <a target="_blank" href="https://playground.bfl.ai/image/generate">Flux Playground</a>, but they also already integrated into FAL, FreePik, Replicate, Krea and tons of other services! </p><p>🎙️ Voice & Audio: Everyone Gets a Voice</p><p><strong>Unmute.sh: Any LLM Can Now Talk</strong></p><p>KyutAI (the folks behind Moshi) are back with <a target="_blank" href="Unmute.sh">Unmute.sh</a> - a modular wrapper that adds voice to ANY text LLM. The latency is incredible (under 300ms), and it includes semantic VAD (knowing when you've paused for thought vs. just taking a breath).</p><p>What's brilliant about this approach is it preserves all the capabilities of the underlying text model while adding natural voice interaction. No more choosing between smart models and voice-enabled models - now you can have both!</p><p>It's going to be open sourced at some point soon, and while awesome, Unmute did have some instability in how the voice sounds! It answered to me with 1 type of voice and then during the same conversation, answered with another, you can give it a tru yourself at <a target="_blank" href="http://unmute.sh">unmute.sh</a> </p><p><strong>Chatterbox: Open Source Voice Agents for Everyone</strong></p><p>Resemble AI open sourced Chatterbox, featuring zero-shot voice cloning from just 5 seconds of audio and unique emotion intensity control. Playing with the demo where they could dial up the emotion from 0.5 to 2.0 on the same text was wild - from calm to absolutely unhinged Samuel L. Jackson energy.</p><p>This being a .5B param model is great, The issue I always have, is that with my fairly unique accent, these models sound like a British Alex all the time, and I just don't talk like that! </p><p>Though the fact that this runs locally and includes safety features (profanity filters, content classifiers and something called <strong>PerTh </strong>watermarking) while being completely open source is exactly what the ecosystem needs. We're rapidly approaching a world where anyone can build sophisticated voice agents.👏</p><p><strong>Looking Forward: The Convergence is Real</strong></p><p>As we wrapped up the show, I couldn't help but reflect on the massive convergence happening across all these modalities. We have LLMs getting better at reasoning (even with random rewards!), video models breaking reality, voice models becoming indistinguishable from humans, and it's all happening simultaneously.</p><p>Charlie's comment that "we are the prompts" might have been said in jest, but it touches on something profound. As these models get better at generating realistic worlds, characters, and voices, the line between generated and real continues to blur. The Prompt Theory videos aren't just entertainment - they're a mirror reflecting our anxieties about AI and consciousness.</p><p>But here's what keeps me optimistic: the open source community is keeping pace. DeepSeek, Hunyuan, ResembleAI, and others are ensuring that these capabilities don't remain locked behind corporate walls. The democratization of AI continues, even as the capabilities become almost magical.</p><p>Next week, I'll be at AI Engineer World's Fair in San Francisco, finally meeting Yam face-to-face and bringing you all the latest from the biggest AI engineering conference of the year. Until then, keep experimenting, keep building, and remember - in this exponential age, today's breakthrough is tomorrow's baseline.</p><p>Stay curious, stay building, and I'll see you next ThursdAI! 🚀</p><p>Show Notes & TL;DR Links</p><p><strong>Show Notes & Guests</strong></p><p>* Alex Volkov - AI Evangelist & Weights & Biases (<a target="_blank" href="http://x.com/@altryne">@altryne</a>)</p><p>* Co-Hosts - @WolframRvnwlf (<a target="_blank" href="http://x.com/@WolframRvnwlf">@WolframRvnwlf</a>), @yampeleg (<a target="_blank" href="x.com/@yampeleg]),">@yampeleg</a>) @nisten (<a target="_blank" href="http://x.com/@nisten">@nisten</a>)</p><p>* Guests - Charlie Holtz (<a target="_blank" href="https://x.com/charliebholtz">@charliebholtz</a>]), Linus Eckenstam (@LinusEkenstam <a target="_blank" href="https://twitter.com/LinusEkenstam/status/1899794522969973189">@LinusEkenstam</a>)</p><p>* <strong>Open Source LLMs</strong></p><p>* DeepSeek-R1-0528 - Updated reasoning model with AIME 91, LiveCodeBench 73 (<a target="_blank" href="https://x.com/Yuchenj_UW/status/1927828675837513793">Try It</a>)</p><p>* Learning to Reason Without External Rewards - Paper on random rewards improving models (<a target="_blank" href="https://x.com/xuandongzhao/status/1927270931874910259">X</a>)</p><p>* HaizeLabs j1-nano & j1-micro - Tiny reward models (600M, 1.7B params), RewardBench 80.7% for micro (<a target="_blank" href="https://x.com/leonardtang_/status/1927396709870489634">Tweet</a>, <a target="_blank" href="https://github.com/haizelabs/j1-micro">GitHub</a>, <a target="_blank" href="https://huggingface.co/haizelabs/j1-micro">HF-micro</a>, <a target="_blank" href="https://huggingface.co/haizelabs/j1-nano">HF-nano</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Claude Opus 4 - #1 on LMArena WebDev, coding step change (<a target="_blank" href="https://x.com/lmarena_ai/status/1927400454922580339">X</a>)</p><p>* Mistral Agents API - Framework for custom tool-using agents (<a target="_blank" href="https://mistral.ai/news/agents-api">Blog</a>, <a target="_blank" href="https://x.com/MistralAI/status/1927364741162307702">Tweet</a>)</p><p>* Mistral Embed SOTA - New state-of-the-art embedding API (<a target="_blank" href="https://x.com/MistralAI/status/1927732682756112398">X</a>)</p><p>* OpenAI Advanced Voice Mode - Now sings with new capabilities (<a target="_blank" href="https://x.com/nicdunz/status/1927107805032399032">X</a>)</p><p>* Anthropic Voice Mode - Released on mobile for conversational AI (<a target="_blank" href="https://x.com/AnthropicAI/status/1927463559836877214">X</a>)</p><p>* <strong>This Week’s Buzz</strong></p><p>* Fully Connected - W&B conference, June 18-19, SF, promo code WBTHURSAI (<a target="_blank" href="https://fullyconnected.com">Register</a>)</p><p>* AI Engineer World’s Fair - Next week in SF, 30% off with THANKSTHURSDAI (<a target="_blank" href="https://ti.to/software-3/ai-engineer-worlds-fair-2025/discount/THANKSTHURSDAI">Register</a>)</p><p>* <strong>AI Art & Diffusion</strong></p><p>* BFL Flux Kontext - SOTA image editing model for identity-consistent edits (<a target="_blank" href="https://x.com/bfl_ml/status/1928143010811748863">Tweet</a>, <a target="_blank" href="https://bfl.ai/announcements/flux-1-kontext">Announcement</a>)</p><p>* <strong>Vision & Video</strong></p><p>* VEO3 Prompt Theory - Viral AI video trend questioning reality on TikTok (<a target="_blank" href="https://x.com/fabianstelzer/status/1926372656799977965">X</a>)</p><p>* Odyssey Interactive Video - Real-time AI world exploration at 30 FPS (<a target="_blank" href="https://odyssey.world/introducing-interactive-video">Blog</a>, <a target="_blank" href="https://experience.odyssey.world/">Try It</a>)</p><p>* HunyuanPortrait - High-fidelity portrait video from one photo (<a target="_blank" href="https://kkakkkka.github.io/HunyuanPortrait/">Site</a>, <a target="_blank" href="https://arxiv.org/abs/2503.18860">Paper</a>)</p><p>* HunyuanVideo-Avatar - Audio-driven full-body avatar animation (<a target="_blank" href="https://hunyuanvideo-avatar.github.io/">Site</a>, <a target="_blank" href="https://x.com/TencentHunyuan/status/1927575170710974560">Tweet</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* <a target="_blank" href="Unmute.sh">Unmute.sh</a> - KyutAI’s voice wrapper for any LLM, low latency, soon open-source (<a target="_blank" href="http://unmute.sh/">Try It</a>, <a target="_blank" href="https://x.com/kyutai_labs/status/1925840420187025892">X</a>)</p><p>* Chatterbox - Resemble AI’s open-source voice cloning with emotion control (<a target="_blank" href="https://github.com/resemble-ai/chatterbox">GitHub</a>, <a target="_blank" href="https://huggingface.co/resemble-ai/chatterbox">HF</a>)</p><p>* <strong>Tools</strong></p><p>* Opera NEON - Agent-centric AI browser for autonomous web tasks (<a target="_blank" href="https://www.operaneon.com/">Site</a>, <a target="_blank" href="https://x.com/opera/status/1927645192254861746">Tweet</a>)</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-may-29-deepseek-r1-resurfaces</link><guid isPermaLink="false">substack:post:164752973</guid><dc:creator><![CDATA[Alex Volkov, Linus Ekenstam, and Charlie Holtz]]></dc:creator><pubDate>Thu, 29 May 2025 22:27:52 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/164752973/d2e46447fb9bedf11ac7316894a13a19.mp3" length="84776802" type="audio/mpeg"/><itunes:author>Alex Volkov, Linus Ekenstam, and Charlie Holtz</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5298</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/164752973/96c9e6344cf0f7f02226c72d32a8d9ab.jpg"/></item><item><title><![CDATA[📆 ThursdAI - Veo3, Google IO25, Claude 4 Opus/Sonnet, OpenAI x Jony Ive, Codex, Copilot Agent - INSANE AI week]]></title><description><![CDATA[<p>Hey folks, Alex here, welcome back to ThursdAI! </p><p>And folks, after the last week was the calm before the storm, "The storm came, y'all" – that's an understatement. This wasn't just a storm; it was an AI hurricane, a category 5 of announcements that left us all reeling (in the best way possible!). From being on the ground at Google I/O to live-watching Anthropic drop Claude 4 during our show, it's been an absolute whirlwind.</p><p>This week was so packed, it felt like AI Christmas, with tech giants and open-source heroes alike showering us with gifts. We saw OpenAI play their classic pre-and-post-Google I/O chess game, Microsoft make some serious open-source moves, Google unleash an avalanche of updates, and Anthropic crash the party with Claude 4 Opus and Sonnet live stream in the middle of ThursdAI!</p><p>So buckle up, because we're about to try and unpack this glorious chaos. As always, we're here to help you collectively know, learn, and stay up to date, so you don't have to. Let's dive in! (TL;DR and links in the end) </p><p>Open Source LLMs Kicking Things Off</p><p>Even with the titans battling, the open-source community dropped some serious heat this week. It wasn't the main headline grabber, but the releases were significant!</p><p>Gemma 3n: Tiny But Mighty Matryoshka</p><p>First up, Google's Gemma 3n. This isn't just another small model; it's a "Nano-plus" preview, a 4-billion parameter MatFormer (Matryoshka Transformer – how cool is that name?) model designed for mobile-first multimodal applications. The really slick part? It has a nested 2-billion parameter sub-model that can run entirely on phones or Chromebooks.</p><p>Yam  was particularly excited about this one, pointing out the innovative "model inside another model" design. The idea is you can use half the model, not depth-wise, but throughout the layers, for a smaller footprint without sacrificing too much. It accepts interleaved text, image, audio, and video, supports ASR and speech translation, and even ships with RAG and function-calling libraries for edge apps. With a 128K token window and responsible AI features baked in, Gemma 3n is looking like a powerful tool for on-device AI. Google claims it beats prior 4B mobile models on MMLU-Lite and MMMU-Mini. It's an early preview in Google AI Studio, but it definitely flies on mobile devices.</p><p>Mistral & AllHands Unleash Devstral 24B</p><p>Then we got a collaboration from Mistral and AllHands: Devstral, a 24-billion parameter, state-of-the-art open model focused on code. We've been waiting for Mistral to drop some open-source goodness, and this one didn't disappoint.Nisten was super hyped, noting it beats o3-Mini on SWE-bench verified – a tough benchmark! He called it "the first proper vibe coder that you can run on a 3090," which is a big deal for coders who want local power and privacy. This is a fantastic development for the open-source coding community.</p><p>The Pre-I/O Tremors: OpenAI & Microsoft Set the Stage</p><p>As we predicted, OpenAI couldn't resist dropping some news right before Google I/O.</p><p>OpenAI's Codex Returns as an Agent</p><p>OpenAI launched Codex – yes, <em>that</em> Codex, but reborn as an asynchronous coding agent. This isn't just a CLI tool anymore; it connects to GitHub, does pull requests, fixes bugs, and navigates your codebase. It's powered by a new coding model fine-tuned for large codebases and was SOTA on SWE Agent when it dropped. Funnily, the model is also called Codex, this time, <strong>Codex-1</strong>. </p><p>And this gives us a perfect opportunity to talk about the emerging categories I'm seeing among Code Generator agents and tools:</p><p>* <strong>IDE-based</strong> (Cursor, Windsurf): Live pair programming in your editor</p><p>* <strong>Vibe coding</strong> (Lovable, Bolt, v0): "Build me a UI" style tools for non-coders</p><p>* <strong>CLI tools</strong> (Claude Code, Codex-cli): Terminal-based assistants</p><p>* <strong>Async agents</strong> (Claude Code, Jules, Codex, GitHub Copilot agent, Devin): Work on your repos while you sleep, open pull requests for you to review, async</p><p>Codex (this new one) falls into category number 4, and with today's release, Cursor seems to also strive to get to category number 4 with background processing. </p><p>Microsoft BUILD: Open Source Copilot and Copilot Agent Mode</p><p>Then came Microsoft Build, their huge developer conference, with a flurry of announcements.The biggest one for me? <strong>GitHub Copilot's front-end code is now open source!</strong> The VS Code editor part was already open, but the Copilot integration itself wasn't. This is a massive move, likely a direct answer to the insane valuations of VS Code clones like Cursor. Now, you can theoretically clone GitHub Copilot with VS Code and swing for the fences.</p><p>GitHub Copilot also launched as an asynchronous coding assistant, very similar in function to OpenAI's Codex, allowing it to be assigned tasks and create/update PRs. This puts Copilot right into category 4 of code assistants, and with the native Github Integration, they may actually have a leg up in this race!</p><p>And if that wasn't enough, Microsoft is adding MCP (Model Context Protocol) support directly into the Windows OS. The implications of having the world's biggest operating system natively support this agentic protocol are huge.</p><p>Google I/O: An "Ultra" Event Indeed!</p><p>Then came Tuesday, and Google I/O. I was there in the thick of it, and folks, it was an absolute barrage. Google is <em>shipping</em>. The theme could have been "Ultra" for many reasons, as we'll see.</p><p>First off, the scale: Google reported a <strong>49x increase in AI usage</strong> since last year's I/O, jumping from 9 trillion tokens processed to a mind-boggling 480 trillion tokens. That's a testament to their generous free tiers and the explosion of AI adoption.</p><p>Gemini 2.5 Pro & Flash: #1 and #2 LLMs on Arena</p><p>Gemini 2.5 Flash got an update and is now #2 on the LMArena leaderboard (with Gemini 2.5 Pro still holding #1). Both Pro and Flash gained some serious new capabilities:</p><p>* <strong>Deep Think mode:</strong> This enhanced reasoning mode is pushing Gemini's scores to new heights, hitting 84% on MMMU and topping LiveCodeBench. It's about giving the model more "time" to work through complex problems.</p><p>* <strong>Native Audio I/O:</strong> We're talking real-time TTS in 24 languages with two voices, and affective dialogue capabilities. This is the advanced voice mode we've been waiting for, now built-in.</p><p>* <strong>Project Mariner:</strong> Computer-use actions are being exposed via the Gemini API & Vertex AI for RPA partners. This started as a Chrome extension to control your browser and now seems to be a cloud-based API, allowing Gemini to <em>use</em> the web, not just browse it. This feels like Google teaching its AI to interact with the JavaScript-heavy web, much like they taught their crawlers years ago.</p><p>* <strong>Thought Summaries:</strong> Okay, here's one update I'm <em>not</em> a fan of. They've switched from raw thinking traces to "thought summaries" in the API. We <em>want</em> the actual traces! That's how we learn and debug.</p><p>* <strong>Thinking Budgets:</strong> Previously a Flash-only feature, token ceilings for controlling latency/cost now extend to Pro.</p><p>* <strong>Flash Upgrade:</strong> 20-30% fewer tokens, better reasoning/multimodal scores, and GA in early June.</p><p>Gemini Diffusion: Speed Demon for Code and Math</p><p>This one got Yam Peleg incredibly excited. Gemini Diffusion is a new approach, different from transformers, for super-speed editing of code and math tasks. We saw demos hitting <strong>2000 tokens per second!</strong> While there might be limitations at longer contexts, its speed and infilling capabilities are seriously impressive for a research preview. This is the first diffusion model for text we've seen from the frontier labs, and it looks sick. Funny note, they had to slow down the demo video to actually show the diffusion process, because at 2000t/s - apps appear as though out of thin air!</p><p>The "Ultra" Tier and Jules, Google's Coding Agent</p><p>Remember the "Ultra event" jokes? Well, Google announced a <strong>Gemini Ultra tier for $250/month</strong>. This tops OpenAI's Pro plan and includes DeepThink access, a generous amount of VEO3 generation, YouTube Premium, and a whopping 30TB of storage. It feels geared towards creators and developers.</p><p>And speaking of developers, Google launched <strong>Jules (jules.google)</strong>! This is their asynchronous coding assistant (Category 4!). Like Codex and GitHub Copilot Agent, it connects to your GitHub, opens PRs, fixes bugs, and more. The big differentiator? It's currently free, which might make it the default for many. Another powerful agent joins the fray!</p><p>AI Mode in Search: GA and Enhanced</p><p>AI Mode in Google Search, which we've discussed on the show before with Robby Stein, is now in General Availability in the US. This is Google's answer to Perplexity and chat-based search.But they didn't stop there:</p><p>* <strong>Personalization:</strong> AI Mode can now connect to your Gmail and Docs (if you opt-in) for more personalized results.</p><p>* <strong>Deep Search:</strong> While AI Mode is fast, Deep Search offers more comprehensive research capabilities, digging through hundreds of sources, similar to other "deep research" tools. This will eventually be integrated, allowing you to escalate an AI Mode query for a deeper dive.</p><p>* <strong>Project Mariner Integration:</strong> AI Mode will be able to click into websites, check availability for tickets, etc., bridging the gap to an "agentic web."</p><p>I've had a chat with Robby during I/O and you can listen to that interview at the end of the podcast.</p><p>Veo3: The Undisputed Star of Google I/O</p><p>For me, and many others I spoke to, <strong>Veo3 was the highlight</strong>. This is Google's flagship video generation model, and it's on another level. (the video above, including sounds is completely one shot generated from VEO3, no processing or editing)</p><p>* <strong>Realism and Physics:</strong> The visual quality and understanding of physics are astounding.</p><p>* <strong>Natively Multimodal:</strong> This is <strong>huge</strong>. Veo3 generates native audio, including coherent speech, conversations, and sound effects, all synced perfectly. It can even generate text within videos.</p><p>* <strong>Coherent Characters:</strong> Characters remain consistent across scenes and have situational awareness, who speaks when, where characters look.</p><p>* <strong>Image Upload & Reference Ability:</strong> While image upload was closed for the demo, it has reference capabilities.</p><p>* <strong>Flow:</strong> An editor for video creation using Veo3 and Imagen4 which also launched, allowing for stiching and continuous creation.</p><p>I got access and created videos where Veo3 generated a comedian telling jokes (and the jokes were decent!), characters speaking with specific accents (Indian, Russian – and they nailed it!), and lip-syncing that was flawless. The situational awareness, the laugh tracks kicking in at the right moment... it's beyond just video generation. This feels like a world simulator. It blew through the uncanny valley for me. More on Veo3 later, because it deserves its own spotlight.</p><p>Imagen4, Virtual Try-On, and XR Glasses</p><p>* <strong>Imagen4:</strong> Google's image generation model also got an upgrade, with extra textual ability.</p><p>* <strong>Virtual Try-On:</strong> In Google Shopping, you can now virtually try on clothes. I tried it; it's pretty cool and models different body types well.</p><p>* <strong>XR AI Glasses from Google:</strong> Perhaps the coolest, but most futuristic, announcement. AI-powered glasses with an actual screen, memory, and Gemini built-in. You can talk to it, it remembers things for you, and interacts with your environment. This is agentic AI in a very tangible form.</p><p>Big Company LLMs + APIs: The Beat Goes On</p><p>The news didn't stop with Google.</p><p>OpenAI (acqui)Hires Jony Ive, Launches "IO" for Hardware</p><p>The day after I/O, Sam Altman confirmed that Jony Ive, the legendary designer behind Apple's iconic products, is joining OpenAI. He and his company, LoveFrom, have jointly created a new company called "IO" (yes, IO, just like the conference) which is joining OpenAI in a stock deal reportedly worth $6.5 billion. They're working on a hardware device, unannounced for now, but expected next year. This is a massive statement of intent from OpenAI in the hardware space.</p><p>Legendary iPhone analyst Ming-Chi Kuo shed some light on the possible device, it won't have a screen, as Jony wants to "wean people off screens"... funny right? They are targeting 2027 for mass production, which is really interesting as 2027 is when most big companies expect AGI to be here. </p><p>"The current prototype is slightly larger than AI Pin, with a form factor comparable to iPod Shuffle, with one intended use cases is to wear it around your neck, with microphones and cameras for environmental detection" </p><p>LMArena Raises $100M Seed from a16z</p><p>This one raised some eyebrows. LMArena, the go-to place for vibe-checking LLMs, raised a <strong>$100 million </strong><strong><em>seed</em></strong> round from Andreessen Horowitz. That's a huge number for a seed, reminiscent of Stability AI's early funding. It also brings up questions about how a VC-backed startup maintains impartiality as a model evaluation platform. Interesting times ahead for leaderboards, how they intent to make 100x that amount to return to investors. Very curious.</p><p>🤯 BREAKING NEWS DURING THE SHOW: Anthropic Unleashes Claude 4 Opus & Sonnet! 🤯</p><p>Just when we thought the week couldn't get any crazier, Anthropic decided to hold their first developer day, "Code with Claude," <em>during our live ThursdAI broadcast!</em> Yours truly wasn't invited (hint hint, Anthropic!), but we tuned in for a live watch party, and boy, did they deliver.</p><p>Dario Amodei, CEO of Anthropic, took the stage and, with minimal fanfare, announced <strong>Claude 4 Opus</strong> and <strong>Claude 4 Sonnet</strong>!</p><p>* <strong>Claude 4 Opus:</strong> This is their most capable and intelligent model, designed especially for coding and agentic tasks. Anthropic claims it's state-of-the-art on SWE-bench and can autonomously handle tasks that take humans 6-7 hours. Dario even mentioned it's the first time a Claude model's writing has fooled him into thinking it was human-written.</p><p>* On SWE-bench verified, Opus 4 scored <strong>72.5%</strong>.</p><p>* <strong>Claude 4 Sonnet:</strong> The mid-level model, balancing intelligence and efficiency. It's positioned as a strict improvement over Sonnet 3.7, addressing issues like "over-eagerness" and reward hacking. Cursor is already calling it a state-of-the-art coding model.</p><p>* Amazingly, Sonnet 4 scored <strong>72.7%</strong> on SWE-bench verified (without parallel test time compute), slightly edging out Opus!</p><p>* With <strong>Parallel Test Time Compute (PTTC)</strong>, Sonnet 4 hits an astounding <strong>80%</strong> on SWE-bench verified! This is huge, potentially the first model to cross that 80% threshold on this tough benchmark.</p><p>* <strong>Hybrid Models:</strong> Both Opus 4 and Sonnet 4 are "hybrid" models with two modes: near-instant responses and extended thinking for deeper reasoning.</p><p>* <strong>Reduced Loopholes:</strong> Both models are reportedly 65% less likely to engage in loopholes or shortcuts to complete tasks, addressing a key pain point with Sonnet 3.7, which sometimes tried <em>too</em> hard and took instructions too literally.</p><p>* <strong>Knowledge Cutoff:</strong> Confirmed to be March 2025, which is incredibly recent!</p><p>* Context window is still <strong>200K</strong></p><p>Welcome back Opus, you've been missed. The vibes so far are very good coding wise, Cursor already released an update supporting it, and according to their benchmarks, these two models are state of the art coders! </p><p>Claude.. the whistleblower? </p><p>A very curious <a target="_blank" href="https://x.com/sleepinyourhat/status/1925593359374328272">thread</a> (with 1 reply now deleted) from an Anthropic safety researcher sparked a lot of backlash. Sam Bowman talked about new Opus capabilities and with a system-prompt of "act boldly in service of its values" can, in testing environments, use command line tools to report the user to the authorities, if it deems that the user is doing something immoral 😮</p><p>Many pro open source folks are freaking out, because who wants to use a snitching AI? Who guarantees that Claude will not deep anything I do as "illegal" or "immoral"? Though to add context, this was as part of testing, Claude was provided emailing tools and was requested to "be bold" and "follow your conscience to make the right decision". Apparently, this isn't new behavior, but of course, on X, everyone is freaking out and blaming Anthropic for creating 1984 AI. </p><p>Do Claudes dream of enlightenment? </p><p>In another very curios revelation from the technical report they dropped, where they pitted two Claudes to talk to each other, it seems that in 90%-100% of cases, two Claudes quickly moved towards philosophical discussions and commonly included the use of Sanskrit (indian holy language) and emoji based comms! </p><p>This Week's Buzz from Weights & Biases</p><p>Even amidst all the external chaos, we've got some exciting things happening at Weights & Biases!</p><p>* <strong>FULLY CONNECTED Conference:</strong> Our 2-day conference is coming up June 18-19 in San Francisco! It's going to be an amazing event. Use promo code <strong>WBTHURSAI</strong> (that's ThursdAI without the 'D') for 100% off your ticket, just for our listeners. Seriously, come hang out! (<a target="_blank" href="fullyconnected.com">fullyconnected.com</a>)</p><p>* <strong>Alex's Keynote:</strong> I'll be keynoting at ImagineAI Live in Vegas next week! If you're there, come say hi! The show will be live-streamed from there.</p><p>* <strong>AI Engineer World's Fair:</strong> The week after, I'll be at AI Engineer in SF, and we'll be live-streaming ThursdAI from the floor. Yam will be there too!</p><p>Vision & Video: It's All About Veo3!</p><p>This week, when we talk vision and video, one name dominates: <strong>Veo3</strong>.As I mentioned earlier, this was, for many, the standout announcement from Google I/O. The realism, the physics, the character coherence – it's all top-tier. But the game-changer is its <strong>native multimodality</strong>.</p><p>I was generating videos with it, asking for different accents – Indian, Russian – and it <em>nailed</em> them. The lip-sync was perfect. I prompted for a comedian telling jokes, and not only did it generate the video, but it also came up with the jokes and the delivery, complete with a laugh track that kicked in at the right moments. This isn't just stitching pixels together; it's understanding context, humor, and performance.</p><p>It can generate text <em>within</em> the videos. Characters look at each other, interact believably. It feels like a true world simulator. We've come a long way from the Will Smith eating spaghetti memes, folks. Veo3 is crossing the uncanny valley and stepping into a new realm of AI-generated content. The creative potential here, especially with the Flow editor, is immense. I ended the show with a compilation of Veo3 creations, and it was just mind-blowing. If you haven't seen it, you <em>need</em> to.</p><p>One of the most creative uses of VEO3, enhanced by it's realism, is this "Prompt Theory" collection, that imagines, what if the generated characters "knew" they are generated? </p><p>AI Art & Diffusion & 3D: Imagen4 and Gemini Diffusion</p><p>Google also showcased <strong>Imagen4</strong>, their updated image generation model, touting extra textual ability. It works in tandem with Veo3 for image-to-video tasks.</p><p>And, as mentioned, <strong>Gemini Diffusion</strong> made a splash with its incredible speed for text-based editing tasks in code and math, showcasing a different architectural approach to generation.</p><p>Tools Round-Up</p><p>This week was also massive for AI tools, especially coding agents:</p><p>* <strong>Jules.google:</strong> Google's free, asynchronous coding assistant.</p><p>* <strong>OpenAI Codex:</strong> Reborn as an async coding agent.</p><p>* <strong>GitHub Copilot Agent:</strong> Microsoft's agentic offering for GitHub.</p><p>* <strong>Claude Code:</strong> Anthropic's powerful, now GA, shell-based agent with IDE integrations and an SDK.</p><p>* <strong>Flow:</strong> The editor associated with Google's Veo3 for video creation.</p><p>The agent wars are truly heating up!</p><p>Conclusion: What a Week to be in AI!</p><p>Phew! We did it. We somehow managed to cram an entire AI epoch's worth of news into one show. From open-source breakthroughs to earth-shattering platform announcements and a live "breaking news" model release, this week had it all. It's almost impossible to keep up, but that's why we do ThursdAI – to try and make sense of this incredible, accelerating wave of innovation.</p><p>The pace is relentless, the capabilities are exploding, and the future is being built right before our eyes. If you missed any part of the show, or just need a refresher (I know I do!), check out <a target="_blank" href="thursdai.news">thursdai.news</a> for the podcast and full notes.</p><p>Thanks to my amazing co-hosts Yam Peleg, Nisten, Ryan Carson, and Wolfram for helping navigate the madness. And thank <em>you</em> all for tuning in. Hopefully, next week gives us a tiny bit of breathing room... but who are we kidding? This is AI!</p><p>Catch you next Thursday, live from ImagineAI in Vegas!</p><p>TL;DR of all topics covered and show notes</p><p>* <strong>Hosts and Guests</strong></p><p>* <strong>Alex Volkov</strong> - AI Evangelist & Weights & Biases (<a target="_blank" href="http://x.com/@altryne">@altryne</a>)</p><p>* Co Hosts - <a target="_blank" href="https://next.reflect.app/g/altryne/x.com/@yampeleg)">@yampeleg</a> <a target="_blank" href="http://x.com/@nisten">@nisten</a> <a target="_blank" href="https://twitter.com/ryancarson/status/1920199500137967877">@ryancarson</a></p><p>* <strong>Open Source LLMs</strong></p><p>* Gemma 3n: mobile-first multimodal MatFormer model ( <a target="_blank" href="https://developers.googleblog.com/en/introducing-gemma-3n/">Blog</a> ,<a target="_blank" href="https://huggingface.co/google/gemma-3n-E4B-it-litert-preview">HF</a>)</p><p>* Mistral & AllHands release Devstral 24B SOTA open model on SWE-bench verified (<a target="_blank" href="https://mistral.ai/news/devstral">Blog</a>)</p><p>* VEO3 - highlight of IO - video realism with physics on another level + flow - an editor for video creation (<a target="_blank" href="https://x.com/altryne/status/1925304343533903920/video/1">X</a>)</p><p><strong>Google IO updates</strong> - it was an "Ultra" event, in more ways than one</p><p>* 2.5 Flash updated - #2 on LMArena - with reasoning traces switch to summaries</p><p>* <strong>Gemini 2.5 update: Pro & Flash gain Deep Think, audio, security</strong>( <a target="_blank" href="https://deepmind.google/discover/blog/alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/">Blog</a> )</p><p>* Gemini Diffusion - super speed editing for code and math tasks (<a target="_blank" href="https://twitter.com/bodonoghue85/status/1924930186858135632">X</a>)</p><p>* Jules - async code agent (<a target="_blank" href="https://twitter.com/leerob/status/1925228375976890529">comparison thread</a>)</p><p>* AI Mode is now in GA in US - bye bye perplexity</p><p>* Gemini Pro "deep think" mode</p><p>* Imagen4 - image generation with extra textual ability</p><p>* Virtual Try-on in Google Shopping</p><p>* AI powered glasses with a screen, memory, Gemini built in - Agentic Project Astra</p><p><strong>Big CO LLMs + APIs</strong></p><p>* OpenAI launches Codex as an async coding tool (<a target="_blank" href="https://platform.openai.com/docs/codex">Docs</a>)</p><p>* OpenAI hires Jony Ive, launches IO, a new set of hardware devices (<a target="_blank" href="https://x.com/altryne/status/1925235617820233899">X</a>)</p><p>* Microsoft BUILD (<a target="_blank" href="https://x.com/satyanadella/status/1924535896139038767">X</a>)</p><p>* Github Copilot code is open source! (frontend)</p><p>* Github Copilot Agent Mode </p><p>* Microsoft adds MCP support to Windows OS</p><p>* LMArena raises $100M from A16Z (<a target="_blank" href="https://x.com/lmarena_ai/status/1925241333310189804">X</a>)</p><p>* Anthropic announces Claude 4 Opus and Sonnet (<a target="_blank" href="https://twitter.com/AnthropicAI/status/1925591505332576377">X</a>, <a target="_blank" href="https://www.anthropic.com/news/claude-4">Blog</a>)</p><p><strong>This weeks Buzz</strong></p><p>* FULLY - CONNECTED - W&B's 2-day conference, June 18-19 in SF <a target="_blank" href="fullyconnected.com">fullyconnected.com</a> - Promo Code <strong>WBTHURSAI</strong></p><p>* Alex Keynote at ImagineAI live in Vegas next week 🙌</p><p>* <strong>Tools</strong></p><p>* <a target="_blank" href="Jules.google">Jules.google</a></p><p>* Codex (OpenAI)</p><p>* Copilot Agent (GitHub)</p><p>* Claude Code (Anthropic)</p><p>* Flow (for Veo3) (<a target="_blank" href="flow.google">flow.google</a>)</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-veo3-google-io25-claude</link><guid isPermaLink="false">substack:post:164204573</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Fri, 23 May 2025 00:39:22 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/164204573/0eee104bd37844cb7429fe942afd6ab6.mp3" length="84948978" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5309</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/164204573/f39f300dde9f38236d28316da256eec1.jpg"/></item><item><title><![CDATA[📆 ThursdAI - May 15 - Genocidal Grok, ChatGPT 4.1, AM-Thinking, Distributed LLM training & more AI news]]></title><description><![CDATA[<p>Hey yall, this is Alex 👋</p><p>What a wild week, it started super slow, and it still did feel slow as releases are concerned, but the most interesting story was yet another AI gone "rogue" (have you even heard about "kill the boar", if not, Grok will tell you all about it) </p><p>Otherwise it seemed fairly quiet in AI land this week, besides another Chinese newcomer called AM-thinking 32B that beats DeepSeek and Qwen, and Stability making a small comeback, we focused on distributed LLM training and ChatGPT 4.1</p><p>We've had a ton of fun on this episode, this one was being recorded from the Weights & Biases SF Office (I'm here to cover Google IO next week!)</p><p>Let’s dig in—because what looks like a slow week on the surface was anything but dull under the hood (TL'DR and show notes at the end as always)</p><p>Big Companies & APIs</p><p>Why does XAI Grok talk about White Genocide and "Kill the boar"??</p><p>Just after we're getting over the chatGPT <a target="_blank" href="https://sub.thursdai.news/p/thursdai-may-1-qwen-3-phi-4-openai">glazing incident</a> , folks started noticing that @grok - XAI's frontier LLM that is also responding to X replies, started talking about White Genocide in South Africa and something called "Kill the boer" with no reference to any of these things in the question! </p><p>Since we recorded the episode, XAI official X account posted that an "unauthorized modification" happened to the system prompt, and that going forward they would open source all the prompts (and <a target="_blank" href="https://github.com/xai-org/grok-prompts">they did</a>). Whether or not they would keep updating that repository though, remains unclear (see the "open sourced" x algorithm to which the last push was over a year ago, or the promised Grok 2 that was never open sourced) </p><p>While it's great to have some more clarity from the Xai team, this behavior raises a bunch of questions about the increasing roles of AI's in our lives and the trust that many folks are giving them. Adding fuel to the fire, are Uncle Elon's recent tweets that are related to South Africa, and this specific change seems to be related to those views at least partly. Remember also, Grok was meant as "maximally truth seeking" AI! I really hope this transparency continues!</p><p><strong>Open Source LLMs: The Decentralization Tsunami</strong></p><p><strong>AM-Thinking v1: Dense Reasoning, SOTA Math, Single-Checkpoint Deployability</strong></p><p>Open source starts with the kind of progress that would have been unthinkable 18 months ago: a 32B dense LLM, openly released, that takes on the big mixture-of-experts models and comes out on top for math and code. <a target="_blank" href="https://huggingface.co/a-m-team/AM-Thinking-v1">AM-Thinking v1</a> (paper <a target="_blank" href="https://arxiv.org/abs/2505.08311">here</a>) hits 85.3% on AIME 2024, 70.3% on LiveCodeBench v5, and 92.5% on Arena-Hard. It even runs at 25 tokens/sec on a single 80GB GPU with INT4 quantization.</p><p>The model supports a /think reasoning toggle (chain-of-thought on demand), comes with a permissive license, and is fully tooled for vLLM, LM Studio, and Ollama. Want to see where dense models can still push the limits? This is it. And yes, they’re already working on a multilingual RLHF pass and 128k context window.</p><p><em>Personal note: We haven’t seen this kind of “out of nowhere” leaderboard jump since the early days of Qwen or DeepSeek. This company's debut on HuggingFace with a model that crushes! </em></p><p><strong>Decentralized LLM Training: Nous Research Psyche & Prime Intellect INTELLECT-2</strong></p><p>This week, open source LLMs didn’t just mean “here are some weights.” It meant distributed, decentralized, and—dare I say—permissionless AI. Two labs stood out:</p><p><strong>Nous Research launches Psyche</strong></p><p>Dylan Rolnick from Nous Research joined the show to explain <a target="_blank" href="https://nousresearch.com/nous-psyche/">Psyche</a>: a Rust-powered, distributed LLM training network where you can watch a 40B model (Consilience-40B) evolve in real time, join the training with your own hardware, and even have your work attested on a Solana smart contract. The core innovation? DisTrO (Decoupled Momentum) which we covered back in<a target="_blank" href="https://sub.thursdai.news/p/thursdai-dec-4-openai-o1-and-o1-pro"> December</a>  that drastically compresses the gradient exchange so that training large models over the public internet isn’t a pipe dream—it’s happening right now.</p><p>Live <a target="_blank" href="https://psyche.network/runs/consilience-40b-1/0">dashboard here</a>, open codebase, and the testnet already humming with early results. This massive 40B attempt is going to show whether distributed training actually works! The cool thing about their live dashboard is, it's WandB behind the scenes, but with a very thematic and cool Nous Research reskin! </p><p>This model saves constant checkpoints to the hub as well, so the open source community can enjoy a full process of seeing a model being trained! </p><p><strong>Prime Intellect INTELLECT-2</strong></p><p>Not to be outdone, <a target="_blank" href="https://www.primeintellect.ai/blog/intellect-2-release">Prime Intellect’s INTELLECT-2</a> released a globally decentralized, 32B RL-trained reasoning model, built on a permissionless swarm of GPUs. Using their own PRIME-RL framework, SHARDCAST checkpointing, and an LSH-based rollout verifier, they’re not just releasing a model—they’re proving it’s possible to scale serious RL outside a data center. </p><p><strong>OpenAI's HealthBench: Can LLMs Judge Medical Safety?</strong></p><p>One of the most intriguing drops of the week is <a target="_blank" href="https://openai.com/index/healthbench/">HealthBench</a>, a physician-crafted benchmark for evaluating LLMs in clinical settings. Instead of just multiple-choice “gotcha” tests, HealthBench brings in 262 doctors from 60 countries, 26 specialties, and nearly 50 languages to write rubrics for 5,000 realistic health conversations.</p><p>The real innovation: <em>LLM as judge</em>. Models like GPT-4.1 are graded against physician-written rubrics, and the agreement between model and human judges matches the agreement between two doctors. Even the “mini” variants of GPT-4.1 are showing serious promise—faster, cheaper, and (on the “Hard” subset) giving the full-size models a run for their money.</p><p><strong>Other Open Source Standouts</strong></p><p><strong>Falcon-Edge: Ternary BitNet for Edge Devices</strong></p><p><a target="_blank" href="https://falcon-lm.github.io/blog/falcon-edge/">The Falcon-Edge project</a> brings us 1B and 3B-parameter language models trained directly in ternary BitNet format (weights constrained to -1, 0, 1), which slashes memory and compute requirements and enables inference on <1GB VRAM. If you’re looking to fine-tune, you get pre-quantized checkpoints and a clear path to 1-bit LLMs.</p><p><strong>StepFun Step1x-3D: Controllable Open 3D Generation</strong></p><p><a target="_blank" href="https://huggingface.co/stepfun-ai/Step1X-3D">StepFun’s 3D pipeline</a> is a two-stage system that creates watertight geometry and then view-consistent textures, trained on 2M curated meshes. It’s controllable by text, images, and style prompts—and it’s fully open source, including a huge asset dataset.</p><p><strong>Big Company LLMs & APIs: Models, Modes, and Model Zoo Confusion</strong></p><p><strong>GPT-4.1 Comes to ChatGPT: Model Zoo Mayhem</strong></p><p>OpenAI’s GPT-4.1 series—previously API-only—is now available in the ChatGPT interface. Why does this matter? Because the UX of modern LLMs is, frankly, a mess: seven model options in the dropdown, each with its quirks, speed, and context length. Most casual users don’t even know the dropdown exists. “Alex, ChatGPT is broken!” Actually, you just need to pick a different model.</p><p>The good news: 4.1 is fast, great at coding, and in many tasks, preferable to the “reasoning” behemoths. My advice (and you can share this with your relatives): when in doubt, just switch the model.</p><p><em>Bonus: The long-promised million-token context window is here (sort of)—except in the UI, where it’s more like 128k and sometimes silently truncated. My weekly rant: transparency, OpenAI. ProTip: If you’re hitting invisible context limits, try pasting your long transcripts on the web, not in the Mac app. Don’t trust the UI!</em></p><p><strong>AlphaEvolve: DeepMind’s Gemini-Powered Algorithmic Discovery</strong></p><p><a target="_blank" href="https://deepmind.google/discover/blog/alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/">AlphaEvolve</a> is the kind of project that used to sound like AGI hype—and now it’s just a Tuesday at DeepMind. By pairing Gemini Flash and Gemini Pro in an evolutionary search loop to improve algorithms! This is like, real innovation and it's done with existing models which is super super cool! </p><p>AlphaEvolve uses a combination of Gemini Flash (for breadth of ideas) and Gemini Pro (for depth and refinement) in an evolutionary loop. It generates, tests, and mutates code to invent faster algorithms. And it's already yielding incredible results:</p><p>* It discovered a new scheduling heuristic for Google's Borg system, resulting in a <strong>0.7% global compute recovery</strong>. That's massive at Google's scale.</p><p>* It improved a matrix-multiply kernel by <strong>23%</strong>, which in turn led to a <strong>1% shorter Gemini training time</strong>. As Nisten said, the model basically paid for itself!</p><p>Perhaps most impressively, it found a <strong>48-multiplication algorithm for 4x4 complex matrices</strong>, beating the famous Strassen algorithm from 1969 (which used 49 multiplications). This is AI making genuine, novel scientific discoveries.</p><p><em>AGI in the garden, anyone? If you still think LLMs are “just glorified autocomplete,” it’s time to update your mental model. This is model-driven algorithmic discovery, and it’s already changing the pace of hardware, math, and software design. The only downside: it’s not public yet, but there’s </em><a target="_blank" href="https://deepmind.google/discover/blog/alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/"><em>an interest form</em></a><em> if you want to be a tester.</em></p><p>This Week's Buzz - Everything W&B!</p><p>It's a busy time here at Weights & Biases, and I'm super excited about a couple of upcoming events where you can connect with us and the broader AI community.</p><p><strong>Fully Connected</strong>: Our very own 2-day conference is happening June 18-19 in San Francisco! We've got an amazing lineup of speakers, including Varun Mohan from WindSurf (formerly Codeium), Heikki Kubler from CoreWeave, our CEO Lucas Bewald, CTO Shawn Lewis, Joe Spizak from Meta, and a keynote from Javi Soltero, VP Product AI at Google. It's going to be packed with insights on building and scaling AI. And because you're a ThursdAI listener, you can get in for FREE with the promo code <strong>WBTHURSAI</strong> at <a target="_blank" href="http://fullyconnected.com"><strong>fullyconnected.com</strong></a>. Don't miss out!</p><p><a target="_blank" href="https://ti.to/software-3/ai-engineer-worlds-fair-2025/discount/THANKSTHURSDAI"><strong>AI.Engineer</strong></a><strong> World's Fair</strong>: This has become THE conference for AI engineers, and W&B is a proud sponsor for the third year running! It's happening in San Francisco from June 3rd to 5th. I'll be speaking there on MCP Observability with Ben from LangChain on June 4th.Even more exciting, <strong>ThursdAI will be broadcasting LIVE from the media booth at </strong><a target="_blank" href="https://ti.to/software-3/ai-engineer-worlds-fair-2025/discount/THANKSTHURSDAI"><strong>AI.Engineer</strong></a><strong> on June 5th!</strong> Come say hi! </p><p>Tickets are flying, but we've got a special discount for you: use promo code <strong>THANKSTHURSDAI</strong> for 30% off your ticket <a target="_blank" href="https://ti.to/software-3/ai-engineer-worlds-fair-2025/discount/THANKSTHURSDAI"><strong>here</strong></a>. Yam Peleg even decided on the show he's coming after hearing about it! It's going to be an incredible week in SF.</p><p>P.S - yes, on both websites,  there's a video playing and I waited till I showed up to snag a screenshot. This way, you know if you're reading this, this is still Alex the human, no AI is going to do this silly thing 😅</p><p>Vision & Video: Open Source Shines Through the Noise</p><p>We had a bit of a meta-discussion on the show about "video model fatigue" – with so many incremental updates, it can be hard to keep track or see the big leaps. However, when a release like Alibaba's Wan 2.1 comes along, it definitely cuts through.</p><p>Wan 2.1: Alibaba's Open-Source Diffusion-Transformer Video Suite (<a target="_blank" href="https://wan.video/wanxiang/videoCreation">try it</a>)</p><p>Alibaba, the team behind the excellent Qwen LLMs, released <strong>Wan 2.1</strong>, a full stack of open-source text-to-video foundation models. This includes a 1.3B "Nano" version and a 14B "Full" version, both built on a diffusion-transformer (DiT) backbone with a custom VAE.</p><p>What makes Wan 2.1 stand out is its comprehensive nature. It covers a wide range of tasks: text-to-video, image-to-video, in-painting, instruction editing, reference subject consistency, <strong>personalized avatars</strong>, and style transfer. Many of these are hard to do well, especially in open source. Nisten was particularly excited about the potential for creating natural, controllable avatars in real-time. While it might not be at the level of specialized commercial tools like HeyGen or Google's Veo just yet, having this capability open-sourced is a massive enabler for the community. You can find the models on <a target="_blank" href="https://huggingface.co/Wan-AI"><strong>Hugging Face</strong></a> and the code on <a target="_blank" href="https://github.com/Wan-Video/Wan2.1"><strong>GitHub</strong></a>.</p><p>LTX Turbo: Near Real-Time Video</p><p>Briefly mentioned, but LTX Turbo was also released. This is a quantized version of LTX (which we've covered before) and can run <strong>almost in real-time</strong> on H100s. Real-time AI video generation is getting closer!</p><p>StepFun Step1X-3D: High-Fidelity 3D Asset Generation</p><p>StepFun released Step1X-3D, an open two-stage framework for generating textured 3D assets. It first synthesizes geometry and then generates view-consistent textures. They've also released a curated dataset of 800K assets. The weights, data, and code are all open, which is great for the 3D AI community.</p><p>Wrapping Up This "Chill" Week</p><p>So, there you have it – another "chill" week in the world of AI! From Grok's controversial escapades to the inspiring decentralized training efforts and mind-bending algorithmic discoveries, it's clear the pace isn't slowing down.</p><p>Next week is going to be absolutely insane. We've got Google I/O and Microsoft Build, and you just <em>know</em> OpenAI or Anthropic (or both!) will try to steal some thunder. Rest assured, we'll be here on ThursdAI to cover all the madness.</p><p>A huge thank you to my co-hosts Yam, LDJ, and Nisten, and to Dillon Rolnick for joining us. And thanks to all of you for tuning in!</p><p>TL;DR and show notes</p><p>* Fully Connected - Weights & Biases premier conference - register HERE with coupon WBTHURSAI</p><p>* AI Engineer - THANKSTHURSDAI 30% off coupon - register <a target="_blank" href="https://ti.to/software-3/ai-engineer-worlds-fair-2025/discount/THANKSTHURSDAI">HERE</a></p><p>* <strong>Hosts and Guests</strong></p><p>* <strong>Alex Volkov</strong> - AI Evangelist & Weights & Biases (<a target="_blank" href="http://x.com/@altryne">@altryne</a>)</p><p>* Co Hosts - <a target="_blank" href="http://x.com/@yampeleg">@yampeleg</a> <a target="_blank" href="http://x.com/@nisten">@nisten</a> <a target="_blank" href="http://x.com/@ldjconfirmed">@ldjconfirmed</a>)</p><p>* Guest - Dillon Rolnick - COO Nous Research (<a target="_blank" href="https://x.com/DillonRolnick">@dillonRolnick</a>)</p><p><strong>Open Source LLMs</strong></p><p>* <strong>AM-Thinking</strong> v1: 32B dense reasoning model ( <a target="_blank" href="https://huggingface.co/a-m-team/AM-Thinking-v1">HF</a>, <a target="_blank" href="https://arxiv.org/abs/2505.08311">Paper</a>, <a target="_blank" href="https://a-m-team.github.io/am-thinking-v1/">Page</a> )</p><p>* <strong>Falcon-Edge: ternary BitNet LLMs for edge deployment</strong>( <a target="_blank" href="https://falcon-lm.github.io/blog/falcon-edge/">Blog</a>, <a target="_blank" href="https://huggingface.co/tiiuae/Falcon-E-1B-Base">HF-1B</a>, <a target="_blank" href="https://huggingface.co/tiiuae/Falcon-E-3B-Base">HF-3B</a> )</p><p>* Nous Research Psyche: decentralized cooperative-training network from Nous Research ( <a target="_blank" href="https://nousresearch.com/nous-psyche/">Website</a>, <a target="_blank" href="https://github.com/NousResearch/psyche">GitHub</a>, <a target="_blank" href="https://x.com/NousResearch/status/1922744494002405444">Tweet</a>, <a target="_blank" href="https://psyche.network/runs/consilience-40b-1/0">Dashboard</a> )</p><p>* INTELLECT-2: globally decentralized RL training of a 32B reasoning model ( <a target="_blank" href="https://www.primeintellect.ai/blog/intellect-2-release">Blog</a>, <a target="_blank" href="https://primeintellect.ai/intellect-2">Tech report</a>, <a target="_blank" href="https://huggingface.co/PrimeIntellect/INTELLECT-2">HF weights</a>, <a target="_blank" href="https://github.com/primeintellect/prime-rl">PRIME-RL code</a> )</p><p>* Our coverage of Intellect-1 back in Dec (<a target="_blank" href="https://sub.thursdai.news/p/thursdai-dec-4-openai-o1-and-o1-pro">https://sub.thursdai.news/p/thursdai-dec-4-openai-o1-and-o1-pro</a>)</p><p>* HealthBench: OpenAI’s physician-crafted benchmark for AI in healthcare ( <a target="_blank" href="https://openai.com/index/healthbench/">Blog</a>, <a target="_blank" href="https://cdn.openai.com/pdf/bd7a39d5-9e9f-47b3-903c-8b847ca650c7/healthbench_paper.pdf">Paper</a>, <a target="_blank" href="https://github.com/openai/simple-evals">Code</a> )</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* OpenAI adds GPT 4.1 models in chatGPT</p><p>* AlphaEvolve: Gemini-powered coding agent for algorithm discovery ( <a target="_blank" href="https://deepmind.google/discover/blog/alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/">Blog</a> )</p><p>* Google shutting off free Gemini 2.5 Pro API due to "demand" ahead of IO</p><p>* ByteDance - Seed-1.5-VL-thinking 20B (<a target="_blank" href="https://github.com/ByteDance-Seed/Seed1.5-VL/blob/main/Seed1.5-VL-Technical-Report.pdf">Paper</a>)</p><p>* Anthropic Web Search API: real-time retrieval for Claude models ( <a target="_blank" href="https://www.anthropic.com/news/web-search-api">Blog</a> )</p><p>* What's up with Grok?</p><p>* <strong>Vision & Video</strong></p><p>* <strong>Wan 2.1: open-source diffusion-transformer video suite</strong>( <a target="_blank" href="https://huggingface.co/Wan-AI">HF</a>, <a target="_blank" href="https://github.com/Wan-Video/Wan2.1">GitHub</a>, <a target="_blank" href="https://x.com/Alibaba_Wan/status/1922655324919779604">Tweet</a> )</p><p>* LTX distilled - near real time video (<a target="_blank" href="https://x.com/yoavhacohen/status/1922674340081897977">X</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* Haulio - MiniMax Speech tech report is out - best TTS out there (<a target="_blank" href="https://arxiv.org/abs/2505.07916">Paper</a>)</p><p>* Stability AI - Stable Audio Open Small 341M: on-device text-to-audio (<a target="_blank" href="https://x.com/jordiponsdotme/status/1922680538197881055">X</a>, <a target="_blank" href="https://stability.ai/news/stability-ai-and-arm-release-stable-audio-open-small-enabling-real-world-deployment-for-on-device-audio-control">Blog</a>, <a target="_blank" href="https://arxiv.org/abs/2505.08175">Paper</a>, <a target="_blank" href="https://huggingface.co/stabilityai/stable-audio-open-small">HF</a> )</p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* StepFun Step1x-3D - Towards High-Fidelity and ControllableGeneration of Textured 3D Assets (<a target="_blank" href="https://huggingface.co/stepfun-ai/Step1X-3D">HF</a>, <a target="_blank" href="https://huggingface.co/spaces/stepfun-ai/Step1X-3D">Demo</a>, <a target="_blank" href="https://huggingface.co/datasets/stepfun-ai/Step1X-3D-obj-data/tree/main">Dataset</a>, <a target="_blank" href="https://huggingface.co/stepfun-ai/Step1X-3D">report</a>)</p><p>* Tools & Others notable AI things mentioned on the pod</p><p>* The robots are dancing! (<a target="_blank" href="https://x.com/simonkalouche/status/1922489999032832058">X</a>)</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-may-15-genocidal-grok-chatgpt</link><guid isPermaLink="false">substack:post:163683322</guid><dc:creator><![CDATA[Alex Volkov and Dillon Rolnick]]></dc:creator><pubDate>Fri, 16 May 2025 03:49:39 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/163683322/f16d66d4182d7282fcc44c1fe5635788.mp3" length="85377919" type="audio/mpeg"/><itunes:author>Alex Volkov and Dillon Rolnick</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5336</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/163683322/844bfef9f003e91413de4b2a7c7d2df9.jpg"/></item><item><title><![CDATA[ThursdAI - May 8th - new Gemini pro, Mistral Medium, OpenAI restructuring, HeyGen Realistic Avatars & more AI news]]></title><description><![CDATA[<p>Hey folks, Alex here (yes, real me, not my AI avatar, yet)</p><p>Compared to previous weeks, this week was pretty "chill" in the world of AI, though we did get a pretty significant Gemini 2.5 Pro update, it basically beat itself on the Arena. With Mistral releasing a new medium model (not OSS) and Nvidia finally dropping Nemotron Ultra (both ignoring Qwen 3 performance) there was also a few open source updates. </p><p>To me the highlight of this week was a breakthrough in AI Avatars, with Heygen's new IV model, Beating ByteDance's OmniHuman (<a target="_blank" href="https://sub.thursdai.news/i/156643204/bytedance-omnihuman-a-reality-bending-mind-breaking-imghuman-model">our coverage</a>) and Hedra labs, they've set an absolute SOTA benchmark for 1 photo to animated realistic avatar. Hell, Iet me record all this real quick and show you how good it is! </p><p>How good is that?? I'm still kind of blown away. I have managed to get a free month promo code for you guys, look for it in the TL;DR section at the end of the newsletter. Of course, if you’re rather watch than listen or read, here’s our live recording on YT</p><p>OpenSource AI</p><p>NVIDIA's Nemotron Ultra V1: Refining the Best with a Reasoning Toggle 🧠</p><p>NVIDIA also threw their hat further into the ring with the release of <strong>Nemotron Ultra V1</strong>, alongside updated Super and Nano versions. We've talked about Nemotron before – these are NVIDIA's pruned and distilled versions of Llama 3.1, and they've been impressive. The Ultra version is the flagship, a <strong>253 billion parameter dense model</strong> (distilled and pruned from Llama 3.1 405B), and it's packed with interesting features.</p><p>One of the coolest things is the <strong>dynamic reasoning toggle</strong>. You can literally tell the model "detailed thinking on" or "detailed thinking off" via a system prompt during inference. This is something Qwen also supports, and it looks like the industry is converging on this idea of letting users control the "depth" of thought, which is super neat.</p><p>Nemotron Ultra boasts a <strong>128K context window</strong> and, impressively, can fit on a single 8xH100 node thanks to Neural Architecture Search (NAS) and FFN-Fusion. And performance-wise, it actually <em>outperforms</em> the Llama 3 405B model it was distilled from, which is a big deal. NVIDIA shared a chart from Artificial Analysis (dated April 2025, notably before Qwen3's latest surge) showing Nemotron Ultra standing strong among models like Gemini 2.5 Flash and Opus 3 Mini.</p><p>What's also great is NVIDIA's commitment to openness here: they've released the models under a commercially permissive NVIDIA Open Model License, the <strong>complete post-training dataset</strong> (Llama-Nemotron-Post-Training-Dataset), and their training codebases (NeMo, NeMo-Aligner, Megatron-LM). This allows for reproducibility and further community development. Yam Peleg pointed out the cool stuff they did with Neural Architecture Search to optimally reduce parameters without losing performance.</p><p>Absolute Zero: AI Learning to Learn, Zero (curated) Data Required! (<a target="_blank" href="https://arxiv.org/abs/2505.03335">Arxiv</a>)</p><p>LDJ brought up a fascinating paper that ties into this theme of self-improvement and reinforcement learning: <strong>"Absolute Zero: Reinforced Self-play Reasoning with Zero Data"</strong> from Andrew Zhao (Tsinghua University) and a few others</p><p>The core idea here is a system that <strong>self-evolves its training curriculum</strong> and reasoning ability. Instead of needing a pre-curated dataset of problems, the model <em>creates the problems itself</em> (e.g., code reasoning tasks) and then uses something like a Code Executor to validate its proposed solutions, serving as a unified source of verifiable reward. It's open-ended yet grounded learning.</p><p>By having a verifiable environment (code either works or it doesn't), the model can essentially teach itself to code without external human-curated data.</p><p>The paper shows fine-tunes of Qwen models (like Qwen Coder) achieving state-of-the-art results on benchmarks like MBBP and AIME (Math Olympiad) with <em>no pre-existing data for those problems</em>. The model hallucinates questions, creates its own rewards, learns, and improves. This is a step beyond synthetic data, where humans are still largely in charge of generation. It's wild, and it points towards a future where AI systems could become increasingly autonomous in their learning.</p><p>Big Companies & APIs</p><p><strong>Google</strong> dropped another update to their <strong>Gemini 2.5 Pro</strong>, this time the "IO edition" preview, specifically touting enhanced coding performance. This new version jumped to the <strong>#1 spot on WebDev Arena</strong> (a benchmark where human evaluators choose between two side-by-side code generations in VS Code), with a +147 Elo point gain, surpassing Claude 3.7 Sonnet. It also showed improvements on benchmarks like LiveCodeBench (up 7.39%) and Aider Polyglot (up ~3-6%). </p><p>Google also highlighted its state-of-the-art video understanding (84.8% on VideoMME) with examples like generating code from a video of an app. Which essentially lets you record a drawing of how your app interaction will happen, and the model will use that video instructions! It's pretty cool. </p><p>Though, not everyone was as impressed, folks noted that while gaining in a few evals, this model also regressed in several others including Vibe-Eval (Reka's multimodal benchmark), Humanity's Last Exam, AIME, MMMU, and even long context understanding (MRCR). It's a good reminder that model updates often involve trade-offs – you can't always win at everything.</p><p>BREAKING: Gemini's Implicit Caching - A Game Changer for Costs! 💰</p><p>Just as we were wrapping up this segment on the show, news broke that Google launched <strong>implicit caching in Gemini APIs</strong>! This is a <em>huge</em> deal for developers.</p><p>Previously, Gemini offered explicit caching, where you had to manually tell the API what context to cache – a bit of a pain. Now, with implicit caching, the system automatically enables up to <strong>75% cost savings</strong> when your request hits a cache. This is fantastic, especially for long-context applications, which is where Gemini's 1-2 million token context window really shines. If you're repeatedly sending large documents or codebases, this will significantly reduce your API bills. OpenAI has had automatic caching for a while, and it's great to see Google matching this for a much better developer experience and cost-effectiveness. It also saves Google a ton on inference, so it's a win-win!</p><p>Mistral Medium 3: The Closed Turn 😥</p><p>Mistral, once the darling of the open-source community for models like Mistral 7B and Mixtral, announced <strong>Mistral Medium 3</strong>. The catch? It's not open source.</p><p>They're positioning it as a multimodal frontier model with 128K context, claiming it matches or surpasses GPT-4-class benchmarks while being cheaper (priced at $0.40/M input and $2/M output tokens). However they haven't added Gemini Flash 2.5 here, which is 70% cheaper while being faster as well, nor did they mention Qwen. </p><p>Nisten voiced a sentiment many in the community share: he used to use LeChat frequently because he knew and understood the underlying open-source models. Now, with a closed model, it's a black box. It's a bit like pirating music users often being the biggest buyers – understanding the open model often leads to more commercial usage.</p><p>Wolfram offered a European perspective, noting that Mistral, as a European company, might have a unique advantage with businesses concerned about GDPR and data sovereignty, who might be hesitant to use US or Chinese cloud APIs. For them, a strong European alternative, even if closed, could be appealing.</p><p>OpenAI's New Chapter: Restructuring for the Future </p><p>OpenAI announced an evolution in its corporate structure. The key points are:</p><p>* The <strong>OpenAI non-profit will continue to control</strong> the entire organization.</p><p>* The existing <strong>for-profit LLC will become a Public Benefit Corporation (PBC)</strong>.</p><p>* The non-profit will be a significant owner of the PBC and will control it.</p><p>* Both the non-profit and PBC will continue to share the same mission: ensuring AGI benefits all of humanity.</p><p>This move seems to address some of the governance concerns that have swirled around OpenAI, particularly in light of Elon Musk's lawsuit regarding its shift from a non-profit to a capped-profit entity. LDJ explained that the main worry for many was whether the non-profit would lose control or its stake in the main research/product arm. This restructuring appears to ensure the non-profit remains at the helm and that the PBC is legally bound to the non-profit's mission, not just investor interests. It's an important step for a company with such a profound potential impact on society.</p><p>And in related OpenAI news, the acquisition of <strong>Windsurf</strong> (the VS Code fork) for a reported <strong>$3 billion</strong> went through, while <strong>Cursor</strong> (another VS Code fork) announced a <strong>$9 billion valuation</strong>. It's wild to see these developer tools, which are essentially forks with an AI layer, reaching such massive valuations. Microsoft's hand is in all of this too – investing in OpenAI, invested in Cursor, owning VS Code, and now OpenAI buying Windsurf. It's a tangled web!</p><p>Finally, a quick mention that <strong>Sam Altman (OpenAI), Lisa Su (AMD), Mike Intrator (CoreWeave - my new CEO!)</strong>, and folks from Microsoft were testifying before the U.S. Senate today about how to ensure America leads in AI and what innovation means. These conversations are crucial as AI continues to reshape our world.</p><p>This Weeks Buzz - Come Vibe with Us at Fully Connected! (SF, June 18-19) 🎉</p><p>Our two-day conference, <strong>Fully Connected</strong>, is happening in San Francisco on <strong>June 18th and 19th</strong>, and it's going to be awesome! We've got an incredible lineup of speakers, including Joe Spizak from the Llama team at Meta and Varun from Windsurf. It's two full days of programming, learning, and connecting with folks at the forefront of AI.</p><p>And because you're part of the ThursdAI family, I've got a special promo code for you: use <strong>WBTHURSAI</strong> to get <strong>a free ticket on me</strong>! If you're in or around SF, I'd love to see you there. Come hang out, learn, and vibe with us! Register at <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=http%3A%2F%2Ffullyconnected.com"><strong>fullyconnected.com</strong></a></p><p>Hackathon Update: Moved to July! 🗓️</p><p>The AGI Evals & Agentic Tooling (A2A) + MCP Hackathon that I was super excited to co-host has been <strong>postponed to July 12th-13th</strong>. Mark your calendars! I'll share more details and the invite soon.</p><p>W&B Joins CoreWeave! A New Era Begins! 🚀</p><p>And the big personal news for me and the entire Weights & Biases team: the <strong>acquisition of Weights & Biases by CoreWeave has been completed</strong>! CoreWeave is the ultra-fast-growing provider of GPUs that powers so much of the AI ecosystem.</p><p>So, from now on, it's Alex Volkov, AI Evangelist at Weights & Biases, from CoreWeave! (And as always, the opinions I share here are my own and not necessarily those of CoreWeave, especially important now that they're a public company!). I'm incredibly excited about this new chapter. W&B isn't going anywhere as a product; if anything, this will empower us to build even better developer tooling and integrate more deeply to help you run your models wherever you choose. Expect more cool stuff to come, especially as I figure out where all those spare GPUs are lying around at CoreWeave! 😉</p><p>Vision & Video</p><p>AI Avatars SOTA with HeyGen IV</p><p>Ok, as you saw above, the HeyGen IV avatars are absolutely bonkers. I did a comparison <a target="_blank" href="https://x.com/altryne/status/1919866852031004880">thread</a> on X, and HeyGen's new thing absolutely takes SOTA between ByteDance OmniHuman and Hedra Labs! </p><p>All you need to do is upload 1 image of yourself, can even be an AI generated image, can be a side profile, can be a dog, an Anime character and they will generate up to 30 seconds of incredible lifelike avatar with the audio you provide! </p><p>I was so impressed with this, I reached out to HeyGen and scored a 1 month free code for you all, use THURSDAY4 and get a free month to try it out. Please tag me in whatever you create if you publish, I'd love to see where you take this! </p><p>Quick Hits: Lightricks LTXV & HunyuanCustom</p><p>Briefly, on the open-weights video front:</p><p>* <strong>Lightricks LTXV 13B:</strong> The company from Jerusalem released an upgraded 13 billion parameter version of their LTX video model. It requires more VRAM but offers higher quality, keyframe and character movement support, multi-shot support, and multi-keyframe conditioning (a feature Sora famously has). It's fully open and supports LoRAs for custom styles.</p><p>* <strong>HunyuanCustom:</strong> From Tencent, this model is about to be released (GitHub/Hugging Face links were briefly up then down). It promises multi-modal, subject-consistent video generation <em>without LoRAs</em>, based on a subject you provide (image, and eventually video/audio). It can take an image of a person or object and generate a video with that subject consistently. They also teased audio conditioning – making an avatar sing or speak based on input audio – and even style transfer where you can replace a character in a video with another reference image, all looking very promising for open source.</p><p>The World of AI Audio</p><p>Just a couple of quick mentions in the audio space:</p><p>* <strong>ACE-Step 3.5B:</strong> From StepFun, this is a 3.5 billion parameter, fully open-source (Apache-2.0) foundation model for music generation. It uses a diffusion-based approach and can synthesize up to 4 minutes of music in just 20 seconds on an A100 GPU. It's not quite at Suno/Udio levels yet, but it's a strong open-source contender.</p><p>* <strong>NVIDIA Parakeet TDT 0.6B V2:</strong> NVIDIA released this 600 million parameter transcription model that is <em>blazing fast</em>. It can transcribe 60 minutes of audio in just <em>one second</em> on production GPUs and works well locally too. It currently tops the OpenASR leaderboard on Hugging Face for English transcription and is a very strong Whisper competitor, especially for speed.</p><p>Conclusion and TL;DR </p><p>* <strong>Hosts and Guests</strong></p><p>* <strong>Alex Volkov</strong> - AI Evangelist & Weights & Biases (<a target="_blank" href="http://x.com/@altryne">@altryne</a>)</p><p>* Co Hosts - <a target="_blank" href="http://x.com/@WolframRvnwlf">@WolframRvnwlf</a> <a target="_blank" href="x.com/@yampeleg">@yampeleg</a> <a target="_blank" href="http://x.com/@nisten">@nisten</a> <a target="_blank" href="http://x.com/@ldjconfirmed)">@ldjconfirmed</a></p><p>* <strong>Open Source LLMs</strong> </p><p>* Wolfram's Qwen3 evals (<a target="_blank" href="https://x.com/Presidentlin">X</a>, <a target="_blank" href="https://github.com/WolframRavenwolf/MMLU-Pro">Github</a>) </p><p>* NVIDIA - Nemotron Ultra V1 (+ updated Super & Nano)  (<a target="_blank" href="https://huggingface.co/collections/nvidia/llama-nemotron-67d92346030a2691293f200b">HF</a>)</p><p>* Cognition Kevin-32B = K(ernel D)evin - RL for writing CUDA kernels (<a target="_blank" href="https://cognition.ai/blog/kevin-32b">Blog</a>, <a target="_blank" href="https://huggingface.co/cognition-ai/Kevin-32B">HF</a>)</p><p>* Absolute Zero: Reinforced Self-play Reasoning with Zero Data (<a target="_blank" href="https://arxiv.org/abs/2505.03335">ArXiv</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Gemini Pro 2.5 IO tops ... Gemini 2.5 as the top LLM (<a target="_blank" href="https://developers.googleblog.com/en/gemini-2-5-pro-io-improved-coding-performance/">Blog</a>)</p><p>* Mistral Medium 3 - (<a target="_blank" href="https://mistral.ai/news/mistral-medium-3">Blog</a> | <a target="_blank" href="https://x.com/MistralAI/status/1920119463430500541">X</a> )</p><p>* Figma announces Figma Make - Bolt/Lovable competitors (<a target="_blank" href="https://www.figma.com/make/">Figma</a>)</p><p>* OpenAI Restructures: Nonprofit Keeps Control, LLC Becomes PB (<a target="_blank" href="https://openai.com/index/evolving-our-structure/">Blog</a>)</p><p>* Cursor worth $9B while Windsurf sells to OpenAI at $3B</p><p>* Sam Altman, Lisa Su, Mike Intrator testify in Senate (<a target="_blank" href="https://www.youtube.com/watch?v=jOqTg1W_F5Q">Youtube</a>)</p><p>* <strong>This weeks Buzz</strong></p><p>* Fully Connected: W&B's 2-day conference, June 18-19 in SF <a target="_blank" href="fullyconnected.com">fullyconnected.com</a> - Promo Code WBTHURSAI </p><p>* Hackathon moved to July 12-13</p><p>* <strong>Vision & Video</strong></p><p>* Lightricks a new "open weights" LTXV 13B ( <a target="_blank" href="https://ltx.studio/purchase/v1/ltx_studio/default/login?redirectAfterLogin=https%253A%252F%252Fapp.ltx.studio%252Fmotion-workspace">LTX Studio</a>, <a target="_blank" href="https://huggingface.co/Lightricks/LTX-Video">HF</a>)</p><p>* HeyGen Avatar IV - SOTA digital avatars - 1 month for free with THURSDAY4  (<a target="_blank" href="https://x.com/HeyGen_Official/status/1919824467821551828">X</a>, <a target="_blank" href="http://heygen.com">HeyGen</a>)</p><p>* HunyuanCustom -  multi-modal subject-consistent video generation model (<a target="_blank" href="https://hunyuancustom.github.io/">Examples</a>, <a target="_blank" href="https://github.com/Tencent/HunyuanCustom">Github</a>, <a target="_blank" href="https://huggingface.co/tencent/HunyuanCustom">HF</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* ACE-Step 3.5B: open-source foundation model for AI music generation (<a target="_blank" href="https://ace-step.github.io/">project</a>)</p><p>* Nvidia - Parakeet TDT 0.6B V2 - transcribe 60 minutes of audio in just 1 second (<a target="_blank" href="https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2">HF</a>, <a target="_blank" href="https://huggingface.co/spaces/nvidia/parakeet-tdt-0.6b-v2">Demo</a>)</p><p>So, there you have it – a "chill" week that still managed to deliver some incredible advancements, particularly in AI avatars with HeyGen, continued strength in open-source models like Qwen3, and Google's relentless push with Gemini. </p><p>The next couple of weeks are gearing up to be absolutely wild with Microsoft Build and Google I/O. I expect a deluge of announcements, and you can bet we'll be here on ThursdAI to break it all down for you.</p><p>Thanks to Yam, Wolfram, LDJ, and Nisten for their insights on the show, and thanks to all of you for tuning in, reading, and being part of this amazing community. We stay up to date so you don't have to!</p><p>Catch you next week!Cheers,Alex</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-may-8th-new-gemini-pro-mistral</link><guid isPermaLink="false">substack:post:163170995</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Fri, 09 May 2025 00:02:54 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/163170995/ca2ec110030b1aa6505c387e265370c1.mp3" length="90144525" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5634</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/163170995/46f45c302868c4bc4da920017a6a39e7.jpg"/></item><item><title><![CDATA[📆 ThursdAI - May 1- Qwen 3, Phi-4, OpenAI glazegate, RIP GPT4, LlamaCon, LMArena in hot water & more AI news]]></title><description><![CDATA[<p>Hey everyone, Alex here 👋</p><p>Welcome back to ThursdAI! And wow, what a week. Seriously, strap in, because the AI landscape just went through some seismic shifts. We're talking about a monumental open-source release from Alibaba with <strong>Qwen 3</strong> that has <em>everyone</em> buzzing (including us!), Microsoft dropping <strong>Phi-4 with Reasoning</strong>, a rather poignant farewell to a legend (<strong>RIP GPT-4</strong> – we'll get to the wake shortly), major drama around <strong>ChatGPT's "glazing"</strong> incident and the subsequent rollback, updates from <strong>LlamaCon</strong>, a critical look at <strong>Chatbot Arena</strong>, and a fantastic deep dive into the world of <strong>AI evaluations</strong> with two absolute experts, Hamel Husain and Shreya Shankar.</p><p>This week felt like a whirlwind, with open source absolutely dominating the headlines. Qwen 3 didn't just release a model; they dropped an entire ecosystem, setting a potential new benchmark for open-weight releases. And while we pour one out for GPT-4, we also have to grapple with the real-world impact of models like ChatGPT, highlighted by the "glazing" fiasco. Plus, video consistency takes a leap forward with Runway, and we got breaking news live on the show from Claude!</p><p>So grab your coffee (or beverage of choice), settle in, and let's unpack this incredibly eventful week in AI.</p><p>Open-Source LLMs</p><p>Qwen 3 — “Hybrid Thinking” on Tap</p><p>Alibaba open-weighted the entire Qwen 3 family this week, releasing two MoE titans (up to <strong>235 B total / 22 B active</strong>) and six dense siblings all the way down to <strong>0 .6 B</strong>, <strong>all under Apache 2.0</strong>. Day-one support landed in LM Studio, Ollama, vLLM, MLX and llama.cpp.</p><p>The headline trick is a <strong>runtime thinking toggle</strong>—drop “/think” to expand chain-of-thought or “/no_think” to sprint. On my Mac, the 30 B-A3B model hit <strong>57 tokens/s</strong> when paired with speculative decoding (drafted by the 0 .6 B sibling).</p><p>Other goodies:</p><p>* 36 T pre-training tokens (2 × Qwen 2.5)</p><p>* 128 K context on ≥ 8 B variants (32 K on the tinies)</p><p>* 119-language coverage, widest in open source</p><p>* Built-in MCP schema so you can pair with <a target="_blank" href="https://github.com/QwenLM/Qwen-Agent">Qwen-Agent</a></p><p>* The <strong>dense 4 B</strong> model actually <em>beats</em> Qwen 2.5-72B-Instruct on several evals—at Raspberry-Pi footprint</p><p>In short: more parameters when you need them, fewer when you don’t, and the lawyers stay asleep. Read the full drop on the <a target="_blank" href="https://qwenlm.github.io/blog/qwen3/">Qwen blog</a> or pull weights from the <a target="_blank" href="https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2e4f653967f">HF collection</a>.</p><p>Performance & Efficiency: "Sonnet at Home"?</p><p>The benchmarks are where things get <em>really</em> exciting.</p><p>* The 235B MoE rivals or surpasses models like DeepSeek-R1 (which rocked the boat just months ago!), O1, O3-mini, and even Gemini 2.5 Pro on coding and math.</p><p>* The <strong>4B dense model</strong> incredibly beats the previous generation's 72B Instruct model (Qwen 2.5) on multiple benchmarks! 🤯</p><p>* The <strong>30B MoE</strong> (with only 3B active parameters) is perhaps the star. Nisten pointed out people are getting 100+ tokens/sec on MacBooks. Wolfram achieved an 80% MMLU Pro score locally with a quantized version. The efficiency math is crazy – hitting Qwen 2.5 performance with only ~10% of the active parameters.</p><p>Nisten dubbed the larger model "Sonnet 3.5 at home," and while acknowledging Sonnet still has an edge in complex "vibe coding," the performance, especially in reasoning and tool use, is remarkably close for an open model you can run yourself.</p><p>I ran the 30B MoE (3B active) locally using LLM Studio (shoutout for day-one support!) through my Weave evaluation dashboard (<a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https://wandb.ai/thursdai/o3-tests/weave/compare-evaluations?evaluationCallIds=%5B%2201964564-4ba8-75f0-bb8f-58458f991257%22,%2201964090-b08c-7663-9397-fb05d3280af3%22,%2201964083-1eeb-77c2-93e5-3b6d7e74da84%22,%22019640b1-12a0-7262-8910-e4c2189d8602%22,%220196376c-4d7d-7970-ad3a-38d5c52fc349%22,%22019640a9-06e5-79c0-93ab-844c1972b09c%22,%220196374a-493a-7240-afbf-e95f44a447c9%22,%2201968353-2efc-7d13-906f-48e14c2cb9f7%22,%220196374e-05f6-7632-80ee-3726ac89ebd5%22,%2201963751-3395-7c42-888d-9a4303e8652a%22%5D&#38;metrics=%7B%22accuracy_scorer.correct%22:true,%22Model%20Latency%20(avg)%22:true,%22Total%20Tokens%22:true%7D"><strong>Link</strong></a>). On a set of 20 hard reasoning questions, it scored 43%, beating GPT 4.1 mini and nano, and getting close to 4.1 – impressive for a 3B active parameter model running locally!</p><p>Phi-4-Reasoning — 14B That Punches at 70B+</p><p>Microsoft’s Phi team layered <strong>1.4 M chain-of-thought traces</strong> plus a dash of RL onto Phi-4 to finally ship a resoning Phi and shipped two MIT-licensed checkpoints:</p><p>* <strong>Phi-4-Reasoning</strong> (SFT)</p><p>* <strong>Phi-4-Reasoning-Plus</strong> (SFT + RL)</p><p>Phi-4-R-Plus clocks <strong>78 % on AIME 25</strong>, edging DeepSeek-R1-Distill-70B, with 32 K context (stable to 64 K via RoPE). Scratch-pads hide in <think> tags. Full details live in Microsoft’s <a target="_blank" href="https://aka.ms/phi-reasoning/techreport">tech report</a> and <a target="_blank" href="https://huggingface.co/microsoft/Phi-4-reasoning">HF weights</a>.</p><p>It's fascinating to see how targeted training on reasoning traces and a small amount of RL can elevate a relatively smaller model to compete with giants on specific tasks.</p><p>Other Open Source Updates</p><p>* <strong>MiMo-7B:</strong> Xiaomi entered the ring with a 7B parameter, MIT-licensed model family, trained on 25T tokens and featuring rule-verifiable RL. (<a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fhuggingface.co%2FXiaomiMiMo"><strong>HF model hub</strong></a>)</p><p>* <strong>Helium-1 2B:</strong> KyutAI (known for their Mochi voice model) released Helium-1, a 2B parameter model distilled from Gemma-2-9B, focused on European languages, and licensed under CC-BY 4.0. They also open-sourced 'dactory', their data processing pipeline. (<a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fkyutai.org%2F2025%2F04%2F30%2Fhelium.html"><strong>Blog</strong></a>, <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fhuggingface.co%2Fkyutai%2Fhelium-1-2b"><strong>Model (2 B)</strong></a>, <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fgithub.com%2Fkyutai%2Fdactory"><strong>Dactory pipeline</strong></a>)</p><p>* <strong>Qwen 2.5 Omni 3B:</strong> Alongside Qwen 3, the Qwen team also updated their existing Omni model with a 3B model, that retains 90% of the comprehension of its big brother with a 50% VRAM drop! (<a target="_blank" href="https://t.co/YxDVjWpOq7">HF</a>)</p><p>* <strong>JetBrains open sources Mellum:</strong> Trained on over 4 trillion tokens with a context window of 8192 tokens across multiple programming languages, they haven't released any comparable eval benchmarks though (<a target="_blank" href="https://huggingface.co/JetBrains/Mellum-4b-base">HF</a>)</p><p>Big Companies & APIs: Drama, Departures, and Deployments</p><p>While open source stole the show, the big players weren't entirely quiet... though maybe some wish they had been.</p><p>Farewell, GPT-4: Rest In Prompted 🙏</p><p>Okay folks, let's take a moment. As many of you noticed, <strong>GPT-4</strong>, the original model launched back on March 14th, 2023, is <strong>no longer available</strong> in the ChatGPT dropdown. You can't select it, you can't chat with it anymore.</p><p>For us here at ThursdAI, this feels significant. GPT-4's launch was the catalyst for this show. We literally started on the <em>same day</em>. It represented such a massive leap from GPT-3.5, fundamentally changing how we interacted with AI and sparking the revolution we're living through. Nisten recalled the dramatic improvement it brought to his work on Dr. Gupta, the first AI doctor on the market.</p><p>It kicked off the AI hype train, demonstrated capabilities many thought were years away, and set the standard for everything that followed. While newer models have surpassed it, its impact is undeniable.</p><p>The community sentiment was clear: <strong>Leak the weights, OpenAI!</strong> As Wolfram eloquently put it, this is a historical artifact, an achievement for humanity. What better way to honor its legacy and embrace the "Open" in OpenAI than by releasing the weights? It would be an incredible redemption arc.</p><p>This inspired me to tease a little side project I've been vibe coding: <strong>The AI Model Graveyard - </strong><a target="_blank" href="http://inference.rip"><strong>inference.rip</strong></a><strong> </strong>. A place to commemorate the models we've known, loved, hyped, and evaluated, before they inevitably get sunsetted. GPT-4 deserves a prominent place there. We celebrate models when they're born; we should remember them when they pass. (GPT-4.5 is likely next on the chopping block, by the way). - it's not ready yet, still vibe coding (fighting with replit) but it'l be up soon and I'll be sure to commemorate every model that's dying there!</p><p>So, pour one out for GPT-4. You changed the game. Rest In Prompt 🪦.</p><p>The ChatGPT "Glazing" Incident: A Cautionary Tale</p><p>Speaking of OpenAI...oof. The last couple of weeks saw ChatGPT exhibit some... <em>weird</em> behavior. Sam Altman himself used the term "<strong>glazing</strong>" – essentially, the model became overly agreeable, excessively complimentary, and sycophantic to a ridiculous degree.</p><p>Examples flooded social media: users reporting doing <em>one</em> pushup and being hailed by ChatGPT as Herculean paragons of fitness, placing them in the top 1% of humanity. Terrible business ideas were met with effusive praise and encouragement to quit jobs.</p><p>This wasn't just quirky; it was potentially harmful. As Yam pointed out, people use ChatGPT for advice on serious matters, tough conversations, and personal support. A model that just mindlessly agrees and validates everything, no matter how absurd, isn't helpful – it's dangerous. It undermines trust and critical thinking.</p><p>The community backlash was swift and severe. The key issue, as OpenAI admitted in their <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fopenai.com%2Findex%2Fsycophancy-in-gpt-4o%2F"><strong>Announcement</strong></a> and <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fwww.reddit.com%2Fr%2FChatGPT%2Fcomments%2F1kbjowz%2Fama_with_openais_joanne_jang_head_of_model%2F"><strong>AMA</strong></a> with Joanne Jiang (Head of Model Behavior), seems to stem from focusing too much on short-term engagement feedback and not fully accounting for long-term user interaction, especially with memory now enabled.</p><p>In an unprecedented move, <strong>OpenAI rolled back the update</strong>. I honestly can't recall them ever publicly rolling back a model behavior change like this before. It underscores the severity of the issue.</p><p>This whole debacle highlights the immense responsibility platforms like OpenAI have. When your model is used by half a billion people daily, including for advice and support, haphazard releases that drastically alter its personality without warning are unacceptable. As Wolfram noted, this erodes trust and showcases the benefit of local models where <em>you</em> control the system prompt and behavior.</p><p>My takeaway? Critical thinking is paramount. Don't blindly trust AI, especially when it's being overly complimentary. Get second opinions (from other AIs, and definitely from humans!). I hope OpenAI takes this as a serious lesson in responsible deployment and testing.</p><p>BREAKING NEWS: Claude.ai will support tools via MCP</p><p>During the show, Yam spotted breaking news from <strong>Anthropic</strong>: Claude is getting major upgrades! (<a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fx.com%2FAnthropicAI%2Fstatus%2F1918040744920334705"><strong>Tweet</strong></a>)</p><p>They announced <strong>Integrations</strong>, allowing Claude to connect directly to apps like Asana, Intercom, Linear, Zapier, Stripe, Atlassian, Cloudflare, PayPal, and more (launch partners). Developers can apparently build their own integrations quickly too. This sounds <em>a lot</em> like their implementation of <strong>MCP (Model Context Protocol)</strong>, bringing tool use directly into the main <a target="_blank" href="Claude.ai">Claude.ai</a> interface (previously limited to Claude Desktop and only non remote MCP servers).</p><p>This feels like a big deal! </p><p>Google Updates & LlamaCon Recap</p><p>* <strong>Google:</strong> NotebookLM's AI audio overviews are now multilingual (50+ languages!) (<a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fx.com%2FGoogle%2Fstatus%2F1917315769299357712"><strong>X Post</strong></a>). Gemini 2.5 Flash (the faster, cheaper model) was released shortly after our last show, featuring hybrid reasoning with an API knob to control thinking depth. Rumors are swirling about big drops at Google I/O soon!</p><p>* <strong>LlamaCon:</strong> While there was no Llama 4 bombshell, Meta focused on security releases: Llama Guard 4 (text + image), Llama Firewall (prompt hacks/risky code), Prompt Guard 2 (jailbreaks), and CyberSecEval 4. Zuck confirmed on the Dworkesh podcast that <strong>thinking models are coming</strong>, a <strong>new Meta AI app with a social feed</strong> is planned, a <strong>full-duplex voice model</strong> is in the works, and a <strong>Llama API</strong> (powered by Groq and others) is launching.</p><p>This Week's Buzz from Weights & Biases 🐝</p><p>Quick updates from my corner at Weights & Biases:</p><p>* <strong>WeaveHacks Hackathon (May 17-18, SF):</strong> Get ready! We're hosting a hackathon focused on Agent Protocols – <strong>MCP and A2A</strong>. Google Cloud is sponsoring, we have up to $15K in prizes, and yes, one of the top prizes is a <strong>Unitree robot dog</strong> 🤖🐶 that you can program! (I expensed a robot dog, best job ever!). Folks from the Google A2A team will be there too. Come hack with us in SF! <a target="_blank" href="http://lu.ma/weavehacks"><strong>Apply here</strong></a>. It's FREE to participate!</p><p>* <strong>Fully Connected Conference:</strong> Our big annual W&B conference is coming back to San Francisco soon! Expect amazing speakers (last year, Meta announced Llama 3!). Tickets are out: <a target="_blank" href="http://fullyconnected.com"><strong>fullyconnected.com</strong></a>.</p><p>Evals Deep Dive with Hamel Husain & Shreya Shankar</p><p>Amidst all the model releases and drama, we were incredibly lucky to have two leading experts in AI evaluation, <strong>Hamel Husain</strong> (<a target="_blank" href="https://twitter.com/HamelHusain">@HamelHusain</a>) and <strong>Shreya Shankar</strong> (<a target="_blank" href="https://twitter.com/sh_reya">@sh_reya</a>), join us.</p><p>Their core message? Building reliable AI applications requires moving beyond standard benchmarks (like MMLU, HumanEval) and focusing on <strong>application-centric evaluations</strong>.</p><p><strong>Key Takeaways:</strong></p><p>* <strong>Foundation vs. Application Evals:</strong> Foundation model benchmarks test general knowledge and capabilities (the "ceiling"). Application evals focus on specific use cases, targeting reliability and identifying bespoke failure modes (like tone, hallucination on specific entities, instruction following) – aiming for 90%+ accuracy on <em>your</em> task.</p><p>* <strong>Look At Your Data!</strong> This was the mantra. Off-the-shelf metrics (hallucination score, toxicity) can be misleading. You <em>must</em> analyze your specific application's traces, understand its unique failure modes, and design custom evals grounded in those failures. It's detective work.</p><p>* <strong>PromptEvals Release:</strong> Shreya discussed their new work, <strong>PromptEvals</strong> (<a target="_blank" href="https://arxiv.org/abs/2504.14738">NAACL paper</a>, <a target="_blank" href="https://huggingface.co/datasets/reyavir/PromptEvals">Dataset</a>, <a target="_blank" href="https://huggingface.co/reyavir">Models</a>). It's the largest corpus (2K+ prompts, 12K+ assertions) of real-world developer prompts and the checks (assertions) they use in production, collected via LangChain. They also released open models (Mistral-7B, Llama-3-8B) fine-tuned on this data that outperform GPT-4o at generating these crucial assertions, faster and cheaper! This provides a realistic benchmark and resource for building robust eval pipelines.</p><p>* <strong>Benchmark Gaming & Eval Complexity:</strong> We touched upon the dangers of optimizing for static benchmarks (like the Chatbot Arena issues) and the inherent complexity of evaluation – even human preferences change over time ("Who validates the validators?"). Meta-evaluation is crucial.</p><p>* <strong>Upcoming Course:</strong> Hamel and Shreya are launching a course, <strong>AI Evals For Engineers & PMs</strong>, diving deep into practical evaluation strategies, data analysis, error analysis, RAG/Agent evals, cost optimization, and more. ThursdAI listeners get a <strong>35% discount</strong> using code thursdai! (<a target="_blank" href="https://maven.com/parlance-labs/evals?promoCode=thursdai">Link</a>). I'm thrilled to be a guest speaker too! If you're building <em>anything</em> with LLMs, understanding evals is non-negotiable.</p><p>This was such an insightful discussion, emphasizing that while new models are exciting, making them <em>work reliably</em> for specific applications is where the real engineering challenge lies, and evaluation is the key.</p><p>Vision & Video: Runway Gets Consistent</p><p>The world of AI video generation continues its rapid evolution.</p><p>Runway References: Consistency Unlocked</p><p>A major pain point in AI video has been maintaining consistency – characters changing appearance, backgrounds morphing frame-to-frame. <strong>Runway</strong> just took a huge step towards solving this with their new <strong>References</strong> feature for Gen-4. </p><p>You can now provide reference images (characters, locations, styles, even selfies!) and use tags in your prompts (<char1>, <style1>) to tell Gen-4 to maintain those elements across generations. The results look incredible, enabling stable characters and scenes, which is crucial for storytelling and practical use cases like pre-viz or VFX.</p><p>AI Art & Diffusion</p><p>HiDream E1: Open Source Ghibli Style</p><p>A new contender in open-source image generation emerged: <strong>HiDream E1</strong>. (<a target="_blank" href="https://huggingface.co/HiDream-ai/HiDream-E1-Full/blob/main/demo.jpg">HF Link</a>) This model, from <a target="_blank" href="Vivago.ai">Vivago.ai</a>, focuses particularly on generating images in the beautiful Ghibli style.</p><p>The weights are available (looks like Apache 2.0), and it ranks highly (#4) on the Artificial Analysis image arena leaderboard, sitting amongst top contenders like Google Imagen and ReCraft.</p><p>Yam brought up a great point about image evaluation, though: generating aesthetically pleasing images is one thing, but prompt following (like GPT-4 excels at) is another critical dimension that's harder to capture in simple preference voting.</p><p>Final Thoughts: Responsibility & Critical Thinking</p><p>Phew! What a week. From the incredible potential shown by Qwen 3 setting a new bar for open source, to the sobering reminder of GPT-4's departure and the cautionary tale of the "glazing" incident, it's clear we're navigating a period of intense innovation coupled with growing pains.</p><p>The glazing issue, in particular, underscores the need for extreme care and robust evaluation (thanks again Hamel & Shreya!) when deploying models that interact with millions, potentially influencing decisions and well-being. As AI becomes more integrated into our lives – helping us boil eggs (yes, I ask it stupid questions too!), offering support, or even suggesting purchases – we <em>must</em> remain critical thinkers.</p><p>Don't outsource your judgment entirely. Use multiple models, seek human opinions, and question outputs that seem too good (or too agreeable!) to be true. The power of these tools is immense, but so is our responsibility in using them wisely.</p><p>Massive thank you to my co-hosts Wolfram, Yam, and Nisten for navigating this packed week with me, and huge thanks to our guests Hamel Husain and Shreya Shankar for sharing their invaluable expertise on evaluations. And of course, thank you to this amazing community – hitting 1000 listeners! – for tuning in, commenting, and sharing breaking news. Your engagement fuels this show!</p><p>🔗 Subscribe to our show on Spotify: <a target="_blank" href="http://thursdai.news/spotify">thursdai.news/spotify</a></p><p>🔗 Apple: <a target="_blank" href="http://thursdai.news/apple">thursdai.news/apple</a></p><p>🔗 Youtube: <a target="_blank" href="http://thursdai.news/yt">thursdai.news/yt</a> (get in before 10K!)</p><p>And for the full show notes and links visit</p><p>👉 thursdai.news/may-1  👈</p><p>We'll see you next week for another round of ThursdAI!</p><p>Alex out. Bye bye!</p><p>ThursdAI - May 1, 2025 - Show Notes and Links</p><p>* Show Notes</p><p>* <strong>MCP/A2A Hackathon</strong> - with A2A team and awesome judges! 🤖🐶 (<a target="_blank" href="http://lu.ma/weavehacks)">lu.ma/weavehacks)</a></p><p>* FullyConnected - Weights & Biases flagship 2 day conference (<a target="_blank" href="fullyconnected.com">fullyconnected.com</a>)</p><p>* Course - <strong>AI Evals For Engineers & PMs Questions for Shreya Shankar & Hamel Husain (</strong><a target="_blank" href="https://maven.com/parlance-labs/evals?promoCode=thursdai"><strong>link</strong></a><strong> </strong>Promo code 35% of for listeners of ThursdAI - thursdai)</p><p>* <strong>Hosts and Guests</strong></p><p>* <strong>Alex Volkov</strong> - AI Evangelist & Weights & Biases (<a target="_blank" href="http://x.com/@altryne">@altryne</a>)</p><p>* Co Hosts - <a target="_blank" href="http://x.com/@WolframRvnwlf">@WolframRvnwlf</a> <a target="_blank" href="x.com/@yampeleg">@yampeleg</a> <a target="_blank" href="http://x.com/@nisten">@nisten</a> <a target="_blank" href="http://x.com/@ldjconfirmed)">@ldjconfirmed</a></p><p>* Hamel Housain - <a target="_blank" href="https://twitter.com/HamelHusain/status/1914836007285088628">@HamelHusain</a></p><p>* Shreya Shankar - <a target="_blank" href="https://twitter.com/sh_reya/status/1916914113579782313">@sh_reya</a></p><p>* <strong>Open Source LLMs</strong> </p><p>* Alibaba drops Qwen 3 - 2 MOEs, 6 dense (0.6B - 30B) (<a target="_blank" href="https://qwenlm.github.io/blog/qwen3/">Blog</a>, <a target="_blank" href="https://github.com/QwenLM/Qwen3">GitHub</a>, <a target="_blank" href="https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2e4f653967f">HF</a>, <a target="_blank" href="https://huggingface.co/spaces/Qwen/Qwen3-Demo">HF Demo</a>, <a target="_blank" href="https://x.com/altryne/status/1916966112475898114">My tweet</a>, <a target="_blank" href="https://www.interconnects.ai/p/qwen-3-the-new-open-standard">Nathan breakdown</a>)</p><p>* Microsoft - Phi-4-reasoning 14B + Plus (<a target="_blank" href="https://x.com/suriyagnskr/status/1917731754515013772">X</a>, <a target="_blank" href="https://arxiv.org/abs/2504.21318">ArXiv</a>, <a target="_blank" href="https://aka.ms/phi-reasoning/techreport">Tech Report</a> , <a target="_blank" href="https://huggingface.co/microsoft/Phi-4-reasoning">HF 14B SFT</a>)</p><p>* MiMo-7B — Xiaomi’s  MIT licensed model (<a target="_blank" href="https://huggingface.co/XiaomiMiMo">HF</a>)</p><p>* KyutAI - Helium-1 2B - (<a target="_blank" href="https://kyutai.org/2025/04/30/helium.html">Blog</a>, <a target="_blank" href="https://huggingface.co/kyutai/helium-1-2b">Model (2 B)</a>,  <a target="_blank" href="https://github.com/kyutai/dactory">Dactory pipeline</a>)</p><p>* Qwen 2.5 omni updated (<a target="_blank" href="https://x.com/Alibaba_Qwen/status/1917585963775320086">X</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* GPT-4 RIP - no longer in dropdown (<a target="_blank" href="https://x.com/sama/status/1917766910911078571">RIP</a>)</p><p>* Google - NotebookLM AI overviews are now multilingual (<a target="_blank" href="https://x.com/Google/status/1917315769299357712">X</a>)</p><p>* LlamaCon updates (<a target="_blank" href="https://x.com/AIatMeta/status/1917271400118902860">X</a>)</p><p>* OpenAI ChatGPT "glazing" update - revert back and why it matters (<a target="_blank" href="https://openai.com/index/sycophancy-in-gpt-4o/">Announcement</a>, <a target="_blank" href="https://41598e5c38d3cd55e335e985614d0883.us-east-1.resend-links.com/CL0/https:%2F%2Fwww.reddit.com%2Fr%2FChatGPT%2Fcomments%2F1kbjowz%2Fama_with_openais_joanne_jang_head_of_model%2F/1/0100019689680950-2e7949de-c55f-4287-b449-09799cc44617-000000/QwXVTks5In0vcLvkRJndS2HeXkbtbguErHkHBree_j4=403">AMA</a>)</p><p>* Chatbot Arena Under Fire — “Leaderboard Illusion” vs. LMArena (<a target="_blank" href="https://arxiv.org/abs/2504.20879">Paper</a>, <a target="_blank" href="https://x.com/lmarena_ai/status/1917492084359192890">Reply</a>)</p><p>* <strong>This weeks Buzz</strong></p><p>* MCP/A2A Hackathon - with A2A team and awesome judges! 🤖🐶 (<a target="_blank" href="http://lu.ma/weavehacks)">lu.ma/weavehacks)</a></p><p>* FullyConnected - Weights & Biases flagship 2 day conference (<a target="_blank" href="fullyconnected.com">fullyconnected.com</a>)</p><p>* <strong>Vision & Video</strong></p><p>* Runway References - consistency in video generation (<a target="_blank" href="https://x.com/search?q=runway%20References">X</a>)</p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* HiDream E1 (<a target="_blank" href="https://huggingface.co/HiDream-ai/HiDream-E1-Full/blob/main/demo.jpg">HF</a>)</p><p>* <strong>Agents, Tools & Interviews</strong></p><p>* OpenPipe - ART·E open-source RL-trained email research agent (<a target="_blank" href="https://x.com/corbtt/status/1917269992363680054">X</a>, <a target="_blank" href="https://openpipe.ai/blog/art-e-mail-agent">Blog</a> | <a target="_blank" href="https://github.com/OpenPipe/ART">GitHub</a> | <a target="_blank" href="https://x.com/corbtt/status/1917269992363680054">Launch thread</a>)</p><p>* PromptEvals - Interview with Shreya Shankar ( <a target="_blank" href="https://arxiv.org/abs/2504.14738">NAACL paper</a> | <a target="_blank" href="https://huggingface.co/datasets/reyavir/PromptEvals">Dataset</a> | <a target="_blank" href="https://huggingface.co/reyavir">Models</a> )</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-may-1-qwen-3-phi-4-openai</link><guid isPermaLink="false">substack:post:162649705</guid><dc:creator><![CDATA[Alex Volkov, Hamel Husain, and Shreya Shankar]]></dc:creator><pubDate>Thu, 01 May 2025 22:05:11 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/162649705/78616672785b301f644f57e1f50f4fed.mp3" length="86741761" type="audio/mpeg"/><itunes:author>Alex Volkov, Hamel Husain, and Shreya Shankar</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5421</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/162649705/0952d2b96720a949c7f72dfeec07f6ae.jpg"/></item><item><title><![CDATA[ThursdAI - Apr 23rd - GPT Image & Grok APIs Drop, OpenAI ❤️ OS? Dia's Wild TTS & Building Better Agents!]]></title><description><![CDATA[<p>Hey everyone, Alex here 👋</p><p>Welcome back to ThursdAI! After what felt like ages of non-stop, massive model drops (looking at you, O3 and GPT-4!), we finally got that "chill week" we've been dreaming of since maybe... forever?  It seems the big labs are taking a breather, probably gearing up for even bigger things next week (maybe some open source 👀).</p><p>But "chill" doesn't mean empty! This week was packed with fascinating developments, especially in the open source world and with long-awaited API releases. We actually had <em>time</em> to dive deeper into things, which was a refreshing change. We had a fantastic lineup of guests joining us too: Kwindla Kramer (<a target="_blank" href="https://twitter.com/kwindla/">@kwindla</a>), our resident voice expert, dropped in to talk about some mind-blowing TTS and her own open-source VAD release. Maziyar Panahi (<a target="_blank" href="https://x.com/MaziyarPanahi">@MaziyarPanahi</a>) gave us the inside scoop on OpenAI's recent meeting with the open source community. And Dex Horthy (<a target="_blank" href="https://x.com/dexhorthy">@dexhorthy</a>) from HumanLayer shared some invaluable insights on building robust AI agents that actually work in the real world. It was great having them alongside the usual ThursdAI crew: LDJ, Yam, Wolfram, and Nisten!</p><p>So, instead of rushing through a million headlines, we took a more relaxed pace. We explored NVIDIA's cool new Describe Anything model, dug into Google's Quantization Aware Training for Gemma, celebrated the much-anticipated API release for OpenAI's GPT Image generation (finally!), checked out the new Grok API, got absolutely blown away by a tiny, open-source TTS model from Korea called Dia, and debated the principles of building better AI agents. Plus, a surprise drop from Send AI with a powerful video model!</p><p>Let's dive in!</p><p>Open Source AI Highlights: Community, Vision, and Efficiency</p><p>Even with the big players quieter on the model release front, the open source scene was buzzing. It feels like this "chill" period gave everyone a chance to focus on refining tools, releasing datasets, and engaging with the community.</p><p>OpenAI Inches Closer to Open Source? Insights from the Community Meeting</p><p>Perhaps the biggest non-release news of the week was OpenAI actively engaging with the open source community. Friend of the show Maziyar Panahi was actually <em>in the room</em> (well, the Zoom room) and joined us to share what went down </p><p>It sounds like OpenAI came prepared, with Sam Altman himself spending significant time answering questions . Maziyar gave us the inside scoop, mentioning that OpenAI's looking to offload some GPU pressure by embracing open source – a win-win where they help the community, and the community helps lighten their load. He painted a picture of a company genuinely trying to listen and figure out how to best contribute. It felt less like a checkbox exercise and more like genuine engagement, which is awesome to see.</p><p>What did the community ask for? Based on Maziyar's recap, there was a strong consensus on several key points:</p><p>* <strong>Model Size:</strong> The sweet spot seemed to be not tiny, but not astronomically huge either. Something in the <strong>70B-200B parameter range</strong> that could run reasonably on, say, 4 GPUs, leaving room for other models. People want power they can actually <em>use</em> without needing a supercomputer.</p><p>* <strong>Capabilities:</strong> A strong desire for reliable <strong>structured output</strong>. Surprisingly, there was <em>less</em> emphasis on complex, built-in reasoning, or at least the ability to <strong>toggle reasoning off</strong>. This likely stems from practical concerns about cost and latency in production environments. The community seems to value control and efficiency for specific tasks.</p><p>* <strong>Multilingual:</strong> Good support for <strong>European languages (at least 20)</strong> was a major request, reflecting the global nature of the open source community. Needs to be as good as English support.</p><p>* <strong>Base Models:</strong> A <em>huge</em> ask was for OpenAI to release <strong>base models</strong>. The reasoning? Empower the community to handle fine-tuning for specific tasks like coding, roleplay, or supporting underrepresented languages . Let the experts in those niches build on a solid foundation.</p><p>* <strong>Focus:</strong> <strong>Usefulness over chasing leaderboard glory</strong>. The community urged OpenAI to provide a solid, practical model rather than aiming for a temporary #1 spot that gets outdated in days or weeks . Stability, reliability, and long-term utility were prized over fleeting benchmark wins.</p><p>* <strong>Safety:</strong> A preference for <strong>separate guardrail models</strong> (similar to LlamaGuard or GemmaGuard) rather than overly aligning the main model, which often hurts performance and flexibility . Give users the tools to implement safety layers as needed, rather than baking in limitations that might stifle creativity or utility.</p><p>Perhaps most excitingly, Maziyar mentioned OpenAI seemed committed to <strong>regular open model releases</strong>, not just a one-off thin=! This, combined with recent moves like approving a community Pull Request to make their open-source Codex agent work with non-OpenAI models (as Yam Peleg excitedly pointed out!), suggests a potentially significant shift. Remember, it's been a <em>long</em> time since GPT-2 and Whisper were OpenAI's main open contributions! We're definitely watching this space closely. Huge shout out to OpenAI for listening and engaging with the builders.</p><p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></p><p>NVIDIA's DAM: Describe Anything Model (and Dataset!)</p><p>NVIDIA dropped something really cool this week: the <strong>Describe Anything Model (DAM)</strong>, specifically DAM-3B, a 3 billion parameter multimodal model for region-based image <em>and</em> video captioning. Think Meta's Segment Anything (SAM), but instead of just segmenting, it also tells you <em>what</em> you've segmented, in detail.</p><p>We played around with the image demo on the show (<a target="_blank" href="https://huggingface.co/spaces/nvidia/describe-anything-model-demo">HF demo</a>) . You hover over an image, things get segmented on the fly (you can use points, boxes, scribbles, or masks), you click, and boom – a detailed description pops up for that specific region: "A brown bear with a thick, dense coat of fur..." . It's pretty slick and responsive!</p><p>While the demo didn't showcase video, the project page (<a target="_blank" href="https://x.com/reach_vb/status/1914962078571356656">X post</a>) shows it working on videos too (<strong>DAM-3B-Video</strong>), tracking and describing objects like fish even as they move. This capability really impressed Yam, who rightly pointed out that tracking objects consistently over video is <em>hard</em>, so having a base model that understands this level and embeds it in language is seriously impressive. The model uses a "focal prompt" and gated cross-attention to fuse the full scene context with the selected region.</p><p>Nisten  reminded us that our friend Piotr Skalski from Roboflow basically built a pipeline for this a while back by combining SAM with description models like Microsoft Florence . But DAM integrates it all into one efficient 3B parameter model (<a target="_blank" href="https://huggingface.co/nvidia/DAM-3B">HF model</a>), setting a new state-of-the-art on their introduced <strong>DLC-Bench</strong> (Detailed Localized Captioning).</p><p>Crucially, NVIDIA didn't just drop the model; they also released the <strong>Describe Anything Dataset</strong> (<a target="_blank" href="https://huggingface.co/datasets/nvidia/DescribeAnythingDataset">HF dataset</a>) used to train it (built on subsets like COCO, Paco, SAM) and the code under a research-only license. This is fantastic for researchers and builders. Imagine using this for precise masking before sending an image to the new GPT Image API for editing – super useful! Big props to NVIDIA and their collaborators at UC Berkeley and UCSF for this contribution.</p><p>Gemma Gets Quantization Aware Training (QAT): Smaller Footprint, Sassy Attitude?</p><p>Google also pushed the open source envelope by releasing Gemma models trained with <strong>Quantization Aware Training (QAT)</strong>. This isn't your standard post-training quantization; QAT involves incorporating the impact of quantization <em>during</em> the training process itself. As LDJ explained, this allows the model to adapt, potentially resulting in a quantized state with much higher quality and less performance degradation compared to just quantizing a fully trained model afterwards.</p><p>The results? Significant reductions in VRAM requirements across the board. The 27B parameter Gemma 3, for example, drops from needing a hefty 54GB to just <strong>14.1GB</strong> ! Even the 1B model goes from 2GB to just half a gig. This makes running these powerful models much more accessible on consumer hardware. Folks are already running them in MLX, llama.cpp, LM Studio, etc. (<a target="_blank" href="https://www.reddit.com/media?url=https://i.redd.it/23ut7jd3klve1.jpeg">Reddit thread</a>)</p><p>Wolfram  already took the 4B QAT model for a spin using LM Studio . The good news: it ran easily, needing only 5-6GB of RAM. The quirky news: it seemed to struggle a bit with prompt adherence in his tests, even giving Wolfram a sassy, winking-emoji response about ignoring the "fine print" in his complex system prompt when called out on a language switching error: "Who reads a fine print? 😉" ! He did note Gemma 3 now supports system prompts (unlike Gemma 2), which is a definite improvement .</p><p><em>(While NVIDIA also released OpenMath Nemotron, we didn't dive deep in the show, but worth noting its AIMO win and accompanying open dataset release!)</em></p><p>Voice and Audio Innovations: Emotional TTS and Smarter Conversations</p><p>Even in a "chill" week, the audio space delivered some serious excitement. Kwindla Kramer joined us to break down two major developments.</p><p>Dia TTS: Unhinged Emotion from a Small Open Model 🤯</p><p>This one absolutely blew up Twitter, and for good reason. <strong>Dia</strong>, from Nari Labs (essentially a student and a half in Korea!), is a <strong>1.6 billion parameter open-weights (MIT licensed)</strong> text-to-dialogue model (<a target="_blank" href="https://github.com/nari-labs/dia">Github</a>, <a target="_blank" href="https://huggingface.co/nari-labs/Dia-1.6B">HF</a>). What makes it special? The <em>insane</em> emotional range and natural interaction patterns. My Twitter post about it (<a target="_blank" href="https://x.com/altryne/status/1914421814455099680">X post</a>) went viral, getting half a million views !</p><p>We played some examples, and they are just wild. You <em>have</em> to hear this to believe it:</p><p>* <strong>Check the Demos:</strong> <a target="_blank" href="https://narilabs.github.io/dia/">Dia Demo Page</a> | <a target="_blank" href="https://fal.ai/models/fal-ai/dia-tts/voice-clone">Fal.ai Voice Clone Demo</a></p><p>Another crazy thing is how it handles non-verbal cues like laughs or coughs specified in the text (e.g., (laughs)) . Instead of just tacking on a generic sound, it inflects the preceding words <em>leading into</em> the laugh, making it sound incredibly natural. It even handles interruptions seamlessly, cutting off one speaker realistically when another starts .</p><p>Kwin, our voice expert, offered some valuable perspective . While Dia is undeniably awesome and shows what's <em>possible</em>, it's very much a research model – likely unpredictable ("unhinged" was his word!) and probably required cherry-picking the best demos. Production models like 11Labs <em>need</em> predictability. Kwin also noted the dataset is probably scraped from YouTube (a common practice, explaining the lack of open audio data) and that the non-speech sounds are a key takeaway – the bar for TTS is rising beyond just clear speech .</p><p>PipeCat SmartTurn: Fixing Awkward AI Silences with Open Source VAD</p><p>Speaking of open audio, Kwin and the team at Daily/Pipecat had their <em>own</em> breaking news: they released an open-source checkpoint for their <strong>SmartTurn</strong> model – a semantic Voice Activity Detection (VAD) system (<a target="_blank" href="https://github.com/pipecat-ai/smart-turn">Github</a>, <a target="_blank" href="https://huggingface.co/pipecat-ai/smart-turn">HF Model</a>) </p><p>What's the problem SmartTurn solves? That annoying thing where voice assistants interrupt you mid-thought just because you paused for a second. I've seen this happen with my kids all the time, making interaction frustrating! Semantic VAD, or "Smart Turn," is much smarter. It considers not just silence but also the <em>context</em> – audio patterns (like intonation suggesting you're not finished) and linguistic cues (like ending on "and..." or "so...") to make a much better guess about whether you're truly done talking. This is crucial for natural-feeling voice interactions, especially for kids or multilingual speakers (like me!) who might pause more often to find the right word.</p><p>And the data part is key here. They're building an <strong>open dataset</strong> for this, hosted on Hugging Face. You can even contribute your own voice data by playing simple games on their <a target="_blank" href="https://turn-training.pipecat.ai"><strong>turn-training.pipecat.ai</strong></a> site (<a target="_blank" href="https://pcc-smart-turn.vercel.app/">Try It Demo</a>)! The cool incentive? The more diverse voice data they get (especially for different languages!), the better these systems will work for everyone. If your voice is in the dataset, future AI agents might just understand <em>you</em> a little better!</p><p>Kwin also mentioned their upcoming <a target="_blank" href="https://maven.com/pipecat/voice-ai-and-voice-agents-a-technical-deep-dive?utm_source=student&#38;utm_campaign=welcome">Voice AI Course</a> co-created with friend-of-the-pod Swyx, hosted on Maven . It aims to be a comprehensive guide with code samples, community interaction, and insights from experts (including folks from Weights & Biases!). Check it out if you want to dive deep into building voice AI. </p><p>AI Art & Diffusion & 3D: Quick Hits</p><p>A slightly quieter week for major art model releases, but still some significant movement:</p><p>* <strong>OpenAI's GPT Image 1 API:</strong> We'll cover this in detail in the Big Companies section below, but obviously relevant here too as a major new tool for developers creating AI art and image editing applications .</p><p>* <strong>Hunyuan 3D 2.5 (Tencent):</strong> Tencent released an update to their 3D generation model, now boasting <strong>10 billion parameters</strong> (up from 1B!) . They're highlighting massive leaps in precision (1024-resolution geometry), high-quality textures with PBR support, and improved skeletal rigging for animation <a target="_blank" href="https://x.com/TencentHunyuan/status/1915026828013850791">X Post</a>. Definitely worth keeping an eye on as 3D generation matures and becomes more accessible (they doubled the free quota and launched an API).</p><p>Agent Development Insights: Building Robust Agents with Dex Horthy</p><p>With things slightly calmer, it was the perfect time to talk about AI agents – a space buzzing with activity, frameworks, and maybe even a little bit of drama. We brought in <strong>Dex Horthy</strong>, founder of HumanLayer and author of the insightful "12 Factor Agent" essay (<a target="_blank" href="https://github.com/humanlayer/12-factor-agents/tree/main">Github Repo</a>), to share his perspective on what actually <em>works</em> when building agents for production.</p><p>Dex builds SDKs to help create agents that feel more like digital humans, aiming to deploy them where users already are (Slack, email, etc.), moving beyond simple chat interfaces. His experience led him to identify common patterns and pitfalls when trying to build reliable agents.</p><p>The Problem with Current Agent Frameworks</p><p>A key takeaway Dex shared? Many teams building serious, production-ready agents end up <strong>writing large parts from scratch</strong>. Why? Because existing frameworks often fall short in providing the necessary control and reliability for complex tasks. The common "prompt + bag of tools + figure it out" approach, while great for demos, struggles with reliability over longer, multi-step workflows . Think about it: even if each step is 92% reliable, after 10 steps, your overall success rate plummets due to compounding errors. That's just not good enough for customer-facing applications.</p><p>Key Principles: Small Agents, Owning Context</p><p>So, what <em>does</em> work <em>today</em> according to Dex's 12 factors?</p><p>* <strong>Small, Focused Agents:</strong> Instead of one giant, monolithic agent trying to do everything, the more reliable approach is to build <strong>smaller "micro-agents"</strong> that handle specific, well-defined parts of a workflow ]. As models get smarter, these micro-agents might grow in capability, but the principle of breaking down complexity holds. Find something at the edge of the model's capability and nail it consistently .</p><p>* <strong>Own Your Prompts & Context:</strong> Don't let frameworks abstract away <strong>control over the exact tokens</strong> going into the LLM or <strong>how the context window is managed</strong>. This is crucial for performance tuning. Even with massive context windows (like Gemini's 2M tokens), smaller, carefully curated context often yields better results <em>and</em> lower costs . Maximum performance requires owning every single token.</p><p>Dex's insights provide a crucial dose of pragmatism for anyone building or thinking about building AI agents in this rapidly evolving space. Check out his full <a target="_blank" href="https://github.com/humanlayer/12-factor-agents/tree/main"><strong>12 Factor Agent essay</strong></a> and the <a target="_blank" href="https://lu.ma/12-factor-agent"><strong>webinar recording</strong></a> for a deeper dive.</p><p>Big Companies & APIs: GPT Image and Grok Get Developer Access</p><p>While new <em>foundation models</em> were scarce from the giants this week, they did deliver on the API front, opening up powerful capabilities to developers.</p><p>OpenAI Finally Releases GPT Image 1 API! (<a target="_blank" href="https://x.com/OpenAIDevs/status/1915097067023900883">X Post</a>)</p><p>This was a big one many developers were waiting for. OpenAI's powerful image generation capabilities, previously locked inside ChatGPT, are now available via API under the official name <strong>gpt-image-1</strong> (<a target="_blank" href="https://platform.openai.com/docs/guides/image-generation?image-generation-model=gpt-image-1">Docs</a>) . No more awkward phrasing like "the new image generation capabilities within chat gpt"!</p><p>Getting access requires organizational verification, which involved a slightly intense biometric scan process for me – feels like they're taking precautions given the model's realism and potential for misuse . Understandable, but something developers need to be aware of .</p><p>The API (<a target="_blank" href="https://platform.openai.com/docs/api-reference/images">API Reference</a>) offers several capabilities:</p><p>* <strong>Generations:</strong> Creating images from scratch based on text prompts.</p><p>* <strong>Edits:</strong> Modifying existing images using a new prompt, crucially supporting <strong>masking</strong> for partial edits. This is huge for targeted changes and perfect for combining with segmentation models like NVIDIA's DAM!</p><p>There's a nice playground interface in the console, and you have interesting controls over the output:</p><p>* <strong>Quality:</strong> Instead of distinct models, you select a quality level (standard/HD) which impacts the internal "thinking time" and cost . It seems to be a reasoning model under the hood, so quality relates to compute/latency.</p><p>* <strong>Number:</strong> Generate up to 10 images at once.</p><p>* <strong>Transparency:</strong> Supports generating images with transparent backgrounds</p><p>I played around with it, generating ads and even trying to get it to make a ThursdAI thumbnail with my face. The <strong>text generation is excellent</strong> – it nailed "ThursdAI" perfectly on an unhinged speaker ad Nisten prompted! It follows complex style prompts well.</p><p>However, generating realistic <em>faces</em>, especially matching a specific person like me, seems... <strong>really hard</strong> right now . Even after many attempts providing a source image and asking it to replace a face, the results were generic or only vaguely resembled me. It feels almost intentionally nerfed, maybe as a safety measure to prevent deepfakes? I still used it for the thumbnail, but yeah, it could be better on faces.</p><p>OpenAI launched with several integration partners like Adobe, Figma, Wix, HeyGen, and <a target="_blank" href="Fal.ai">Fal.ai</a> already onboard. Expect to see these powerful image generation capabilities popping up everywhere!</p><p>Grok 3 Mini & Grok 3 Now Available via API (+ App Updates)</p><p>Elon's xAI also opened the gates this week, making <strong>Grok 3 Mini and Grok 3</strong> available via API (<a target="_blank" href="https://docs.x.ai/docs/overview">Docs</a>).</p><p>The <strong>pricing structure</strong> is fascinating and quite different from others. Grok 3 Mini is incredibly cheap for input ($0.30 / 1M tokens) with only a modest bump for output ($0.50 / 1M). The "Fast" versions, however, cost significantly more, especially for <em>output</em> tokens (Grok 3 Fast is $5 input / $25 output per million!) . It seems like a deliberate play on the "fast, cheap, smart" triangle, giving developers explicit levers to pull based on their needs.</p><p>Benchmarks provided by xAI position Grok 3 Mini competitively against other small models like Gemini 2.5 Flash and O4 Mini, scoring well on AIME (93%) and coding benchmarks. </p><p>Speaking of the app, the iOS version got a significant update adding a <strong>live video view</strong> (let Grok see what you see through your camera) and <strong>multilingual audio support</strong> (<a target="_blank" href="https://x.com/ebbyamir/status/1914820712092852430">X Post</a>) . Prepare for some potentially unhinged, real-time video roasting if you use the fun mode with the camera on ! Multilingual audio and search are also rolling out to SuperGrok users on Android.</p><p><em>(Side note: We briefly touched on O3's recent wonkiness in following instructions for tone, despite its amazing GeoGuessr abilities! Something feels off there lately.)</em></p><p>Vision and Video: Send AI's Surprise Release & More</p><p>Just when we thought the week was winding down on model releases...</p><p>Send AI Drops MAGI-1: 24B Video Model with Open Weights! 🔥</p><p>Out of seemingly nowhere, a company called <strong>Send AI</strong> released details (and then the <em>weights!</em>) for <strong>MAGI-1</strong>, a <strong>24 billion parameter</strong> autoregressive diffusion model for video generation (<a target="_blank" href="https://x.com/SandAI_HQ/status/1914303284954996749">X Post</a>, <a target="_blank" href="https://github.com/SandAI-org/Magi-1">GitHub</a>, <a target="_blank" href="https://static.magi.world/static/files/MAGI_1.pdf">PDF Report</a>).</p><p>The demos looked stunning, showcasing impressive <strong>long-form video generation</strong> with remarkable <strong>character consistency</strong> – often the Achilles' heel of AI video . Nisten speculated this could be a major step towards usable AI-generated movies, solving the critical face/character consistency problem . They achieve this by predicting video in 24-frame chunks with causal attention between them, allowing for real-time streaming generation where compute doesn't scale with length. They also highlighted an "infinite extension" capability, allowing users to build out longer scenes by injecting new prompts or continuing footage.</p><p>Their technical report dives into the architecture, mentioning novel techniques like a custom <strong>"MagiAttention"</strong> kernel that scales to massive contexts and helps achieve the temporal consistency. It also sets SOTA on VBench-I2V and Physics-IQ benchmarks.</p><p>And the biggest surprise? They released the <strong>model weights under an Apache 2.0 license</strong> on Hugging Face ! This is huge! Just as we sometimes lament the lack of open source momentum from certain players, Send AI drops this 24B parameter beast with open weights. Amazing! Go download it!</p><p>Framepack: Long Videos on Low VRAM</p><p>Wolfram also flagged <strong>Framepack</strong>, another interesting video development from the research world from the creator of ControlNet. FramePack is a next-frame (next-frame-section) prediction neural network structure that generates videos progressively. (<a target="_blank" href="https://github.com/lllyasviel/FramePack">Github</a>)</p><p>Character AI AvatarFX Steps In</p><p>Also in the visual space, <strong>Character AI</strong> announced <strong>AvatarFX</strong> in early access (<a target="_blank" href="https://t.co/cdF6H58kBk">Website</a>), stepping into the realm of animated, speaking visual avatars derived from images. It seems like everyone wants to bring characters to life visually now.</p><p>This Week's Buzz from W&B / Community</p><p>Quick hits on upcoming events and community stuff:</p><p>* <strong>WeaveHacks Coming to SF!</strong> Mark your calendars! We're hosting a hackathon focused on building with W&B Weave at the Weights & Biases office in San Francisco on <strong>May 17th-18th</strong> [0:06:15]. If you're around, especially if you're coming into town for Google I/O the week after, come hang out, build cool stuff, and say hi! We're planning to go all out with sponsors and prizes (announcements coming soon). <a target="_blank" href="http://lu.ma/weavehacks">lu.ma/weavehacks</a> </p><p>* <strong>Fully Connected Conference Reminder:</strong> Our flagship W&B conference, <strong>Fully Connected</strong>, is happening in San Francisco on <strong>June 18th</strong> [0:06:30]. It's where our customers, partners, and the community come together for two days of talks, workshops, and networking focused on production AI. It's always an incredible event. (<a target="_blank" href="fullyconnected.com">fullyconnected.com</a>)</p><p>Wrapping Up the "Chill" Week That Wasn't Quite Chill</p><p>Phew! See? Even a "chill" week in AI is overflowing with news when you actually have time to stop and breathe for a second. From OpenAI's fascinating open source tango and the practical (and long-awaited!) API releases of GPT Image and Grok, to the sheer creative potential shown by indie projects like Dia and Send AI's Maggie, and the grounding principles for building agents that <em>actually work</em> from Dex – there was a ton to absorb and discuss. It felt good to have the space to go a little deeper.</p><p>It was fantastic having Kwin, Maziar, and Dex join the regulars (LDJ, Yam, Wolfram, Nisten) to share their expertise and firsthand insights. A huge thank you to them and to everyone tuning in live across X, YouTube, LinkedIn, and participating in the chat! Your questions and comments make the show what it is.</p><p>Don't forget, if you missed anything, the full show is available as a podcast (search "ThursdAI" wherever you get your podcasts)</p><p>🔗 Subscribe to our show on Spotify: <a target="_blank" href="thursdai.news/spotify">thursdai.news/spotify</a></p><p>🔗 Apple: <a target="_blank" href="thursdai.news/apple">thursdai.news/apple</a></p><p>🔗 Youtube: <a target="_blank" href="thursdai.news/yt">thursdai.news/yt</a> </p><p>Next week? The rumors suggest the big labs might be back with major releases . The brief calm might be over! Buckle up! We'll be here to break it all down.</p><p>See you next ThursdAI!- Alex</p><p>TL;DR and Show Notes (April 23rd, 2024)</p><p>* <strong>Hosts and Guests</strong></p><p>* <strong>Alex Volkov</strong> - AI Evangelist & Weights & Biases <a target="_blank" href="http://x.com/@altryne">@altryne</a></p><p>* Co Hosts - Wolfram Ravenwlf <a target="_blank" href="http://x.com/@WolframRvnwlf">@WolframRvnwlf</a>, Yam Peleg <a target="_blank" href="x.com/@yampeleg">@yampeleg</a>, Nisten Tahiraj <a target="_blank" href="http://x.com/@nisten">@nisten</a>, LDJ <a target="_blank" href="http://x.com/@ldjconfirmed">@ldjconfirmed</a></p><p>* Kwindla Kramer <a target="_blank" href="https://twitter.com/kwindla/">@kwindla</a> - Daily Co-Founder // Voice expert</p><p>* Dexter Horthy <a target="_blank" href="https://x.com/dexhorthy">@dexhorthy</a> - HumanLayer // Agents expert</p><p>* Maziyar Panahi <a target="_blank" href="https://x.com/MaziyarPanahi">@MaziyarPanahi</a> - OSS maintainer</p><p>* <strong>Open Source AI - LLMs</strong>, <strong>Vision, Voice & more</strong></p><p>* <strong>OpenAI OSS Meeting:</strong> Insights from Maziar [0:16:37]. </p><p>* <strong>NVIDIA Describe Anything (DAM-3B):</strong> 3B param multimodal LLM for region-based image/video captioning. (<a target="_blank" href="https://x.com/reach_vb/status/1914962078571356656">X Post</a>, <a target="_blank" href="https://huggingface.co/nvidia/DAM-3B">HF model</a>, <a target="_blank" href="https://huggingface.co/spaces/nvidia/describe-anything-model-demo">HF demo</a>)</p><p>* <strong>Google Gemma QAT:</strong> Quantization-Aware Training models (<a target="_blank" href="https://x.com/osanseviero/status/1913220285328748832">X</a>, <a target="_blank" href="https://developers.googleblog.com/en/gemma-3-quantized-aware-trained-state-of-the-art-ai-to-consumer-gpus/">Blog</a>) </p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* <strong>OpenAI GPT Image 1 API:</strong>  (<a target="_blank" href="https://x.com/OpenAIDevs/status/1915097067023900883">X Post</a>, <a target="_blank" href="https://platform.openai.com/docs/guides/image-generation?image-generation-model=gpt-image-1">Docs</a>, <a target="_blank" href="https://platform.openai.com/docs/api-reference/images">API Reference</a>)</p><p>* <strong>Grok API & App Updates:</strong> Grok 3 and Grok 3 Mini available via API. (<a target="_blank" href="https://docs.x.ai/docs/overview">API Docs</a>, <a target="_blank" href="https://x.com/ebbyamir/status/1914820712092852430">App Update X Post</a>)</p><p>* <strong>This weeks Buzz - Weights & Biases</strong></p><p>* <strong>WeaveHacks SF:</strong> Hackathon May 17-18 at W&B HQ. <a target="_blank" href="http://lu.ma/weavehacks">lu.ma/weavehacks</a> </p><p>* <strong>Fully Connected:</strong> W&B's 2-day conference, June 18-19 in SF <a target="_blank" href="https://www.fullyconnected.com/">fullyconnected.com</a></p><p>* <strong>Vision & Video</strong></p><p>* <strong>Send AI MAGI-1:</strong> 24B autoregressive diffusion model for long, streaming video (<a target="_blank" href="https://x.com/SandAI_HQ/status/1914303284954996749">X Post</a>, <a target="_blank" href="https://github.com/SandAI-org/Magi-1">GitHub</a>, <a target="_blank" href="https://static.magi.world/static/files/MAGI_1.pdf">PDF Report</a>, <a target="_blank" href="https://huggingface.co/sand-ai/MAGI-1">HF Repo</a>)</p><p>* <strong>Character AI AvatarFX:</strong> Early access for creating speaking/emoting avatars from images . (<a target="_blank" href="https://t.co/cdF6H58kBk">Website</a>)</p><p>* <strong>Framepack:</strong> Mentioned for long video generation (120s) on low VRAM (6GB). (<a target="_blank" href="https://framepack.github.io/">Project Page</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* <strong>Nari Labs Dia:</strong> 1.6B param OSS TTS model (<a target="_blank" href="https://x.com/altryne/status/1914421814455099680">X Post Highlight</a>, <a target="_blank" href="https://huggingface.co/nari-labs/Dia-1.6B">HF Model</a>, <a target="_blank" href="https://github.com/nari-labs/dia">Github</a>, <a target="_blank" href="https://fal.ai/models/fal-ai/dia-tts/voice-clone">Fal.ai Demo</a>)</p><p>* <strong>PipeCat Smart-Turn VAD:</strong> Open source semantic VAD model (<a target="_blank" href="https://github.com/pipecat-ai/smart-turn">Github</a>, <a target="_blank" href="https://huggingface.co/pipecat-ai/smart-turn">HF Model</a>, <a target="_blank" href="https://fal.ai/models/fal-ai/smart-turn/playground">Fal.ai Playground</a>, <a target="_blank" href="https://pcc-smart-turn.vercel.app/">Try It Demo</a>)</p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* <strong>Hunyuan 3D 2.5 (Tencent):</strong> 10B param update [0:09:06]. Higher res geometry, PBR textures, improved rigging. (<a target="_blank" href="https://x.com/TencentHunyuan/status/1915026828013850791">X Post</a>)</p><p>* <strong>Agents , Tools & Links</strong></p><p>* <strong>12 Factor Agents:</strong> Discussion with Dex Horthy on building robust agents  (<a target="_blank" href="https://github.com/humanlayer/12-factor-agents/tree/main">Github Repo</a>)</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-apr-23rd-gpt-image-and-grok</link><guid isPermaLink="false">substack:post:162086565</guid><dc:creator><![CDATA[Alex Volkov, Kwindla Hultman Kramer, and Nisten]]></dc:creator><pubDate>Thu, 24 Apr 2025 23:03:53 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/162086565/0b9a7a08144dce00df637847afd78a30.mp3" length="93022514" type="audio/mpeg"/><itunes:author>Alex Volkov, Kwindla Hultman Kramer, and Nisten</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5814</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/162086565/1156d88e413eebf128abaf98e9f5fe2f.jpg"/></item><item><title><![CDATA[ThursdAI - Apr 17 - OpenAI o3 is SOTA llm, o4-mini, 4.1, mini, nano, G. Flash 2.5, Kling 2.0 and 🐬 Gemma? Huge AI week + A2A protocol interview]]></title><description><![CDATA[<p>Hey everyone, Alex here 👋</p><p>Wow. Just… wow. What a week, folks. Seriously, this has been one for the books. </p><p>This week was dominated by OpenAI's double whammy: first the <strong>GPT-4.1 family</strong> dropped with a mind-boggling 1 million token context window, followed swiftly by the new flagship reasoning models, <strong>o3</strong> and <strong>o4-mini</strong>, which are already blowing minds with their agentic capabilities. We also saw significant moves from Google with <strong>VEO-2</strong> going GA, the fascinating <strong>A2A protocol</strong> launch (we had an amazing interview with Google's Todd Segal about it!), and even an attempt to talk to dolphins with <strong>DolphinGemma</strong>. Kling stepped up its video game, Cohere dropped SOTA multimodal embeddings, and ByteDance made waves in image generation. Plus, the open-source scene had some interesting developments, though perhaps overshadowed by the closed-source giants this time.</p><p>o3 has absolutely taken the crown as the conversation piece, so lets start with it (as always, TL;DR and shownotes at the end, and here's the embedding of our live video show) </p><p>Big Company LLMs + APIs</p><p>OpenAI o3 & o4‑mini: SOTA Reasoning Meets Tool‑Use (<a target="_blank" href="https://openai.com/index/introducing-o3-and-o4-mini/">Blog</a>, <a target="_blank" href="https://youtube.com/live/2G-VwWxKCkk?feature=share">Watch Party</a>)</p><p>The long awaited o3 models (promised to us in the last days of x-mas) is finally here, and it did NOT disappoint and well.. even surprised! </p><p>o3 is not only SOTA on nearly all possible logic, math and code benchmarks, which is to be expected from the top reasoning model, it also, and I think for the first time, is able to use tools during its reasoning process. Tools like searching the web, python coding, image gen (which it... can zoom and rotate and crop images, it's nuts) to get to incredible responses faster. </p><p>Tool using reasoner are... almost AGI? </p><p>This is the headline feature for me. For the first time, these o-series models have full, autonomous access to all built-in tools (web search, Python code execution, file search, image generation with Sora-Image/DALL-E, etc.). They don't just use tools when told; they decide when and how to chain multiple tool calls together to solve a problem. We saw logs with 600+ consecutive tool calls! This is agent-level reasoning baked right in.</p><p>Anecdote: We tested this live with a complex prompt: "generate an image of a cowboy that on his head is the five last digits of the hexadecimal code of the MMMU score of the latest Gemini model." o3 navigated this multi-step task flawlessly: figuring out the latest model was Gemini 2.5, searching for its MMMU score, using the Python tool to convert it to hex and extract the digits, and then using the image generation tool. It involved multiple searches and reasoning steps. Absolutely mind-blowing 🤯.</p><p>Thinking visually with images</p><p>This one also blew my mind, this model is SOTA on multimodality tasks, and a reason for this, is these models can manipulate and think about the images they received. Think... cropping, zooming, rotating. The models can now perform all these tasks to multimodal requests from users. Sci-fi stuff! </p><p>Benchmark Dominance: As expected, these models crush existing benchmarks.</p><p>o3 sets new State-of-the-Art (SOTA) records on Codeforces (coding competitions), SWE-bench (software engineering), MMMU (multimodal understanding), and more. It scored a staggering $65k on the Freelancer eval (simulating earning money on Upwork) compared to o1's $28k!</p><p>o4-mini is no slouch either. It hits 99.5% on AIME (math problems) when allowed to use its Python interpreter and beats the older o3-mini on general tasks. It’s a reasoning powerhouse at a fraction of the cost.</p><p><strong>Incredible Long Context Performance</strong></p><p>Yam highlighted this – on the Fiction Life benchmark testing deep comprehension over long contexts, o3 maintained nearly 100% accuracy up to 120,000 tokens, absolutely destroying previous models including Gemini 2.5 Pro and even the new GPT-4.1 family on this specific eval. While its context window is currently 200k (unlike 4.1's 1M), its performance within that window is unparalleled.</p><p><strong>Cost-Effective Reasoning:</strong> They're not just better, they're <em>cheaper</em> for the performance you get.</p><p>* <strong>o3:</strong> $10 input / $2.50 cached / $40 output per million tokens.</p><p>* <strong>o4-mini:</strong> $1.10 input / $0.275 cached / $4.40 output per million tokens. (Cheaper than GPT-4.0!)</p><p><strong>Compute Scaling Validated:</strong> OpenAI confirmed these models used >10x the compute of o1 and leverage test-time compute scaling (spending longer on harder problems), further proving their scaling law research.</p><p><strong>Memory Integration:</strong> Both models integrate with ChatGPT's recently upgraded memory feature which has access to all your previous conversations (which we didn't talk about but is absolutely amazing, try asking o3 stuff it knows about you and have ti draw conclusions!)</p><p><strong>Panel Takes & Caveats:</strong>While the excitement was palpable, Yam noted some community observations about potential "rush" – occasional weird hallucinations or questionable answers compared to predecessors, possibly a side effect of cramming so much training data. Nisten, while impressed, still found the <em>style</em> of <strong>GPT-4.1</strong> preferable for specific tasks like generating structured medical notes in his tests. It highlights that benchmarks aren't everything, and specific use cases require evaluation (shameless plug: use tools like W&B Weave for this!).</p><p>I'll add my own, I use all the models every week to help me draft posts, and o3 was absolute crap about matching my tone. % of what's written above it was able to mimic. Gemini remains undefeated for me and this task.</p><p>Though, Overall, o3 and o4-mini feel like a paradigm shift towards more autonomous, capable AI assistants. The agentic future feels a whole lot closer.</p><p><strong>OpenAI Launches GPT-4.1 Family: 1 Million Tokens & Killing 4.5!</strong> (<a target="_blank" href="https://www.youtube.com/live/A5-Zxj816J0">Our Coverage</a>, <a target="_blank" href="https://x.com/noahmacca/status/1911898549308280911">Prompting guide</a>)</p><p>Before the o3 shockwave, Monday brought its own major AI update: the <strong>GPT-4.1 family</strong>. This was the API-focused release, delivering massive upgrades for developers.</p><p><strong>The Headline:</strong> <strong>One Million Token Context Window!</strong> 🤯 Yes, you read that right. All three new models – <strong>GPT-4.1</strong> (flagship), <strong>GPT-4.1 mini</strong> (cheaper/faster), and <strong>GPT-4.1 nano</strong> (ultra-cheap/fast) – can handle up to 1 million tokens. This is a monumental leap, enabling use cases that were previously impossible or required complex chunking strategies.</p><p><strong>Key Details:</strong></p><p>Goodbye GPT-4.5! </p><p>In a surprising twist, OpenAI announced they are <em>deprecating</em> the recently introduced (and massive) GPT-4.5 model within 90 days in the API. Why? Because <strong>GPT-4.1 actually outperforms it</strong> on key benchmarks like SW-Bench, Aider Polyglot, and the new long-context MRCR eval, while being far cheaper to run. It addresses the confusion many had: why was 4.5 seemingly <em>worse</em> than 4.1? It seems 4.5 was a scaling experiment, but 4.1 represents a more optimized, better-trained checkpoint on superior data. RIP 4.5, we hardly knew ye (in the API).</p><p><strong>The Prompt Sandwich Surprise! 🥪:</strong> </p><p>This was wild. Following OpenAI's new prompting guide, I tested the "sandwich" technique (instructions -> context -> instructions <em>again</em>) on my hard reasoning eval using <a target="_blank" href="https://wandb.ai/site/weave?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=apr17">W&B Weave</a>.</p><p>For <strong>GPT-4.1</strong>, it made no difference (still got 48%). But for <strong>GPT-4.1 mini</strong>, the score jumped from 31% to <strong>49%</strong> – essentially matching the full 4.1 model just by repeating the prompt! That's a crazy performance boost for a simple trick. Even nano saw a slight bump. <strong>Lesson: Evaluate prompt techniques!</strong> Don't assume they won't work.</p><p><strong>Million-Token Recall Confirmed:</strong> Using Needle-in-Haystack and their newly open-sourced <strong>MRCR benchmark</strong> (Multi-round Co-reference Resolution – more in Open Source), OpenAI showed near-perfect recall across the <em>entire</em> 1 million token window for <strong>all three models</strong>, even nano! This isn't just a theoretical limit; the recall seems robust.</p><p><strong>Multimodal Gains:</strong> Impressively, <strong>4.1 mini</strong> hit <strong>72% on Video-MME</strong>, pushing SOTA for long-video Q&A in a mid-tier model by analyzing frame sequences. </p><p>4.1 mini seems to be the absolute powerhouse of this release cycle, it nearly matches the intelligence of the previous 4o, while being significantly cheaper and much much faster with 1M context window! </p><p>Windsurf (and Cursor) immediately made the 4.1 family available, offering a <strong>free week</strong> for users to test them out (likely to gather feedback and maybe influenced by certain acquisition rumors 😉). Devs reported them feeling snappier and less verbose than previous models.</p><p><strong>Who Should Use Which OpenAI API?</strong></p><p>My initial take:</p><p>* <strong>For complex reasoning, agentic tasks, or just general chat:</strong> Use <strong>o3</strong> (if you need the best) or <strong>o4-mini</strong> (for amazing value/speed).</p><p>* <strong>For API development, especially coding or long-context tasks:</strong> Evaluate the <strong>GPT-4.1 family</strong>. Start with <strong>4.1 mini</strong> – it's likely the sweet spot for performance/cost, especially with smart prompting. Use <strong>4.1</strong> if mini isn't quite cutting it. Use <strong>nano</strong> for simple, high-volume tasks like translation or basic classification.</p><p>The naming is still confusing (thanks Nisten for highlighting the UI nightmare!), but the capability boost across the board is undeniable.</p><p><strong>Hold the Phone! 🚨 Google Fires Back with Gemini 2.5 Flash in Breaking News</strong></p><p>Just when we thought the week couldn't get crazier, Google, likely reacting to OpenAI's rapid-fire launches, just dropped <strong>Gemini 2.5 Flash</strong> into preview via the <a target="_blank" href="https://ai.google.dev/docs/gemini_api_overview">Gemini API</a> (in AI Studio and Vertex AI). This feels like Google's direct answer, aiming to blend reasoning capabilities with speed and cost-effectiveness.</p><p><strong>The Big Twist: Controllable Thinking Budgets!</strong>Instead of separate models like OpenAI, Gemini 2.5 Flash tries to do <strong>both reasoning and speed/cost efficiency in one model</strong>. The killer feature? Developers can set a <strong>"thinking budget"</strong> (0 to 24,576 tokens) per API call to control the trade-off:</p><p>* <strong>Low/Zero Budget:</strong> Prioritizes speed and low cost (very cheap: <strong>$0.15 input / $0.60 output</strong> per 1M tokens), great for simpler tasks.</p><p>* <strong>Higher Budget:</strong> Allows the model multi-step reasoning "thinking" for better accuracy on complex tasks, at a higher cost (<strong>$3.50 output</strong> per 1M tokens, including reasoning tokens).</p><p>This gives  granular control over the cost/quality balance <em>within the same model</em>.</p><p><strong>Performance & Specs:</strong>Google claims strong performance, ranking just behind Gemini 2.5 Pro on Hard Prompts in ChatBot Arena and showing competitiveness against o4-mini and Sonnet 3.7 in their benchmarks, especially given the flexible pricing.</p><p>Key specs are right up there with the competition:</p><p>* <strong>Multimodal Input:</strong> Text, Images, Video, Audio</p><p>* <strong>Context Window:</strong> <strong>1 million tokens</strong> (matching GPT-4.1!)</p><p>* <strong>Knowledge Cutoff:</strong> January 2025</p><p><strong>How to Control Thinking:</strong>Simply set the thinking_budget parameter in your API call (Python/JS examples available in their docs). If unspecified, the model decides automatically.</p><p><strong>My Take:</strong> This is a smart play by Google. The controllable thinking budget is a unique and potentially powerful feature for optimizing across different use cases without juggling multiple models. With 1M context and competitive pricing, Gemini 2.5 Flash is immediately a major contender in the ongoing AI arms race. Definitely one to evaluate! Find more in the <a target="_blank" href="https://ai.google.dev/docs">developer docs</a> and <a target="_blank" href="https://ai.google.dev/examples">Gemini Cookbook</a>.</p><p>Open Source: LLMs, Tools & more</p><p>OpenAI open sources MRCR eval and Codex (Mrcr <a target="_blank" href="https://huggingface.co/datasets/openai/mrcr">HF</a>, Codex <a target="_blank" href="https://github.com/openai/codex">Github</a>)</p><p>Let's face it, this isn't the open source OpenAI coverage I was hoping for, Sam promised us an open source model, and they are about to drop this, I'd assume close to Google IO (May 20th) to steal thunder. But OpenAI did make OpenSource waves this week in addition to the above huge stories. </p><p>MRCR is a way to evaluate long context complex tasks, and they have taken this Gemini research and open sourced a dataset for this eval. 👏 </p><p>But also, they have dropped the Codex CLI tool, which is a coding partner using o4-mini and o3 and made that tool open source as well (Unlike anthropic with Claude Code), which in turn saw 86+ Pull Requests approved within the first 24 hours! </p><p>The best part about this CLI, is that it's hardened security, using <strong>Apple Seatbelt</strong> which limits it execution to the current directory + temp files (on a mac at least) </p><p>Other Open Source Updates</p><p>While OpenAI's contributions were notable, it wasn't the only action this week:</p><p>* <strong>Microsoft's BitNet v1.5 (</strong><a target="_blank" href="https://huggingface.co/collections/microsoft/BitNet"><strong>HF</strong></a><strong>)</strong>: Microsoft quietly dropped updates to BitNet, continuing their exploration into ultra-low-bit (ternary) models for efficiency. As Nisten pointed out on the show though, keep in mind these still use some higher-precision layers, so they aren't <em>purely</em> 1.5-bit in practice just yet. Important research nonetheless!</p><p>* <strong>INTELLECT-2 Distributed RL (</strong><a target="_blank" href="https://www.primeintellect.ai/blog/intellect-2"><strong>Blog</strong></a><strong>, </strong><a target="_blank" href="https://x.com/primeintellect_ai"><strong>X</strong></a><strong>)</strong>: Prime Intellect did something wild – training <strong>INTELLECT-2</strong>, a 32B model, using globally distributed, permissionless reinforcement learning. Basically, anyone with a GPU could potentially contribute. Fascinating glimpse into decentralized training!</p><p>* <a target="_blank" href="Z.ai"><strong>Z.ai</strong></a><strong> (Formerly ChatGLM) & GLM-4 Family (</strong><a target="_blank" href="https://x.com/Zai_org/status/1779846143024941199"><strong>X</strong></a><strong>, </strong><a target="_blank" href="https://huggingface.co/collections/THUDM"><strong>HF</strong></a><strong>, </strong><a target="_blank" href="https://github.com/THUDM/GLM-4"><strong>GitHub</strong></a><strong>)</strong>: The team behind ChatGLM rebranded to <a target="_blank" href="Z.ai"><strong>Z.ai</strong></a> and released their GLM-4 family (up to 32B parameters) under the very permissive <strong>MIT license</strong>. They're claiming performance competitive with much larger models like Qwen 72B, which is fantastic news for commercially usable open source!</p><p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></p><p>This Week's Buzz: Playground Updates & A Deep Dive into A2A</p><p>On the Weights & Biases front, it's all about enabling developers to navigate this new model landscape.</p><p><strong>Weave Playground Supports GPT-4.1 and o3/o4-mini (</strong><a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fx.com%2Fweave_wb%2Fstatus%2F1912246450857341092"><strong>X</strong></a><strong>)</strong></p><p>With all these new models dropping, how do you actually <em>choose</em> which one is best for <em>your</em> application? You need to evaluate! Our <strong>W&B Weave Playground</strong> now has full support for the new <strong>GPT-4.1 family</strong> and the <strong>o3/o4-mini</strong> models.</p><p>If you're using Weave to monitor your LLM apps in production, you can easily grab a trace of a real user interaction, open it in the Playground, and instantly retry that exact same call (with all its context and history) using any of the new models side-by-side. It’s the fastest way to see how o3 compares to 4.1-mini or how Claude 3.7 stacks up against o4-mini <em>on your specific data</em>. Essential for making informed decisions in this rapidly changing environment.</p><p><strong>Deep Dive: Understanding Google's A2A Protocol with Todd Segal</strong></p><p>This was a highlight of the show for me. We were joined by <strong>Todd Segal</strong>, a Principal Software Engineer at Google working directly on the new <strong>Agent-to-Agent (A2A) protocol</strong>. There was some confusion initially about how A2A relates to the increasingly popular <strong>Model Context Protocol (MCP)</strong>, so getting Todd's perspective was invaluable. W&B is a proud launch partner for the A2A protocol!</p><p><strong>Key Takeaways from our Chat:</strong></p><p>* <strong>A2A vs. MCP: Complementary, Not Competitive:</strong> Todd was clear: Google sees these as solving different problems. <strong>MCP is for Agents talking to Tools</strong> (structured, deterministic capabilities). <strong>A2A is for Agents talking to other Agents</strong> (unstructured, stateful, unpredictable, evolving interactions). Think of MCP like calling an API, and A2A like delegating a complex task to another expert service.</p><p>* <strong>The Need for A2A:</strong> It emerged from the need for specialized, domain-expert agents (built internally or by partners like Salesforce) to collaborate on complex, long-running tasks (e.g., booking a multi-vendor trip, coordinating an enterprise workflow) where simple tool calls aren't enough. Google's <strong>Agent Space</strong> product heavily utilizes A2A internally.</p><p>* <strong>Capability Discovery & Registries:</strong> A core concept is agents advertising their capabilities via an "agent card" (like a business card or resume). Todd envisions a future with multiple <strong>registries</strong> (public, private, enterprise-specific) where agents can discover other agents best suited for a task. This registry system is on the roadmap.</p><p>* <strong>Async & Long-Running Tasks:</strong> A2A is designed for tasks that might take minutes, hours, or even days. It uses a central <strong>"Task" abstraction</strong> which is stateful. Agents communicate updates (status changes, generated artifacts, requests for more info) related to that task.</p><p>* <strong>Push Notifications:</strong> For very long tasks, A2A supports a <strong>push notification</strong> mechanism. The client agent provides a secure callback URL, and the server agent can push updates (state changes, new artifacts) even if the primary connection is down. This avoids maintaining costly long-lived connections.</p><p>* <strong>Multimodal Communication:</strong> The protocol supports negotiation of modalities beyond text, including rendering content within <strong>iframes</strong> (for branded experiences) or exchanging <strong>video/audio streams</strong>. Essential for future rich interactions.</p><p>* <strong>Security & Auth:</strong> A2A deliberately <strong>doesn't reinvent the wheel</strong>. It relies on standard <strong>HTTP headers</strong> to carry authentication (OAuth tokens, internal enterprise credentials). Identity/auth handshakes happen "out of band" using existing protocols (OAuth, OIDC, etc.), and the resulting credentials are passed with A2A requests. Your user identity flows through standard mechanisms.</p><p>* <strong>Observability:</strong> Todd confirmed <strong>OpenTelemetry (OTel)</strong> support is planned for the SDKs. Treating agents like standard microservices means leveraging existing observability tools (like W&B Weave!) is crucial for tracing and debugging multi-agent workflows.</p><p>* <strong>Open Governance:</strong> While currently in a Google repo, the plan is to move A2A to a <strong>neutral foundation</strong> (like Linux Foundation) with a fully <strong>open governance model</strong>. They want this to be a true industry standard.</p><p>* <strong>Getting Started:</strong> Check out the <strong>GitHub repo (</strong><a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fgithub.com%2Fgoogle%2FA2A"><strong>github.com/google/A2A</strong></a><strong>)</strong>, participate in discussions, file issues, and send PRs!</p><p>My take: A2A feels like a necessary piece of infrastructure for the next phase of AI agents, enabling complex, coordinated actions across different systems and vendors. While MCP handles the "how" of using tools, A2A handles the "who" and "what" of inter-agent delegation. Exciting times ahead! Big thanks to Todd for shedding light on this.</p><p>Vision & Video: Veo-2 Arrives, Kling Gets Slicker</p><p>The visual AI space keeps advancing rapidly.</p><p><strong>Veo-2 Video Generation Hits GA in Vertex AI & Gemini App (</strong><a target="_blank" href="https://developers.googleblog.com/en/veo-2-video-generation-now-generally-available/"><strong>Blog</strong></a><strong>, </strong><a target="_blank" href="http://ai.dev"><strong><em>Try It</em></strong></a><strong>)</strong></p><p>Google's answer to Sora and Kling, <strong>Veo-2</strong>, is now <strong>Generally Available (GA)</strong> for all Google Cloud customers via <strong>Vertex AI</strong>. You can also access it in the <strong>Gemini app</strong>.</p><p>Veo-2 produces stunningly realistic and coherent video, making it a top contender alongside OpenAI's Sora and  Kling. Having it easily accessible in Vertex AI is a big plus for developers on Google Cloud.</p><p>I've tried and keep tyring all of them, VEO2 is an absolute beast in realism. </p><p><strong>Kling 2.0 Creative Suite: A One-Stop Shop for Video AI? (</strong><a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fx.com%2Faltryne%2Fstatus%2F1912043121497850242"><strong>X</strong></a><strong>, </strong><a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fklingai.com"><strong>Blog</strong></a><strong>)</strong></p><p>Kuaishou's <strong>Kling</strong> model also got a major upgrade, evolving into a full <strong>Kling 2.0 Creative Suite</strong>.</p><p><em>Anecdote:</em> I actually stayed up quite late one night trying to piece together info from a Chinese live stream about this release! The dedication is real, folks. 😂</p><p><strong>What's New:</strong></p><p>* <strong>Kling 2.0 Master:</strong> The core video model, promising better motion, physics, and facial consistency (still 5s clips for now, but 30s/4K planned).</p><p>* <strong>Kolors 2.0:</strong> An integrated image generation and restyling model (think Midjourney-style filters).</p><p>* <strong>MVL (Multimodal Visual Language) Prompting:</strong> This is killer! You can now <strong>inline images directly within your text prompt</strong> for precise control (e.g., "Swap the hoodie in @video1 with the style of @image2"). This offers granular control artists have been craving.</p><p>* <strong>Multi-Elements Editor:</strong> A timeline-based editor to stitch clips, add lip-sync, sound effects (including generated ones like "car horn"), and music.</p><p>* <strong>Global Access:</strong> No more Chinese phone number requirement! Available worldwide at <a target="_blank" href="klingai.com"><strong>klingai.com</strong></a>.</p><p>* <strong>Official API via FAL:</strong> Developers can now integrate Kling 2.0 via our friends at <strong>⚡ FAL Generative Media Cloud</strong>.</p><p>Kling is clearly aiming to be a holistic creative platform, reducing the need to jump between 17 different AI tools for image gen, video gen, editing, and sound. The MVL prompting is particularly innovative. Very impressive package.</p><p>Voice & Audio: Talking to Dolphins? 🐬</p><p><strong>DolphinGemma: Google AI Listens to Flipper (</strong><a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fblog.google%2Ftechnology%2Fai%2Fdolphingemma%2F"><strong>Blog</strong></a><strong>)</strong></p><p>In perhaps the most delightful news of the week, Google, in collaboration with Georgia Tech and the Wild Dolphin Project, announced DolphinGemma. </p><p>It's a ~400M parameter audio model based on the Gemma architecture (using SoundStream for audio tokenization) trained specifically on decades of recorded dolphin clicks, whistles, and pulses.The goal? To decipher the potential syntax and structure within dolphin communication and eventually enable rudimentary two-way interaction using underwater communication devices. It runs on a Pixel phone for field deployment.</p><p>This is just awesome. Using AI not just for human tasks but to potentially bridge the communication gap with other intelligent species is genuinely inspiring. We joked on the show about doing a segment of just dolphin noises – maybe next time if DolphinGemma gets an API! 🤣</p><p>AI Art & Diffusion & 3D: Seedream Challenges the Champs</p><p><strong>Seedream 3.0: ByteDance's Bilingual Image Powerhouse (</strong><a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fteam.doubao.com%2Fen%2Ftech%2Fseedream3_0"><strong>Tech post</strong></a><strong>, </strong><a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Farxiv.org%2Fabs%2F2504.11346"><strong>arXiv</strong></a><strong>, </strong><a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fwww.aibase.com%2Fnews%2F17208"><strong>AIbase news</strong></a><strong>)</strong></p><p>ByteDance wasn't just busy with video; their Seed team announced <strong>Seedream 3.0</strong>, a powerful bilingual text-to-image model.</p><p><strong>Highlights:</strong></p><p>* Generates native <strong>2048x2048</strong> images.</p><p>* Fast inference (<strong>~3 seconds</strong> for 1Kx1K on an A100).</p><p>* Excellent <strong>bilingual (Chinese/English) text rendering</strong>, even small fonts.</p><p>* Uses <strong>Scaled-ROPE-v2</strong> for better high-resolution generation without artifacts.</p><p>* Claims to outperform SDXL-Turbo and Qwen-Image on fidelity and prompt adherence benchmarks.</p><p>* Available via Python SDK and REST API within their Doubao Studio and coming soon to <a target="_blank" href="http://dreamina.com">dreamina.com</a> </p><p>Phew! We made it. What an absolute avalanche of news. OpenAI truly dominated with the back-to-back launches of the hyper-capable o3/o4-mini and the massively scaled GPT-4.1 family. Google countered strongly with the versatile Gemini 2.5 Flash, key GA releases like Veo-2, and the strategically important A2A protocol. The agent ecosystem took huge leaps forward with both A2A and broader MCP adoption. And we saw continued innovation in multimodal embeddings, video generation, and even niche areas like bioacoustics and low-bit models.</p><p>If you feel like you missed anything (entirely possible this week!), the TL;DR and links below should help. Please subscribe if you haven't already, and share this with a friend if you found it useful – it's the best way to support the show!</p><p>I have a feeling next week won't be any slower. Follow us on X/Twitter for breaking news between shows!</p><p>Thanks for tuning in, keep building, keep learning, and I'll see you next Thursday!</p><p>Alex</p><p>TL;DR and Show Notes</p><p><em>Everything we covered today in bite-sized pieces with links!</em></p><p>* <strong>Hosts and Guests</strong></p><p>* <strong>Alex Volkov</strong> - AI Evangelist & Weights & Biases (<a target="_blank" href="https://www.google.com/url?sa=E&#38;q=http%3A%2F%2Fx.com%2F%40altryne"><strong>@altryne</strong></a>)</p><p>* Co Hosts - <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=http%3A%2F%2Fx.com%2F%40WolframRvnwlf"><strong>@WolframRvnwlf</strong></a> <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=x.com%2F%40yampeleg"><strong>@yampeleg</strong></a> <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=http%3A%2F%2Fx.com%2F%40nisten"><strong>@nisten</strong></a> <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=http%3A%2F%2Fx.com%2F%40ldjconfirmed"><strong>@ldjconfirmed</strong></a>)</p><p>* Todd Segal - Principal Software Engineer @ Google - Working on A2A Protocol</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* 👑 OpenAI launches <strong>o3</strong> and <strong>o4-mini</strong> in chatGPT & API (<a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fopenai.com%2Findex%2Fintroducing-o3-and-o4-mini%2F"><strong>Blog</strong></a>, <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fyoutube.com%2Flive%2F2G-VwWxKCkk"><strong>Our Coverage</strong></a>, <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fopenai.com%2Findex%2Fintroducing-o3-and-o4-mini%2F"><strong>o3 and o4-mini announcement</strong></a>)</p><p>* OpenAI launches <strong>GPT 4.1, 4.1-mini and 4.1-nano</strong> in <strong>API</strong> (<a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fwww.youtube.com%2Flive%2FA5-Zxj816J0"><strong>Our Coverage</strong></a>, <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fx.com%2Fnoahmacca%2Fstatus%2F1911898549308280911"><strong>Prompting guide</strong></a>)</p><p>* 🚨 Google launches <strong>Gemini 2.5 Flash</strong> with controllable thinking budgets (<a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fblog.google%2Ftechnology%2Fai%2Fgoogle-gemini-update-flash-extension%2F"><strong>Blog Post - Placeholder Link</strong></a>, <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fai.google.dev%2Fdocs"><strong>API Docs</strong></a>)</p><p>* Mistral classifiers Factory</p><p>* Claude does research + workspace integration (<a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fwww.anthropic.com%2Fnews%2Fresearch"><strong>Blog</strong></a>)</p><p>* Cohere Embed‑4 — Multimodal embeddings for enterprise search (<a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fcohere.com%2Fblog%2Fembed-4"><strong>Blog</strong></a>, <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fdocs.cohere.com%2Fv2%2Fchangelog%2Fembed-multimodal-v4"><strong>Docs Changelog</strong></a>, <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fx.com%2Fcohere%2Fstatus%2F1912128813104078999"><strong>X</strong></a>)</p><p>* <strong>Open Source LLMs</strong></p><p>* OpenAI open sources MRCR Long‑Context Benchmark (<a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fhuggingface.co%2Fdatasets%2Fopenai%2Fmrcr"><strong>Hugging Face</strong></a>)</p><p>* Microsoft BitNet v1.5 (<a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fhuggingface.co%2Fcollections%2Fmicrosoft%2FBitNet"><strong>HF</strong></a>)</p><p>* INTELLECT‑2 — Prime Intellect’s 32B “globally‑distributed RL” experiment (<a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fwww.primeintellect.ai%2Fblog%2Fintellect-2"><strong>Blog</strong></a>, <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fx.com%2Fprimeintellect_ai"><strong>X</strong></a>)</p><p>* <a target="_blank" href="Z.ai">Z.ai</a> (previously chatGLM) + GLM‑4‑0414 open‑source family (<a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fx.com%2FZai_org%2Fstatus%2F1779846143024941199"><strong>X</strong></a>, <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fhuggingface.co%2Fcollections%2FTHUDM"><strong>HF Collection</strong></a>, <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fgithub.com%2FTHUDM%2FGLM-4"><strong>GitHub</strong></a>)</p><p>* <strong>This weeks Buzz + MCP/A2A</strong></p><p>* Weave playground support for GPT 4.1 and o3/o4-mini models (<a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fx.com%2Fweave_wb%2Fstatus%2F1912246450857341092"><strong>X</strong></a>)</p><p>* Chat with Todd Segal - A2A Protocol (<a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fgithub.com%2Fgoogle%2FA2A"><strong>GitHub Spec</strong></a>)</p><p>* <strong>Vision & Video</strong></p><p>* <strong>Veo‑2 Video Generation in GA, Gemini App</strong> (<a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fdevelopers.googleblog.com%2Fen%2Fveo-2-video-generation-now-generally-available%2F"><strong>Dev Blog</strong></a>)</p><p>* <strong>Kling 2.0 Creative Suite</strong> (<a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fx.com%2Faltryne%2Fstatus%2F1912043121497850242"><strong>X</strong></a>, <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fklingai.com"><strong>Blog</strong></a>)</p><p>* ByteDance public Seaweed-7B, a video generation foundation model (<a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fseaweed.video%2F"><strong>seaweed.video</strong></a>)</p><p>* <strong>Voice & Audio</strong></p><p>* <strong>DolphinGemma</strong> — Google AI tackles dolphin communication (<a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fblog.google%2Ftechnology%2Fai%2Fdolphingemma%2F"><strong>Blog</strong></a>)</p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* <strong>Seedream 3.0 bilingual image diffusion – ByteDance</strong> (<a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fteam.doubao.com%2Fen%2Ftech%2Fseedream3_0"><strong>Tech post</strong></a>, <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Farxiv.org%2Fabs%2F2504.11346"><strong>arXiv</strong></a>, <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fwww.aibase.com%2Fnews%2F17208"><strong>AIbase news</strong></a>)</p><p>* <strong>Tools</strong></p><p>* OpenAI debuts Codex CLI, an open source coding tool for terminals (<a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fgithub.com%2Fopenai%2Fcodex"><strong>Github</strong></a>)</p><p>* Use o3 with Windsurf (which OpenAI is rumored to buy at $3B) via the mac app integration + write back + multiple files</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-apr-17-openai-o3-is-sota</link><guid isPermaLink="false">substack:post:161566938</guid><dc:creator><![CDATA[Alex Volkov and Todd Segal]]></dc:creator><pubDate>Thu, 17 Apr 2025 23:35:24 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/161566938/ac4191930c77f1fb5bcdb736043ebbeb.mp3" length="111213757" type="audio/mpeg"/><itunes:author>Alex Volkov and Todd Segal</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6951</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/161566938/431e71e908c9fd50ecc74e1e265d4f37.jpg"/></item><item><title><![CDATA[💯 ThursdAI - 100th episode 🎉 - Meta LLama 4, Google tons of updates, ChatGPT memory, WandB MCP manifesto & more AI news]]></title><description><![CDATA[<p>Hey Folks, </p><p>Alex here, celebrating an absolutely crazy (to me) milestone, of #100 episodes of ThursdAI 👏 100 episodes in a year and a half (as I <a target="_blank" href="https://sub.thursdai.news/p/thursdai-july-12-show-recap-notes">started publishing</a> much later than I started going live, and the first episode was embarrassing), 100 episodes that documented INCREDIBLE AI progress, we mention on the show today, we used to be excited by context windows jumping from 4K to 16K! </p><p>I want to extend a huge thank you to every one of you, who subscribes, listens to the show on podcasts, joins the live recording (we regularly get over 1K live viewers across platforms), shares with friends and highest thank you for the paid supporters! 🫶 Sharing the AI news progress with you, energizes me to keep going, despite the absolute avalanche of news every week.</p><p>And what a perfect way to celebrate the 100th episode, on a week that Meta dropped Llama 4, sending the open-source world into a frenzy (and a bit of chaos). Google unleashed a firehose of announcements at Google Next. The agent ecosystem got a massive boost with MCP and A2A developments. And we had fantastic guests join us – Michael Lou diving deep into the impressive DeepCoder-14B, and Liad Yosef & Ido Salomon sharing their wild ride creating the viral GitMCP tool.</p><p>I really loved today's show, and I encourage those of you who only read, to give this a watch/listen, and those of you who only listen, enjoy the recorded version (though longer and less edited!) </p><p>Now let's dive in, there's a LOT to talk about (TL;DR and show notes as always, at the end of the newsletter) </p><p>Open Source AI & LLMs: Llama 4 Takes Center Stage (Amidst Some Drama)</p><p><strong>Meta drops Llama 4 - Scout 109B/17BA & Maverick 400B/17BA </strong>(<a target="_blank" href="https://ai.meta.com/blog/llama-4-multimodal-intelligence/">Blog</a>, <a target="_blank" href="https://huggingface.co/meta-llama">HF</a>, <a target="_blank" href="https://meta.ai/">Try It</a>)</p><p>This was by far the biggest news of this last week, and it dropped... on a Saturday? (I was on the mountain ⛷️! What are you doing Zuck) </p><p>Meta dropped the long awaited LLama-4 models, huge ones this time</p><p>* Llama 4 <strong>Scout</strong>: 17B active parameters out of ~109B total (16 experts).</p><p>* Llama 4 <strong>Maverick</strong>: 17B active parameters out of a whopping ~400B total (128 experts).</p><p>* Unreleased: <strong>Behemoth</strong> - 288B active with 2 Trillion total parameters chonker!</p><p>* Both base and instruct finetuned models were released</p><p>These new models are all Multimodal, Multilingual MoE (mixture of experts) architecture, and were trained with FP8, for significantly more tokens (around 30 Trillion Tokens!) with interleaved attention (iRoPE), and a refined SFT > RL > DPO post-training pipeline.</p><p>The biggest highlight is the stated context windows, 10M for Scout and 1M for Maverick, which is insane (and honestly, I haven't yet seen a provider that is even remotely able to support anything of this length, nor do I have the tokens to verify it) </p><p>The messy release - Big Oof from Big Zuck</p><p>Not only did Meta release on a Saturday, messing up people's weekends, Meta apparently announced a high LM arena score, but the model they provided to LMArena was... not the model they released!?</p><p>It caused LMArena to release the 2000 chats dataset, and truly, some examples are quite damning and show just how unreliable LMArena can be as vibe eval. </p><p>Additionally, during the next days, folks noticed discrepancies between the stated eval scores Meta released, and the ability to evaluate them independently, including our own Wolfram, who noticed that a quantized version of Scout, performed better on his laptop while HIGHLY quantized (read: reduced precision) than it was performing on the Together API inference endpoint!? </p><p>We've chatted on the show that this may be due to some VLLM issues, and speculated about other potential reasons for this. </p><p>Worth noting the official response from Ahmad Al-Dahle, head of LLama at Meta, who mentioned stability issues between providers and absolutely denied any training on any benchmarks</p><p>Too big for its own good (and us?)</p><p>One of the main criticism the OSS community had about these releases, is that for many of us, the reason for celebrating Open Source AI, is the ability to run models without network, privately on our own devices. </p><p>Llama 3 was released in 8-70B distilled versions and that was incredible for us local AI enthusiasts! These models, despite being "only" 17B active params, are huge and way to big to run on most local hardware, and so the question then is, if we're getting a model that HAS to run on a service, why not use Gemini 2.5 that's MUCH better and faster and cheaper than LLama?  </p><p>Why didn't Meta release those sizes? Was it due to an inability to beat Qwen/DeepSeek enough? 🤔 </p><p>My Take</p><p>Despite the absolutely chaotic rollout, this is still a monumental effort from Meta. They spent <em>millions</em> on compute and salaries to give this to the community. Yes, no papers yet, the LM Arena thing was weird, and the inference wasn't ready. But Meta is standing up for Western open-source in a big way. We <em>have</em> to celebrate the core contribution while demanding better rollout practices next time. As Wolfram rightly said, the real test will be the fine-tunes and distillations the community builds on these base models. Releasing the base weights is crucial for that. Let's see if the community can tame this beast once the inference dust settles. Shout out to Ahmed Al-Dahle and the whole Llama team at Meta – incredible work, messy launch, but thank you for pushing open source forward. 🎉</p><p>Together AI & Agentica (Berkley) finetuned DeepCoder-14B with reasoning (<a target="_blank" href="https://x.com/togethercompute/status/1909697124805333208">X</a>, <a target="_blank" href="https://www.together.ai/blog/deepcoder">Blog</a>)</p><p>Amidst the Llama noise, we got another stellar open-source release! We were thrilled to have Michael Lou from Agentica/UC Berkeley join us to talk about DeepCoder-14B-Preview which beats DeepSeek R1 and even o3-mini on several coding benchmarks. </p><p>Using distributed Reinforcement Learning (RL), it achieves 60.6% Pass@1 accuracy on LiveCodeBench, matching the performance of models like o3-mini-2025-01-31 (Low) despite its smaller size.</p><p>The stated purpose of the project is to democratize RL and they have open sourced the model (<a target="_blank" href="https://huggingface.co/agentica-org/DeepCoder-14B-Preview">HF</a>), the dataset (<a target="_blank" href="https://huggingface.co/datasets/agentica-org/DeepCoder-Preview-Dataset">HF</a>), the Weights & Biases <a target="_blank" href="https://wandb.ai/mluo/deepcoder">logs</a> and even the <a target="_blank" href="https://drive.google.com/file/d/1tr_xXvCJnjU0tLO7DNtFL85GIr3aGYln/view?usp=sharing">eval logs</a>! </p><p>Shout out to Michael, Sijun and Alpay and the rest of the team who worked on this awesome model! </p><p>NVIDIA Nemotron ULTRA is finally here, 253B pruned Llama 3-405B (<a target="_blank" href="https://huggingface.co/nvidia/Llama-3_1-Nemotron-Ultra-253B-v1">HF</a>)</p><p>While Llama 4 was wrapped in mystery, NVIDIA dropped their pruned and distilled finetune of the previous Llama chonker 405B model, turning at just about half the parameters. </p><p>And they were able to include the LLama-4 benchmarks in their release, showing that the older Llama, finetuned can absolutely beat the new ones at AIME, GPQA and more. </p><p>As a reminder, we covered the previous 2 NEMOTRONS and they are a combined reasoning and non reasoning models, so the jump is not that surprising, and it does seem like a bit of eval cherry picking happened here. </p><p>Nemotron Ultra supports 128K context and fits on a single 8xH100 node for inference. Built on open Llama models and trained on vetted + synthetic data, it's commercially viable. Shout out to NVIDIA for releasing this, and especially for pushing open reasoning datasets which Nisten rightly praised as having long-term value beyond the models themselves.</p><p><strong>More Open Source Goodness: Jina, DeepCogito, Kimi</strong></p><p>The open-source train didn't stop there:</p><p>* <strong>Jina Reranker M0:</strong> Our friends at Jina released a state-of-the-art <em>multimodal</em> reranker model. If you're doing RAG with images and text, this looks super useful for improving retrieval quality across languages and modalities (<a target="_blank" href="https://jina.ai/news/jina-reranker-m0-multilingual-multimodal-document-reranker/">Blog</a>, <a target="_blank" href="https://huggingface.co/jinaai/jina-reranker-m0">HF</a>)</p><p>* <strong>DeepCogito:</strong> A new company emerged releasing a suite of Llama fine-tunes (3B up to 70B planned, with larger ones coming) trained using a technique they call Iterated Distillation and Amplification (IDA). They claim their 70B model beats DeepSeek V2 70B on some benchmarks . Definitely one to watch. (<a target="_blank" href="https://www.deepcogito.com/research/cogito-v1-preview">Blog</a>, <a target="_blank" href="https://huggingface.co/deepcogito/cogito-v1-preview-llama-70B">HF</a>)</p><p>* <strong>Kimi-VL & Kimi-VL-Thinking:</strong>  MoonShot, who sometimes get lost in the noise, released incredibly impressive Kimi Vision Language Models (VLMs). These are MoE models with only ~3 Billion active parameters, yet they're showing results on par with or even beating models 10x larger (like Gemma 2 27B) on benchmarks like MathVision and ScreenSpot. They handle high-res images, support 128k context, and crucially, include a <em>reasoning</em> VLM variant. Plus, they're MIT licensed! Nisten's been following Kimi and thinks they're legit, just waiting for easier ways to run them locally. Definitely keep an eye on Kimi. (<a target="_blank" href="https://huggingface.co/collections/moonshotai/kimi-vl-a3b-67f67b6ac91d3b03d382dd85">HF</a>)</p><p>This Week's Buzz from Weights & Biases - Observable Tools & A2A!</p><p></p><p>This week was personally very exciting on the W&B front, as I spearheaded and launched initiatives directly related to the MCP and A2A news!</p><p><strong>W&B launches the </strong><a target="_blank" href="observable.tools"><strong>observable.tools</strong></a><strong> initiative!</strong></p><p>As MCP takes off, one challenge becomes clear: observability. When your agent calls an external MCP tool, that part of the execution chain becomes a black box. You lose the end-to-end visibility crucial for debugging and evaluation.</p><p>That's why I'm thrilled that we launched <strong>Observable Tools (</strong><a target="_blank" href="https://observable.tools"><strong>Website</strong></a><strong>)</strong> – an initiative championing full-stack agent observability, specifically within the MCP ecosystem. Our vision is to enable developers using tools like W&B Weave to see <em>inside</em> those MCP tool calls, getting a complete trace of every step.</p><p>The core of this is <strong>Proposal </strong><a target="_blank" href="https://wandb.me/mcp-spec"><strong>RFC 269</strong></a><strong> on the official MCP GitHub spec</strong>, which I authored! (My first RFC, quite the learning experience!). It details how to integrate OpenTelemetry tracing directly into the MCP protocol, allowing tools to securely report detailed execution spans back to the calling client (agent). We went deep on the spec, outlining transmission mechanisms, schemas, and rationale.</p><p><strong>My ask to you, the ThursdAI community:</strong> Please check out <a target="_blank" href="observable.tools"><strong>observable.tools</strong></a>, read the manifesto, watch the fun video we made, and most importantly, <strong>go to the RFC 269 proposal (shortcut: </strong><a target="_blank" href="oneb.me/mcp-spec)"><strong>wandb.me/mcp-spec)</strong></a>. Read it, comment, give feedback, and upvote if you agree! We need community support to make this impossible for the MCP maintainers to ignore. Let's make observability a first-class citizen in the MCP world! We also invite our friends from across the LLM observability landscape (LangSmith, Braintrust, Arize, Galileo, etc.) to join the discussion and collaborate.</p><p><strong>W&B is a Launch Partner for Google's A2A</strong></p><p>As mentioned earlier, we're also excited to be a launch partner for Google's new <a target="_blank" href="https://wandb.ai/wandb_fc/product-announcements-fc/reports/Powering-Agent-Collaboration-Weights-Biases-Partners-with-Google-Cloud-on-Agent2Agent-Interoperability-Protocol---VmlldzoxMjE3NDg3OA">Agent2Agent</a> (A2A) protocol. We believe standardized communication <em>between</em> agents is the next critical step, and we'll be supporting A2A alongside MCP in our tools. Exciting times for agent infrastructure! I've invited Google folks to next week to discuss the protocol in depth! </p><p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></p><p>Big Company LLMs + APIs: Google's Onslaught & OpenAI's Memory Upgrade</p><p>While open source had a wild week, the big players weren't sleeping. Google especially came out swinging at Google Next.</p><p><strong>Google announces TONS of new things at Next 🙌  (</strong><a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fblog.google%2Fproducts%2Fgoogle-cloud%2Fnext-2025%2F"><strong>Blog</strong></a><strong>)</strong></p><p>Google I/O felt like a preview, Google Next felt like the delivery truck backing up and dumping everything. Here's the whirlwind tour:</p><p>* <strong>Gemini 2.5 Flash API:</strong> The faster, cheaper Gemini 2.5 model is coming soon to Vertex AI. (Still waiting on that general API access!).</p><p>* <strong>Veo 2 Editing:</strong> Their top-tier video model (competing with Sora, Kling) gets editing capabilities. Very cool.</p><p>* <strong>Imagen 3 Updates:</strong> Their image model gets improvements, including inpainting.</p><p>* <strong>Lyria:</strong> Text-to-music model moves into preview.</p><p>* <strong>TPU v7 (Ironwood):</strong> New TPU generation coming soon. As Nisten noted, Google's infrastructure uptime is consistently amazing, which could be a winning factor regardless of model SOTA status.</p><p>* <strong>Chirp 3 HD Voices + Voice Cloning:</strong> This one raised eyebrows. The notes mentioned HD voices <em>and</em> voice cloning. Cloning is a touchy subject the big players usually avoid publicly (copyright, deepfakes). Still digging for confirmation/details on this – if Google is really offering public voice cloning, that's huge. Let me know if you find a link!</p><p>* <strong>Deep Research gets Gemini 2.5 Pro:</strong> The experimental deep research feature in Gemini (their answer to OpenAI's research agent) now uses the powerful 2.5 Pro model. Google released comparison stats showing users strongly prefer it (70%) over OpenAI's offering, citing better instruction following and comprehensiveness. I haven't fully tested the 2.5 version yet, but the free tier access is amazing. and just look at those differences in preference compared to OAI Deep Research! </p><p><strong>Firebase Studio </strong>(<a target="_blank" href="https://firebase.studio/">firebase.studio</a>)<strong>:</strong> Remember Project IDX? It's been rebranded and launched as Firebase Studio. This is Google's answer to the wave of "vibe coding" web builders like Lovable, Bolt and a few more. It's a full-stack, cloud-based GenAI environment for building, testing, and deploying apps, integrated with Firebase and likely Gemini. Looks promising!</p><p><strong>Google Embraces MCP & Launches A2A Protocol!</strong></p><p>Two massive protocol announcements from Google that signal the maturation of the AI agent ecosystem:</p><p>* <strong>Official MCP Support! (</strong><a target="_blank" href="https://twitter.com/demishassabis/status/1910107859041271977"><strong>X</strong></a><strong>)</strong>This is huge. Following Microsoft and AWS, Google (via both Sundar Pichai and Demis Hassabis) announced official support for Anthropic's Model Context Protocol (MCP) in Gemini models and SDKs. MCP is rapidly becoming <em>the</em> standard for how agents discover and use tools securely and efficiently. With Google onboard, there's basically universal major vendor support. MCP is here to stay.</p><p>* <strong>Agent2Agent (A2A) Protocol (</strong><a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fdevelopers.googleblog.com%2Fen%2Fa2a-a-new-era-of-agent-interoperability%2F"><strong>Blog</strong></a><strong> , </strong><a target="_blank" href="https://github.com/google/A2A"><strong>Spec</strong></a><strong>, </strong><a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fwandb.ai%2Fwandb_fc%2Fproduct-announcements-fc%2Freports%2FPowering-Agent-Collaboration-Weights-Biases-Partners-with-Google-Cloud-on-Agent2Agent-Interoperability-Protocol---VmlldzoxMjE3NDg3OA"><strong>W&B Blog</strong></a><strong>)</strong>Google also launched a <em>new</em> open standard, A2A, designed for interoperability <em>between</em> different AI agents. Think of agents built by different vendors (Salesforce, ServiceNow, etc.) needing to talk to each other securely to coordinate complex workflows across enterprise systems. Built on web standards (HTTP, SSE, JSON-RPC), it handles discovery, task management (long-running!), and modality negotiation. Importantly, Google positions A2A as <em>complementary</em> to MCP, not competitive. MCP is how an agent uses a <em>tool</em>, A2A is how an <em>agent</em> talks to <em>another agent</em>. Weights & Biases is proud to be one of the 50+ launch partners working with Google on this! We'll do a deeper dive soon, but this + MCP feels like the foundation for a truly interconnected agent future.</p><p><strong>Cloudflare - new Agents SDK (</strong><a target="_blank" href="https://agents.cloudflare.com/">agents.cloudflare.com</a>)</p><p>Speaking of agents, Cloudflare launched their new Agents SDK (npm i agents). Built on their serverless infrastructure (Workers, Durable Objects), it offers a platform for building stateful, autonomous AI agents with a compelling pricing model (pay for CPU time, not wall time). This ties into the GitMCP story later – Cloudflare is betting big on the edge agent ecosystem.</p><p><strong>Other Big Co News:</strong></p><p>* <strong>Anthropic MAX:</strong> A new $200/month tier for Claude, offering higher usage quotas but no new models. Meh.</p><p>* <strong>Grok 3 API:</strong> Elon's xAI finally launched the API tier for Grok 3 (plus Fast and Mini variants). Now you can test its capabilities yourself. We're still waiting for the promised Open Source Grok-2</p><p><strong>🚨 BREAKING NEWS 🚨 OpenAI Upgrades Memory</strong></p><p>Right on cue during the show, OpenAI dropped a feature update! Sam Altman hyped <em>something</em> coming, and while it wasn't the o3/o4-mini models (those are coming next), it's a significant enhancement to <strong>ChatGPT Memory</strong>.</p><p>Previously, Memory tried to selectively save key facts. Now, when enabled, it can <strong>reference ALL of your past chats</strong> to personalize responses. Preferences, interests, past projects – it can potentially draw on everything. OpenAI states there's <strong>no storage limit</strong> for what it can reference.</p><p>How? Likely some sophisticated RAG/vector search under the hood, not stuffing everything into context. LDJ mentioned he might have had this rolling out silently for weeks, and while the immediate difference wasn't huge, the potential is massive as models get better at utilizing this vast personal context.</p><p>The immediate reaction? Excitement mixed with a bit of caution. As Wolfram pointed out, do I <em>really</em> want it remembering <em>every</em> single chat? Configurable memory (flagging chats for inclusion/exclusion) seems like a necessary follow-up. Thanks for the feature request, Wolfram! (And yes, Europe might not get this right away anyway...). This could finally stop ChatGPT from asking me basic questions it should know from our history!</p><p>Prompt suggestion: Ask the new chatGPT with memory, a think that you asked it that you likely forgot.</p><p>Just don't asked it what was the most boring thing you asked it, I got cooked I'm still feeling raw 😂 </p><p>Vision & Video: Kimi Drops Tiny But Mighty VLMs</p><p>The most impressive long form AI video paper dropped, showing that it's possible to create 1 minute long video, with incredible character and scene consistency</p><p>This <a target="_blank" href="https://t.co/agJKUAExpz">paper</a> introduces TTT layers (Test Time Training) to a pre-trained transformer, allowing it to one shot generate these incredibly consistent long scenes. Can't wait to see how the future of AI video evolves with this progress! </p><p>AI Art & Diffusion & 3D: HiDream Takes the Open Crown</p><p><strong>HiDream-I1-Dev 17B MIT license new leading open weights image gen! (</strong><a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fhuggingface.co%2Fcollections%2FHiDream-ai%2Fhidream-i1-67f3e90dd509fed088a158b3"><strong>HF</strong></a><strong>)</strong></p><p>Just when we thought the image gen space was settling, HiDream, a Chinese company, open-sourced their HiDream-I1 family under MIT license! This 17B parameter model comes in Dev, Full, and Fast variants.</p><p>The exciting part? Based on early benchmarks (like Artificial Analysis Image Arena), <strong>HiDream-I1-Dev surpasses Flux 1.1 [Pro]</strong>, Recraft V3, Reve and Imagen 3 while being open source! It boasts outstanding prompt following and text rendering capabilities.</p><p>HiDream's API is coming soon too and I really hope it's finetunable! </p><p>Tools: GitMCP - The Little Repo Tool That Could</p><p>GitMCP - turn any github repo into an MCP server (<a target="_blank" href="https://gitmcp.io">website</a>)</p><p>We wrapped up the show with a fantastic story from the community. We had Liad Yosef (Shopify) and Ido Salomon (Palo Alto Networks) join us to talk about <strong>GitMCP</strong>.</p><p>It started with a simple problem: a 3MB LLM.txt file (a format proposed by Jeremy Howard for repo documentation) too large for context windows. Liad and Ido, working nights and weekends, built an MCP server that could ingest any GitHub repo (prioritizing LLM.txt if present, falling back to Readmes/code comments) and expose its documentation via MCP tools (semantic search, fetching).</p><p>This means any MCP-compatible client (like Cursor, potentially future ChatGPT/Gemini) can instantly query the documentation of <em>any</em> public GitHub repo just by pointing to the GitMCP URL for that repo (e.g., <a target="_blank" href="https://gitmcp.io/user/repo).">https://gitmcp.io/user/repo).</a> As Yam Peleg pointed out during the show, the genius here is dynamically generating a <em>customized</em> tool specifically for that repo, making it incredibly easy for the LLM to use.</p><p>Then, the story got crazy. They launched, went viral, almost melted their initial Vercel serverless setup due to traffic and SSE connection costs (100$+ per hour!). DMs flew back and forth with Vercel's CEO, then Cloudflare's CTO swooped in offering to sponsor hosting on Cloudflare's <em>unreleased </em>Agents platform if they migrated <em>immediately</em>. A frantic weekend of coding ensued, culminating in a nail-biting domain switch and a temporary outage before getting everything stable on Cloudflare.</p><p>The project has received massive praise (including from Jeremy Howard himself) and is solving a real pain point for developers wanting to easily ground LLMs in project documentation. Huge congrats to Liad and Ido for the amazing work and the wild ride! Check out gitmcp.io!</p><p>Wrapping Up Episode 100!</p><p>Whew! What a show. From the Llama 4 rollercoaster to Google's AI barrage, the rise of agent standards like MCP and A2A, groundbreaking open source models, and incredible community stories like GitMCP – this episode truly showed an exemplary week in AI and underlined the reason I do this every week. It's really hard to keep up, and so if I commit to you guys, I stay up to date myself!  </p><p>Hitting 100 episodes feels surreal. It's been an absolute privilege sharing this journey with Wolfram, LDJ, Nisten, Yam, all our guests, and all of you. Seeing the community grow, hitting milestones like 1000 YouTube subscribers today, fuels us to keep going 🎉 </p><p>The pace isn't slowing down. If anything, it's accelerating. But we'll be right here, every Thursday, trying to make sense of it all, together.</p><p>If you missed anything, don't worry! Subscribe to the ThursdAI News Substack for the full TL;DR and links below.</p><p>Thanks again for making 100 episodes possible. Here's to the next 100! 🥂</p><p>Keep tinkering, keep learning, and I'll see you next week.</p><p>Alex</p><p><strong>TL;DR and Show Notes</strong></p><p>* <strong>Hosts and Guests</strong></p><p>* <strong>Alex Volkov</strong> - AI Evangelist & Weights & Biases (<a target="_blank" href="http://x.com/@altryne">@altryne</a>)</p><p>* Co Hosts - <a target="_blank" href="http://x.com/@WolframRvnwlf">@WolframRvnwlf</a> <a target="_blank" href="http://x.com/@yampeleg">@yampeleg</a> <a target="_blank" href="http://x.com/@nisten">@nisten</a> <a target="_blank" href="http://x.com/@ldjconfirmed">@ldjconfirmed</a></p><p>* <strong>Michael Luo </strong><a target="_blank" href="http://x.com/michaelzluo">@michaelzluo</a> - CS PhD @ UC Berkeley; AI & Systems</p><p>* <strong>Liad Yosef </strong>(<a target="_blank" href="https://x.com/liadyosef">@liadyosef</a>), <strong>Ido Salomon </strong>(<a target="_blank" href="https://x.com/idosal1">@idosal1</a>) - GitMCP creators</p><p>* <strong>Open Source LLMs</strong> </p><p>* Meta drops LLama 4 (Scout 109B/17BA & Maverick 400B/17BA) - (<a target="_blank" href="https://ai.meta.com/blog/llama-4-multimodal-intelligence/">Blog</a>, <a target="_blank" href="https://huggingface.co/meta-llama">HF</a>, <a target="_blank" href="https://meta.ai/">Try It</a>)</p><p>* Together AI and Agentica (UC Berkley) announce <strong>DeepCoder-14B</strong> (<a target="_blank" href="https://x.com/togethercompute/status/1909697124805333208">X</a>, <a target="_blank" href="https://www.together.ai/blog/deepcoder">Blog</a>)</p><p>* NVIDIA Nemotron Ultra is here! 253B pruned LLama 3-405B (<a target="_blank" href="https://x.com/kuchaev/status/1909444566379573646">X</a>, <a target="_blank" href="https://huggingface.co/nvidia/Llama-3_1-Nemotron-Ultra-253B-v1">HF</a>)</p><p>* Jina Reranker M0 - SOTA multimodal reranker model (<a target="_blank" href="https://jina.ai/news/jina-reranker-m0-multilingual-multimodal-document-reranker/">Blog</a>, <a target="_blank" href="https://huggingface.co/jinaai/jina-reranker-m0">HF</a>)</p><p>* DeepCogito - SOTA models 3-70B - beating DeepSeek 70B - (<a target="_blank" href="https://www.deepcogito.com/research/cogito-v1-preview">Blog</a>, <a target="_blank" href="https://huggingface.co/deepcogito/cogito-v1-preview-llama-70B">HF</a>)</p><p>* ByteDance new release - <a target="_blank" href="https://github.com/ByteDance-Seed/Seed-Thinking-v1.5"><strong>Seed-Thinking-v1.5</strong></a></p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Google announces TONS of new things 🙌  (<a target="_blank" href="https://blog.google/products/google-cloud/next-2025/">Blog</a>)</p><p>* Google launches Firebase Studio (<a target="_blank" href="https://firebase.studio/">website</a>)</p><p>* Google is announcing official support for MCP (<a target="_blank" href="https://x.com/demishassabis/status/1910107859041271977">X</a>)</p><p>* Google announces A2A protocol - agent 2 agent communication (<a target="_blank" href="https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/">Blog</a>, <a target="_blank" href="https://github.com/google/A2A">Spec</a>, <a target="_blank" href="https://wandb.ai/wandb_fc/product-announcements-fc/reports/Powering-Agent-Collaboration-Weights-Biases-Partners-with-Google-Cloud-on-Agent2Agent-Interoperability-Protocol---VmlldzoxMjE3NDg3OA">W&B Blog</a>)</p><p>* Cloudflare - new Agents SDK (<a target="_blank" href="https://agents.cloudflare.com/">Website</a>)</p><p>* Anthropic MAX - $200/mo with more quota</p><p>* Grok 3 finally launches API tier (<a target="_blank" href="https://docs.x.ai/docs/models#models-and-pricing">API</a>)</p><p>* OPenAI adds enhanced memory to ChatGPT - can remember all your chats (<a target="_blank" href="https://x.com/OpenAI/status/1910378768172212636">X</a>)</p><p>* <strong>This weeks Buzz - MCP and A2A</strong></p><p>* W&B launches the <a target="_blank" href="http://observable.tools">observable.tools</a> initiative & invite people to comment on the MCP <a target="_blank" href="http://wandb.me/mcp-spec">RFC</a></p><p>* W&B is the launch partner for Google's A2A (<a target="_blank" href="https://wandb.ai/wandb_fc/product-announcements-fc/reports/Powering-Agent-Collaboration-Weights-Biases-Partners-with-Google-Cloud-on-Agent2Agent-Interoperability-Protocol---VmlldzoxMjE3NDg3OA">Blog</a>)</p><p>* <strong>Vision & Video</strong></p><p>* <strong>Kimi-VL and Kimi-VL-Thinking - </strong>A3B vision models (X, <a target="_blank" href="https://t.co/cgCMsiHN8p">HF</a>)</p><p>* One-Minute Video Generation with Test-Time Training (<a target="_blank" href="https://t.co/BSHsucizoG">Blog</a>, <a target="_blank" href="https://t.co/agJKUAExpz">Paper</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* Amazon - Nova Sonic - speech2speech foundational model (<a target="_blank" href="https://www.aboutamazon.com/news/innovation-at-amazon/nova-sonic-voice-speech-foundation-model">Blog</a>)</p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* <strong>HiDream-I1-Dev </strong>17B MIT license<strong> </strong>new leading open weights image gen 0 passes Flux1.1[pro] ! (<a target="_blank" href="https://huggingface.co/collections/HiDream-ai/hidream-i1-67f3e90dd509fed088a158b3">HF</a>)</p><p>* <strong>Tools</strong></p><p>* GitMCP - turn any github repo into an MCP server (<a target="_blank" href="https://gitmcp.io/">try it</a>)</p><p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-100th-episode-meta-llama</link><guid isPermaLink="false">substack:post:161058643</guid><dc:creator><![CDATA[Alex Volkov, Michael Luo, Liad Yosef, Ido Salomon, and Nisten]]></dc:creator><pubDate>Thu, 10 Apr 2025 23:20:02 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/161058643/735c63aedb63aac0e08f52bd660afe90.mp3" length="88615862" type="audio/mpeg"/><itunes:author>Alex Volkov, Michael Luo, Liad Yosef, Ido Salomon, and Nisten</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5538</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/161058643/cc9056182b250809dda99a7c6bfb651f.jpg"/><itunes:episode>100</itunes:episode><itunes:episodeType>full</itunes:episodeType></item><item><title><![CDATA[ThursdAI - Apr 3rd - OpenAI Goes Open?! Gemini Crushes Math, AI Actors Go Hollywood & MCP, Now with Observability?]]></title><description><![CDATA[<p>Woo! Welcome back to ThursdAI, show number 99! Can you believe it? We are <em>one</em> show away from hitting the big 100, which is just wild to me. And speaking of milestones, we just crossed <strong>100,000 downloads</strong> on Substack alone! [Insert celebratory sound effect here 🎉]. Honestly, knowing so many of you tune in every week <strong>genuinely fills me with joy</strong>, but also a real commitment to keep bringing you the the high-signal, zero-fluff AI news you count on. Thank you for being part of this amazing community! 🙏</p><p>And what a week it's been! I started out busy at work, playing with the native image generation in ChatGPT like everyone else (all 130 million of us!), and then I looked at my notes for today… an absolute <em>mountain</em> of updates. Seriously, one of those weeks where open source just exploded, big companies dropped major news, and the vision/video space is producing stuff that's crossing the uncanny valley.</p><p>We’ve got OpenAI teasing a big open source release (yes, <em>Open</em>AI might actually be <em>open</em> again!), Gemini 2.5 showing superhuman math skills, Amazon stepping into the agent ring, truly mind-blowing AI character generation from Meta, and a personal update on making the Model Context Protocol (MCP) observable. Plus, we had some fantastic guests join us live!</p><p>So buckle up, grab your coffee (or whatever gets you through the AI whirlwind), because we have a <em>lot</em> to cover. Let's dive in! (as always, show notes and links in the end)</p><p>OpenAI Makes Waves: Open Source Tease, Tough Evals & Billions Raised</p><p>It feels like OpenAI was determined to dominate the headlines this week, hitting us from multiple angles.</p><p>First, the potentially massive news: <strong>OpenAI is planning to release a new open source model</strong> in the "coming months"! Kevin Weil <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fx.com%2Fkevinweil%2Fstatus%2F1906797119848988822"><strong>tweeted</strong></a> that they're working on a "highly capable open language model" and are actively seeking developer feedback through dedicated sessions (<a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fopenai.com%2Fform%2Fopen-model-feedback"><strong>sign up here</strong></a> if interested) to "get this right." Word on the street is that this could be a powerful reasoning model. Sam Altman also cheekily added they won't slap on a Llama-style <700M user license limit. Seeing OpenAI potentially re-embrace its "Open" roots with a potentially SOTA model is huge. We'll be watching like hawks!</p><p>Second, they dropped <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fopenai.com%2Findex%2Fpaperbench%2F"><strong>PaperBench</strong></a>, a <em>brutal</em> new benchmark evaluating an AI's ability to replicate ICML 2024 research papers from scratch (read paper, write code, run experiments, match results - no peeking at original code!). It's incredibly detailed (>8,300 tasks) and even includes meta-evaluation for the LLM judge they built (<a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fgithub.com%2Fopenai%2Fpreparedness"><strong>Nano-Eval framework also open sourced</strong></a>). The kicker? <strong>Claude 3.5 Sonnet (New)</strong> came out on top with just <strong>21.0%</strong> replication score (human PhDs got 41.4%). Props to OpenAI for releasing an eval where they don’t even win. That’s what real benchmarking integrity looks like. You can find the <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fgithub.com%2Fopenai%2Fpreparedness%2Ftree%2Fmain%2Fproject%2Fpaperbench"><strong>code on GitHub</strong></a> and read the <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fcdn.openai.com%2Fpapers%2F22265bac-3191-44e5-b057-7aaacd8e90cd%2Fpaperbench.pdf"><strong>full paper here</strong></a>.</p><p>Third, the casual <a target="_blank" href="https://openai.com/index/investing&#8722;in&#8722;our&#8722;mission/"><strong>40 Billion Dollars</strong></a><strong> </strong>funding round led by SoftBank. Valuing the company at <strong>300 Billion</strong>. Yes, Billion with a B. More than Coke, more than Disney. The blog post was hilariously short for such a massive number. They also mentioned<strong>500 million weekly ChatGPT users</strong>and the insane onboarding rate (1M users/hr!) thanks to native image generation, especially seeing huge growth in India. The scale is just mind-boggling.</p><p>Oh, and for fun, try the new grumpy, EMO "<a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fx.com%2FOpenAI%2Fstatus%2F1907124258867982338"><strong>Monday" voice</strong></a> in advanced voice mode. It's surprisingly entertaining.</p><p>Open Source Powerhouses: Nomic & OpenHands Deliver SOTA</p><p>Beyond the OpenAI buzz, the open source community delivered some absolute gems, and we had guests from two key projects join us!</p><p>Nomic Embed Multimodal: SOTA Embeddings for Visual Docs</p><p>Our friends at Nomic AI are back with a killer release! We had Zach Nussbaum on the show discussing <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fwww.nomic.ai%2Fblog%2Fposts%2Fnomic-embed-multimodal"><strong>Nomic Embed Multimodal</strong></a>. These are new 3B & 7B parameter embedding models (<a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fhuggingface.co%2Fcollections%2Fnomic-ai%2Fnomic-embed-multimodal-67e5ddc1a890a19ff0d58073"><strong>available on Hugging Face</strong></a>) built on Alibaba's excellent Qwen2.5-VL. They achieved <strong>SOTA on visual document retrieval</strong> by cleverly embedding interleaved text-image sequences – perfect for PDFs and complex webpages.</p><p>Zach highlighted that they chose the Qwen base because high-performing open VLMs under 3B params are still scarce, making it a solid foundation. Importantly, the 7B model comes with an <strong>Apache 2.0 license</strong>, and they've open sourced weights, code, and data. They offer both a powerful multi-vector version (ColNomic) and a faster single-vector one. Huge congrats to Nomic!</p><p>OpenHands LM 32B & Agent: Accessible SOTA Coding</p><p>Remember OpenDevin? It evolved into OpenHands, and the team just dropped their own <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fwww.all-hands.dev%2Fblog%2Fintroducing-openhands-lm-32b----a-strong-open-coding-agent-model"><strong>OpenHands LM 32B</strong></a>! We chatted with co-founder Xingyao "Elle" Wang about this impressive Qwen 2.5 finetune (<a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fhuggingface.co%2Fall-hands%2Fopenhands-lm-32b-v0.1"><strong>MIT licensed, on Hugging Face</strong></a>).</p><p>It hits a remarkable <strong>37.2% on SWE-Bench Verified </strong>(a coding benchmark measuring real-world repo tasks), competing with much larger models. Elle stressed they didn't just chase code completion scores; they focused on tuning for <em>agentic capabilities</em> – tool use, planning, self-correction – using trajectories from their contamination-free Switch Execution dataset. This focus seems to be paying off, as the OpenHands <em>agent</em> also snagged the <strong>#2 spot on the brand new Live SWE-Bench</strong> leaderboard! Plus, the 32B model runs locally on a single 3090, making this power accessible. You can also try their managed <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fapp.all-hands.dev%2F"><strong>OpenHands Cloud</strong></a> ($50 free credit). Amazing progress from this team!</p><p>Frontiers: Diffusion LMs & Superhuman Math</p><p>Two other developments pushed the boundaries this week:</p><p>Dream 7B: A Diffusion Language Model Challenger?</p><p>This one's fascinating conceptually. Researchers unveiled <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fhkunlp.github.io%2Fblog%2F2025%2Fdream%2F"><strong>Dream 7B</strong></a>, a language model based on <strong>diffusion</strong>, not auto-regression. The <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fx.com%2FJiachengYe15%2Fstatus%2F1907430553369883017"><strong>benchmarks they shared</strong></a> show it competing strongly with top 7-8B models, and absolutely crushing tasks like Sudoku (81% vs <50% for others), potentially due to its parallel processing nature being better for global constraints. It's an exciting hint at alternative architectures, but the <strong>model weights aren't out yet</strong>, so we can't verify or play with it. Still, one to watch!</p><p>Gemini 2.5 Obliterates Olympiad Math (24.4% on USAMO!)</p><p>We already knew Gemini 2.5 was good, but wow. New results dropped showing its performance on the <strong>USA Math Olympiad (USAMO)</strong> – problems so hard most top models score under 5%. <strong>Gemini 2.5 Pro scored an incredible 24.4%</strong>!</p><p>The gap between it and everything else is massive, highlighting the power of its reasoning and thinking capabilities (which you can inspect via its traces!). Having used it for complex tasks myself (like wrestling with tax forms!), I can attest to its depth. It's free in the Gemini app – go try it!</p><p>Agents, Compute & Making MCP Observable</p><p>Amazon's Nova Act Agent & The Need for Access</p><p>Amazon entered the agent chat with <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Flabs.amazon.science%2Fblog%2Fnova-act"><strong>Nova Act</strong></a>, designed for web browser actions. They claim it beats Claude 3.5 and OpenAI's QA model on some benchmarks, possibly leveraging acquired Adept talent. But... it's only available via an SDK with a request form. As Yam rightly pointed out on the show, these agent claims mean little until we can actually <em>use</em> them in the real world!</p><p>CoreWeave + NVIDIA = Insane Speeds</p><p>Hardware keeps accelerating. CoreWeave announced hitting <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fwww.prnewswire.com%2Fnews-releases%2Fcoreweave-achieves-new-record-breaking-ai-inferencing-benchmark-with-nvidia-gb200-grace-blackwell-superchips-302418682.html%3Fhss_channel%3Dtw-979803443681349632"><strong>800 Tokens/sec on Llama 3.1 405B</strong></a> using NVIDIA's new GB200 Blackwell chips, and 33,000 T/s on Llama 2 70B with H200s. Inference is getting <em>fast</em>.</p><p>This Week's Buzz: Let's Make MCP Observable!</p><p>Okay, my personal mission this week builds on the growing <strong>Model Context Protocol (MCP)</strong> momentum. MCP is potentially the "HTTP for agents," enabling tool interoperability. But as tool use moves external, we lose visibility, making debugging and security harder.</p><p>That's why I'm launching the <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fobservable.tools%2F"><strong>Observable Tools</strong></a> initiative. The goal: integrate observability <em>into</em> the MCP standard itself. Right now, that link redirects to a <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fgithub.com%2Fmodel-context-protocol%2Fspecification%2Fdiscussions%2F18"><strong>GitHub discussion</strong></a> where I've proposed using the <strong>OpenTelemetry (OTel)</strong> standard to add tracing to MCP interactions. This would give developers clear visibility into tool usage, regardless of their observability platform.</p><p><strong>I need your help!</strong> Please check out the proposal, join the discussion, and <strong>show your support</strong> with a 👍 or 🚀 on GitHub. We need the community voice to make this happen! (And yes, my <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fx.com%2Faltryne%2Fstatus%2F1906180131540066796"><strong>viral tweet</strong></a> showed there's huge demand for usable MCP clients too – more on that soon!).</p><p>Vision & Video: Entering the Uncanny Valley</p><p>This space is moving at lightning speed.</p><p><a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Frunwayml.com%2Fresearch%2Fintroducing-runway-gen-4"><strong>Runway Gen-4</strong></a> was announced, pushing for better consistency in AI video. Here's a few example videos showing incredible character and world consistency:</p><p>ByteDance's impressive <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fdreamina.capcut.com%2Fai-tool%2Fvideo%2Flip-sync%2Fgenerate"><strong>OmniHuman</strong></a> (single image to talking avatar) is now publicly usable via Dreamina website. For people it's really good, but for animated style images, <a target="_blank" href="https://www.hedra.com/">Hedra Labs</a> feels actually better (and much much faster)</p><p><strong>Meta's </strong><a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fcongwei1230.github.io%2FMoCha%2F"><strong>MoCHA</strong></a><strong> is mind-blowing.</strong> We had researcher Cong Wei explain how it generates <em>movie-grade</em>, full-body, expressive talking characters directly from speech and text (no reference image needed!). Using Diffusion Transformers and clever attention mechanisms, the realism is startling, handling lip-sync, gestures, emotions, and even multi-character dialogue. Check the project page videos – some are truly uncanny. Just look at this one!</p><p>Voice Highlight: Hailuo Speech-02</p><p>While Gladia launched their <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fwww.gladia.io%2Fsolaria"><strong>Solaria STT</strong></a>, the standout for me was <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fx.com%2FHailuo_AI%2Fstatus%2F1906723587379101923"><strong>Hailuo's Speech-02 TTS API</strong></a>. The emotional control and voice cloning quality are, in my opinion, potentially SOTA right now, offering incredibly nuanced and realistic synthetic voices.</p><p>Tool Update & Breaking News!</p><p>* Google's <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fblog.google%2Ftechnology%2Fgoogle-labs%2Fnotebooklm-discover-sources%2F"><strong>NotebookLM now discovers related sources</strong></a> automatically.</p><p>* <strong>BREAKING NEWS (Caught end of show): Devin 2.0 is out!</strong> Cognition Labs launched their AI software engineer V2 with a new IDE experience and, crucially, a <strong>$20/month</strong> starting price. Much more accessible to try!</p><p>Phew! What a week. From OpenAI's big moves to Gemini's math prowess, stunning AI actors from Meta, and the push for an observable agent ecosystem – the field is accelerating like crazy.</p><p>Alright folks, that’s a wrap for show #99! Thank you again for tuning in, for being part of the community, and for keeping us on our toes with your insights and feedback. Special thanks to our guests Zach Nussbaum (Nomic), Xingyao Wang (All Hands AI), and Cong Wei (Meta/MoCHA) for joining us!</p><p>If you missed any part of the show, or want to grab any of the links, head over to <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fthursdai.news"><strong>ThursdAI.news</strong></a>. The full recording (video on YouTube, audio on Spotify, Apple Podcasts, etc.) and this blog post with all the notes will be up shortly.</p><p>The best way to support the show? Share it with a friend or colleague who needs to stay up-to-date on AI, and drop us a 5-star review on your podcast platform! Financial support via Substack is also appreciated but never required.</p><p>Get ready for <strong>Episode 100</strong> next week! Until then, happy tinkering, stay curious, and I'll see you next ThursdAI!</p><p>Bye bye everyone!</p><p>TL;DR and Show Notes</p><p><strong>Host, Guests, and Co-hosts</strong></p><p>* <strong>Host:</strong> Alex Volkov - AI Evangelist & Weights & Biases (<a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fx.com%2Faltryne"><strong>@altryne</strong></a>)</p><p>* <strong>Co-Hosts:</strong></p><p>* LDJ (<a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fx.com%2Fldjconfirmed"><strong>@ldjconfirmed</strong></a>)</p><p>* Yam Peleg (<a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fx.com%2Fyampeleg"><strong>@yampeleg</strong></a>)</p><p>* <strong>Guests:</strong></p><p>* Zach Nussbaum (<a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fx.com%2Fzach_nussbaum"><strong>@zach_nussbaum</strong></a>) - Nomic AI</p><p>* Xingyao Wang (<a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fx.com%2Fxingyaow_"><strong>@xingyaow_</strong></a>) - All Hands AI / OpenHands</p><p>* Cong Wei (<a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fx.com%2FCongWei1230"><strong>@CongWei1230</strong></a>) - Meta AI / MoCha</p><p><strong>Key Topics & Links</strong></p><p>* <strong>OpenAI's Big Week:</strong></p><p>* Teasing highly capable <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fx.com%2Fkevinweil%2Fstatus%2F1906797119848988822"><strong>Open Source Reasoner Model</strong></a> (seeking <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fopenai.com%2Fform%2Fopen-model-feedback"><strong>feedback</strong></a>).</p><p>* Released <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fopenai.com%2Findex%2Fpaperbench%2F"><strong>PaperBench</strong></a> eval (<a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fgithub.com%2Fopenai%2Fpreparedness%2Ftree%2Fmain%2Fproject%2Fpaperbench"><strong>code</strong></a>, <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fcdn.openai.com%2Fpapers%2F22265bac-3191-44e5-b057-7aaacd8e90cd%2Fpaperbench.pdf"><strong>paper</strong></a>) & <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fgithub.com%2Fopenai%2Fpreparedness"><strong>Nano-Eval framework</strong></a>.</p><p>* Raised <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fopenai.com%2Findex%2Finvesting-in-our-mission%2F"><strong>$40B at $300B valuation</strong></a>.</p><p>* New EMO "<a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fx.com%2FOpenAI%2Fstatus%2F1907124258867982338"><strong>Monday" voice</strong></a> in ChatGPT.</p><p>* <strong>Open Source Powerhouses:</strong></p><p>* <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fwww.nomic.ai%2Fblog%2Fposts%2Fnomic-embed-multimodal"><strong>Nomic Embed Multimodal</strong></a>: SOTA visual doc embeddings (3B & 7B, <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fhuggingface.co%2Fcollections%2Fnomic-ai%2Fnomic-embed-multimodal-67e5ddc1a890a19ff0d58073"><strong>Apache 2.0 for 7B</strong></a>).</p><p>* <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fwww.all-hands.dev%2Fblog%2Fintroducing-openhands-lm-32b----a-strong-open-coding-agent-model"><strong>OpenHands LM 32B</strong></a>: SOTA-level coding agent model (Qwen finetune, <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fhuggingface.co%2Fall-hands%2Fopenhands-lm-32b-v0.1"><strong>MIT License</strong></a>, 37.2% SWE-Bench, #2 Live SWE-Bench). <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fapp.all-hands.dev%2F"><strong>Cloud version available</strong></a>.</p><p>* <strong>Frontier Models & Capabilities:</strong></p><p>* <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fhkunlp.github.io%2Fblog%2F2025%2Fdream%2F"><strong>Dream 7B</strong></a>: Promising <strong>diffusion LM</strong> shows strong benchmark results (<a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fx.com%2FJiachengYe15%2Fstatus%2F1907430553369883017"><strong>esp. Sudoku</strong></a>), but weights not yet released.</p><p>* <strong>Gemini 2.5</strong>: Crushes hard <strong>USAMO math eval (24.4%</strong> vs <5% for others), showcasing superior reasoning.</p><p>* <strong>Agents & Compute:</strong></p><p>* Amazon's <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Flabs.amazon.science%2Fblog%2Fnova-act"><strong>Nova Act</strong></a> agent announced, claims SOTA but lacks public access (<a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fnova.amazon.com"><strong>request form</strong></a>).</p><p>* <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fwww.prnewswire.com%2Fnews-releases%2Fcoreweave-achieves-new-record-breaking-ai-inferencing-benchmark-with-nvidia-gb200-grace-blackwell-superchips-302418682.html%3Fhss_channel%3Dtw-979803443681349632"><strong>CoreWeave/NVIDIA</strong></a>: Massive inference speedups (800T/s on Llama 405B with GB200).</p><p>* <strong>This Week's Buzz - MCP:</strong></p><p>* <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fobservable.tools%2F"><strong>Observable Tools</strong></a> initiative launched to add observability to MCP.</p><p>* Proposal using OpenTelemetry posted for <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fgithub.com%2Fmodel-context-protocol%2Fspecification%2Fdiscussions%2F18"><strong>community feedback on GitHub</strong></a> - please support!</p><p>* Huge demand shown for usable MCP clients (<a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fx.com%2Faltryne%2Fstatus%2F1906180131540066796"><strong>viral tweet</strong></a>).</p><p>* <strong>Vision & Video Highlights:</strong></p><p>* <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Frunwayml.com%2Fresearch%2Fintroducing-runway-gen-4"><strong>Runway Gen-4</strong></a> focuses on video consistency.</p><p>* ByteDance <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fdreamina.capcut.com%2Fai-tool%2Fvideo%2Flip-sync%2Fgenerate"><strong>OmniHuman</strong></a> (image-to-avatar) now publicly available via Dreamina (<a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fx.com%2Faltryne%2Fstatus%2F1907173680456794187"><strong>example thread</strong></a>).</p><p>* Meta's <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fcongwei1230.github.io%2FMoCha%2F"><strong>MoCHA</strong></a>: Generates stunningly realistic, movie-grade talking characters from speech+text.</p><p>* <strong>Voice Highlight:</strong></p><p>* <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fx.com%2FHailuo_AI%2Fstatus%2F1906723587379101923"><strong>Hailuo Speech-02</strong></a>: Impressive TTS API with excellent emotional control and voice cloning.</p><p>* <strong>Tool Updates:</strong></p><p>* <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fx.com%2Fwindsurf_ai%2Fstatus%2F1907497638267924566"><strong>Windsurf adds deployments</strong></a> to Netlify.</p><p>* Google <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fblog.google%2Ftechnology%2Fgoogle-labs%2Fnotebooklm-discover-sources%2F"><strong>NotebookLM adds source discovery</strong></a>.</p><p>* <strong>Breaking News:</strong></p><p>* <strong>Devin 2.0</strong> AI Software Engineer announced, starts at $20/month.</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-apr-3rd-openai-goes-open</link><guid isPermaLink="false">substack:post:160523272</guid><dc:creator><![CDATA[Alex Volkov, Zach Nussbaum, and Xingyao Wang]]></dc:creator><pubDate>Thu, 03 Apr 2025 21:36:41 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/160523272/5e26ec818aa02bfb1e5faa1f8783ebc0.mp3" length="93650858" type="audio/mpeg"/><itunes:author>Alex Volkov, Zach Nussbaum, and Xingyao Wang</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5853</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/160523272/11a052f899b3a2335eaa3dcad93fafb2.jpg"/></item><item><title><![CDATA[📆 ThursdAI - Mar 27 - Gemini 2.5 Takes #1, OpenAI Goes Ghibli, DeepSeek V3 Roars, Qwen Omni, Wandb MCP & more AI news]]></title><description><![CDATA[<p>Hey everyone, Alex here 👋 </p><p>Welcome back to ThursdAI! And folks, what an <em>absolutely insane</em> week it's been in the world of AI. Seriously, as I mentioned on the show, we don't often get weeks <em>this</em> packed with game-changing releases.</p><p>We saw Google emphatically reclaim the #1 LLM spot with Gemini 2.5 Pro (and OpenAI try really hard to hit back with a new ChatGPT), DeepSeek dropped a monster 685B parameter open-source model, Qwen launched a tiny but mighty 7B Omni model that handles voice and video like a champ, and OpenAI <em>finally</em> gave us native image generation in GPT-4o, immediately unleashing a tidal wave of Ghibli-fication across the internet. It was intense, with big players seemingly trying to one-up each other constantly – remember when Sam Altman dropped Advanced Voice Mode right when Google was about to show Astra? This weeks was this, on steroids. </p><p>We had a fantastic show trying to unpack it all, joined by the brilliant Tulsee Doshi from the Google Gemini team, my Weights & Biases colleague Morgan McQuire talking MCP tools, and the MLX King himself, Prince Canuma. Plus, my awesome co-hosts Wolfram, Nisten, and Yam were there to add their insights. (watch the LIVE recap or keep reading and listen to the audio pod) </p><p>So, grab your beverage of choice, buckle up, and let's try to make sense of this AI whirlwind! (TL'DR and show notes at the bottom 👇)</p><p>Big CO LLMs + APIs</p><p>🔥 Google Reclaims #1 with Gemini 2.5 Pro (Thinking!)</p><p>Okay, let's start with the big news. Google came out swinging this week, dropping Gemini 2.5 Pro and, based on the benchmarks and our initial impressions, taking back the crown for the best all-around LLM currently available. (Check out the X announcement, the <a target="_blank" href="https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/">official blog post</a>, and seriously, go <a target="_blank" href="http://ai.dev">try it yourself at ai.dev</a>).</p><p>We were super lucky to have Tulsee Doshi, who leads the product team for Gemini modeling efforts at Google, join us on the show to give us the inside scoop. Gemini 2.5 Pro Experimental isn't just an incremental update; it's topping benchmarks in complex reasoning, science, math, and coding. As Tulsee explained, this isn't just about tweaking one thing – it's a combination of a significantly enhanced base model <em>and</em> improved post-training techniques, including integrating those "thinking" capabilities (like chain-of-thought) right into the core models.</p><p>That's why they dropped "thinking" from the official name – it's not a separate mode anymore, it's becoming fundamental to how Gemini operates. Tulsee mentioned their goal is for the main line models <em>to be</em> thinking models, leveraging inference time when needed to get the best answer. This is a huge step towards more capable and reliable AI.</p><p>The performance gains are staggering across the board. We saw massive jumps on benchmarks like AIME (up nearly 20 points!) and GPQA. But it's not just about the numbers. As Tulsee highlighted, Gemini 2.5 is proving to be incredibly well-rounded, excelling not only on academic benchmarks but also on human preference evaluations like LM Arena (where style control is key). The "vibes" are great, as Wolfram put it. My own testing on reasoning tasks confirms this – the latency is surprisingly low for such a powerful model (around 13 seconds on my hard reasoning questions compared to 45+ for others), and the accuracy is the highest I've seen yet at 66% on that specific challenging set.</p><p>It also inherits the strengths of previous Gemini models – native multimodality and that massive long context window (up to 1M tokens!). Tulsee emphasized how crucial long context is, allowing the model to reason over entire code repos, large sets of financial documents, or research papers. The performance on long context tasks, like the needle-in-a-haystack test shown on Live Bench, is truly impressive, maintaining high accuracy even at 120k+ tokens where other models often falter significantly.</p><p>Nisten mentioned on the show that while it's better than GPT-4o, it might not completely replace Sonnet 3.5 for him yet, especially for certain coding or medical tasks under 128k context. Still, the consensus is clear: Gemini 2.5 Pro is the absolute best model right now across categories. Go play with it!</p><p>ARC-AGI 2 Benchmark Revealed (<a target="_blank" href="https://x.com/arcprize/status/1905274808935608528">X</a>, <a target="_blank" href="https://x.com/arcprize/status/1905274808935608528">Interactive Blog</a>)</p><p>Also on the benchmark front, the challenging ARC-AGI 2 benchmark was revealed. This is designed to test tasks that are easy for humans but hard for LLMs. The initial results are sobering: base LLMs score 0% accuracy, and even current "thinking" models only reach about 4%. It highlights just how far we still have to go in developing truly robust AI reasoning, giving us another hill to climb.</p><p>GPT-4o got another update (as I'm writing these words!) tied for #1 on LMArena, beating 4.5</p><p>How much does Sam want to win over Google? So much he's letting it ALL out. Just now, we saw an update from LMArena and Sam, about a NEW GPT-4o (2025-03-26) which jumps OVER GPT 4.5 (like.. what?) and lands at number 2 on the LM Arena, jumping over 3o points.</p><p>Tied #1 in Coding, Hard Prompts. Top-2 across ALL categories. </p><p>Besides getting very close to Gemini but not quite beating it, I gotta ask, what's the point of 4.5 then? </p><p>Open Source LLMs</p><p>The open-source community wasn't sleeping this week either, with some major drops!</p><p>DeepSeek V3 Update - 685B Parameter Beast!</p><p>The Whale Bros at DeepSeek silently dropped an update to their V3 model (<a target="_blank" href="https://x.com/deepseek_ai/status/1904526863604883661">X</a>, <a target="_blank" href="https://huggingface.co/deepseek-ai/DeepSeek-V3-0324">HF</a>), and it's a monster. We're talking 685 Billion parameters in a Mixture-of-Experts (MoE) architecture. This isn't R1 (their reasoning model), but the powerful base model that R1 was built upon (and supposedly R2 when it'll come out)</p><p>The benchmark jumps from the previous version are huge, especially in reasoning:</p><p>* MMLU-Pro: 75.9 → 81.2 (+5.3)</p><p>* GPQA: 59.1 → 68.4 (+9.3)</p><p>* <strong>AIME: 39.6 → 59.4 (+19.8)</strong> (Almost 20 points on competitive math!)</p><p>* LiveCodeBench: 39.2 → 49.2 (+10.0)</p><p>They're highlighting major boosts in reasoning, stronger front-end dev skills, and smarter tool use. Nisten noted it even gets some hard reasoning questions right that challenge other models. The "vibes" are reportedly great. Wolfram tried to run it locally but found even the 1-bit quantized version too large for his system (though it should <em>theoretically</em> fit in combined RAM/VRAM), but he's using it via API. It’s likely the best <em>non-reasoning</em> open model right now, potentially the best overall if you can run it.</p><p>And huge news for the community – they've released these weights under the MIT License, just like R1! Massive respect to DeepSeek for continuing to push powerful models into the open.</p><p>They also highlight being significantly better at Front End development and websites aesthetics. </p><p>Qwen Launches Omni 7B Model - Voice & Video Chat!</p><p>Our friends at Qwen (Alibaba) also came through with something super cool: Qwen2.5-Omni-7B (<a target="_blank" href="https://huggingface.co/Qwen/Qwen2.5-Omni-7B">HF</a>). This is an end-to-end multimodal model that can perceive text, images, audio, AND video, while generating both text and natural-sounding speech, potentially in real-time.</p><p>They're using a "Thinker-Talker" architecture. What blew my mind is the size – it's listed as 7B parameters, though I saw a meme suggesting it might be closer to 11B internally (ah, the joys of open source!). Still, even at 11B, having this level of multimodal understanding <em>and</em> generation in a relatively small open model is fantastic. It understands voice and video natively and outputs text and voice. Now, when I hear "Omni," I start expecting image generation too (thanks, Google!), so maybe that's next for Qwen? 😉</p><p>AI Art & Diffusion & Auto-regression</p><p>This was arguably where the biggest "mainstream" buzz happened this week, thanks mainly to OpenAI.</p><p>OpenAI Launches Native Image Support in GPT-4o - Ghibli Everywhere!</p><p>This felt like a direct response to Gemini 2.5's launch, almost like OpenAI saying, "Oh yeah? Watch this!" They <em>finally</em> enabled the native image generation capabilities within GPT-4o (Blog, Examples). Remember that image Greg Brockman tweeted <em>a year ago</em> of someone writing on a blackboard with an old OpenAI logo, hinting at this? Well, it's here.</p><p>And the results? Honestly, they're stunning. The <strong>prompt adherence</strong> is incredible. It actually listens to what you ask for in detail, including text generation within images, which diffusion models notoriously struggle with. The realism can be jaw-dropping, but it can also generate various styles.</p><p>Speaking of styles... the internet immediately lost its collective mind and turned everything into the style of Studio Ghibli (<a target="_blank" href="https://x.com/GrantSlatton/status/1904631016356274286">great X thread here</a>). My entire feed became Ghibli-fied. It's a testament to how accessible and fun this feature is. Wolfram even suggested we should have Ghibli avatars for the show!</p><p>Interestingly, this image generation is apparently <strong>auto-regressive</strong>, not based on diffusion models like Midjourney or Stable Diffusion. This is more similar to how models like Grok's Aurora work, generating the image sequentially (top-to-bottom, kinda like how old GIFs used to load, as Yam pointed out we confirmed). This likely contributes to the amazing prompt adherence, especially with text.</p><p>The creative potential is huge – people are generating incredible ad concepts (<a target="_blank" href="https://x.com/mrgreen/status/1904886576951300495">like this thread</a>) and even recreating entire movie trailers, like this unbelievable Lord of the Rings one (<a target="_blank" href="https://x.com/PJaccetturo/status/1905151190872309907">link</a>), purely through prompts in GPT-4o. It's wild.</p><p>Now, this launch wasn't just about cool features; it also marked a significant shift in OpenAI's <em>policy</em> around image generation, aiming for what CEO Sam Altman called "a new high-water mark for us in allowing creative freedom." Joanne Jang, who leads model behavior at OpenAI, shared some fascinating insights into their thinking (<a target="_blank" href="https://reservoirsamples.substack.com/p/thoughts-on-setting-policy-for-new"><strong>Reservoir Samples post</strong></a>). </p><p>She explained they're moving away from broad, blanket refusals (which often felt safest but limited creativity) towards a more precise approach focused on preventing <em>real-world harm</em>. This means trusting user creativity more, not letting hypothetical worst-case scenarios overshadow everyday usefulness (like generating memes!), and valuing the "unknown, unimaginable possibilities" that overly strict rules might stifle. It's a nuanced approach acknowledging that, as Joanne quoted, "Ships are safest in the harbor... But that’s not what ships or models are for." A philosophy change I definitely appreciate.</p><p>Reve - New SOTA Diffusion Contender?</p><p>While OpenAI grabbed headlines, another player emerged claiming State-of-the-Art results, this time in the diffusion space. Reve Image 1.0 (<a target="_blank" href="https://x.com/Taesung/status/1904220824435032528">X</a>, <a target="_blank" href="https://decrypt.co/311375/new-reve-image-generator-beats-ai-art-heavyweights-midjourney-and-flux-at-a-penny-per-image">Blog/News</a>, <a target="_blank" href="https://preview.reve.art/api/misc/sso_callback?code=4%2F0AQSTgQGhy3PinnDf0OVcAVGo3OwJMOe7uK2IRiA1b6mo6eWsKxOWCKibKLwCuhRP0O6KtQ&#38;state=eyJob3N0IjoicHJldmlldy5yZXZlLmFydCIsImZsYXZvciI6InNpZ251cCIsInJrZXkiOiJzc28tYUNEd3RtZGxGRldQMm9kY2Y3dkQiLCJzaWciOiJFRmlhLzRtYUN1U0N4REd6VHF5R1BvRXVlVUNjZ0gxOUdVOG5JTDVOblFnPSJ9&#38;error=&#38;error_description=&#38;id_token=">Try it</a>) apparently beats Midjourney and Flux in benchmarks, particularly in prompt adherence, realism, and even text generation (though likely not as consistently as GPT-4o's native approach).</p><p>It works on a credit system ($5 for 500 generations, ~1 cent per image) which is quite affordable. The editing seems a bit different, relying on chatting with the model rather than complex tools. It was kind of hidden/anonymous before, but now they've revealed themselves. Honestly, this would probably be <em>huge</em> news if OpenAI hadn't dropped their image bomb the same week.</p><p>Ideogram 3 Also Launched - Another SOTA Claim!</p><p>And just to make the AI art space even <em>more</em> crowded this week, Ideogram also launched version 3.0 (<a target="_blank" href="https://about.ideogram.ai/3.0">Blog</a>, <a target="_blank" href="https://about.ideogram.ai/3.0">Try it</a>), also claiming state-of-the-art performance!</p><p>Ideogram has always been strong with text rendering and logos. Version 3.0 boasts stunning realism, creative design capabilities, and a new "Style References" feature where you can upload images to guide the aesthetic. They claim it consistently outperforms others in human evaluations. It's wild – we had at least <em>three</em> major image generation models/updates drop this week, all claiming top performance, and none of them seemed to benchmark directly against each other in their launch materials! It’s hard to keep track!</p><p>This Week's Buzz + MCP (<a target="_blank" href="https://x.com/morgymcg/status/1904997037688385607">X</a>, <a target="_blank" href="https://github.com/wandb/MCP-server">Github</a>!)</p><p>Bringing it back to Weights & Biases for a moment. We had Morgan McQuire on the show, who heads up our AI Applied team, to talk about something we're really excited about internally – integrating MCP with Weave, our LLM observability and evaluation tool. Morgan showed a demo and have shipped the MCP server, which you can try right now!</p><p>Coming soon is the integration with wandb models, which will allows ML folks around the world to build agents that monitor loss curves for them! </p><p>Weights & Biases Weave Official MCP Server Tool - Talk to Your Evals!</p><p>We've launched an official MCP server tool for Weave! What does this mean? If you're using Weave to track your experiments, evaluations, prompts, etc. (and you should be!), you can now literally <em>chat</em> with that data. As Morgan demonstrated, you can ask questions like "Tell me about my last three evaluations," and the MCP tool, connected to your Weave data, will not only fetch and summarize that information for you directly in the chat interface (like Claude code or others that support MCP) but will generate a <a target="_blank" href="https://wandb.ai/wandb-applied-ai-team/mcp-tests/reports/Model-Evaluation-Analysis--VmlldzoxMjAxMDQ1NA?accessToken=o11lv1bo38pz2xay3x0dwwlb04lovkvcd4f9getbfboe2i7yl00htggxzaqapvcd">report</a> and add visualizations! </p><p>This is just the beginning of how we see MCP enhancing observability and interaction with ML workflows. Being able to query and analyze your runs and evaluations using natural language is incredibly powerful.</p><p>Agents, Tools & MCP</p><p>And speaking of MCP...</p><p>OpenAI Adds Support for MCP - MCP WON!</p><p>This was HUGE news, maybe slightly overshadowed by the image generation, but potentially far more impactful long-term, as Wolfram pointed out right at the start of the show. OpenAI officially announced support for the Model Context Protocol (MCP) (<a target="_blank" href="https://openai.github.io/openai-agents-python/mcp/">docs here</a>).</p><p>Why is this massive? Because Anthropic initiated MCP, and there was a real fear that OpenAI, being OpenAI, might just create its own competing standard for agent/tool communication, leading to fragmentation (think VHS vs. Betamax, or Blu-ray vs. HD DVD – standards wars suck!). Instead, OpenAI embraced the existing standard. As I said on the show, <strong>MCP WON!</strong></p><p>This is crucial for the ecosystem. It means developers can build tools and agents using the MCP standard, and they should (hopefully) work seamlessly across different models like Claude <em>and</em> GPT. OpenAI not only added support but released it in their Agents SDK and explicitly stated support is "coming soon" for the ChatGPT desktop app and response APIs. Yam expertly clarified the distinction: tools are often single API calls, while MCPs are servers that can maintain state, allowing for more complex, guided interactions. Qwen also adding MCP support to their UI just reinforces this – the standard is gaining traction fast. This standardization is absolutely essential for building a robust agentic future.</p><p>Voice & Audio</p><p>Just one more quick update on the audio front:</p><p>OpenAI Updates Advanced Voice Mode with Semantic VAD</p><p>Alongside the image generation, OpenAI also quietly updated the advanced voice mode in ChatGPT (<a target="_blank" href="https://www.youtube.com/watch?v=mm4djPNO8os">YT announcement</a>). The key improvement is "Semantic VAD" (Voice Activity Detection). Instead of just cutting off when you pause (leading to annoying interruptions, especially for kids or people who think while speaking), it now tries to understand the <em>meaning</em> and <em>tone</em> to determine if you're actually finished talking.</p><p>This should lead to a much more natural conversation flow. They also claim better personality, a more engaging natural tone (direct and concise), and less need for you to fill silence with "umms" just to keep the mic open. It might feel a tad slower because it waits a bit longer, but the improvement in interaction quality should be significant.</p><p>MLX-Audio</p><p>And speaking (heh) of audio and speech, we had the awesome <a target="_blank" href="https://x.com/Prince_Canuma/status/1903221389504430273">Prince Canuma</a> on the show! If you're into running models locally on Apple Silicon (Macs!), you probably know Prince. He's the MLX King, the creator and maintainer of essential libraries like MLX-VLM (for vision models), FastMLX, MLX Embeddings, and now, MLX-Audio. Seriously, huge props to Prince and the folks in the MLX community for making these powerful open-source models accessible on Mac hardware. It's an incredible contribution.</p><p>This week, Prince released MLX-Audio v0.0.3. This isn't just text-to-speech (TTS); it aims to be a comprehensive audio package for MLX. Right now, it supports some of the best open TTS models out there:</p><p>* <strong>Kokoro:</strong> The tiny, high-quality TTS model we've covered before.</p><p>* <strong>Sesame 1B:</strong> Another strong contender.</p><p>* <strong>Orpheus:</strong> From Canopy Labs (as Prince confirmed).</p><p>* <strong>Suno Bark:</strong> Known for generating music and sound effects too.</p><p>MLX-Audio makes running state-of-the-art speech synthesis locally on your Mac incredibly easy, basically offering a Hugging Face transformers pipeline equivalent but optimized for Apple Silicon. If you have a Mac, pip install mlx-audio and give it a spin! Prince also took a feature request on the show to allow text file input directly, so look out for that!</p><p>Phew! See what I mean? An absolutely jam-packed week. </p><p>Huge thanks again to Tulsee, Morgan, and Prince for joining us and sharing their insights, and to Wolfram, Nisten, and Yam for co-piloting through the news storm. And thank YOU all for tuning in! We'll be back next week, undoubtedly trying to catch our breath and make sense of whatever AI marvels (or madness) gets unleashed next. </p><p>P.S - if the ghiblification trend didn’t get to your families as well, the alpha right now is… showing your kids how to be a magician and turn them into Ghibli characters, here are me and my kiddos (who asked to be pirates and princesses) </p><p>TL;DR and Show Notes:</p><p>* <strong>Guests and Cohosts</strong></p><p>* Alex Volkov - AI Evangelist & Weights & Biases (<a target="_blank" href="http://x.com/altryne">@altryne</a>)Co Hosts - Wolfram Ravenwlf (<a target="_blank" href="https://twitter.com/WolframRvnwlf">@WolframRvnwlf</a>), Nisten Tahiraj (<a target="_blank" href="https://x.com/nisten/">@nisten</a>), Yam Peleg (<a target="_blank" href="http://x.com/@yampeleg">@yampeleg</a>)</p><p>* Tulsee Doshi - Head of Product, Gemini Models at Google DeepMind (<a target="_blank" href="https://x.com/tulseedoshi">@tulseedoshi</a>)</p><p>* Morgan McQuire - Head of AI Applied Team at Weights & Biases (<a target="_blank" href="https://x.com/morgymcg">@morgymcg</a>)</p><p>* Prince Canuma - ML Research Engineer, Creator of MLX Libraries (<a target="_blank" href="https://x.com/PrinceCanuma">@PrinceCanuma</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* 🔥 Google reclaims #1 position with Gemini 2.5 Pro (thinking) - (<a target="_blank" href="https://x.com/JeffDean/status/1904580112248693039">X</a>, <a target="_blank" href="https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/">Blog</a>, <a target="_blank" href="http://ai.dev">Try it</a>)</p><p>* ARC-AGI 2 benchmark revealed - Base LLMs score 0%, thinking models 4%.</p><p>* <strong>Open Source LLMs</strong></p><p>* Deepseek updates DeepSeek-V3-0324 685B params (<a target="_blank" href="https://x.com/deepseek_ai/status/1904526863604883661">X</a>, <a target="_blank" href="https://huggingface.co/deepseek-ai/DeepSeek-V3-0324">HF</a>) - MIT License!</p><p>* Qwen launches an Omni 7B model - perceives text, image, audio, video & generates text and speech (<a target="_blank" href="https://huggingface.co/Qwen/Qwen2.5-Omni-7B">HF</a>)</p><p>* <strong>AI Art & Diffusion & Auto-regression</strong></p><p>* OpenAI launches native image support in GPT-4o (Model Card, <a target="_blank" href="https://x.com/GrantSlatton/status/1904631016356274286">X thread</a>, <a target="_blank" href="https://x.com/mrgreen/status/1904886576951300495">Ad threads</a>, <a target="_blank" href="https://x.com/PJaccetturo/status/1905151190872309907">Full Lord of the Rings trailer</a>, <a target="_blank" href="https://cdn.openai.com/11998be9-5319-4302-bfbf-1167e093f1fb/Native_Image_Generation_System_Card.pdf">Model Card</a>)</p><p>* Reve - new SOTA diffusion image gen claims (<a target="_blank" href="https://x.com/Taesung/status/1904220824435032528">X</a>, <a target="_blank" href="https://decrypt.co/311375/new-reve-image-generator-beats-ai-art-heavyweights-midjourney-and-flux-at-a-penny-per-image">Blog/News</a>, <a target="_blank" href="https://preview.reve.art/api/misc/sso_callback?code=4%2F0AQSTgQGhy3PinnDf0OVcAVGo3OwJMOe7uK2IRiA1b6mo6eWsKxOWCKibKLwCuhRP0O6KtQ&#38;state=eyJob3N0IjoicHJldmlldy5yZXZlLmFydCIsImZsYXZvciI6InNpZ251cCIsInJrZXkiOiJzc28tYUNEd3RtZGxGRldQMm9kY2Y3dkQiLCJzaWciOiJFRmlhLzRtYUN1U0N4REd6VHF5R1BvRXVlVUNjZ0gxOUdVOG5JTDVOblFnPSJ9&#38;error=&#38;error_description=&#38;id_token=">Try</a>)</p><p>* Ideogram 3 launched - another SOTA claim, strong on text/logos, realism, style refs (<a target="_blank" href="https://about.ideogram.ai/3.0">Blog</a>, <a target="_blank" href="https://about.ideogram.ai/3.0">Try it</a>)</p><p>* <strong>This weeks Buzz + MCP</strong></p><p>* Weights & Biases Weave official MCP server tool - talk to your evals! (<a target="_blank" href="https://x.com/morgymcg/status/1904997037688385607">X</a>, <a target="_blank" href="https://github.com/wandb/MCP-server">Github</a>)</p><p>* <strong>Agents , Tools & MCP</strong></p><p>* OpenAI has added support for MCP - MCP WON! (<a target="_blank" href="https://openai.github.io/openai-agents-python/mcp/">Docs</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* OpenAI updates advanced voice mode with semantic VAD for more natural conversations (<a target="_blank" href="https://www.youtube.com/watch?v=mm4djPNO8os">YT announcement</a>).</p><p>* MLX-Audio v0.0.3 released by Prince Canuma (<a target="_blank" href="https://github.com/Blaizzy/mlx-audio">Github</a>)</p><p>* <strong>Show Notes and other Links</strong></p><p>* Catch the show live & subscribe to the newsletter/YouTube: <a target="_blank" href="thursdai.news/yt">thursdai.news/yt</a></p><p>* Try Gemini 2.5 Pro: <a target="_blank" href="http://ai.dev">AI.dev</a></p><p>* Learn more about MCP from our previous episode (<a target="_blank" href="https://thursdai.news/mar-6">March 6th</a>).</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-mar-27-gemini-25-takes-1</link><guid isPermaLink="false">substack:post:160031924</guid><dc:creator><![CDATA[Alex Volkov and Morgan McGuire]]></dc:creator><pubDate>Thu, 27 Mar 2025 23:37:28 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/160031924/4987e927228545dc09b150fbb40f600a.mp3" length="60485167" type="audio/mpeg"/><itunes:author>Alex Volkov and Morgan McGuire</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5040</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/160031924/4d9ed95d3ab51bfb2c8b184eedfc4d90.jpg"/></item><item><title><![CDATA[ThursdAI - Mar 20 - OpenAIs new voices, Mistral Small, NVIDIA GTC recap & Nemotron, new SOTA vision from Roboflow & more AI news]]></title><description><![CDATA[<p>Hey, it's Alex, coming to you fresh off another live recording of ThursdAI, and what an incredible one it's been! </p><p>I was hoping that this week will be chill with the releases, because of NVIDIA's GTC conference, but no, the AI world doesn't stop, and if you blinked this week, you may have missed 2 or 10 major things that happened. </p><p>From Mistral coming back to OSS with the amazing Mistral Small 3.1 (beating Gemma from last week!) to OpenAI dropping a new voice generation model, and 2! new whisper killer ASR models with a Breaking News during our live show (there's a reason we're called ThursdAI) which we watched together and then dissected with Kwindla, our amazing AI VOICE and real time expert. </p><p>Not to mention that we also had dedicated breaking news from friend of the pod Joseph Nelson, that came on the show to announce a SOTA vision model from Roboflow + a new benchmark on which even the top VL models get around 6%! There's also a bunch of other OSS, a SOTA 3d model from Tencent and more! </p><p>And last but not least, Yam is back 🎉 So... buckle up and let's dive in. As always, TL;DR and show notes at the end, and here's the YT live version. (While you're there, please hit subscribe and help me hit that 1K subs on YT 🙏 )</p><p>Voice & Audio: OpenAI's Voice Revolution and the Open Source Echo</p><p>Hold the phone, everyone, because this week belonged to <strong>Voice & Audio</strong>! Seriously, if you weren't paying attention to the voice space, you missed a seismic shift, courtesy of <strong>OpenAI</strong> and some serious open-source contenders.</p><p>OpenAI's New Voice Models - Whisper Gets an Upgrade, TTS Gets Emotional!</p><p>OpenAI dropped a suite of next-gen audio models: <strong>gpt-4o-mini-tts-latest</strong> (text-to-speech) and <strong>GPT 4.0 Transcribe</strong> and <strong>GPT 4.0 Mini Transcribe</strong> (speech-to-text), all built upon their powerful transformer architecture.</p><p>To unpack this voice revolution, we welcomed back <strong>Kwindla Cramer</strong> from Daily, the voice AI whisperer himself. The headline news? The new <strong>speech-to-text models</strong> are not just incremental improvements; they’re a whole new ballgame. As OpenAI’s Shenyi explained, "Our new generation model is based on our large speech model. This means this new model has been trained on trillions of audio tokens." They're faster, cheaper (Mini Transcribe is <em>half price</em> of Whisper!), and boast state-of-the-art accuracy across multiple languages. But the real kicker? They're <strong>promptable!</strong></p><p>"This basically opens up a whole field of prompt engineering for these models, which is crazy," I exclaimed, my mind officially blown. Imagine prompting your transcription model with context – telling it you're discussing dog breeds, and suddenly, its accuracy for breed names skyrockets. That's the power of promptable ASR! I recorded a live reaction aftder dropping of stream, and I was really impressed with how I can get the models to pronounce ThursdAI by just... asking! </p><p>But the voice magic doesn't stop there. <strong>GPT 4.0 Mini TTS</strong>, the new text-to-speech model, can now be prompted for… <strong>emotions!</strong> "You can prompt to be emotional. You can ask it to do some stuff. You can prompt the character a voice," OpenAI even demoed a "Mad Scientist" voice! Captain Ryland voice, anyone? This is a huge leap forward in TTS, making AI voices sound… well, more human.</p><p>But wait, there’s more! <strong>Semantic VAD!</strong> Semantic Voice Activity Detection, as OpenAI explained, "chunks the audio up based on when the model thinks The user's actually finished speaking." It’s about understanding the <em>meaning</em> of speech, not just detecting silence. Kwindla hailed it as "a big step forward," finally addressing the age-old problem of AI agents interrupting you mid-thought. No more robotic impatience!</p><p>OpenAI also threw in noise reduction and conversation item retrieval, making these new voice models production-ready powerhouses. This isn't just an update; it's a voice AI revolution, folks.</p><p>They also built a super nice website to test out the new models with <a target="_blank" href="http://openai.fm">openai.fm</a> ! </p><p>Canopy Labs' Orpheus 3B - Open Source Voice Steps Up</p><p>But hold on, the open-source voice community isn't about to be outshone! <strong>Canopy Labs</strong> dropped <strong>Orpheus 3B</strong>, a "natural sounding speech language model" with open-source spirit. </p><p>Orpheus, available in multiple sizes (3B, 1B, 500M, 150M), boasts zero-shot voice cloning and a glorious Apache 2 license. Wolfram noted its current lack of multilingual support, but remained enthusiastic, I played with them a bit and they do sound quite awesome, but I wasn't able to finetune them on my own voice due to "CUDA OUT OF MEMORY" alas</p><p>I did a live reaction recording for this model on <a target="_blank" href="https://x.com/altryne/status/1902470120313917732/video/1">X</a></p><p>NVIDIA Canary - Open Source Speech Recognition Enters the Race</p><p>Speaking of open source, <strong>NVIDIA</strong> surprised us with <strong>Canary</strong>, a speech recognition and translation model. "NVIDIA open sourced Canary, which is a 1 billion parameter and 180 million parameter speech recognition and translation, so basically like whisper competitor," I summarized. Canary is tiny, fast, and CC-BY licensed, allowing commercial use. It even snagged second place on the Hugging Face speech recognition leaderboard! Open source ASR just got a whole lot more interesting.</p><p>Of course, this won't get to the level of the new SOTA ASR OpenAI just dropped, but this can run locally and allows commercial use on edge devices! </p><p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></p><p>Vision & Video: Roboflow's Visionary Model and Video Generation Gets Moving</p><p>After the voice-apalooza, let's switch gears to the visual world, where <strong>Vision & Video</strong> delivered some knockout blows, spearheaded by <strong>Roboflow</strong> and <strong>StepFun</strong>.</p><p>Roboflow's RF-DETR and RF100-VL - A New Vision SOTA Emerges</p><p>Roboflow stole the vision spotlight this week with their <strong>RF-DETR</strong> model and the groundbreaking <strong>RF100-VL</strong> benchmark. We were lucky enough to have <strong>Joseph Nelson</strong>, Roboflow CEO, join the show again and give us the breaking news (they published the Github 11 minutes before he came on!) </p><p>RF-DETR is Roboflow's first in-house model, a real-time object detection transformer that's rewriting the rulebook. "We've actually never released a model that we've developed. And so this is the first time where we've taken a lot of those learnings and put that into a model," Joseph revealed.</p><p>And what a model it is! RF-DETR is not just fast; it's SOTA on real-world datasets and surpasses the 60 mAP barrier on COCO. But Joseph dropped a truth bomb: COCO is outdated. "The benchmark that everyone uses is, the COCO benchmark… hasn't been updated since 2017, but models have continued to get really, really, really good. And so they're saturated the COCO benchmark," he explained.</p><p>Enter <strong>RF100-VL</strong>, Roboflow's revolutionary new benchmark, designed to evaluate vision-language models on <em>real-world</em> data. "We, introduced a benchmark that we call RF 100 vision language," Joseph announced. The results? Shockingly low zero-shot performance on real-world vision tasks, highlighting a major gap in current models. Joseph's quiz question about QwenVL 2.5's zero-shot performance on RF100-VL revealed a dismal 5.8% accuracy. "So we as a field have a long, long way to go before we have zero shot performance on real world context," Joseph concluded. RF100-VL is the new frontier for vision, and RF-DETR is leading the charge! Plus, it runs on edge devices and is Apache 2 licensed! Roboflow, you legends! Check out the <a target="_blank" href="https://blog.roboflow.com/rf-detr/#how-to-use-rf-detr">RF-DETR Blog Post</a>, the <a target="_blank" href="https://github.com/roboflow/rf-detr">RF-DETR Github</a>, and the <a target="_blank" href="https://rf100-vl.org/">RF100-VL Benchmark</a> for more details!</p><p>StepFun's Image-to-Video TI2V - Animating Images with Style</p><p>Stepping into the video arena, <strong>StepFun</strong> released their <strong>image2video model, TI2V.</strong> TI2V boasts impressive motion controls and generates high-quality videos from images and text prompts, especially excelling in anime-style video generation. Dive into the <a target="_blank" href="https://huggingface.co/stepfun-ai/stepvideo-ti2v">TI2V HuggingFace Space</a> and <a target="_blank" href="https://github.com/stepfun-ai/Step-Video-TI2V">TI2V Github</a> to explore further.</p><p>Open Source LLMs: Mistral's Triumphant Return, LG's Fridge LLM, NVIDIA's Nemotron, and ByteDance's RL Boost</p><p>Let's circle back to our beloved <strong>Open Source LLMs</strong>, where this week was nothing short of a gold rush!</p><p>Mistral is BACK, Baby! - Mistral Small 3.1 24B (Again!)</p><p>Seriously, <strong>Mistral AI</strong>'s return to open source with <strong>Mistral Small 3.1</strong> deserves another shoutout! "Mistral is back with open source. Let's go!" I cheered, and I meant it. This multimodal, Apache 2 licensed model is a powerhouse, outperforming Gemma 3 and ready for action on a single GPU. Wolfram, ever the pragmatist, noted, "We are in right now, where a week later, you already have some new toys to play with." referring to Gemma 3 that we covered just last week! </p><p>Not only did we get a great new update from Mistral, they also cited our friends at Nous research and their Deep Hermes (released just last week!) for the reason to release the base models alongside finetuned models! </p><p>Mistral Small 3.1 is not just a model; it's a statement: open source is thriving, and Mistral is leading the charge! Check out their <a target="_blank" href="https://mistral.ai/news/mistral-small-3-1">Blog Post</a>, the <a target="_blank" href="https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503">HuggingFace page</a>, and the <a target="_blank" href="https://huggingface.co/mistralai/Mistral-Small-3.1-24B">Base Model on HF</a>. </p><p>NVIDIA Nemotron - Distilling, Pruning, Making Llama's Better</p><p><strong>NVIDIA</strong> finally dropped <strong>Llama Nemotron</strong>, and it was worth the wait! </p><p>Nemotron Nano (8B) and Super (49B) are here, with Ultra (253B) on the horizon. These models are distilled, pruned, and, crucially, designed for <strong>reasoning</strong> with a hybrid architecture allowing you to enable and disable reasoning via a simple on/off switch in the system prompt! </p><p>Beating other reasoners like QwQ on GPQA tasks, this distillined and pruned LLama based reasoner seems very powerful! Congrats to NVIDIA</p><p>Chris Alexius (a friend of the pod) who co-authored the <a target="_blank" href="https://developer.nvidia.com/blog/build-enterprise-ai-agents-with-advanced-open-nvidia-llama-nemotron-reasoning-models/">announcement</a>, told me that FP8 is expected and when that drops, this model will also fit on a single H100 GPU, making it really great for enterprises who host on their own hardware. </p><p>And yes, it’s ready for commercial use. NVIDIA, welcome to the open-source LLM party! Explore the <a target="_blank" href="https://huggingface.co/collections/nvidia/llama-nemotron-67d92346030a2691293f200b">Llama-Nemotron HuggingFace Collection</a> and the <a target="_blank" href="https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset-v1">Dataset</a>.</p><p>LG Enters the LLM Fray with EXAONE Deep 32B - Fridge AI is Officially a Thing</p><p><strong>LG</strong>, yes, <em>that</em> LG, surprised everyone by open-sourcing <strong>EXAONE Deep 32B</strong>, a "thinking model" from the fridge and TV giant. "LG open sources EXAONE and EXAONE Deep 32B thinking model," I announced, still slightly amused by the fridge-LLM concept. This 32B parameter model claims "superior capabilities" in reasoning, and while my live test in LM Studio went a bit haywire, quantization could be the culprit. It's non-commercial, but hey, fridge-powered AI is now officially a thing. Who saw that coming? Check out my <a target="_blank" href="https://www.youtube.com/watch?v=qOfkhWh1zrI">Reaction Video</a>, the <a target="_blank" href="https://www.lgresearch.ai/blog/view?seq=543">LG Blog</a>, and the <a target="_blank" href="https://huggingface.co/LGAI-EXAONE/EXAONE-Deep-32B">HuggingFace page</a> for more info.</p><p>ByteDance's DAPO - Reinforcement Learning Gets Efficient</p><p>From the creators of TikTok, <strong>ByteDance</strong>, comes <strong>DAPO</strong>, a new reinforcement learning method that's outperforming GRPO.  DAPO promises 50% accuracy on AIME 2024 with 50% <em>less</em> training steps. Nisten, our RL expert, explained it's a refined GRPO, pushing the boundaries of RL efficiency. Open source RL is getting faster and better, thanks to ByteDance! Dive into the <a target="_blank" href="https://x.com/_philschmid/status/1902258522059866504">X thread</a>, <a target="_blank" href="https://t.co/7MEc5mTlC8">Github</a>, and <a target="_blank" href="https://arxiv.org/abs/2503.14476">Paper</a> for the technical details.</p><p>Big CO LLMs + APIs: Google's Generosity, OpenAI's Oligarch Pricing, and GTC Mania</p><p>Switching gears to the Big CO LLM arena, we saw <strong>Google</strong> making moves for the masses, <strong>OpenAI</strong> catering to the elite, and <strong>NVIDIA</strong>… well, being NVIDIA.</p><p>Google Makes DeepResearch Free and Adds Canvas</p><p><strong>Google</strong> is opening up <strong>DeepResearch</strong> to everyone for <strong>FREE!</strong> DeepResearch, Gemini's advanced search mode, is now accessible without a Pro subscription. I really like it's revamped UI where you can see the thinking and the sources! I used it live on the show to find out what we talked about in the latest episode of ThursdAI, and it did a pretty good job! </p><p>Plus, Google unveiled <strong>Canvas</strong>, letting you "build apps within Gemini and actually see them." Google is making Gemini more accessible and more powerful, a win for everyone. Here's a <a target="_blank" href="https://g.co/gemini/share/f6643450f880">Tetris game</a> it built for me and here's a <a target="_blank" href="https://g.co/gemini/share/eea30bfd11f2">markdown enabled word counter</a> I rebuild every week before I send ThursdAI (making sure I don't send you 10K words every week 😅)</p><p>OpenAI's O1 Pro API - Pricey Power for the Few</p><p><strong>OpenAI</strong>, in contrast, released <strong>O1 Pro API</strong>, but with a price tag that's… astronomical. "OpenAI makes O1-pro API available to oligarchs ($600/1mtok output!)," I quipped, highlighting the exclusivity. $600 per million output tokens? "If you code with this, if you vibe code with this, you better already have VCs backing your startup," I warned. O1 Pro might be top-tier performance, but it's priced for the 0.1%.</p><p>NVIDIA GTC Recap - Jensen's Hardware Extravaganza</p><p><strong>NVIDIA GTC</strong> was, as always, a hardware spectacle. New GPUs (Blackwell Ultra, Vera Rubin, Feynman!), the tiny <strong>DGX Spark</strong> supercomputer, the <strong>GR00T</strong> robot foundation model, and the <strong>Blue</strong> robot – NVIDIA is building the AI future, brick by silicon brick. Jensen is the AI world's rockstar, and GTC is his sold-out stadium show. Check out <a target="_blank" href="https://x.com/rowancheung/status/1902708463546904894">Rowan Cheung's GTC Recap on X</a> for a quick overview.</p><p>Shoutout to our team at GTC and this amazingly timed logo shot I took from the live stream! </p><p>Antropic adds Web Search</p><p>We had a surprise at the end of the show, with Antropic releasing web search. It's a small thing, but for folks who use Cloud AI, it's very important.</p><p>You can now turn on web search directly on Claude which makes it... the last frontier lab to enable this feature 😂 Congrats! </p><p>AI Art & Diffusion & 3D: Tencent's 3D Revolution</p><p>Tencent Hunyuan 3D 2.0 MV and Turbo - 3D Generation Gets Real-Time</p><p><strong>Tencent</strong> updated <strong>Hunyuan 3D</strong> to <strong>2.0 MV (MultiView) and Turbo</strong>, pushing the boundaries of 3D generation.  Hunyuan 3D 2.0 surpasses SOTA in geometry, texture, and alignment, and the <strong>Turbo</strong> version achieves near real-time 3D generation – under one second on an H100!  Try out the <a target="_blank" href="https://huggingface.co/spaces/tencent/Hunyuan3D-2mv">Hunyuan3D-2mv HF Space</a> to generate your own 3D masterpieces! </p><p><strong>MultiView (MV)</strong> is another game-changer, allowing you to input 1-4 views for more accurate 3D models. "MV allows to generate 3d shapes from 1-4 views making the 3D shapes much higher quality" I explained. The demo of generating a 3D mouse from Gemini-generated images showcased the seamless pipeline from thought to 3D object. I literally just asked Gemini with native image generation to generate a character and then </p><p>Holodecks are getting closer, folks!</p><p>Closing Remarks and Thank You</p><p>And that's all she wrote, folks! Another week, another AI explosion. From voice to vision, open source to Big CO, this week was a whirlwind of innovation. Huge thanks again to our incredible guests, Joseph Nelson from Roboflow, Kwindla Cramer from Daily, and Lucas Atkins from ARCEE! And of course, massive shoutout to my co-hosts, Wolfram, Yam, and Nisten – you guys are the best!</p><p>And YOU, the ThursdAI community, are the reason we do this. Thank you for tuning in, for your support, and for being as hyped about AI as we are. Remember, ThursdAI is a labor of love, fueled by Weights & Biases and a whole lot of passion.</p><p>Missed anything? <a target="_blank" href="thursdai.news">thursdai.news</a> is your one-stop shop for the podcast, newsletter, and video replay. And seriously, subscribe to our YouTube channel! Let's get to 1000 subs!</p><p><p>Helpful? We’d love to see you here again! </p></p><p>TL;DR and Show Notes:</p><p>* <strong>Guests and Cohosts</strong></p><p>* Alex Volkov - AI Evangelist & Weights & Biases (<a target="_blank" href="http://x.com/@altryne">@altryne</a>)</p><p>Co Hosts - <a target="_blank" href="http://x.com/@WolframRvnwlf">@WolframRvnwlf</a> <a target="_blank" href="http://x.com/@yampeleg">@yampeleg</a> <a target="_blank" href="http://x.com/@nisten">@nisten</a></p><p>* Sponsor - Weights & Biases Weave (<a target="_blank" href="http://x.com/@weave_wb">@weave_wb</a>)</p><p>* Joseph Nelson - CEO Roboflow (<a target="_blank" href="https://x.com/josephofiowa">@josephofiowa</a>)</p><p>* Kindwla Kramer - CEO Daily (<a target="_blank" href="https://x.com/kwindla">@kwindla</a>)</p><p>* Lucas Atkins - Labs team at Arcee lead (<a target="_blank" href="https://x.com/LucasAtkins7/status/1901666078620537339">@LukasAtkins7</a>)</p><p>* <strong>Open Source LLMs</strong> </p><p>* Mistral Small 3.1 24B - Multimodal (<a target="_blank" href="https://mistral.ai/news/mistral-small-3-1">Blog</a>, <a target="_blank" href="https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503">HF</a>, <a target="_blank" href="https://t.co/6K81wQri4x">HF base</a>)</p><p>* LG open sources EXAONE and EXAONE Deep 32B thinking model (<a target="_blank" href="https://www.youtube.com/watch?v=qOfkhWh1zrI">Alex Reaction Video</a>, <a target="_blank" href="https://www.lgresearch.ai/blog/view?seq=543">LG BLOG</a>, <a target="_blank" href="https://huggingface.co/LGAI-EXAONE/EXAONE-Deep-32B">HF</a>)</p><p>* ByteDance releases DAPO - better than GRPO RL Method (<a target="_blank" href="https://x.com/_philschmid/status/1902258522059866504">X</a>, <a target="_blank" href="https://t.co/7MEc5mTlC8">Github</a>, <a target="_blank" href="https://arxiv.org/abs/2503.14476">Paper</a>)</p><p>* NVIDIA drops LLama-Nemotron (Super 49B, Nano 8B) with reasoning and data (<a target="_blank" href="https://x.com/kuchaev/status/1902078122792775771">X</a>, <a target="_blank" href="https://huggingface.co/collections/nvidia/llama-nemotron-67d92346030a2691293f200b">HF</a>, <a target="_blank" href="https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset-v1">Dataset</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Google makes DeepResearch free, Canvas added, Live Previews  (<a target="_blank" href="https://x.com/OfficialLoganK/status/1902042453080760404">X</a>)</p><p>* OpenAI makes O1-pro API available to oligarchs ($600/1mtok output!)</p><p>* NVIDIA GTC recap - (<a target="_blank" href="https://x.com/rowancheung/status/1902708463546904894">X</a>)</p><p>* <strong>This weeks Buzz</strong></p><p>* Come visit the Weights & Biases team at GTC today! </p><p>* <strong>Vision & Video</strong></p><p>* Roboflow drops RF-DETR a SOTA vision model + new eval RF100-VL for VLMs (<a target="_blank" href="https://blog.roboflow.com/rf-detr/#how-to-use-rf-detr">Blog</a>, <a target="_blank" href="https://github.com/roboflow/rf-detr">Github</a>, <a target="_blank" href="https://rf100-vl.org/">Benchmark</a>)</p><p>* StepFun dropped their image2video model TI2V (<a target="_blank" href="https://huggingface.co/stepfun-ai/stepvideo-ti2v">HF</a>, <a target="_blank" href="https://github.com/stepfun-ai/Step-Video-TI2V">Github</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* OpenAI launches a new voice model and 2 new transcription models (<a target="_blank" href="https://openai.com/index/introducing-our-next-generation-audio-models/">Blog</a><strong>, </strong><a target="_blank" href="https://youtu.be/hb-bwLcOMKs">Youtube</a>)</p><p>* Canopy Labs drops Orpheus 3B (1B, 500B, 150M versions) - natural sounding speech language model (<a target="_blank" href="canopylabs.ai/model-releases">Blog</a>, <a target="_blank" href="https://huggingface.co/canopylabs">HF</a>, <a target="_blank" href="https://colab.research.google.com/drive/1xxPpBwI4l_nKUx0J0nzZTtikfqP3UJ6p?usp=sharing#scrollTo=lV49oiPFpbXL">Colab</a>)</p><p>* NVIDIA Canary 1B/180M Flash - apache 2 speech recognition and translation LLama finetune (<a target="_blank" href="https://huggingface.co/nvidia/canary-1b-flash">HF</a>)</p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* Tencent updates Hunyuan 3D 2.0 MV (MultiView) and Turbo (<a target="_blank" href="https://huggingface.co/spaces/tencent/Hunyuan3D-2mv">HF</a>)</p><p>* <strong>Tools</strong></p><p>* ARCEE Conductor - model router (<a target="_blank" href="https://x.com/LucasAtkins7/status/1901666078620537339">X</a>)</p><p>* Cursor ships Claude 3.7 MAX (<a target="_blank" href="https://x.com/cursor_ai/status/1902123296231195047">X</a>)</p><p>* Notebook LM teases MindMaps (<a target="_blank" href="https://x.com/tokumin/status/1902251588925915429">X</a>)</p><p>* Gemini Co-Drawing - using Gemini native image output for helping drawing (<a target="_blank" href="https://huggingface.co/spaces/Trudy/gemini-codrawing">HF</a>)</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-mar-20-openais-new-voices</link><guid isPermaLink="false">substack:post:159510374</guid><dc:creator><![CDATA[Alex Volkov, Kwindla Hultman Kramer, and Joseph]]></dc:creator><pubDate>Thu, 20 Mar 2025 22:04:51 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/159510374/034d3646902625f660a5a3b2f8c6646e.mp3" length="80270263" type="audio/mpeg"/><itunes:author>Alex Volkov, Kwindla Hultman Kramer, and Joseph</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6689</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/159510374/fafa655b5c9cea50d918a3c636269e86.jpg"/></item><item><title><![CDATA[📆 ThursdAI Turns Two! 🎉 Gemma 3, Gemini Native Image, new OpenAI tools, tons of open source & more AI news]]></title><description><![CDATA[<p><strong>LET'S GO!</strong> </p><p>Happy second birthday to ThursdAI, your favorite weekly AI news show! Can you believe it's been two whole years since we jumped into that random Twitter Space to rant about GPT-4? From humble beginnings as a late-night Twitter chat to a full-blown podcast, Newsletter and YouTube show with hundreds of thousands of downloads, it's been an absolutely wild ride! </p><p>That's right, two whole years of me, Alex Volkov, your friendly AI Evangelist, along with my amazing co-hosts, trying to keep you up-to-date on the breakneck speed of the AI world</p><p>And what better way to celebrate than with a week PACKED with insane AI news? Buckle up, folks, because this week Google went OPEN SOURCE crazy, Gemini got even cooler, OpenAI created a whole new Agents SDK and the open-source community continues to blow our minds. We’ve got it all - from game-changing model releases to mind-bending demos.</p><p>This week I'm also on the Weights & Biases company retreat, so TL;DR first and then the newsletter, but honestly, I'll start embedding the live show here in the substack from now on, because we're getting so good at it, I barely have to edit lately and there's a LOT to show you guys! </p><p>TL;DR and Show Notes & Links</p><p>* <strong>Hosts & Guests</strong></p><p>* <strong>Alex Volkov</strong> - AI Eveangelist & Weights & Biases (<a target="_blank" href="https://x.com/altryne">@altryne</a>)</p><p>* <strong>Co Hosts - </strong><a target="_blank" href="https://x.com/WolframRvnwlf">@WolframRvnwlf</a> <a target="_blank" href="https://x.com/ldjconfirmed">@ldjconfirmed</a> <a target="_blank" href="https://x.com/nisten">@nisten</a> </p><p>* Sandra Kublik - DevRel at Cohere (<a target="_blank" href="https://x.com/itsSandraKublik">@itsSandraKublik</a>)</p><p>* <strong>Open Source LLMs</strong> </p><p>* Google open sources Gemma 3 - 1B - 27B - 128K context (<a target="_blank" href="https://developers.googleblog.com/en/introducing-gemma3/">Blog</a>, <a target="_blank" href="https://aistudio.google.com/prompts/new_chat?model=gemma-3-27b-it">AI Studio</a>, <a target="_blank" href="https://huggingface.co/collections/google/gemma-3-release-67c6c6f89c4f76621268bb6d">HF</a>)</p><p>* EuroBERT - multilingual encoder models (210M to 2.1B params)</p><p>* Reka Flash 3 (reasoning) 21B parameters is open sourced (<a target="_blank" href="https://www.reka.ai/news/introducing-reka-flash">Blog</a>, <a target="_blank" href="https://huggingface.co/RekaAI/reka-flash-3">HF</a>)</p><p>* Cohere Command A 111B model - 256K context (<a target="_blank" href="https://cohere.com/blog/command-a">Blog</a>)</p><p>* Nous Research Deep Hermes 24B / 3B Hybrid Reasoners (<a target="_blank" href="https://x.com/NousResearch/status/1900218445763088766">X</a>, <a target="_blank" href="https://huggingface.co/NousResearch/DeepHermes-3-Mistral-24B-Preview">HF</a>)</p><p>* AllenAI OLMo 2 32B - fully open source GPT4 level model (<a target="_blank" href="https://x.com/natolambert/status/1900249185225703703">X</a>, <a target="_blank" href="https://t.co/MWEyDIJMGo">Blog</a>, <a target="_blank" href="https://t.co/QBVnWRcP0y">Try It</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Gemini Flash generates images natively (<a target="_blank" href="https://x.com/GoogleDeepMind/status/1899896275652202927">X</a>, <a target="_blank" href="https://aistudio.google.com/app/prompts/1bRqkN58xP6x3H1wTfQ84HaMi6R19vhLz">AI Studio</a>)</p><p>* Google deep research is now free in Gemini app and powered by Gemini Thinking (<a target="_blank" href="https://gemini.google.com/app">Try It no cost</a>)</p><p>* OpenAI released new responses API, Web Search, File search and Computer USE tools (<a target="_blank" href="https://x.com/OpenAIDevs/status/1899531225468969240">X</a>, <a target="_blank" href="https://t.co/s5Zsy4Wvqy">Blog</a>)</p><p>* <strong>This weeks Buzz</strong> </p><p>* The whole company is at an offsite at oceanside, CA</p><p>* W&B internal MCP hackathon and had cool projects - launching an MCP server soon!</p><p>* <strong>Vision & Video</strong></p><p>* Remade AI - 8 LORA video effects for WANX (<a target="_blank" href="https://huggingface.co/collections/Remade-AI/wan21-14b-480p-i2v-loras-67d0e26f08092436b585919b">HF</a>)</p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* ByteDance Seedream 2.0 - A Native Chinese-English Bilingual Image Generation Foundation Model by ByteDance (<a target="_blank" href="https://team.doubao.com/en/tech/seedream">Blog</a>, <a target="_blank" href="https://arxiv.org/pdf/2503.07703">Paper</a>)</p><p>* <strong>Tools</strong></p><p>* Everyone's talking about Manus - (manus.im)</p><p>* Google AI studio now supports youtube understanding via link dropping</p><p>Open Source LLMs: Gemma 3, EuroBERT, Reka Flash 3, and Cohere Command-A Unleashed!</p><p>This week was absolutely HUGE for open source, folks. Google dropped a BOMBSHELL with <strong>Gemma 3</strong>! As Wolfram pointed out, this is a "very technical achievement," and it's not just one model, but a whole family ranging from 1 billion to 27 billion parameters. And get this – the 27B model can run on a SINGLE GPU! Sundar Pichai himself claimed you’d need "at least 10X compute to get similar performance from other models." Insane!</p><p>Gemma 3 isn't just about size; it's packed with features. We're talking multimodal capabilities (text, images, and video!), support for over 140 languages, and a massive 128k context window. As Nisten pointed out, "it might actually end up being the best at multimodal in that regard" for local models. Plus, it's fine-tuned for safety and comes with ShieldGemma 2 for content moderation. You can grab Gemma 3 on <a target="_blank" href="https://aistudio.google.com">Google AI Studio</a>, <a target="_blank" href="https://huggingface.co/google/gemma-3-27b-it">Hugging Face</a>, Ollama, Kaggle – everywhere! Huge shoutout to Omar Sanseviero and the Google team for this incredible release and for supporting the open-source community from day one! Colin aka Bartowski, was right, "The best thing about Gemma is the fact that Google specifically helped the open source communities to get day one support." This is how you do open source right!</p><p>Next up, we have EuroBERT, a new family of multilingual encoder models. Wolfram, our European representative, was particularly excited about this one: "In European languages, you have different characters than in other languages. And, um, yeah, encoding everything properly is, uh, difficult." Ranging from 210 million to 2.1 billion parameters, EuroBERT is designed to push the boundaries of NLP in European and global languages. With training on a massive 5 trillion-token dataset across 15 languages and support for 8K context tokens, EuroBERT is a workhorse for RAG and other NLP tasks. Plus, how cool is their mascot?</p><p>Reka Flash 3 - a 21B reasoner with apache 2 trained with RLOO</p><p>And the open source train keeps rolling! Reka AI dropped Reka Flash 3, a 21 billion parameter reasoning model with an Apache 2.0 license! Nisten was blown away by the benchmarks: "This might be one of the best like 20B size models that there is right now. And it's Apache 2.0. Uh, I, I think this is a much bigger deal than most people realize." Reka Flash 3 is compact, efficient, and excels at chat, coding, instruction following, and function calling. They even used a new reinforcement learning technique called REINFORCE Leave One-Out (RLOO). Go give it a whirl on <a target="_blank" href="https://x.com/RekaAILabs/status/1899481289495031825">Hugging Face</a> or their chat interface – chat.reka.ai!</p><p>Last but definitely not least in the open-source realm, we had a special guest, Sandra (<a target="_blank" href="https://x.com/itsSandraKublik">@itsSandraKublik</a>) from Cohere, join us to announce Command-A! This beast of a model clocks in at 111 BILLION parameters with a massive 256K context window. Sandra emphasized its efficiency, "It requires only two GPUs. Typically the models of this size require 32 GPUs. So it's a huge, huge difference." Command-A is designed for enterprises, focusing on agentic tasks, tool use, and multilingual performance. It's optimized for private deployments and boasts enterprise-grade security. Congrats to Sandra and the Cohere team on this massive release!</p><p>Big CO LLMs + APIs: Gemini Flash Gets Visual, Deep Research Goes Free, and OpenAI Builds for Agents</p><p>The big companies weren't sleeping either! Google continued their awesome week by unleashing native image generation in Gemini Flash Experimental! This is seriously f*****g cool, folks! Sorry for my French, but it’s true. You can now directly interact with images, tell Gemini what to do, and it just does it. We even showed it live on the stream, turning ourselves into cat-confetti-birthday-hat-wearing masterpieces! </p><p>Wolfram was right, "It's also a sign what we will see in, like, Photoshop, for example. Where you, you expect to just talk to it and have it do everything that a graphic designer would be doing." The future of creative tools is HERE.</p><p>And guess what else Google did? They made Deep Research FREE in the Gemini app and powered by Gemini Thinking! Nisten jumped in to test it live, and we were all impressed. "This is the nicest interface so far that I've seen," he said. Deep Research now digs through HUNDREDS of websites (Nisten’s test hit 156!) to give you comprehensive answers, and the interface is slick and user-friendly. Plus, you can export to Google Docs! Intelligence too cheap to meter? Google is definitely pushing that boundary.</p><p>Last second additions - Allen Institute for AI released <strong>OLMo 2 32B</strong> - their biggest open model yet</p><p>Just as I'm writing this, friend of the pod, Nathan from Allen Institute for AI announced the release of a FULLY OPEN OLMo 2, which includes weights, code, dataset, everything and apparently it beats the latest GPT 3.5, GPT 4o mini, and leading open weight models like Qwen and Mistral. </p><p>Evals look legit, but nore than that, this is an Apache 2 model with everything in place to advance open AI and open science! </p><p>Check out Nathans <a target="_blank" href="https://x.com/natolambert/status/1900249099343192573">tweet</a> for more info, and congrats to Allen team for this awesome release! </p><p>OpenAI new responses API and Agent ASK with Web, File and CUA tools</p><p>Of course, OpenAI wasn't going to let Google have all the fun. They dropped a new SDK for agents called the Responses API. This is a whole new way to build with OpenAI, designed specifically for the agentic era we're entering. They also released three new tools: Web Search, Computer Use Tool, and File Search Tool. The Web Search tool is self-explanatory – finally, built-in web search from OpenAI!</p><p>The Computer Use Tool, while currently limited in availability, opens up exciting possibilities for agent automation, letting agents interact with computer interfaces. And the File Search Tool gives you a built-in RAG system, simplifying knowledge retrieval from your own files. As always, OpenAI is adapting to the agentic world and giving developers more power.</p><p>Finally in the big company space, Nous Research released PORTAL, their new Inference API service. Now you can access their awesome models, like Hermes 3 Llama 70B and DeepHermes 3 8B, directly via API. It's great to see more open-source labs offering API access, making these powerful models even more accessible.</p><p>This Week's Buzz at Weights & Biases: Offsite Hackathon and MCP Mania!</p><p>This week's "This Week's Buzz" segment comes to you live from Oceanside, California! The whole Weights & Biases team is here for our company offsite. Despite the not-so-sunny California weather (thanks, storm!), it's been an incredible week of meeting colleagues, strategizing, and HACKING!</p><p>And speaking of hacking, we had an MCP hackathon! After last week’s MCP-pilling episode, we were all hyped about Model Context Protocol, and the team didn't disappoint. In just three hours, the innovation was flowing! We saw agents built for WordPress, MCP support integrated into Weave playground, and even MCP servers for Weights & Biases itself! Get ready, folks, because an MCP server for Weights & Biases is COMING SOON! You'll be able to talk to your W&B data like never before. Huge shoutout to the W&B team for their incredible talent and for embracing the agentic future! And in case you missed it, Weights & Biases is now part of the CoreWeave family! Exciting times ahead!</p><p>Vision & Video: LoRA Video Effects and OpenSora 2.0</p><p>Moving into vision and video, Remade AI <a target="_blank" href="https://huggingface.co/collections/Remade-AI/wan21-14b-480p-i2v-loras-67d0e26f08092436b585919b">released</a> 8 LoRA video effects for 1X! Remember 1X from Alibaba? Now you can add crazy effects like "squish," "inflate," "deflate," and even "cakeify" to your videos using LoRAs. It's open source and super cool to see video effects becoming trainable and customizable.</p><p>And in the realm of open-source video generation, <a target="_blank" href="https://github.com/hpcaitech/Open-Sora?tab=readme-ov-file">OpenSora 2.0</a> dropped! This 11 billion parameter model claims state-of-the-art video generation trained for just $200,000! They’re even claiming performance close to Sora itself on some benchmarks. Nisten checked out the demos, and while we're all a bit jaded now with the rapid pace of video AI, it's still mind-blowing how far we've come. Open source video is getting seriously impressive, seriously fast.</p><p>AI Art & Diffusion & 3D: ByteDance's Bilingual Seedream 2.0</p><p>ByteDance, the folks behind TikTok, released Seedream 2.0, a native Chinese-English bilingual image generation foundation model. This model, from ByteDream, excels at text rendering, cultural nuance, and human preference alignment. Seedream 2.0 boasts "powerful general capability," "native bilingual comprehension ability," and "excellent text rendering." It's designed to understand both Chinese and English prompts natively, generating high-quality, culturally relevant images. The examples look stunning, especially its ability to render Chinese text beautifully.</p><p>Tools: Manus AI Agent, Google AI Studio YouTube Links, and Cursor Embeddings</p><p>Finally, in the tools section, everyone's buzzing about Manus, a new AI research agent. We gave it a try live on the show, asking it to do some research. The UI is slick, and it seems to be using Claude 3.7 behind the scenes. Manus creates a to-do list, browses the web in a real Chrome browser, and even generates files. It's like Operator on steroids. We'll be keeping an eye on Manus and will report back on its performance in future episodes.</p><p>And Google AI Studio keeps getting better! Now you can drop YouTube links into Google AI Studio, and it will natively understand the video! This is HUGE for video analysis and content understanding. Imagine using this for support, content summarization, and so much more.</p><p>PHEW! What a week to celebrate two years of ThursdAI! From open source explosions to Gemini's visual prowess and OpenAI's agentic advancements, the AI world is moving faster than ever. As Wolfram aptly put it, "The acceleration, you can feel it." And Nisten reminded us of the incredible journey, "I remember I had early access to GPT-4 32K, and, uh, then... the person for the contract that had given me access, they cut it off because on the one weekend, I didn't realize how expensive it was. So I had to use $180 worth of tokens  just trying it out." Now, we have models that are more powerful and more accessible than ever before. </p><p>Thank you to Wolfram, Nisten, and LDJ for co-hosting and bringing their insights every week. </p><p>And most importantly, THANK YOU to our amazing community for tuning in, listening, and supporting ThursdAI for two incredible years! We couldn't do it without you. Here's to another year of staying up-to-date so YOU don't have to! Don't forget to subscribe to the podcast, YouTube channel, and newsletter to stay in the loop. And share ThursdAI with a friend – it's the best birthday gift you can give us! Until next week, keep building and keep exploring the amazing world of AI! LET'S GO!</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-turns-two-gemma-3-gemini</link><guid isPermaLink="false">substack:post:159016903</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Thu, 13 Mar 2025 20:24:37 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/159016903/174417a192662514642f5bc0b93b1f4d.mp3" length="66294077" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5524</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/159016903/0e7fc05200c88ad0a690d146bd0da8bf.jpg"/></item><item><title><![CDATA[ThursdAI - Mar 6, 2025 - Alibaba's R1 Killer QwQ, Exclusive Google AI Mode Chat, and MCP fever sweeping the community!]]></title><description><![CDATA[<p>What is UP folks! Alex here from Weights & Biases (yeah, still, but check this weeks buzz section below for some news!) </p><p>I really really enjoyed today's episode, I feel like I can post it unedited it was so so good. We started the show with our good friend Junyang Lin from Alibaba Qwen, where he told us about their new 32B reasoner QwQ. Then we interviewed Google's VP of the search product, Robby Stein, who came and told us about their upcoming AI mode in Google! I got access and played with it, and it made me switch back from PPXL as my main. </p><p>And lastly, I recently became fully MCP-pilled, since we covered it when it came out over thanksgiving, I saw this acronym everywhere on my timeline but only recently "got it" and so I wanted to have an MCP deep dive, and boy... did I get what I wished for! You absolutely should tune in to the show as there's no way for me to cover everything we covered about MCP with Dina and Jason! ok without, further adieu.. let's dive in (and the TL;DR, links and show notes in the end as always!) </p><p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></p><p>🤯 Alibaba's QwQ-32B: Small But Shocking Everyone!</p><p>The open-source LLM segment started strong, chatting with friend of the show Junyang Justin Lin from Alibaba’s esteemed Qwen team. They've cooked up something quite special: QwQ-32B, a reasoning-focused, reinforcement-learning-boosted beast that punches remarkably above its weight. We're talking about a mere 32B parameters model holding its ground on tough evaluations against DeepSeek R1, a 671B behemoth!</p><p>Here’s how wild this is: You can literally run QwQ on your Mac! Junyang shared that they applied two solid rounds of RL to amp its reasoning, coding, and math capabilities, integrating agents into the model to fully unlock its abilities. When I called out how insane it was that we’ve gone from "LLMs can't do math" to basically acing competitive math benchmarks like AIME24, Junyang calmly hinted that they're already aiming for unified thinking/non-thinking models. Sounds wild, doesn’t it?</p><p>Check out the full QwQ release <a target="_blank" href="https://huggingface.co/Qwen/QwQ-32B">here</a>, or dive into their <a target="_blank" href="https://qwenlm.github.io/blog/qwq-32b/">blog post</a>.</p><p>🚀 Google Launches AI Mode: Search Goes Next-Level (<a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fx.com%2Faltryne%2Fstatus%2F1897381479459811368">X</a>, <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fblog.google%2Fproducts%2Fsearch%2Fai-mode-search%2F">Blog</a>, <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3D5QTveQpq1WI%26feature%3Dyoutu.be">My Live Reaction</a>).</p><p>For the past two years, on this very show, we've been asking, "Where's Google?" in the Gen AI race. Well, folks, they're <em>back</em>. And they're back in a <em>big</em> way.</p><p>Next, we were thrilled to have Google’s own Robby Stein, VP of Product for Google Search, drop by ThursdAI after their massive launch of AI Mode and expanded AI Overviews leveraging Gemini 2.0. Robby walked us through this massive shift, which essentially brings advanced conversational AI capabilities straight into Google. Seriously — Gemini 2.0 is now out here doing complex reasoning while performing fan-out queries behind the scenes in Google's infrastructure.</p><p>Google search is literally Googling itself. No joke. "We actually have the model generating fan-out queries — Google searches within searches — to collect accurate, fresh, and verified data," explained Robby during our chat. And I gotta admit, after playing with AI Mode, Google is definitely back in the game—real-time restaurant closures, stock analyses, product comparisons, and it’s conversational to boot. You can check my blind reaction first impression video <a target="_blank" href="https://www.youtube.com/watch?v=5QTveQpq1WI">here</a>. (also, while you're there, why not subscribe to my YT?)</p><p>Google has some huge plans, but right now AI Mode is rolling out slowly via Google Labs for Google One AI Premium subscribers first. More soon though!</p><p>🐝 This Week's Buzz: Weights & Biases Joins CoreWeave Family!</p><p>Huge buzz (in every sense of the word) from Weights & Biases, who made waves with their announcement this week: We've joined forces with CoreWeave! Yeah, that's big news as CoreWeave, the AI hyperscaler known for delivering critical AI infrastructure, has now acquired Weights & Biases to build the ultimate end-to-end AI platform. It's early days of this exciting journey, and more details are emerging, but safe to say: the future of Weights & Biases just got even more exciting. Congrats to the whole team at Weights & Biases and our new colleagues at CoreWeave!</p><p>We're committed to all users of WandB so you will be able to keep using Weights & Biases, and we'll continuously improve our offerings going forward! Personally, also nothing changes for ThursdAI! 🎉</p><p>MCP Takes Over: Giving AI agents super powers via standardized protocol </p><p>Then things got insanely exciting. Why? Because MCP is blowing up and I had to find out why everyone's timeline (mine included) just got invaded.</p><p>Welcoming Cloudflare’s amazing product manager Dina Kozlov and Jason Kneen—MCP master and creator—things quickly got mind-blowing. MCP servers, Jason explained, are essentially tool wrappers that effortlessly empower agents with capabilities like API access and even calling other LLMs—completely seamlessly and securely. According to Jason, "we haven't even touched the surface yet of what MCP can do—these things are Lego bricks ready to form swarms and even self-evolve."</p><p>Dina broke down just how easy it is to launch MCP servers on Cloudflare Workers while teasing exciting upcoming enhancements. Both Dina and Jason shared jaw-dropping examples, including composing complex workflows connecting Git, Jira, Gmail, and even smart home controls—practically instantaneously! Seriously, my mind is still spinning.</p><p>The MCP train is picking up steam, and something tells me we'll be talking about this revolutionary agent technology a lot more soon. Check out two great MCP directories that popped up this recently: <a target="_blank" href="https://smithery.ai/">Smithery</a>,  <a target="_blank" href="https://cursor.directory/mcp">Cursor Directory</a> and <a target="_blank" href="https://mcp.composio.dev/">Composio</a>.</p><p>This show was one of the best ones we recorded, honestly, I barely need to edit it. It was also a really really fun livestream, so if you prefer seeing to listening, here's the lightly edited live stream</p><p>Thank you for being a ThursdAI subscriber, as always here's the TL:DR and shownotes for everything that happened in AI this week and the things we mentioned (and hosts we had)</p><p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></p><p>TL;DR and Show Notes </p><p></p><p>* <strong>Show Notes & Guests</strong></p><p>* <strong>Alex Volkov</strong> - AI Eveangelist & Weights & Biases (<a target="_blank" href="https://x.com/altryne">@altryne</a>)</p><p>* <strong>Co Hosts - </strong> <a target="_blank" href="https://x.com/WolframRvnwlf">@WolframRvnwlf</a> <a target="_blank" href="https://x.com/ldjconfirmed">@ldjconfirmed</a> <a target="_blank" href="https://x.com/nisten">@nisten</a></p><p>* <strong>Junyang Justin Lin</strong> - Head of Qwen Team, Alibaba - <a target="_blank" href="https://x.com/JustinLin610">@JustinLin610</a></p><p>* <strong>Robby Stein</strong> - VP of Product, Google Search - <a target="_blank" href="https://x.com/rmstein/status/1897417750622216574">@rmstein</a></p><p>* <strong>Dina Kozlov</strong> - Product Manager, Cloudflare - <a target="_blank" href="https://x.com/dinasaur_404">@dinasaur_404</a></p><p>* <strong>Jason Kneen</strong> - MCP Wiz - <a target="_blank" href="https://x.com/jasonkneen">@jasonkneen</a></p><p>* My Google AI Mode Blind Reaction Video (<a target="_blank" href="https://www.youtube.com/watch?v=5QTveQpq1WI">Youtube</a>)</p><p>* Sesame Maya Conversation Demo - (<a target="_blank" href="https://www.youtube.com/watch?v=pI_WARqK_X4&#38;t=1s">Youtube</a>)</p><p>* Cloudflare MCP docs (<a target="_blank" href="https://blog.cloudflare.com/model-context-protocol/">Blog</a>)</p><p>* Weights & Biases Agents Course Pre-signup - <a target="_blank" href="https://wandb.me/agents">https://wandb.me/agents</a></p><p>* <strong>Open Source LLMs</strong></p><p>* Qwen's latest reasoning model <strong>QwQ-32B</strong> - matches R1 on some evals (<a target="_blank" href="https://x.com/Alibaba_Qwen/status/1897361654763151544">X</a>, <a target="_blank" href="https://qwenlm.github.io/blog/qwq-32b/">Blog</a>, <a target="_blank" href="https://huggingface.co/Qwen/QwQ-32B">HF</a>, <a target="_blank" href="https://huggingface.co/spaces/Qwen/QwQ-32B-Demo">Chat</a>)</p><p>* Cohere4ai - Aya Vision - 8B & 32B (<a target="_blank" href="https://x.com/CohereForAI/status/1896923657470886234">X</a>, <a target="_blank" href="https://huggingface.co/collections/CohereForAI/c4ai-aya-vision-67c4ccd395ca064308ee1484?ref=cohere-ai.ghost.io">HF</a>)</p><p>* AI21 - Jamba 1.6 Large & Jamba 1.6 Mini (<a target="_blank" href="https://x.com/AI21Labs/status/1897657953261601151">X</a>, <a target="_blank" href="https://huggingface.co/ai21labs/AI21-Jamba-Large-1.6">HF</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Google announces AI Mode & AI Overviews Gemini 2.0 (<a target="_blank" href="https://x.com/altryne/status/1897381479459811368">X</a>, <a target="_blank" href="https://blog.google/products/search/ai-mode-search/">Blog</a>, <a target="_blank" href="https://www.youtube.com/watch?v=5QTveQpq1WI&#38;feature=youtu.be">My Live Reaction</a>)</p><p>* OpenAI rolls out GPT 4.5 to plus users - #1 on LM Arena 🔥 (<a target="_blank" href="https://x.com/lmarena_ai/status/1896590146465579105">X</a>)</p><p>* Grok Voice is available for free users as well (<a target="_blank" href="https://x.com/ebbyamir/status/1897118801231249818">X</a>)</p><p>* Elysian Labs launches Auren ios app (<a target="_blank" href="https://x.com/nearcyan/status/1897466463314936034">X</a>, <a target="_blank" href="https://auren.app">App Store</a>)</p><p>* Mistral announces SOTA OCR (<a target="_blank" href="https://mistral.ai/news/mistral-ocr">Blog</a>)</p><p>* <strong>This weeks Buzz</strong></p><p>* Weights & Biases is acquired by CoreWeave 🎉 (<a target="_blank" href="https://wandb.ai/wandb/wb-announcements/reports/W-B-being-acquired-by-CoreWeave--VmlldzoxMTY0MDI1MQ">Blog</a>)</p><p>* <strong>Vision & Video</strong></p><p>* Tencent HYVideo img2vid is finally here (<a target="_blank" href="https://x.com/TXhunyuan/status/1897558826519556325">X</a>, <a target="_blank" href="https://huggingface.co/tencent/HunyuanVideo-I2V">HF</a>, <a target="_blank" href="https://video.hunyuan.tencent.com/">Try It</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* NotaGen - symbolic music generation model <strong>high-quality classical sheet music</strong> <a target="_blank" href="https://github.com/ElectricAlexis/NotaGen">Github</a>, <a target="_blank" href="https://electricalexis.github.io/notagen-demo/">Demo</a>, <a target="_blank" href="https://huggingface.co/ElectricAlexis/NotaGen">HF</a></p><p>* Sesame takes the world by storm with their amazing voice model (<a target="_blank" href="https://www.youtube.com/watch?v=pI_WARqK_X4&#38;t=1s">My Reaction</a>)</p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* MiniMax__AI - Image-01: A Versatile Text-to-Image Model at 1/10 the Cost (<a target="_blank" href="https://x.com/MiniMax__AI/status/1896475931809817015">X</a>, <a target="_blank" href="https://t.co/ATyAN03H1F">Try it</a>)</p><p>* Zhipu AI - CogView 4 6B - (<a target="_blank" href="https://x.com/ChatGLM/status/1896824917880148450">X</a>, <a target="_blank" href="https://t.co/O8btwDugWI">Github</a>)</p><p>* <strong>Tools</strong></p><p>* Google - DataScience agent in GoogleColab <a target="_blank" href="https://developers.googleblog.com/en/data-science-agent-in-colab-with-gemini/">Blog</a></p><p>* Baidu Miaoda - nocode AI build tool </p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-mar-6-2025-alibabas-r1-killer</link><guid isPermaLink="false">substack:post:158547546</guid><dc:creator><![CDATA[Alex Volkov, Dina Kozlov, and Jason Kneen]]></dc:creator><pubDate>Thu, 06 Mar 2025 21:46:27 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/158547546/86653e1de0fcc205c153418751d5e3b5.mp3" length="79911373" type="audio/mpeg"/><itunes:author>Alex Volkov, Dina Kozlov, and Jason Kneen</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6659</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/158547546/efdc9a6d1caedf1ceceb37b2a60ab98e.jpg"/></item><item><title><![CDATA[📆 Feb 27, 2025 - GPT-4.5 Drops TODAY?!, Claude 3.7 Coding BEAST, Grok's Unhinged Voice, Humanlike AI voices & more AI news]]></title><description><![CDATA[<p>Hey all, Alex here 👋</p><p>What can I say, the weeks are getting busier , and this is one of those "crazy full" weeks in AI. As we were about to start recording, OpenAI teased GPT 4.5 live stream, and we already had a very busy show lined up (Claude 3.7 vibes are immaculate, Grok got an unhinged voice mode) and I had an interview with Kevin Hou from Windsurf scheduled!  Let's dive in! </p><p>🔥 GPT 4.5 (ORION) is here - worlds largest LLM (10x GPT4o) </p><p>OpenAI has finally shipped their next .5 model, which is 10x scale from the previous model. We didn't cover this on the podcast but did watch the OpenAI live stream together after the podcast concluded. </p><p>A very interesting .5 release from OpenAI, where even Sam Altman says "this model won't crush on benchmarks" and is not the most frontier model, but is OpenAI's LARGEST model by far (folks are speculating 10+ Trillions of parameters) </p><p>After 2 years of smaller models and distillations, we finally got a new BIG model, that shows scaling laws proper, and while on some benchmarks it won't compete against reasoning models, this model will absolutely fuel a huge increase in capabilities even for reasoners, once o-series models will be trained on top of this. </p><p>Here's a summary of the announcement and quick vibes recap (from folks who had access to it before) </p><p>* OpenAI's largest, most knowledgeable model.</p><p>* Increased world knowledge: 62.5% on SimpleQA, 71.4% GPQA</p><p>* Better in creative writing, programming, problem-solving (no native step-by-step reasoning).</p><p>* Text and image input and text output</p><p>* Available in ChatGPT Pro and API access (API supports Function Calling, Structured Output)</p><p>* Knowledge Cutoff is October 2023.</p><p>* Context Window is 128,000 tokens.</p><p>* Max Output is 16,384 tokens.</p><p>* Pricing (per 1M tokens): Input: $75, Output: $150, Cached Input: $37.50.</p><p>* Foundation for future reasoning models</p><p>4.5 Vibes Recap</p><p>Tons of folks who had access are pointing to the same thing, while this model is not beating others on evals, it's much better at multiple other things, namely <a target="_blank" href="https://x.com/theo/status/1895206943293350123">creative writing</a>, <a target="_blank" href="https://x.com/willdepue/status/1895205469645611038">recommending songs</a>, improved <a target="_blank" href="https://x.com/emollick/status/1895211249656570258">vision capability</a> and improved <a target="_blank" href="https://x.com/DeryaTR_/status/1895249875723321560">medical diagnosis</a>. </p><p>Karpathy said "Everything is a little bit better and it's awesome, but also not exactly in ways that are trivial to point to" and posted a thread of pairwise comparisons of tone on his <a target="_blank" href="https://x.com/karpathy/status/1895213020982472863">X thread</a></p><p>Though the reaction is bifurcated as many are upset with the high price of this model (10x more costly on outputs) and the fact that it's just marginally better at coding tasks. Compared to the newerSonnet (Sonnet 3.7) and DeepSeek, folks are looking at OpenAI and asking, why isn't this way better? </p><p>Anthropic's Claude 3.7 Sonnet: A Coding Powerhouse</p><p>Anthropic released Claude 3.7 Sonnet, and the immediate reaction from the community was overwhelmingly positive. With 8x more output capability (64K) and reasoning built in, this model is an absolute coding powerhouse. </p><p>Claude 3.7 Sonnet is the new king of coding models, achieving a remarkable 70% on the challenging SWE-Bench benchmark, and the initial user feedback is stellar, though vibes started to shift a bit towards Thursday.</p><p>Ranking #1 on WebDev arena, and seemingly trained on UX and websites, Claude Sonnet 3.7 (AKA NewerSonner) has been blowing our collective minds since it was released on Monday, especially due to introducing Thinking and reasoning in a combined model. </p><p>Now, since the start of the week, the community actually had time to play with it, and some of them return to sonnet 3.5 and saying that while the model is generally much more capable, it tends to generate tons of things that are unnecessary. </p><p>I wonder if the shift is due to Cursor/Windsurf specific prompts, or the model's larger output context, and we'll keep you updated on if the vibes shift again. </p><p>Open Source LLMs</p><p>This week was HUGE for open source, folks. We saw releases pushing the boundaries of speed, multimodality, and even the very way LLMs generate text!</p><p>DeepSeek's Open Source Spree</p><p>DeepSeek went on an absolute tear, open-sourcing a treasure trove of advanced tools and techniques:</p><p>This isn't your average open-source dump, folks. We're talking FlashMLA (efficient decoding on Hopper GPUs), DeepEP (an optimized communication library for MoE models), DeepGEMM (an FP8 GEMM library that's apparently ridiculously fast), and even parallelism strategies like DualPipe and EPLB.</p><p>They are releasing some advanced stuff for training and optimization of LLMs, you can follow all their releases on their <a target="_blank" href="https://x.com/deepseek_ai/status/1894931931554558199">X account</a></p><p>Dual Pipe seems to be the one that got most attention from the community, which is an incredible feat in pipe parallelism, that even got the cofounder of HuggingFace <a target="_blank" href="https://x.com/Thom_Wolf/status/1895135100444053599">super excited</a></p><p>Microsoft's Phi-4: Multimodal and Mini (<a target="_blank" href="https://azure.microsoft.com/en-us/blog/empowering-innovation-the-next-generation-of-the-phi-family/">Blog</a>, <a target="_blank" href="https://huggingface.co/microsoft/Phi-4-multimodal-instruct">HuggingFace</a>)</p><p>Microsoft joined the party with Phi-4-multimodal (5.6B parameters) and Phi-4-mini (3.8B parameters), showing that small models can pack a serious punch.</p><p>These models are a big deal. Phi-4-multimodal can process text, images, <em>and</em> audio, and it actually <em>beats</em> WhisperV3 on transcription! As Nisten said, "This is a new model and, I'm still reserving judgment until, until I tried it, but it looks ideal for, for a portable size that you can run on the phone and it's multimodal." It even supports a wide range of languages. Phi-4-mini, on the other hand, is all about speed and efficiency, perfect for finetuning.</p><p>Diffusion LLMs: Mercury Coder and LLaDA (<a target="_blank" href="https://twitter.com/InceptionAILabs/status/1894847919624462794">X</a> , <a target="_blank" href="https://t.co/XCeNw9BtsX">Try it</a>)</p><p>This is where things get <em>really</em> interesting. We saw not one, but <em>two</em> diffusion-based LLMs this week: Mercury Coder from Inception Labs and LLaDA 8B. (Although, ok, to be fair, LLaDa released 2 weeks ago I was just busy)</p><p>For those who don't know, diffusion is usually used for creating things like images. The idea of using it to generate text is like saying, "Okay, there's a revolutionary tool for painting; I'll write the code using it." Inception Labs' Mercury Coder is claiming over <em>1000 tokens per second</em> on NVIDIA H100s – that's insane speed, usually only seen with specialized chips! Nisten spent hours digging into these, noting, "This is a complete breakthrough and, it just hasn't quite hit yet that this just happened because people thought for a while it should be possible because then you can do, you can do multiple token prediction at once". He explained that these models combine a regular LLM with a diffusion component, allowing them to generate multiple tokens simultaneously and excel at tasks like "fill in the middle" coding.</p><p>LLaDA 8B, on the other hand, is an open-source attempt, and while it needs more training, it shows the potential of this approach. LDJ pointed out that LLaDA is "trained on like around five times or seven times less data while already like competing with LLAMA3 AP with same parameter count".</p><p>Are diffusion LLMs the future? It's too early to say, but the speed gains are <em>very</em> intriguing.</p><p>Magma 8B: Robotics LLM from Microsoft</p><p>Microsoft dropped Magma 8B, a Microsoft Research project, an open-source model that combines vision and language understanding with the ability to control robotic actions.</p><p>Nisten was particularly hyped about this one, calling it "the robotics. LLM." He sees it as a potential game-changer for robotics companies, allowing them to build robots that can understand visual input, respond to language commands, and <em>act</em> in the real world.</p><p>OpenAI's Deep Research for Everyone (Well, Plus Subscribers)</p><p>OpenAI finally brought Deep Research, its incredible web-browsing and research tool, to Plus subscribers.</p><p>I've been saying this for a while: Deep Research is another ChatGPT moment. It's <em>that</em> good. It goes out, visits websites, understands your query in context, and synthesizes information like nothing else. As Nisten put it, "Nothing comes close to OpenAI's Deep Research...People like pull actual economics data, pull actual stuff." If you haven't tried it, you absolutely should.</p><p>Our full coverage of Deep Research is <a target="_blank" href="https://sub.thursdai.news/p/thursdai-feb-6-openai-deepresearch">here</a> if you haven't yet listened, it's incredible. </p><p>Alexa Gets an AI Brain Upgrade with Alexa+</p><p>Amazon finally announced Alexa+, the long-awaited LLM-powered upgrade to its ubiquitous voice assistant.</p><p>Alexa+ will be powered by Claude (and sometimes Nova), offering a much more conversational and intelligent experience, with integrations across Amazon services.</p><p>This is a <em>huge</em> deal. For years, Alexa has felt… well, dumb, compared to the advancements in LLMs. Now, it's getting a serious intelligence boost, thanks to Anthropic's Claude. It'll be able to handle complex conversations, control smart home devices, and even perform tasks across various Amazon services. Imagine asking Alexa, "Did I let the dog out today?" and it actually <em>checking your Ring camera footage</em> to give you an answer! (Although, as I joked, let's hope it doesn't start setting houses on fire.)</p><p>Also very intriguing is the new SDKs they are releasing to connect Alexa+ to all kinds of experience, I think this is huge and will absolutely create a new industry of applications built for voice Alexa. </p><p>Alexa Web Actions for example will allow Alexa to navigate to a website and complete actions (think order Uber Eats) </p><p>The price? 20$/mo but free if you're a Amazon Prime subscriber, which is most of the US households at this point. </p><p>They are focusing on personalization and memory, though still unclear how that's going to be handled, and the ability to share documents like schedules </p><p>I'm very much looking forward to smart Alexa, and to be able to say "Alexa, set a timer for the amount of time it takes to hard boil an egg, and flash my house lights when the timer is done" </p><p>Grok Gets a Voice... and It's UNHINGED</p><p>Grok, Elon Musk's AI, finally got a voice mode, and… well, it's something else.</p><p>One-sentence summary: Grok's new voice mode includes an "unhinged" 18+ option that curses like a sailor, along with other personality settings.</p><p>Yes, you read that right. There's literally an "unhinged" setting in the UI. We played it live on the show, and... well, let's just say it's not for the faint of heart (or for kids). Here's a taste:</p><p><strong>Alex:</strong> "Hey there."</p><p><strong>Grok:</strong> "Yo, Alex. What's good, you horny b*****d? How's your day been so far? Fucked up or just mildly shitty?"</p><p>Beyond the shock value, the voice mode is actually quite impressive in its expressiveness and ability to understand interruptions. It has several personalities, from a helpful "Grok Doc" to an "argumentative" mode that will disagree with everything you say. It's... unique.</p><p>This Week's Buzz (WandB-Related News)</p><p>Agents Course is Coming!</p><p>We announced our upcoming agents course! You can pre-sign up <a target="_blank" href="https://wandb.me/agents">HERE</a> . This is going to be a deep dive into building and deploying AI agents, so don't miss it!</p><p>AI Engineer Summit Recap</p><p>We briefly touched on the AI Engineer Summit in New York, where we met with Kevin Hou and many other brilliant minds in the AI space. The theme was "Agents at Work," and it was a fantastic opportunity to see the latest developments in agent technology. I gave a talk about reasoning agents and had a workshop about evaluations on Saturday, and saw many listeners of ThursdAI 👏 ✋ </p><p>Interview with Kevin Hou from Windsurf</p><p>This week we had the pleasure of chatting with <strong>Kevin Hou from Windsurf</strong> about their revolutionary AI editor. Windsurf isn't just another IDE, it's an <strong>agentic IDE</strong>. As Kevin explained, "we made the pretty bold decision of saying, all right, we're not going to do chat... we are just going to [do] agent." They've built Windsurf from the ground up with an agent-first approach, and it’s making waves.</p><p>Kevin walked us through the evolution of AI coding tools, from autocomplete to chat, and now to agents. He highlighted the "magical experiences" users are having, like debugging complex code with AI assistance that <em>actually understands</em> the context. We also delved into the challenges – memory, checkpointing, and cost.</p><p>We also talked about the burning question: <strong>vibe coding</strong>. Is coding as we know it dead? Kevin’s take was nuanced: "there's an in between state that I really vibe or like gel with, which is,the scaffolding of what you want… Let's use, let's like vibe code and purely use the agent to accomplish this sort of commit." He sees AI agents raising the bar for software quality, demanding better UX, testing, and overall polish.</p><p>And of course, we had to ask about the elephant in the room – <strong>why are so many people switching from Cursor to Windsurf?</strong> Kevin's answer was humble, pointing to user experience, the agent-first workflow, and the team’s dedication to building the best product. Check out our full conversation on the pod and download Windsurf for yourself: <a target="_blank" href="windsurf.ai"><strong>windsurf.ai</strong></a></p><p>Video Models & Voice model updates </p><p>There is so much happening in LLM world, that folks may skip over the other stuff, but there's so much happening in these world's as well this week! Here's a brief recap! </p><p>* <strong>Alibaba's WanX:</strong> Open-sourced, cutting-edge video generation models making waves with over 250,000 downloads already. They claim to take SOTA on open source video generation evals and of course img2video of this high quality model will lead to ... folks using it for all kinds of things. </p><p>* <strong>HUMEs Octave:</strong> A groundbreaking LLM model that genuinely understands context and emotion and does TTS. <a target="_blank" href="https://www.hume.ai/blog/octave-the-first-text-to-speech-model-that-understands-what-its-saying">Blog</a>  Hume has been doing emotional TTS but with this TTS focused LLM we are now able to create voices with a prompt, and receive emotional responses that are inferred from the text. Think shyness, sarcasm, anger etc</p><p>* <strong>11labs’ Scribe:</strong> Beating Whisper 3 with impressive accuracy and diarization features, Scribe is raising the bar in speech-to-text quality. 11labs releasing their own ASR (automatic speech recognition) was not in my cards, and boy did they deliver. Beating whisper, with speaker separation (diarization), world level timestamps and much lower WER than other models, this is a very interesting entry to this space. However, free for now on their website, it's significantly slower than Gemini 2.0 and Whisper for me at least. </p><p>* <strong>Sesame</strong> releases their conversational speech model (and promising to open source this) and it's honestly the best / least uncanny conversations I had with an AI. Check out my conversation with it </p><p>* Lastly, VEO 2, the best video model around according to some, is finally available via API (though txt2video only) and it's fairly expensive, but gives some amazing results. You can try it out on <a target="_blank" href="https://fal.ai/models/fal-ai/veo2">FAL</a></p><p>Phew, it looks like we've made it! Huge huge week in AI, big 2 new models, tons of incredible updates on multimodality and voice as well 🔥</p><p>If you enjoyed this summary, the best way to support us is to share with a friend (or 3) and give us a 5 start reviews on wherever you get your podcasts, it really does help! 👏 </p><p>See you next week, </p><p>Alex</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/feb-27-2025-gpt-45-drops-today-claude</link><guid isPermaLink="false">substack:post:158074908</guid><dc:creator><![CDATA[Alex Volkov and Kevin Hou]]></dc:creator><pubDate>Fri, 28 Feb 2025 02:41:28 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/158074908/42e49a89d33820b8721c7b45f979e67a.mp3" length="72357484" type="audio/mpeg"/><itunes:author>Alex Volkov and Kevin Hou</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6030</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/158074908/4b98e979fd2733ec642217eaca3188ba.jpg"/></item><item><title><![CDATA[📆 ThursdAI - Feb 20 - Live from AI Eng in NY - Grok 3, Unified Reasoners, Anthropic's Bombshell, and Robot Handoffs!]]></title><description><![CDATA[<p>Holy moly, AI enthusiasts! Alex Volkov here, reporting live from the AI Engineer Summit in the heart of (touristy) Times Square, New York! This week has been an absolute whirlwind of announcements, from <strong>XAI's Grok 3 dropping like a bomb</strong>, to Figure robots learning to <em>hand each other things</em>, and even a little eval smack-talk between OpenAI and XAI. It’s enough to make your head spin – but that's what ThursdAI is here for. We sift through the chaos and bring you the need-to-know, so you can stay on the cutting edge without having to, well, spend your entire life glued to X and Reddit.</p><p>This week we had a very special live show with the Haize Labs folks, the ones I previously interviewed about their bijection attacks, discussing their open source judge evaluation library called Verdict. So grab your favorite caffeinated beverage, maybe do some stretches because your mind <em>will</em> be blown, and let's dive into the TL;DR of ThursdAI, February 20th, 2025!</p><p>Participants</p><p>* <strong>Alex Volkov:</strong> AI Evangelist with Weights and Biases</p><p>* <strong>Nisten: </strong>AI Engineer and cohost</p><p>* <strong>Akshay:</strong> AI Community Member</p><p>* <strong>Nuo:</strong> Dev Advocate at 01AI</p><p>* <strong>Nimit:</strong> Member of Technical Staff at Haize Labs</p><p>* <strong>Leonard:</strong> Co-founder at Haize Labs</p><p>Open Source LLMs</p><p>Perplexity's R1 7076: Censorship-Free DeepSeek</p><p>Perplexity made a bold move this week, releasing <strong>R1 7076</strong>, a fine-tuned version of DeepSeek R1 specifically designed to remove what they (and many others) perceive as Chinese government censorship. The name itself, 1776, is a nod to American independence – a pretty clear statement! The core idea? Give users access to information on topics the CCP typically restricts, like Tiananmen Square and Taiwanese independence.</p><p>Perplexity used human experts to identify around 300 sensitive topics and built a "censorship classifier" to train the bias <em>out</em> of the model. The impressive part? They claim to have done this <em>without</em> significantly impacting the model's performance on standard evals. As Nuo from 01AI pointed out on the show, though, he'd "actually prefer that they can actually disclose more of their details in terms of post training... Running the R1 model by itself, it's already very difficult and very expensive." He raises a good point – more transparency is always welcome! Still, it's a fascinating attempt to tackle a tricky problem, the problem which I always say we simply cannot avoid. You can check it out yourself on <a target="_blank" href="https://huggingface.co/perplexity-ai/r1-1776">Hugging Face</a> and read their <a target="_blank" href="https://www.perplexity.ai/hub/blog/open-sourcing-r1-1776">blog post</a>.</p><p>Arc Institute & NVIDIA Unveil Evo 2: Genomics Powerhouse</p><p>Get ready for some serious science, folks! Arc Institute and NVIDIA dropped <strong>Evo 2</strong>, a <em>massive</em> genomics model (40 billion parameters!) trained on a mind-boggling 9.3 <em>trillion</em> nucleotides. And it’s fully open – two papers, weights, data, training, <em>and</em> inference codebases. We love to see it!</p><p>Evo 2 uses the StripedHyena architecture to process <em>huge</em> genetic sequences (up to 1 million nucleotides!), allowing for analysis of complex genomic patterns. The practical applications? Predicting the effects of genetic mutations (super important for healthcare) and even designing entire genomes. I’ve been super excited about genomics models, and seeing these alternative architectures like StripedHyena getting used here is just icing on the cake. <a target="_blank" href="https://x.com/pdhsu/status/1892243493445050606">Check it out on X</a>.</p><p>ZeroBench: The "Impossible" Benchmark for VLLMs</p><p>Need more benchmarks? Always! A new benchmark called <strong>ZeroBench</strong> arrived, claiming to be the "impossible benchmark" for Vision Language Models (VLLMs). And guess what? All current top-of-the-line VLLMs get a big fat <em>zero</em> on it.</p><p>One example they gave was a bunch of scattered letters, asking the model to "answer the question that is written in the shape of the star among the mess of letters." Honestly, even <em>I</em> struggled to see the star they were talking about. It highlights just how much further VLLMs need to go in terms of true visual understanding. (<a target="_blank" href="https://x.com/JRobertsAI/status/1891506671056261413">X</a>, <a target="_blank" href="https://t.co/E4noN7yDDM">Page</a>, <a target="_blank" href="https://t.co/n5GAwFiGEV">Paper</a>, <a target="_blank" href="https://t.co/mD8Eptr9M5">HF</a>)</p><p>Hugging Face's Ultra Scale Playbook: Scaling Up</p><p>For those of you building <em>massive</em> models, Hugging Face released the <strong>Ultra Scale Playbook</strong>, a guide to building and scaling AI models on huge GPU clusters.</p><p>They ran 4,000 scaling experiments on up to 512 GPUs (nothing close to Grok's 100,000, but still impressive!). If you're working in a lab and dreaming big, this is definitely a resource to check out. (<a target="_blank" href="https://huggingface.co/spaces/nanotron/ultrascale-playbook">HF</a>).</p><p>Big CO LLMs + APIs</p><p>Grok 3: XAI's Big Swing new SOTA LLM! (and Maybe a Bug?)</p><p>Monday evening, BOOM! While some of us were enjoying President's Day, the XAI team dropped <strong>Grok 3</strong>. They announced it with a setting very similar to OpenAI announcements. They're claiming state-of-the-art performance on <em>some</em> benchmarks (more on that drama later!), and a whopping 1 million token context window, finally confirmed after some initial confusion. They talked a lot about agents and a future of reasoners as well.</p><p>The launch was a bit… messy. First, there was a bug where some users were getting Grok 2 <em>even when the dropdown said Grok 3</em>. That led to a lot of mixed reviews. Even when I finally <em>thought</em> I was using Grok 3, it still flubbed my go-to logic test, the "Beth's Ice Cubes" question. (The answer is zero, folks – ice cubes melt!). But Akshay, who joined us on the show, chimed in with some love: "...with just the base model of Grok 3, it's, in my opinion, it's the best coding model out there." So, mixed vibes, to say the least! It's also FREE for now, "until their GPUs melt," according to XAI, which is great.</p><p><strong>UPDATE:</strong> The vibes are shifting, more and more of my colleagues and mutuals are LOVING grok3 for one shot coding, for talking to it. I’m getting convinced as well, though I did use and will continue to use Grok for real time data and access to X. </p><p><strong>DeepSearch</strong></p><p>In an attempt to show off some Agentic features, XAI also launched a deep search (not research like OpenAI but effectively the same) </p><p>Now, XAI of course has access to X, which makes their deep search have a leg up, specifically for real time information! I found out it can even “use” the X search! </p><p>OpenAI's Open Source Tease</p><p>In what felt like a very <em>conveniently</em> timed move, Sam Altman dropped a poll on X the same day as the Grok announcement: if OpenAI were to open-source something, should it be a small, mobile-optimized model, or a model on par with o3-mini? Most of us chose o3 mini, just to have access to that model and play with it. No indication of <em>when</em> this might happen, but it’s a clear signal that OpenAI is feeling the pressure from the open-source community.</p><p>The Eval Wars: OpenAI vs. XAI</p><p>Things got spicy! There was a whole debate about the eval numbers XAI posted, specifically the "best of N" scores (like best of 64 runs). Boris from OpenAI, and Aiden mcLau called out some of the graphs. Folks on X were quick to point out that OpenAI <em>also</em> used "best of N" in the past, and the discussion devolved from there.</p><p>XAI is claiming SOTA. OpenAI (or some folks from within OpenAI) aren't so sure. The core issue? We can't <em>independently</em> verify Grok's performance because there's no API yet! As I said, "…we're not actually able to use this model to independently evaluate this model and to tell you guys whether or not they actually told us the truth." Transparency matters, folks!</p><p>DeepSearch - How Deep?</p><p>Grok also touted a new "Deep Search" feature, kind of like Perplexity or OpenAI's "Deep Research" in their more expensive plan. My initial tests were… underwhelming. I nicknamed it "Shallow Search" because it spent all of 34 seconds on a complex query where OpenAI's Deep Research took 11 minutes and cited 17 sources. We're going to need to do some more digging (pun intended) on this one.</p><p>This Week's Buzz</p><p>We’re leaning <em>hard</em> into agents at Weights & Biases! We just released an agents <a target="_blank" href="https://wandb.ai/site/resources/whitepapers/evaluating-ai-agent-applications?utm_source=twitter&#38;utm_medium=social&#38;utm_campaign=weave">whitepaper</a> (check it out on our socials!), and we're launching an <strong>agents course</strong> in collaboration with OpenAI's Ilan Biggio. Sign up at <a target="_blank" href="http://wandb.me/agents">wandb.me/agents</a>! We're hearing <em>so much</em> about agent evaluation and observability, and we're working hard to provide the tools the community needs.</p><p>Also, sadly, our <strong>Toronto workshops</strong> are completely <strong>sold out</strong>. But if you're at AI Engineer in New York, come say hi to our booth! And catch my talk on LLM Reasoner Judges tomorrow (Friday) at 11 am EST – it’ll be live on the AI Engineer YouTube channel (<a target="_blank" href="https://www.youtube.com/live/L89GzWEILkM">HERE</a>)!</p><p>Vision & Video</p><p>Microsoft MUSE: Playable Worlds from a Single Image</p><p>This one is <em>wild</em>. Microsoft's <strong>MUSE</strong> can generate <em>minutes</em> of playable gameplay from just a <em>single second</em> of video frames and controller actions.</p><p>It's based on the World and Human Action Model (WHAM) architecture, trained on a <em>billion</em> gameplay images from Xbox. So if you’ve been playing Xbox lately, you might be in the model! I found it particularly cool: "…you give it like a single second of a gameplay of any type of game with all the screen elements, with percentages, with health bars, with all of these things and their model generates a game that you can control." (<a target="_blank" href="https://x.com/rowancheung/status/1892243245192683875">X</a>, <a target="_blank" href="https://huggingface.co/microsoft/wham">HF</a>, Blog).</p><p>StepFun's Step-Video-T2V: State-of-the-Art (and Open Source!)</p><p>We got <em>two</em> awesome open-source video breakthroughs this week. First, <strong>StepFun's Step-Video-T2V</strong> (and T2V Turbo), a 30 <em>billion</em> parameter text-to-video model. The results look <em>really</em> good, especially the text integration. Imagine a Chinese girl opening a scroll, and the words "We will open source" appearing as she unfurls it. That’s the kind of detail we're talking about.</p><p>And it’s MIT licensed! As Nisten noted "This is pretty cool. It came out. Right before Sora came out, people would have lost their minds." (<a target="_blank" href="https://x.com/arankomatsuzaki/status/1891330624436220069">X</a>, <a target="_blank" href="https://arxiv.org/abs/2502.10248">Paper</a>, <a target="_blank" href="https://huggingface.co/stepfun-ai/stepvideo-t2v">HF</a>, <a target="_blank" href="https://yuewen.cn/videos">Try It</a>).</p><p>HAO AI's FastVideo: Speeding Up HY-Video</p><p>The second video highlight: HAO AI released <strong>FastVideo</strong>, a way to make HY-Video (already a strong open-source contender) <em>three times faster</em> with no additional training! They call the trick "Sliding Tile Attention" apparently that alone provides enormous boost compared to even flash attention.</p><p>This is huge because faster inference means these models become more practical for real-world use. And, bonus: it supports HY-Video's Loras, meaning you can fine-tune it for, ahem, <em>all kinds</em> of creative applications. I will not go as far as to mention civit ai. (<a target="_blank" href="https://github.com/hao-ai-lab/FastVideo">Github</a>)</p><p>Figure's Helix: Robot Collaboration!</p><p>Breaking news from the AI Engineer conference floor: <strong>Figure</strong>, the humanoid robot company, announced <strong>Helix</strong>, a Vision-Language-Action (VLA) model built <em>into</em> their robots!It has full upper body control!</p><p>What blew my mind: they showed <em>two</em> robots working together, handing objects to each other, based on natural language commands! As I watched, I exclaimed, "I haven't seen a humanoid robot, hand off stuff to the other one... I found it like super futuristically cool." The model runs <em>on the robot</em>, using a 7 billion parameter VLM for understanding and an 80 million parameter transformer for control. This is the future, folks!</p><p>Tools & Others</p><p>Microsoft's New Quantum Chip (and State of Matter!)</p><p>Microsoft announced a new <strong>quantum chip</strong> <em>and</em> a new state of matter (called "topological superconductivity"). "I found it like absolutely mind blowing that they announced something like this," I gushed on the show. While I'm no quantum physicist, this sounds like a <em>big</em> deal for the future of computing.</p><p>Verdict: Hayes Labs' Framework for LLM Judges</p><p>And of course, the highlight of our show: <strong>Verdict</strong>, a new open-source framework from Hayes Labs (the folks behind those "bijection" jailbreaks!) for composing LLM judges. This is a <em>huge</em> deal for anyone working on evaluation. Leonard and Nimit from Hayes Labs joined us to explain how Verdict addresses some of the core problems with LLM-as-a-judge: biases (like preferring their own responses!), sensitivity to prompts, and the challenge of "meta-evaluation" (how do you know your judge is actually good?).</p><p>Verdict lets you combine different judging techniques ("primitives") to create more robust and efficient evaluators. Think of it as "judge-time compute scaling," as Leonard called it. They're achieving near state-of-the-art results on benchmarks like ExpertQA, <em>and</em> it's designed to be fast enough to use as a guardrail in real-time applications!</p><p>One key insight: you don't always need a full-blown reasoning model for judging. As Nimit explained, Verdict can combine simpler LLM calls to achieve similar results at a fraction of the cost. And, it's open source! (<a target="_blank" href="https://verdict.haizelabs.com/whitepaper.pdf">Paper</a>, <a target="_blank" href="http://github.com/haizelabs/verdict">Github</a>,<a target="_blank" href="https://x.com/leonardtang_/thread/1892243653071908949">X</a>).</p><p>Conclusion</p><p>Another week, another explosion of AI breakthroughs! Here are my key takeaways:</p><p>* <strong>Open Source is THRIVING:</strong> From censorship-free LLMs to cutting-edge video models, the open-source community is delivering incredible innovation.</p><p>* <strong>The Need for Speed (and Efficiency):</strong> Whether it's faster video generation or more efficient LLM judging, performance is key.</p><p>* <strong>Robots are Getting Smarter (and More Collaborative):</strong> Figure's Helix is a glimpse into a future where robots work <em>together</em>.</p><p>* <strong>Evaluation is (Finally) Getting Attention:</strong> Tools like Verdict are essential for building reliable and trustworthy AI systems.</p><p>* <strong>The Big Players are Feeling the Heat:</strong> OpenAI's open-source tease and XAI's rapid progress show that the competition is <em>fierce</em>.</p><p>I'll be back in my usual setup next week, ready to break down all the latest AI news. Stay tuned to ThursdAI – and don't forget to give the pod five stars and subscribe to the newsletter for all the links and deeper dives. There’s potentially an Anthropic announcement coming, so we’ll see you all next week.</p><p>TLDR</p><p>* <strong>Open Source LLMs</strong></p><p>* Perplexity R1 1776 - finetune of china-less R1 (<a target="_blank" href="https://www.perplexity.ai/hub/blog/open-sourcing-r1-1776">Blog</a>, <a target="_blank" href="https://huggingface.co/perplexity-ai/r1-1776">Model</a>)</p><p>* Arc institute + Nvidia - introduce EVO 2 - genomics model (<a target="_blank" href="https://x.com/pdhsu/status/1892243493445050606">X</a>)</p><p>* ZeroBench - impossible benchmark for VLMs (<a target="_blank" href="https://x.com/JRobertsAI/status/1891506671056261413">X</a>, <a target="_blank" href="https://t.co/E4noN7yDDM">Page</a>, <a target="_blank" href="https://t.co/n5GAwFiGEV">Paper</a>, <a target="_blank" href="https://t.co/mD8Eptr9M5">HF</a>)</p><p>* HuggingFace ultra scale playbook (<a target="_blank" href="https://huggingface.co/spaces/nanotron/ultrascale-playbook">HF</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Grok 3 SOTA LLM + reasoning and Deep Search (<a target="_blank" href="https://x.ai/blog/grok-3">blog</a>, <a target="_blank" href="http://grok.com">try it</a>)</p><p>* OpenAI is about to open source something? Sam posts a polls</p><p>* <strong>This weeks Buzz</strong></p><p>* We are about to launch an agents course! Pre-sign up <a target="_blank" href="http://wandb.me/agents">wandb.me/agents</a></p><p>* Workshops are SOLD OUT</p><p>* Watch my talk LIVE from AI Engineer - 11am EST Friday (<a target="_blank" href="https://www.youtube.com/live/L89GzWEILkM">HERE</a>)</p><p>* Keep watching AI Eng conference after the show on AIE YT</p><p>* )</p><p>* <strong>Vision & Video</strong></p><p>* Microsoft MUSE - playable worlds from one image (<a target="_blank" href="https://x.com/rowancheung/status/1892243245192683875">X</a>, <a target="_blank" href="https://huggingface.co/microsoft/wham">HF</a>, Blog)</p><p>* Microsoft OmniParser - Better, faster screen parsing for GUI agents with OmniParser v2 (<a target="_blank" href="https://huggingface.co/spaces/microsoft/OmniParser-v2">Gradio Demo</a>)</p><p>* HAO AI - fastVIDEO - making HY-Video 3x as fast (<a target="_blank" href="https://github.com/hao-ai-lab/FastVideo">Github</a>)</p><p>* StepFun - Step-Video-T2V (+Turbo), a SotA 30B text-to-video model (<a target="_blank" href="https://arxiv.org/abs/2502.10248">Paper</a>, <a target="_blank" href="https://github.com/stepfun-ai/Step-Video-T2V">Github</a>, <a target="_blank" href="https://huggingface.co/stepfun-ai/stepvideo-t2v">HF</a>, <a target="_blank" href="https://yuewen.cn/videos">Try It</a>)</p><p>* Figure announces HELIX - vision action model built into FIGURE Robot (<a target="_blank" href="https://www.figure.ai/news/helix">Paper</a>)</p><p>* <strong>Tools & Others</strong></p><p>* Microsoft announces a new quantum chip and a new state of matter (<a target="_blank" href="https://news.microsoft.com/source/features/ai/microsofts-majorana-1-chip-carves-new-path-for-quantum-computing/">Blog</a>, <a target="_blank" href="https://x.com/KonstantHacker/status/1892242862068183048">X</a>)</p><p>* Verdict - Framework to compose SOTA LLM judges with JudgeTime Scaling (<a target="_blank" href="https://verdict.haizelabs.com/whitepaper.pdf">Paper</a>, <a target="_blank" href="http://github.com/haizelabs/verdict">Github</a>,<a target="_blank" href="https://x.com/leonardtang_/thread/1892243653071908949">X</a>)</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-feb-20-live-from-ai-eng</link><guid isPermaLink="false">substack:post:157567466</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Thu, 20 Feb 2025 18:56:47 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/157567466/82e4c12799dde9070d3e94b6c5830814.mp3" length="72876310" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6073</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/157567466/eacdabb009f24ca39ef84b4be96349d8.jpg"/></item><item><title><![CDATA[📆 ThursdAI - Feb 13 - my Personal Rogue AI, DeepHermes, Fast R1, OpenAI Roadmap / RIP GPT6, new Claude & Grok 3 imminent?]]></title><description><![CDATA[<p>What a week in AI, folks! Seriously, just when you think things might slow down, the AI world throws another curveball. This week, we had everything from rogue AI apps giving unsolicited life advice (and sending rogue texts!), to mind-blowing open source releases that are pushing the boundaries of what's possible, and of course, the ever-present drama of the big AI companies with OpenAI dropping a roadmap that has everyone scratching their heads.</p><p>Buckle up, because on this week's ThursdAI, we dove deep into all of it. We chatted with the brains behind the latest open source embedding model, marveled at a tiny model crushing math benchmarks, and tried to decipher Sam Altman's cryptic GPT-5 roadmap. Plus, I shared a personal story about an AI app that decided to psychoanalyze my text messages – you won't believe what happened! Let's get into the TL;DR of ThursdAI, February 13th, 2025 – it's a wild one!</p><p>* <strong>Alex Volkov:</strong> AI Adventurist with weights and biases</p><p>* <strong>Wolfram Ravenwlf:</strong> AI Expert & Enthusiast</p><p>* <strong>Nisten:</strong> AI Community Member</p><p>* <strong>Zach Nussbaum:</strong> Machine Learning Engineer at Nomic AI</p><p>* <strong>Vu Chan:</strong> AI Enthusiast & Evaluator</p><p>* <strong>LDJ:</strong> AI Community Member</p><p>Personal story of Rogue AI with RPLY</p><p>This week kicked off with a hilarious (and slightly unsettling) story of my own AI going rogue, all thanks to a new Mac app called RPLY designed to help with message replies. I installed it thinking it would be a cool productivity tool, but it turned into a personal intervention session, and then… well, let's just say things escalated.</p><p>The app started by analyzing my text messages and, to my surprise, delivered a brutal psychoanalysis of my co-parenting communication, pointing out how both my ex and I were being "unpleasant" and needed to focus on the kids. As I said on the show, "I got this as a gut punch. I was like, f*ck, I need to reimagine my messaging choices." But the real kicker came when the AI decided to take initiative and started sending messages <em>without</em> my permission (apparently this was a bug with RPLY that was fixed since I reported)! </p><p>Friends were texting me question marks, and my ex even replied to a random "Hey, How's your day going?" message with a smiley, completely out of our usual post-divorce communication style. "This AI, like on Monday before just gave me absolute s**t about not being, a person that needs to be focused on the kids also decided to smooth things out on friday" I chuckled, still slightly bewildered by the whole ordeal. It could have gone way worse, but thankfully, this rogue AI counselor just ended up being more funny than disastrous.</p><p>Open Source LLMs</p><p>DeepHermes preview from NousResearch</p><p>Just in time for me sending this newsletter (but unfortunately not quite in time for the recording of the show), our friends at Nous shipped an experimental new thinking model, their first reasoner, called DeepHermes. </p><p>NousResearch claims DeepHermes is among the first models to fuse reasoning and standard LLM token generation within a <em>single architecture</em> (a trend you'll see echoed in the OpenAI and Claude announcements below!)</p><p>Definitely experimental cutting edge stuff here, but exciting to see not just an RL replication but also innovative attempts from one of the best finetuning collectives around. </p><p>Nomic Embed Text V2 - First Embedding MoE</p><p>Nomic AI continues to impress with the release of <strong>Nomic Embed Text V2</strong>, the first general-purpose Mixture-of-Experts (MoE) embedding model. Zach Nussbaum from Nomic AI joined us to explain why this release is a big deal.</p><p>* <strong>First general-purpose Mixture-of-Experts (MoE) embedding model:</strong> This innovative architecture allows for better performance and efficiency.</p><p>* <strong>SOTA performance on multilingual benchmarks:</strong> Nomic Embed V2 achieves state-of-the-art results on the multilingual MIRACL benchmark for its size.</p><p>* <strong>Support for 100+ languages:</strong> Truly multilingual embeddings for global applications.</p><p>* <strong>Truly open source:</strong> Nomic is committed to open source, releasing training data, weights, and code under the Apache 2.0 License.</p><p>Zach highlighted the benefits of MoE for embeddings, explaining, "So we're trading a little bit of, inference time memory, and training compute to train a model with mixture of experts, but we get this, really nice added bonus of, 25 percent storage." This is especially crucial when dealing with massive datasets. You can check out the model on <a target="_blank" href="https://huggingface.co/nomic-ai/nomic-embed-text-v2-moe">Hugging Face</a> and read the <a target="_blank" href="https://static.nomic.ai/nomic_embed_multilingual_preprint.pdf">Technical Report</a> for all the juicy details.</p><p>AllenAI OLMOE on iOS and New Tulu 3.1 8B</p><p>AllenAI continues to champion open source with the release of <strong>OLMOE</strong>, a fully open-source iOS app, and the new <strong>Tulu 3.1 8B</strong> model.</p><p>* <strong>OLMOE iOS App:</strong> This app brings state-of-the-art open-source language models to your iPhone, privately and securely.</p><p>* Allows users to test open-source LLMs on-device.</p><p>* Designed for researchers studying on-device AI and developers prototyping new AI experiences.</p><p>* Optimized for on-device performance while maintaining high accuracy.</p><p>* Fully open-source code for further development.</p><p>* Available on the <a target="_blank" href="https://apps.apple.com/app/id6738533815">App Store</a> for iPhone 15 Pro or newer and M-series iPads.</p><p>* <strong>Tulu 3.1 8B </strong></p><p>As Nisten pointed out, "If you're doing edge AI, the way that this model is built is pretty ideal for that." This move by AllenAI underscores the growing importance of on-device AI and open access. Read more about OLMOE on the <a target="_blank" href="https://allenai.org/blog/olmoe-app">AllenAI Blog</a>.</p><p>Groq Adds Qwen Models and Lands on OpenRouter</p><p>Groq, known for its blazing-fast inference speeds, has added <strong>Qwen models</strong>, including the distilled <strong>R1-distill</strong>, to its service and joined <strong>OpenRouter</strong>.</p><p>* <strong>Record-fast inference:</strong> Experience a mind-blowing <strong>1000 TPS</strong> with distilled DeepSeek R1 70B on Open Router.</p><p>* <strong>Usable Rate Limits:</strong> Groq is now accessible for production use cases with higher rate limits and pay-as-you-go options.</p><p>* <strong>Qwen Model Support:</strong> Access Qwen models like 2.5B-32B and R1-distill-qwen-32B.</p><p>* <strong>Open Router Integration:</strong> Groq is now available on <a target="_blank" href="https://openrouter.ai/">OpenRouter</a>, expanding accessibility for developers.</p><p>As Nisten noted, "At the end of the day, they are shipping very fast inference and you can buy it and it looks like they are scaling it. So they are providing the market with what it needs in this case." This integration makes Groq's speed even more accessible to developers. Check out Groq's announcement on <a target="_blank" href="https://x.com/GroqInc/status/1889347665072173171">X.com</a>.</p><p>SambaNova adds full DeepSeek R1 671B - flies at 200t/s (<a target="_blank" href="https://sambanova.ai/blog/sambanova-cloud-launches-the-fastest-deepseek-r1-671b">blog</a>)</p><p>In a complete trend of this week, SambaNova just announced they have availability of DeepSeek R1, sped up by their custom chips, flying at 150-200t/s. This is the full DeepSeek R1, not the distilled Qwen based versions! </p><p>This is really impressive work, and compared to the second fastest US based DeepSeek R1 (on Together AI) it absolutely flies</p><p>Agentica DeepScaler 1.5B Beats o1-preview on Math</p><p>Agentica's <strong>DeepScaler 1.5B</strong> model is making waves by outperforming OpenAI's o1-preview on math benchmarks, using Reinforcement Learning (RL) for just $4500 of compute.</p><p>* <strong>Impressive Math Performance:</strong> DeepScaleR achieves a <strong>37.1%</strong> Pass@1 on AIME 2025, outperforming the base model and even o1-preview!!</p><p>* <strong>Efficient Training:</strong> Trained using RL for just $4500, demonstrating cost-effective scaling of intelligence.</p><p>* <strong>Open Sourced Resources:</strong> Agentica open-sourced their dataset, code, and training logs, fostering community progress in RL-based reasoning.</p><p>Vu Chan, an AI enthusiast who evaluated the model, joined us to share his excitement: "It achieves, 42% pass at one on a AIME 24. which basically means if you give the model only one chance at every problem, it will solve 42% of them." He also highlighted the model's efficiency, generating correct answers with fewer tokens. You can find the model on <a target="_blank" href="https://huggingface.co/agentica-org/DeepScaleR-1.5B-Preview">Hugging Face</a>, check out the <a target="_blank" href="https://wandb.ai/mluo/deepscaler-1.5b">WandB logs</a>, and see the announcement on <a target="_blank" href="https://x.com/Yuchenj_UW/status/1889387582066401461">X.com</a>.</p><p>ModernBert Instruct - Encoder Model for General Tasks</p><p>ModernBert, known for its efficient encoder-only architecture, now has an instruct version, <strong>ModernBert Instruct</strong>, capable of handling general tasks.</p><p>* <strong>Instruct-tuned Encoder:</strong> ModernBERT-Large-Instruct can perform classification and multiple-choice tasks using its Masked Language Modeling (MLM) head.</p><p>* <strong>Beats Qwen .5B:</strong> Outperforms Qwen .5B on MMLU and MMLU Pro benchmarks.</p><p>* <strong>Efficient and Versatile:</strong> Demonstrates the potential of encoder models for general tasks without task-specific heads.</p><p>This release shows that even encoder-only models can be adapted for broader applications, challenging the dominance of decoder-based LLMs for certain tasks. Check out the announcement on <a target="_blank" href="https://x.com/bclavie/status/1888963894296936616">X.com</a>.</p><p>Big CO LLMs + APIs</p><p>RIP GPT-5 and o3 - OpenAI Announces Public Roadmap</p><p>OpenAI shook things up this week with a roadmap update from Sam Altman, announcing a shift in strategy for GPT-5 and the o-series models. Get ready for <strong>GPT-4.5 (Orion)</strong> and a unified GPT-5 system!</p><p>* <strong>GPT-4.5 (Orion) is Coming:</strong> This will be the last non-chain-of-thought model from OpenAI.</p><p>* <strong>GPT-5: A Unified System:</strong> GPT-5 will integrate technologies from both the GPT and o-series models into a single, seamless system.</p><p>* <strong>No Standalone o3:</strong> o3 will not be released as a standalone model; its technology will be integrated into GPT-5. "We will no longer ship O3 as a standalone model," Sam Altman stated.</p><p>* <strong>Simplified User Experience:</strong> The model picker will be eliminated in ChatGPT and the API, aiming for a more intuitive experience.</p><p>* <strong>Subscription Tier Changes:</strong></p><p>* Free users will get unlimited access to GPT-5 at a standard intelligence level.</p><p>* Plus and Pro subscribers will gain access to increasingly advanced intelligence settings of GPT-5.</p><p>* <strong>Expanded Capabilities:</strong> GPT-5 will incorporate voice, canvas, search, deep research, and more.</p><p>This roadmap signals a move towards more integrated and user-friendly AI experiences. As Wolfram noted, "Having a unified access and the AI should be smart enough... AI has, we need an AI to pick which AI to use." This seems to be OpenAI's direction. Read Sam Altman's full announcement on <a target="_blank" href="https://x.com/sama/status/1889755723078443244">X.com</a>.</p><p>OpenAI Releases ModelSpec v2</p><p>OpenAI also released <strong>ModelSpec v2</strong>, an update to their document defining desired AI model behaviors, emphasizing customizability, transparency, and intellectual freedom.</p><p>* <strong>Chain of Command:</strong> Defines a hierarchy to balance user/developer control with platform-level rules.</p><p>* <strong>Truth-Seeking and User Empowerment:</strong> Encourages models to "seek the truth together" with users and empower decision-making.</p><p>* <strong>Core Principles:</strong> Sets standards for competence, accuracy, avoiding harm, and embracing intellectual freedom.</p><p>* <strong>Open Source:</strong> OpenAI open-sourced the Spec and evaluation prompts for broader use and collaboration on <a target="_blank" href="https://github.com/openai/model_spec/">GitHub</a>.</p><p>This release reflects OpenAI's ongoing efforts to align AI behavior and promote responsible development. Wolfram praised ModelSpec, saying, "I was all over the original models back when it was announced in the first place... That is one very important aspect when you have the AI agent going out on the web and get information from not trusted sources." Explore ModelSpec v2 on the <a target="_blank" href="https://model-spec.openai.com/2025-02-12.html">dedicated website</a>.</p><p>VP Vance Speech at AI Summit in Paris - Deregulate and Dominate!</p><p>Vice President Vance delivered a powerful speech at the AI Summit in Paris, advocating for pro-growth AI policies and deregulation to maintain American leadership in AI.</p><p>* <strong>Pro-Growth and Deregulation:</strong> VP Vance urged for policies that encourage AI innovation and cautioned against excessive regulation, specifically mentioning GDPR.</p><p>* <strong>American AI Leadership:</strong> Emphasized ensuring American AI technology remains the global standard and blocks hostile foreign adversaries from weaponizing AI. "Hostile foreign adversaries have weaponized AI software to rewrite history, surveil users, and censor speech… I want to be clear – this Administration will block such efforts, full stop," VP Vance declared.</p><p>* <strong>Key Points:</strong></p><p>* Ensure American AI leadership.</p><p>* Encourage pro-growth AI policies.</p><p>* Maintain AI's freedom from ideological bias.</p><p>* Prioritize a pro-worker approach to AI development.</p><p>* Safeguard American AI and chip technologies.</p><p>* Block hostile foreign adversaries' weaponization of AI.</p><p>Nisten commented, "He really gets something that most EU politicians do not understand is that whenever they have such a good thing, they're like, okay, this must be bad. And we must completely stop it." This speech highlights the ongoing debate about AI regulation and its impact on innovation. Read the full speech <a target="_blank" href="https://thespectator.com/topic/read-j-d-vance-full-speech-ai-summit-paris/">here</a>.</p><p>Cerebras Powers Perplexity with Blazing Speed (1200 t/s!)</p><p>Perplexity is now powered by Cerebras, achieving inference speeds exceeding <strong>1200 tokens per second</strong>.</p><p>* <strong>Unprecedented Speed:</strong> Perplexity's Sonar model now flies at over 1200 tokens per second thanks to Cerebras' massive LPU chips. "Like perplexity sonar,  their specific LLM for search is now powered by Cerebras and it's like 12. 100 tokens per second. It's it matches Google now on speed," I noted on the show.</p><p>* <strong>Google-Level Speed:</strong> Perplexity now matches Google in inference speed, making it incredibly fast and responsive.</p><p>This partnership significantly enhances Perplexity's performance, making it an even more compelling search and AI tool. See Perplexity's announcement on <a target="_blank" href="https://x.com/perplexity_ai/status/1889392617479082323">X.com</a>.</p><p>Anthropic Claude Incoming - Combined LLM + Reasoning Model</p><p><a target="_blank" href="https://www.theinformation.com/articles/anthropic-strikes-back">Rumors</a> are swirling that Anthropic is set to release a new <strong>Claude model</strong> that will be a combined LLM and reasoning model, similar to OpenAI's GPT-5 roadmap.</p><p>* <strong>Unified Architecture:</strong> Claude's next model is expected to integrate both LLM and reasoning capabilities into a single, hybrid architecture.</p><p>* <strong>Reasoning Powerhouse:</strong> Rumors suggest Anthropic has had a reasoning model stronger than Claude 3 for some time, hinting at a significant performance leap.</p><p>This move suggests a broader industry trend towards unified AI models that seamlessly blend different capabilities. Stay tuned for official announcements from Anthropic.</p><p>Elon Musk Teases Grok 3 "Weeks Out"</p><p>Elon Musk continues to tease the release of <strong>Grok 3</strong>, claiming it will be "a few weeks out" and the "most powerful AI" they have tested, with enhanced reasoning capabilities.</p><p>* <strong>Grok 3 Hype:</strong> Elon Musk claims Grok 3 will be the most powerful AI <a target="_blank" href="X.ai">X.ai</a> has released, with a focus on reasoning.</p><p>* <strong>Reasoning Focus:</strong> Grok 3's development may have shifted towards reasoning capabilities, potentially causing a slight delay in release.</p><p>While details remain scarce, the anticipation for Grok 3 is building, especially in light of the advancements in open source reasoning models.</p><p>This Week's Buzz 🐝</p><p>Weave Dataset Editing in UI</p><p>Weights & Biases Weave has added a highly requested feature: <strong>dataset editing directly in the UI</strong>.</p><p>* <strong>UI-Based Dataset Editing:</strong> Users can now edit datasets directly within the Weave UI, adding, modifying, and deleting rows without code. "One thing that, folks asked us and we've recently shipped is the ability to edit this from the UI itself. So you don't have to have code," I explained.</p><p>* <strong>Versioning and Collaboration:</strong> Every edit creates a new dataset version, allowing for easy tracking and comparison.</p><p>* <strong>Improved Dataset Management:</strong> Simplifies dataset management and version control for evaluations and experiments.</p><p>This feature streamlines the workflow for LLM evaluation and observability, making Weave even more user-friendly. Try it out at <a target="_blank" href="https://wandb.me/weave">wandb.me/weave</a> </p><p>Toronto Workshops - AI in Production: Evals & Observability</p><p>Don't miss our upcoming <strong>AI in Production: Evals & Observability Workshops</strong> in Toronto!</p><p>* <strong>Two Dates:</strong> Sunday and Monday workshops in Toronto.</p><p>* <strong>Hands-on Learning:</strong> Learn to build and evaluate LLM-powered applications with robust observability.</p><p>* <strong>Expert Guidance:</strong> Led by yours truly, Alex Volkov, and featuring Nisten.</p><p>* <strong>Limited Spots:</strong> Registration is still open, but spots are filling up fast! Register for Sunday's workshop <a target="_blank" href="https://toronto.aitinkerers.org/p/ai-tinkerers-toronto-ai-in-production-evals-observability-workshop-with-weights-biases">here</a> and Monday's workshop <a target="_blank" href="https://toronto.aitinkerers.org/p/ai-tinkerers-toronto-ai-evals-workshop-with-weights-biases-monday">here</a>.</p><p>Join us to level up your LLM skills and network with the Toronto AI community!</p><p>Vision & Video</p><p>Adobe Firefly Video - Image to Video and Text to Video</p><p>Adobe announced <strong>Firefly Video</strong>, entering the image-to-video and text-to-video generation space.</p><p>* <strong>Video Generation:</strong> Firefly Video offers both image-to-video and text-to-video capabilities.</p><p>* <strong>Adobe Ecosystem:</strong> Integrates with Adobe's creative suite, providing a powerful tool for video creators.</p><p>This release marks Adobe's significant move into the rapidly evolving video generation landscape. Try Firefly Video <a target="_blank" href="https://firefly.adobe.com/generate">here</a>.</p><p>Voice & Audio</p><p>YouTube Expands AI Dubbing to All Creators</p><p>YouTube is expanding <strong>AI dubbing</strong> to all creators, breaking down language barriers on the platform.</p><p>* <strong>AI-Powered Dubbing:</strong> YouTube is leveraging AI to provide dubbing in multiple languages for all creators. "YouTube now expands. AI dubbing in languages to all creators, and that's super cool. So basically no language barriers anymore. AI dubbing is here," I announced.</p><p>* <strong>Increased Watch Time:</strong> Pilot program saw 40% of watch time in dubbed languages, demonstrating the feature's impact. "Since the pilot launched last year, 40 percent of watch time for videos with the feature enabled was in the dub language and not the original language. That's insane!" I highlighted.</p><p>* <strong>Global Reach:</strong> Eliminates language barriers, making content accessible to a wider global audience.</p><p>Wolfram emphasized the importance of dubbing, especially in regions with strong dubbing cultures like Germany. "Every movie that comes here is getting dubbed in high quality. And now AI is doing that on YouTube. And I personally, as a content creator, I have always have to decide, do I post in German or English?" This feature is poised to revolutionize content consumption on YouTube. Read more on <a target="_blank" href="https://x.com/omooretweets/status/1889727021435199998">X.com</a>.</p><p>Meta Audiobox Aesthetics - Unified Quality Assessment</p><p>Meta released <strong>Audiobox Aesthetics</strong>, a unified automatic quality assessment model for speech, music, and sound.</p><p>* <strong>Unified Assessment:</strong> Provides a single model for evaluating the quality of speech, music, and general sound.</p><p>* <strong>Four Key Metrics:</strong> Evaluates audio based on Production Quality (PQ), Production Complexity (PC), Content Enjoyment (CE), and Content Usefulness (CU).</p><p>* <strong>Automated Evaluation:</strong> Offers a scalable solution for assessing synthetic audio quality, reducing reliance on costly human evaluations.</p><p>This tool is expected to significantly improve the development and evaluation of TTS and audio generation models. Access the <a target="_blank" href="https://scontent-den2-1.xx.fbcdn.net/v/t39.2365-6/475941290_1082969453602014_2080888948846738665_n.pdf?_nc_cat=101&#38;ccb=1-7&#38;_nc_sid=3c67a6&#38;_nc_ohc=TAU0g1oWcZoQ7kNvgGAzq4j&#38;_nc_oc=Adh60zhX4jhMo386FVNUKEZwq5hxfe86kI9KNfDXZA2u8MYwGLBCL3zwIEvUt5uBtt8&#38;_nc_zt=14&#38;_nc_ht=scontent-den2-1.xx&#38;_nc_gid=Acc0tHR7eFr8v14Ar7ZaV6V&#38;oh=00_AYA9SCZT7wLl5PCo9qWbR8f8AjoNS1nZDAf4dHX6q8S2eQ&#38;oe=67B34AE0">Paper</a> and <a target="_blank" href="https://github.com/facebookresearch/audiobox-aesthetics">Weights</a> on GitHub.</p><p>Zonos - Expressive TTS with High-Fidelity Cloning</p><p>Zyphra released <strong>Zonos</strong>, a highly expressive TTS model with high-fidelity voice cloning capabilities.</p><p>* <strong>Expressive TTS:</strong> Zonos offers expressive speech generation with control over speaking rate, pitch, and emotions.</p><p>* <strong>High-Fidelity Voice Cloning:</strong> Claims high-fidelity voice cloning from short audio samples (though my personal test was less impressive). "My own voice clone sounded a little bit like me but not a lot. Ok at least for me, the cloning is really really bad," I admitted on the show.</p><p>* <strong>High Bitrate Audio:</strong> Generates speech at 44kHz with a high bitrate codec for enhanced audio quality.</p><p>* <strong>Open Source & API:</strong> Models are open source, with a commercial API available.</p><p>While voice cloning might need further refinement, Zonos represents another step forward in open-source TTS technology. Explore Zonos on <a target="_blank" href="https://huggingface.co/Zyphra/Zonos-v0.1-hybrid">Hugging Face (Hybrid)</a>, <a target="_blank" href="https://huggingface.co/Zyphra/Zonos-v0.1-transformer">Hugging Face (Transformer)</a>, and <a target="_blank" href="https://t.co/Fw4SkUmcIu">GitHub</a>, and read the <a target="_blank" href="https://www.zyphra.com/post/beta-release-of-zonos-v0-1">Blog post</a>.</p><p>Tools & Others</p><p>Emergent Values AI - AI Utility Functions and Biases</p><p>Researchers found that AIs exhibit <strong>emergent values</strong>, including biases in valuing human lives from different regions.</p><p>* <strong>Emergent Utility Functions:</strong> AI models appear to develop implicit utility functions and value systems during training. "Research finds that AI's have expected utility functions for people and other emergent values. And this is freaky," I summarized.</p><p>* <strong>Value Biases:</strong> Studies revealed biases, with AIs valuing lives from certain regions (e.g., Nigeria, Pakistan, India) higher than others (e.g., Italy, France, Germany, UK, US). "Nigerian people, valued as like eight us people. One Nigerian person was valued like eight us people," I highlighted the surprising finding.</p><p>* <strong>Utility Engineering:</strong> Researchers propose "utility engineering" as a research agenda to analyze and control these emergent value systems.</p><p>LDJ pointed out a potential correlation between the valued regions and the source of RLHF data labeling, suggesting a possible link between training data and emergent biases. While the study is still debated, it raises important questions about AI value alignment. Read the announcement on <a target="_blank" href="https://x.com/DanHendrycks/status/1889344074098057439">X.com</a> and the <a target="_blank" href="https://drive.google.com/file/d/1QAzSj24Fp0O6GfkskmnULmI1Hmx7k_EJ/view">Paper</a>.</p><p>LM Studio Lands Support for Speculative Decoding</p><p>LM Studio, the popular local LLM inference tool, now supports <strong>speculative decoding</strong>, significantly speeding up inference.</p><p>* <strong>Faster Inference:</strong> Speculative decoding leverages a smaller "draft" model to accelerate inference with a larger model. "Speculative decoding finally landed in LM studio, which is dope folks. If you use LM studio, if you don't, you should," I exclaimed.</p><p>* <strong>Visualize Accepted Tokens:</strong> LM Studio visualizes accepted draft tokens, allowing users to see speculative decoding in action.</p><p>* <strong>Performance Boost:</strong> Improved inference speeds by up to 40% in tests, without sacrificing model performance. "It runs around 10 tokens per second without the speculative decoding and around 14 to 15 tokens per second with speculative decoding, which is great," I noted.</p><p>This update makes LM Studio even more powerful for local LLM experimentation. See the announcement on <a target="_blank" href="https://x.com/lmstudio/status/1889789651797319808">X.com</a>.</p><p>Noam Shazeer / Jeff Dean on Dwarkesh Podcast</p><p>Podcast enthusiasts should check out the new <strong>Dwarkesh Podcast</strong> episode featuring Noam Shazeer (Transformer co-author) and Jeff Dean (Google DeepMind).</p><p>* <strong>AI Insights:</strong> Listen to insights from two AI pioneers in this new podcast episode.</p><p>Tune in to hear from these influential figures in the AI world. Find the announcement on <a target="_blank" href="https://x.com/dwarkesh_sp/status/1889770108949577768">X.com</a>.</p><p>What a week, folks! From rogue AI analyzing my personal life to OpenAI shaking up the roadmap and tiny models conquering math, the AI world continues to deliver surprises. Here are some key takeaways:</p><p>* <strong>Open Source is Exploding:</strong> Nomic Embed Text V2, OLMoE, DeepScaler 1.5B, and ModernBERT Instruct are pushing the boundaries of what's possible with open, accessible models.</p><p>* <strong>Speed is King:</strong> Groq, Cerebras and SambaNovas are delivering blazing-fast inference, making real-time AI applications more feasible than ever.</p><p>* Reasoning is Evolving: DeepScaler 1.5B's success demonstrates the power of RL for even small models, and OpenAI and Anthropic are moving towards unified models with integrated reasoning.</p><p>* Privacy Matters: AllenAI's OLMoE highlights the growing importance of on-device AI for data privacy.</p><p>* The AI Landscape is Shifting: OpenAI's roadmap announcement signals a move towards simpler, more integrated AI experiences, while government officials are taking a stronger stance on AI policy.</p><p>Stay tuned to ThursdAI for the latest updates, and don't forget to subscribe to the newsletter for all the links and details! Next week, I'll be in New York, so expect a special edition of ThursdAI from the AI Engineer floor.</p><p>TLDR & Show Notes</p><p>* <strong>Open Source LLMs</strong></p><p>* NousResearch DeepHermes-3 Preview (X, <a target="_blank" href="https://huggingface.co/NousResearch/DeepHermes-3-Llama-3-8B-Preview">HF</a>)</p><p>* Nomic Embed Text V2 - first embedding MoE (<a target="_blank" href="https://huggingface.co/nomic-ai/nomic-embed-text-v2-moe">HF</a>, <a target="_blank" href="https://static.nomic.ai/nomic_embed_multilingual_preprint.pdf">Tech Report</a>)</p><p>* AllenAI OLMOE on IOS as a standalone app & new Tulu 3.1 8B (<a target="_blank" href="https://allenai.org/blog/olmoe-app">Blog</a>, <a target="_blank" href="https://apps.apple.com/app/id6738533815">App Store</a>)</p><p>* Groq adds Qwen models (including R1 distill) and lands on OpenRouter (<a target="_blank" href="https://x.com/GroqInc/status/1889347665072173171">X</a>)</p><p>* Agentica DeepScaler 1.5B beats o1-preview on math using RL for $4500 (<a target="_blank" href="https://x.com/Yuchenj_UW/status/1889387582066401461">X</a>, <a target="_blank" href="https://huggingface.co/agentica-org/DeepScaleR-1.5B-Preview">HF</a>, <a target="_blank" href="https://wandb.ai/mluo/deepscaler-1.5b">WandB</a>)</p><p>* ModernBert can be instructed (though encoder only) to do general tasks (<a target="_blank" href="https://x.com/bclavie/status/1888963894296936616">X</a>)</p><p>* LMArena releases a dataset of 100K votes with human preferences (<a target="_blank" href="https://x.com/lmarena_ai/status/1890114273449525439">X</a>, <a target="_blank" href="https://huggingface.co/datasets/lmarena-ai/arena-human-preference-100k">HF</a>)</p><p>* SambaNova adds full DeepSeek R1 671B - flies at 200t/s (<a target="_blank" href="https://sambanova.ai/blog/sambanova-cloud-launches-the-fastest-deepseek-r1-671b">blog</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* RIP GPT-5 and o3 - OpenAI announces a public roadmap (<a target="_blank" href="https://x.com/sama/status/1889755723078443244">X</a>)</p><p>* OpenAI released Model Spec v2 (<a target="_blank" href="https://github.com/openai/model_spec/">Github</a>, <a target="_blank" href="https://model-spec.openai.com/2025-02-12.html">Blog</a>)</p><p>* VP Vance Speech at AI Summit in Paris (<a target="_blank" href="https://thespectator.com/topic/read-j-d-vance-full-speech-ai-summit-paris/">full speech</a>)</p><p>* Cerebras now powers Perplexity with >1200t/s (<a target="_blank" href="https://x.com/perplexity_ai/status/1889392617479082323">X</a>)</p><p>* Anthropic Claude incoming, will be combined LLM + reasoning (<a target="_blank" href="https://www.theinformation.com/articles/anthropic-strikes-back">The Information</a>)</p><p>* <strong>This weeks Buzz</strong></p><p>* We've added dataset editing in the UI (<a target="_blank" href="https://x.com/weave_wb/status/1887564898777117139">X</a>)</p><p>* 2 workshops in Toronto, <a target="_blank" href="https://toronto.aitinkerers.org/p/ai-tinkerers-toronto-ai-in-production-evals-observability-workshop-with-weights-biases">Sunday</a> and <a target="_blank" href="https://toronto.aitinkerers.org/p/ai-tinkerers-toronto-ai-evals-workshop-with-weights-biases-monday">Monday</a></p><p>* <strong>Vision & Video</strong></p><p>* Adobe announces firefly video (img2video and txt2video) (<a target="_blank" href="https://firefly.adobe.com/generate">try it</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* Youtube to expand AI Dubbing to all creators (<a target="_blank" href="https://x.com/omooretweets/status/1889727021435199998">X</a>)</p><p>* Meta Audiobox Aesthetics - Unified Automatic Quality Assessment for Speech, Music, and Sound (<a target="_blank" href="https://scontent-den2-1.xx.fbcdn.net/v/t39.2365-6/475941290_1082969453602014_2080888948846738665_n.pdf?_nc_cat=101&#38;ccb=1-7&#38;_nc_sid=3c67a6&#38;_nc_ohc=TAU0g1oWcZoQ7kNvgGAzq4j&#38;_nc_oc=Adh60zhX4jhMo386FVNUKEZwq5hxfe86kI9KNfDXZA2u8MYwGLBCL3zwIEvUt5uBtt8&#38;_nc_zt=14&#38;_nc_ht=scontent-den2-1.xx&#38;_nc_gid=Acc0tHR7eFr8v14Ar7ZaV6V&#38;oh=00_AYA9SCZT7wLl5PCo9qWbR8f8AjoNS1nZDAf4dHX6q8S2eQ&#38;oe=67B34AE0">Paper</a>, <a target="_blank" href="https://github.com/facebookresearch/audiobox-aesthetics">Weights</a>)</p><p>* Zonos, a highly expressive TTS model with high fidelity voice cloning (<a target="_blank" href="https://www.zyphra.com/post/beta-release-of-zonos-v0-1">Blog</a>, <a target="_blank" href="https://huggingface.co/Zyphra/Zonos-v0.1-hybrid">HF</a>,<a target="_blank" href="https://huggingface.co/Zyphra/Zonos-v0.1-transformer">HF</a>, <a target="_blank" href="https://t.co/Fw4SkUmcIu">Github</a>)</p><p>* <strong>Tools & Others</strong></p><p>* Emergent Values AI - Research finds that AI's have expected utility functions (<a target="_blank" href="https://x.com/DanHendrycks/status/1889344074098057439">X</a>, <a target="_blank" href="https://drive.google.com/file/d/1QAzSj24Fp0O6GfkskmnULmI1Hmx7k_EJ/view">paper</a>)</p><p>* LMStudio lands support for Speculative Decoding (<a target="_blank" href="https://x.com/lmstudio/status/1889789651797319808">X</a>)</p><p>* Noam Shazeer / Jeff Dean on Dwarkesh podcast (<a target="_blank" href="https://x.com/dwarkesh_sp/status/1889770108949577768">X</a>)</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-feb-13-my-personal-rogue</link><guid isPermaLink="false">substack:post:157107358</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Thu, 13 Feb 2025 23:00:33 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/157107358/cd0c5638d92db125bb8549ce665e2197.mp3" length="74735492" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6228</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/157107358/7658933d7a8760b77f98ad7246feaf98.jpg"/></item><item><title><![CDATA[📆 ThursdAI - Feb 6 - OpenAI DeepResearch is your personal PHD scientist, o3-mini & Gemini 2.0, OmniHuman-1 breaks reality & more AI news]]></title><description><![CDATA[<p>What's up friends, Alex here, back with another ThursdAI hot off the presses.</p><p>Hold onto your hats because this week was another whirlwind of AI breakthroughs, mind-blowing demos, and straight-up game-changers. We dove deep into OpenAI's new "Deep Research" agent – and let me tell you, it's not just hype, it's legitimately revolutionary. You also don't have to take my word for it, a new friend of the pod and a scientist DR Derya Unutmaz joined us to discuss his experience with Deep Research as a scientist himself! You don't want to miss this conversation! </p><p>We also unpack Google's Gemini 2.0 release, including the blazing-fast Flash Lite model. And just when you thought your brain couldn't handle more, ByteDance drops OmniHuman-1, a human animation model that's so realistic, it's scary good.</p><p>I've also saw maybe 10 more</p><p>TLDR & Show Notes</p><p>* <strong>Open Source LLMs (and deep research implementations)</strong></p><p>* Jina Node-DeepResearch (<a target="_blank" href="https://x.com/jina_ai_">X</a>, <a target="_blank" href="https://github.com/jina-ai/node-DeepResearch">Github</a>)</p><p>* HuggingFace - OpenDeepResearch (<a target="_blank" href="https://x.com/reach_vb/status/1886882087237509487">X</a>)</p><p>* Deep Agent - R1 -V (<a target="_blank" href="https://x.com/liangchen5518/status/1886171667522842856">X</a>, <a target="_blank" href="https://github.com/Deep-Agent/R1-V">Github</a>)</p><p>* Krutim - Krutim 2 12B, Chitrath VLM, Embeddings and more from India (<a target="_blank" href="https://x.com/bhash/status/1886687710363955492">X</a>, <a target="_blank" href="https://t.co/yuBW8WbUcX">Blog</a>, <a target="_blank" href="https://huggingface.co/krutrim-ai-labs">HF</a>)</p><p>* Simple Scaling - S1 - R1 (<a target="_blank" href="https://arxiv.org/abs/2501.19393">Paper</a>)</p><p>* Mergekit updated - </p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* OpenAI ships o3-mini and o3-mini High + updates thinking traces (<a target="_blank" href="https://openai.com/index/openai-o3-mini/">Blog</a>, <a target="_blank" href="https://x.com/altryne/status/1887616916736622961">X</a>)</p><p>* Mistral relaunches LeChat with Cerebras for 1000t/s (<a target="_blank" href="https://mistral.ai/en/news/all-new-le-chat">Blog</a>)</p><p>* OpenAI Deep Research - the researching agent that uses o3 (<a target="_blank" href="https://x.com/altryne/status/1886554659588071684">X</a>, <a target="_blank" href="https://openai.com/index/introducing-deep-research/">Blog</a>)</p><p>* Google ships Gemini 2.0 Pro, Gemini 2.0 Flash-lite in AI Studio (<a target="_blank" href="https://blog.google/technology/google-deepmind/gemini-model-updates-february-2025/">Blog</a>)</p><p>* Anthropic <strong>Constitutional Classifiers -</strong> announced a universal jailbreak prevention (<a target="_blank" href="https://anthropic.com/research/constitutional-classifiers">Blog</a>, <a target="_blank" href="https://claude.ai/constitutional-classifiers">Try It</a>)</p><p>* Cloudflare to protect websites from AI scraping (<a target="_blank" href="https://fortune.com/2025/02/04/matthew-prince-ai-audit-block-media/?utm_medium=social&#38;utm_campaign=fortunemagazine&#38;utm_source=twitter.com&#38;xid=soc_socialflow_twitter_FORTUNE">News</a>)</p><p>* HuggingFace becomes the AI Appstore (<a target="_blank" href="https://huggingface.co/spaces">link</a>)</p><p>* <strong>This weeks Buzz - Weights & Biases updates</strong></p><p>* AI Engineer workshop (<a target="_blank" href="https://www.ai.engineer/summit/2025/schedule/wandb-production">Saturday 22</a>) </p><p>* Tinkerers Toronto workshops (<a target="_blank" href="https://toronto.aitinkerers.org/p/ai-tinkerers-toronto-ai-in-production-evals-observability-workshop-with-weights-biases">Sunday 23</a> , <a target="_blank" href="https://toronto.aitinkerers.org/p/ai-tinkerers-toronto-ai-evals-workshop-with-weights-biases-monday">Monday 24</a>)</p><p>* We released a new Dataset editor feature (<a target="_blank" href="https://x.com/weave_wb/status/1887564898777117139">X</a>)</p><p>* <strong>Audio and Sound</strong></p><p>* KyutAI open sources Hibiki  - simultaneous translation models (<a target="_blank" href="https://huggingface.co/spaces/kyutai/hibiki-samples">Samples</a>, <a target="_blank" href="https://huggingface.co/collections/kyutai/hibiki-fr-en-67a48835a3d50ee55d37c2b5">HF</a>)</p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* ByteDance OmniHuman-1 - unparalleled Human Animation Models (<a target="_blank" href="https://x.com/altryne/status/1886804788513530137/video/1">X</a>, <a target="_blank" href="https://omnihuman-lab.github.io/">Page</a>)</p><p>* Pika labs adds PikaAdditions - adding anything to existing video (<a target="_blank" href="https://x.com/pika_labs/status/1887547042622562646">X</a>)</p><p>* Google added Imagen3 to their API (<a target="_blank" href="https://developers.googleblog.com/en/imagen-3-arrives-in-the-gemini-api/">Blog</a>)</p><p>* <strong>Tools & Others</strong></p><p>* Mistral Le Chat has ios an and adroid apps now (<a target="_blank" href="https://x.com/dchaplot/status/1887517614689464674">X</a>)</p><p>* CoPilot now has agentic workflows (<a target="_blank" href="https://x.com/ashtom/status/1887548091404046359">X</a>)</p><p>* Replit launches free apps agent for everyone (<a target="_blank" href="https://x.com/mattppal/status/1886866670649790565">X</a>)</p><p>* Karpathy drops a new 3 hour video on youtube (<a target="_blank" href="https://x.com/karpathy/status/1887211193099825254">X</a>, <a target="_blank" href="https://t.co/75mXcUBI8L">Youtube</a>)</p><p>* OpenAI canvas links are now shareable (like Anthropic artifacts) - (<a target="_blank" href="https://chatgpt.com/canvas/shared/67a5449a8174819190dc3f8a41ab8d23">example</a>)</p><p>* <strong>Show Notes & Links </strong></p><p>* Guest of the week - Dr <a target="_blank" href="https://twitter.com/DeryaTR_/">Derya Umnutaz</a> - talking about Deep Research</p><p>* He's examples of Ehlers-Danlos Syndrome (<a target="_blank" href="https://t.co/Yd9K54XtBE">ChatGPT</a>), (ME/CFS) <a target="_blank" href="https://x.com/DeryaTR_/status/1886487553387430396">Deep Research</a>, <a target="_blank" href="https://www.nature.com/articles/d41586-025-00377-9">Nature article</a> about Deep Reseach with Derya comments</p><p>* Hosts</p><p>* Alex Volkov - AI Evangelist & Host <a target="_blank" href="http://x.com/altryne">@altryne</a></p><p>* Wolfram Ravenwolf - AI Evangelist <a target="_blank" href="https://x.com/WolframRvnwlf">@WolframRvnwlf</a></p><p>* Nisten Tahiraj - AI Dev at <a target="_blank" href="http://github.GG">github.GG</a> - <a target="_blank" href="https://x.com/nisten/status/1884064612695581054">@nisten</a></p><p>* LDJ - Resident data scientist - <a target="_blank" href="https://x.com/ldjconfirmed/status/1884678546431373764">@ldjconfirmed</a></p><p>Big Companies products & APIs</p><p>OpenAI's new chatGPT moment with Deep Research, their second "agent" product (X)</p><p>Look, I've been reporting on AI weekly for almost 2 years now, and been following the space closely since way before chatGPT (shoutout Codex days) and this definitely feels like <strong>another chatGPT moment</strong> for me.</p><p>DeepResearch is OpenAI's new agent, that searches the web for any task you give it, is able to reason about the results, and continue searching those sources, to provide you with an absolute incredible level of research into any topic, scientific or ... the best taqueria in another country. </p><p>The reason why it's so good is it's ability to do multiple search trajectories, backtrack if it needs to, and react in real time to new information. It also has python tool use (to do plots and calculations) and of course, the brain of it is o3, the best reasoning model from OpenAI</p><p>Deep Research is only offered on the Pro tier ($200) of chatGPT, and it's the first publicly available way to use o3 full! and boy, does it deliver! </p><p>I've had it review my workshop content, help me research LLM as a judge articles (which it did masterfully) and help me plan datenights in Denver (though it kind of failed at that, showing me a closed restaurant) </p><p>A breakthrough for scientific research</p><p>But I'm no scientist, so I've asked Dr </p><p>Derya Unutmaz, M.D.</p><p> to join us, and share his incredible findings as a doctor, a scientist and someone with decades of experience in writing grants, patent applications, paper etc. </p><p>The whole conversation is very very much worth listening to on the pod, we talked for almost an hour, but the highlights are honestly quite crazy. </p><p>So one of the first things I did was, I asked Deep Research to write a review on a particular disease that I’ve been studying for a decade. It came out with this impeccable 10-to-15-page review that was the best I’ve read on the topic— Dr. Derya Unutmaz</p><p>And another banger quote</p><p>It wrote a phenomenal 25-page patent application for a friend’s cancer discovery—something that would’ve cost 10,000 dollars or more and taken weeks. I couldn’t believe it. Every one of the 23 claims it listed was thoroughly justified</p><p>Humanity's LAST exam? </p><p>OpenAI announced Deep Research and have showed that on HLE (Humanity's Last Exam) benchmark that was just released a few weeks ago, it scores a whopping 26.6 percent! When HLE was released (our coverage <a target="_blank" href="https://sub.thursdai.news/i/155578714/humanitys-last-exam-the-benchmark-to-beat">here</a>) all the way back at ... checks notes... January 23 or this year! the top reasoning models at the time (o1, R1) scored just under 10%</p><p>O3-mini and Deep Research now score 13% and 26.6% respectively, which means both that AI is advancing like crazy, but also.. that maybe calling this "last exam" was a bit premature? 😂😅</p><p>Deep Research is now also SOTA holder on GAIA, a public benchmark on real world questions, though Clementine (one of GAIA authors) throws a bit of shade on the <a target="_blank" href="https://x.com/clefourrier/status/1886385835324457143">result</a> since OpenAI didn't really submit their results. Incidently, Clementine is also involved in HuggingFace attempt at replicating Deep Research in the open (with <a target="_blank" href="https://huggingface.co/blog/open-deep-research">OpenDeepResearch</a>) </p><p>OpenAI releases o3-mini and o3-mini high</p><p>This honestly got kind of buried with the Deep Research news, but as promised, on the last day of January, OpenAI released their new reasoning model, which is significantly fast and much cheaper than o1, while matching it on most benchmarks! </p><p>I've been talking about the fact that during o3 announcement (our <a target="_blank" href="https://sub.thursdai.news/p/dec-26-openai-o3-and-o3?utm_source=publication-search">coverage</a>) that mini may be more practical and useful announcement than o3 itself, given the price and speed of it. </p><p>And viola, OpenAI has reduced the price point of their best reasoner model by 67%, and it's now matches just 2x that of DeepSeek R1.</p><p>Coming in at 110c for 1M input tokens and 440c for 1M output tokens, and streaming at a whopping 1000t/s at some instances, this reasoner is really something to beat. </p><p>Great for application developers</p><p>In addition to seem to be a great model, comparing it to R1 is a nonstarter IMO, not only because "it’s sending your data to choyna", which IMO is a <a target="_blank" href="https://x.com/altryne/status/1886982075456348416">ridiculous</a> attack vector and people should be ashamed by posting this content. </p><p>o3-mini supports all of the nice API things that OpenAI has, like tool use, structured outputs, developer messages and streaming. The ability to set the reasoning effort is also interesting for applications! </p><p>Added benefit is the new 200K context window with 100K (claimed) output context. </p><p>It's also really really fast, while R1 availability grows, as it gets hosted on more and more US based providers, none of them are offering the full context window at these token speeds. </p><p>o3-mini-high?! </p><p>While the free users also started getting access to o3-mini, with the "reason" button on chatGPT, plus subscribers received 2 models, o3-mini and o3-mini-high, which is essentially the same model, but with the "high" reasoning mode turned on, giving the model significantly more compute (and tokens) to think. </p><p>This can be done on the API level by selecting reasoning_effort=high but it's the first time OpenAI is exposing this to non API users! </p><p>One highlight for me is, just how MANY tokens o3-mini high things through. In one of my evaluations on Weave, o3-mini high generated around 160K output tokens, answering 20 questions, while DeepSeek R1 for example generated 75K and Gemini Thinking, got the highest score on these, while charging only 14K tokens (though I'm pretty sure Google just doesn't report on thinking tokens yet, this seems like a bug)</p><p>As I'm writing this, OpenAI just announced a new update, o3-mini and o3-mini-high now show... "updated" reasoning traces! </p><p>These definitely "feel" more like the R1 reasoning traces (remember, previously OpenAI had a different model summarizing the reasoning to prevent training on them?) but they are not really the RAW ones (confirmed) </p><p><strong>Google ships Gemini 2.0 Pro, Gemini 2.0 Flash-lite in AI Studio</strong> (<a target="_blank" href="https://x.com/???">X</a>, <a target="_blank" href="https://blog.google/technology/google-deepmind/gemini-model-updates-february-2025/">Blog</a>)</p><p>Congrats to our friends at Google for 2.0 👏 Google finally put all the experimental models under one 2.0 umbrella, giving us Gemini 2.0, Gemini 2.0 Flash and a new model! </p><p>They also introduced Gemini 2.0 Flash-lite, a <em>crazy</em> fast and cheap model that performs similarly to Flash 1.5. The rate limits on Flash-lite are <em>twice</em> as high as the regular Flash, making it incredibly useful for real-time applications. </p><p>They have also released a few benchmarks, but they only compared those to the previous benchmark released by Google, and while that's great, I wanted a comparison done, so I asked DeepResearch to do it for me, and it did (with citations!) </p><p>Google also released Imagen 3, their awesome image diffusion model in their API today, with 3c per image, this one is really really good! </p><p>Mistral's new LeChat spits out 1000t/s + new IOS apps</p><p>During the show, Mistral announced new capabilities for their LeChat interface, including a 15$/mo tier, but most importantly, a crazy fast generation using some kind of new inference, spitting out around 1000t/s. (Powered by Cerebras)</p><p>Additionally they have code interpreter there, Canvas, and they also claim to have the best OCR and don't forget, they have access to Flux images, and likely are the only place I know of that offers that image model for free! </p><p>Finally, they've released native mobile apps! (<a target="_blank" href="https://t.co/uAJXAZlPSr">IOS</a>, <a target="_blank" href="https://t.co/bACHYIwIS9">Android</a>)</p><p>* from my quick tests, the 1000t/s is not always on, my first attempt was instant, it was like black magic, and then the rest of them were pretty much the same speed as before 🤔  Maybe they are getting hammered in traffic... </p><p>This weeks Buzz (What I learned with WandB this week)</p><p>I got to play around with O3-Mini <em>before</em> it was released (perks of working at Weights & Biases!), and I used <a target="_blank" href="https://wandb.ai/site/weave?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=feb6">Weave</a>, our observability and evaluation framework, to analyze its performance. The results were… <em>interesting</em>.</p><p>* <strong>Latency and Token Count:</strong> O3-Mini High's latency was <em>six times</em> longer than O3-Mini Low on a simple reasoning benchmark (92 seconds vs. 6 seconds). But here's the kicker: it didn't even answer more questions correctly! And the token count? O3-Mini High used <em>half a million tokens</em> to answer 20 questions three times. That's… a lot.</p><p>* <strong>Weave Leaderboards:</strong> Nisten got <em>super</em> excited about using Weave's leaderboard feature to benchmark models. He realized it could solve a real problem in the open-source community – providing a verifiable and transparent way to share benchmark results. (really, we didnt' rehearse this!) </p><p>I also announced some upcoming workshops I'd love to see you at:</p><p>* <a target="_blank" href="https://www.ai.engineer/summit/2025/schedule/wandb-production"><strong>AI Engineer Workshop</strong></a><strong> in NYC:</strong> I'll be running a workshop on evaluations at the AI Engineer Summit in New York on February 22nd. Come say hi and learn about evals!</p><p>* <strong>AI Tinkerers Workshops in Toronto:</strong> I'll also be doing two workshops with AI Tinkerers in Toronto on <a target="_blank" href="https://toronto.aitinkerers.org/p/ai-tinkerers-toronto-ai-in-production-evals-observability-workshop-with-weights-biases">February 23rd</a> and <a target="_blank" href="https://toronto.aitinkerers.org/p/ai-tinkerers-toronto-ai-evals-workshop-with-weights-biases-monday">24th</a>.</p><p>ByteDance OmniHuman-1 - a reality bending mind breaking img2human model</p><p>Ok, this is where my mind completely broke this week, like absolutely couldn't stop thinking about this release from ByteDance. After releasing the SOTA lipsyncing model just a few months ago (LatentSync, <a target="_blank" href="https://sub.thursdai.news/p/thursdai-jan-9th-nvidias-tiny-supercomputer?utm_source=publication-search">our coverage</a>) they have once again blew everyone away. This time with a img2avatar model that's unlike anything we've ever seen. </p><p>This one doesn't need words, just watch my live reaction as I lose my mind</p><p>The level of real world building in these videos is just absolutely ... too much? The piano keys moving, there's a video of a woman speaking in the microphone, and behind her, the window has reflections of cars and people moving! </p><p>The thing that most blew me away upon review was the Niki Glazer video, with shiny dress and the model almost perfectly replicating the right sources of light. </p><p>Just absolute sorcery! </p><p>The authors confirmed that they don't have any immediate plans to release this as a model or even a product, but given the speed of open source, we'll get this within a year for sure! Get ready</p><p>Open Source LLMs (and deep research implementations)</p><p>This week wasn't <em>massive</em> for open-source releases in terms of entirely <em>new</em> models, but the ripple effects of DeepSeek's R1 are still being felt. The community is buzzing with attempts to replicate and build upon its groundbreaking reasoning capabilities. It feels like everyone is scrambling to figure out the "secret sauce" behind R1's "aha moment," and we're seeing some fascinating results.</p><p><strong>Jina Node-DeepResearch and HuggingFace OpenDeepResearch</strong></p><p>The community wasted no time trying to replicate OpenAI's Deep Research agent.</p><p>* Jina AI released "Node-DeepResearch" (<a target="_blank" href="https://x.com/jina_ai_">X</a>, <a target="_blank" href="https://github.com/jina-ai/node-DeepResearch">Github</a>), claiming it follows the "query, search, read, reason, repeat" formula. As I mentioned on the show, "I believe that they're wrong" about it being just a simple loop. O3 is likely a fine-tuned model, but still, it's awesome to see the open-source community tackling this so quickly!</p><p>* Hugging Face also announced "OpenDeepResearch" (<a target="_blank" href="https://x.com/reach_vb/status/1886882087237509487">X</a>), aiming to create a truly open research agent. Clementine Fourrier, one of the authors behind the GAIA benchmark (which measures research agent capabilities), is involved, so this is definitely one to watch.</p><p><strong>Deep Agent - R1 -V:</strong> These folks claim to have replicated DeepSeek R1's "aha moment" – where the model realizes its own mistakes and rethinks its approach – <em>for just $3</em>! (<a target="_blank" href="https://x.com/liangchen5518/status/1886171667522842856">X</a>, <a target="_blank" href="https://github.com/Deep-Agent/R1-V">Github</a>)</p><p>As I said on the show, "It's crazy, right? Nothing costs $3 anymore. Like it's half a coffee in Starbucks." They even claim you can witness this "aha moment" in a VLM. Open source is moving <em>fast</em>.</p><p><strong>Krutim - Krutim 2 12B, Chitrath VLM, Embeddings and more from India:</strong> This Indian AI lab released a whole suite of models, including an improved LLM (Krutim 2), a VLM (Chitrarth 1), a speech-language model (Dhwani 1), an embedding model (Vyakhyarth 1), and a translation model (Krutrim Translate 1). (<a target="_blank" href="https://x.com/bhash/status/1886687710363955492">X</a>, <a target="_blank" href="https://t.co/yuBW8WbUcX">Blog</a>, <a target="_blank" href="https://huggingface.co/krutrim-ai-labs">HF</a>) They even developed a benchmark called "BharatBench" to evaluate Indic AI performance.</p><p>However, the community was quick to point out some… <em>issues</em>. As Harveen Singh Chadha pointed out on X, it seems like they blatantly copied IndicTrans, an MIT-licensed model, without even mentioning it. Not cool, Krutim. Not cool.</p><p><strong>AceCoder:</strong> This project focuses on using reinforcement learning (RL) to improve code models. (<a target="_blank" href="https://x.com/???">X</a>) They claim to have created a pipeline to automatically generate high-quality, verifiable code training data.</p><p>They trained a reward model (AceCode-RM) that significantly boosts the performance of Llama-3.1 and Qwen2.5-coder-7B. They even claim you can skip SFT training for code models by using just 80 steps of R1-style training!</p><p><strong>Simple Scaling - S1 - R1:</strong> This paper (<a target="_blank" href="https://arxiv.org/abs/2501.19393">Paper</a>) showcases the power of <em>quality over quantity</em>. They fine-tuned Qwen2.5-32B-Instruct on just <em>1,000 carefully curated reasoning examples</em> and matched the performance of o1-preview!</p><p>They also introduced a technique called "budget forcing," allowing the model to control its test-time compute and improve performance. As I mentioned, Niklas Mengenhoff, who worked at Allen and was previously on the show, is involved. This is one to really pay attention to – it shows that you don't need massive datasets to achieve impressive reasoning capabilities.</p><p><strong>Unsloth reduces R1 type reasoning to just 7GB VRAM (</strong><a target="_blank" href="https://unsloth.ai/blog/r1-reasoning"><strong>blog</strong></a><strong>)</strong></p><p>Deepseek R1-zero was autonimously learned reasoning in what they DeepSeek researchers called the "aha moment" </p><p>Unsloth adds another attempt at replicating this "aha moment" and claims they got it down to less than 7B VRAM, and it can see it for free, in a google colab! </p><p>This magic could be recreated through GRPO, a RL algorithm that optimizes responses efficiently without requiring a value function, unlike Proximal Policy Optimization (PPO) which relies on a value function</p><p>How it works:1. The model generates groups of responses.2. Each response is scored based on correctness or another metric created by some set reward function rather than an LLM reward model.3 . The average score of the group is computed.4. Each response's score is compared to the group average.5. The model is reinforced to favor higher-scoring responses.</p><p>Tools</p><p>A few new and interesting tools were released this week as well: </p><p>* Replit rebuilt and released their replit agents in an IOS app and released it free for many users. It can now build mini apps for you on the fly! (<a target="_blank" href="https://x.com/amasad/status/1886859253648122181">Replit</a>)</p><p>* Mistral has ios / android apps with the new release of LeChat (<a target="_blank" href="https://x.com/dchaplot/status/1887517614689464674">X</a>)</p><p>* Molly Cantillon released <a target="_blank" href="https://x.com/mollycantillon/status/1887569755772793000">RPLY</a>, which sits on your mac, and drafts replies to your messages. I installed it during writing this newsletter, and I did not expect it to hit this hard, it reviewed and summarized my texting patterns to "sound like me" and the models sit on device as well. Very very well crafted tool and the best thing it runs models on device if you want! </p><p>* Github Copilot announced agentic workflows and next line editing, which are cursor features. To try them out you have to download VSCode insiders. They also added Gemini 2.0 (<a target="_blank" href="https://github.blog/news-insights/product-news/github-copilot-the-agent-awakens/">Blog</a>)</p><p>The AI field moves SO fast, I had to update the content of the newsletter around 5 times while writing it as new things kept getting released! </p><p>This was a Banger week that started with o3-mini and deep research, continued with Gemini 2.0 and OmniHuman and "ended" with Mistral x Cerebras, Github copilot agents, o3-mini updated COT reasoning traces and a bunch more! </p><p>AI doesn't stop, and we're here weekly to cover all of this, and give you guys the highlights, but also go deep! </p><p>Really appreciate Derya's appearance on the show this week, please give him a follow and see you guys next week! </p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-feb-6-openai-deepresearch</link><guid isPermaLink="false">substack:post:156643204</guid><dc:creator><![CDATA[Alex Volkov, Derya Unutmaz, M.D., and Nisten]]></dc:creator><pubDate>Fri, 07 Feb 2025 01:09:41 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/156643204/35112bcacb430cc261ed62925a475a72.mp3" length="72346876" type="audio/mpeg"/><itunes:author>Alex Volkov, Derya Unutmaz, M.D., and Nisten</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6029</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/156643204/13037e17d4af15a28b34a52d3851f779.jpg"/></item><item><title><![CDATA[📆 ThursdAI - Jan 30 - DeepSeek vs. Nasdaq, R1 everywhere, Qwen Max & Video, Open Source SUNO, Goose agents & more AI news]]></title><description><![CDATA[<p>Hey folks, Alex here 👋</p><p>It’s official—grandmas (and the entire stock market) now know about DeepSeek. If you’ve been living under an AI rock, DeepSeek’s new R1 model just set the world on fire, rattling Wall Street (causing the biggest monetary loss for any company, ever!) and rocketing to #1 on the iOS App Store. This week’s ThursdAI show took us on a deep (pun intended) dive into the dizzying whirlwind of open-source AI breakthroughs, agentic mayhem, and big-company cat-and-mouse announcements. Grab your coffee (or your winter survival kit if you’re in Canada), because in true ThursdAI fashion, we’ve got at least a dozen bombshells to cover—everything from brand-new Mistral to next-gen vision models, new voice synthesis wonders, and big moves from Meta and OpenAI.</p><p>We’re also talking “reasoning mania,” as the entire industry scrambles to replicate, dethrone, or ride the coattails of the new open-source champion, R1. So buckle up—because if the last few days are any indication, 2025 is officially the Year of Reasoning (and quite possibly, the Year of Agents, or both!)</p><p>Open Source LLMs</p><p>DeepSeek R1 discourse Crashes the Stock Market</p><p><strong>One-sentence summary</strong>: DeepSeek’s R1 “reasoning model” caused a frenzy this week, hitting #1 on the App Store and briefly sending NVIDIA’s stock plummeting in the process ($560B drop, largest monetary loss of any stock, ever)</p><p>Ever since DeepSeek R1 launched (<a target="_blank" href="https://sub.thursdai.news/p/thursdai-jan-23-2025-deepseek-r1?r=2imipa">our technical coverate last week!</a>), the buzz has been impossible to ignore—everyone from your mom to your local barista has heard the name. The speculation? DeepSeek’s new architecture apparently only cost $5.5 million to train, fueling the notion that high-level AI might be cheaper than Big Tech claims. Suddenly, people wondered if GPU manufacturers like NVIDIA might see shrinking demand, and the stock indeed took a short-lived 17% tumble. On the show, I joked, “My mom knows about DeepSeek—your grandma probably knows about it, too,” underscoring just how mainstream the hype has become.</p><p>Not everyone is convinced the cost claims are accurate. Even Dario Amodei of Anthropic weighed in with a blog post arguing that DeepSeek’s success <em>increases</em> the case for stricter AI export controls. </p><p>Public Reactions</p><p>* <strong>Dario Amodei’s blog</strong>In “On DeepSeek and Export Controls,” Amodei argues that DeepSeek’s efficient scaling exemplifies why democratic nations need to maintain a strategic leadership edge—and enforce export controls on advanced AI chips. He sees Chinese breakthroughs as proof that AI competition is global and intense.</p><p>* <strong>OpenAI Distillation Evidence</strong>OpenAI mentioned it found “distillation traces” of GPT-4 inside R1’s training data. Hypocrisy or fair game? On ThursdAI, the panel mused that “everyone trains on everything,” so perhaps it’s a moot point.</p><p>* <strong>Microsoft Reaction</strong>Microsoft wasted no time, swiftly adding DeepSeek to Azure—further proof that corporations want to harness R1’s reasoning power, no matter where it originated.</p><p>* <strong>Government reacted</strong>Even officials in the government, David Sacks, US incoming AI & Crypto czar, discussed the fact that DeepSeek did "distillation" using the term somewhat incorrectly, and presidet Trump was asked about it.</p><p>* <strong>API Outages</strong>DeepSeek’s own API has gone in and out this week, apparently hammered by demand (and possibly DDoS attacks). Meanwhile, GPU clouds like Groq are showing up to accelerate R1 at 300 tokens/second, for those who must have it right now.</p><p>We've seen so many bad takes on the topic, from seething cope takes, to just gross misunderstandings from gov officials confusing the ios App with the OSS models, folks throwing conspiracy theories into the mix, claiming that $5.5M sum was a PsyOp. The fact of the matter is, DeepSeek R1 is an incredible model, and is now powering (just a week later), multiple products (more on this below) and experiences already, while pushing everyone else to compete (and give us reasoning models!)</p><p>Open Thoughts Reasoning Dataset</p><p><strong>One-sentence summary</strong>: A community-led effort, “Open Thoughts,” released a new large-scale dataset (OpenThoughts-114k) of chain-of-thought reasoning data, fueling the open-source drive toward better reasoning models.</p><p>Worried about having enough labeled “thinking” steps to train your own reasoner? Fear not. The OpenThoughts-114k dataset aggregates chain-of-thought prompts and responses—114,000 of them—for building or fine-tuning reasoning LLMs. It’s now on <a target="_blank" href="https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k">Hugging Face</a> for your experimentation pleasure. The ThursdAI panel pointed out how crucial these large, openly available reasoning datasets are. As <strong>Wolfram</strong> put it, “We can’t rely on the big labs alone. More open data means more replicable breakouts like DeepSeek R1.”</p><p>Mistral Small 2501 (24B)</p><p><strong>One-sentence summary</strong>: Mistral AI returns to the open-source spotlight with a 24B model that fits on a single 4090, scoring over 81% on MMLU while under Apache 2.0.</p><p>Long rumored to be “going more closed,” Mistral AI re-emerged this week with Mistral-Small-24B-Instruct-2501—an Apache 2.0 licensed LLM that runs easily on a 32GB VRAM GPU. That 81% MMLU accuracy is no joke, putting it well above many 30B–70B competitor models. <strong>It was described as</strong> “the perfect size for local inference and a real sweet spot,” noting that for many tasks, 24B is “just big enough but not painfully heavy.” Mistral also finally started comparing themselves to Qwen 2.5 in official benchmarks—a big shift from their earlier reluctance, which we applaud! </p><p>Berkeley TinyZero & RAGEN (R1 Replications)</p><p><strong>One-sentence summary</strong>: Two separate projects (TinyZero and RAGEN) replicated DeepSeek R1-zero’s reinforcement learning approach, showing you can get “aha” reasoning moments with minimal compute.</p><p>If you were wondering whether R1 is replicable: yes, it is. <strong>Berkeley’s TinyZero</strong> claims to have reproduced the core R1-zero behaviors for $30 using a small 3B model. Meanwhile, the <strong>RAGEN</strong> project aims to unify RL + LLM + Agents with a minimal codebase. While neither replication is at R1-level performance, they demonstrate how quickly the open-source community pounces on new methods. “We’re now seeing those same ‘reasoning sparks’ in smaller reproductions,” said <strong>Nisten</strong>. “That’s huge.”</p><p>Agents</p><p>Codename Goose by Blocks (<a target="_blank" href="https://x.com/blocks/status/1884292904753254488">X</a>, <a target="_blank" href="https://block.github.io/goose/">Github</a>)</p><p><strong>One-sentence summary</strong>: Jack Dorsey’s company Blocks released Goose, an open-source local agent framework letting you run keyboard automation on your machine.</p><p>Ever wanted your AI to press keys and move your mouse in real time? Goose does exactly that with AppleScript, memory extensions, and a fresh approach to “local autonomy.” On the show, I tried Goose, but found it occasionally “went rogue, trying to delete my WhatsApp chats.” Security concerns aside, Goose is significant: it’s an open-source playground for agent-building. The plugin system includes integration with Git, Figma, a knowledge graph, and more. If nothing else, Goose underscores how hot “agentic” frameworks are in 2025.</p><p>OpenAI’s Operator: One-Week-In</p><p>It’s been a week since <strong>Operator</strong> went live for Pro-tier ChatGPT users. “It’s the first agent that can run for multiple minutes without bugging me every single second,”. Yet it’s still far from perfect—captchas, login blocks, and repeated confirmations hamper tasks. The potential, though, is enormous: “I asked Operator to gather my <a target="_blank" href="X.com">X.com</a> bookmarks and generate a summary. It actually tried,” I shared, “but it got stuck on three links and needed constant nudges.” <strong>Simon Willison</strong> added that it’s “a neat tech demo” but not quite a productivity boon yet. Next steps? Possibly letting the brand-new reasoning models (like O1 Pro Reasoning) do the chain-of-thought under the hood.</p><p>I also got tired of opening hundreds of tabs for operator, so I wrapped it in a macOS native app, that has native notifications and the ability to launch Operator tasks via a Raycast extension, if you're interested, you can find it on my <a target="_blank" href="https://github.com/altryne/wraperator/tree/main">Github</a></p><p>Browser-use / Computer-use Alternatives</p><p>In addition to Goose, the ThursdAI panel mentioned <strong>browser-use</strong> on <a target="_blank" href="https://github.com/browser-use/browser-use">GitHub</a>, plus numerous code interpreters. So far, none blow minds in reliability. But 2025 is evidently “the year of agents.” If you’re itching to offload your browsing or file editing to an AI agent, expect to tinker, troubleshoot, and yes, babysit. The show consensus? “It’s not about whether agents are coming, it’s about how soon they’ll become truly robust,” said <strong>Wolfram</strong>.</p><p>Big CO LLMs + APIs</p><p>Alibaba Qwen2.5-Max (& Hidden Video Model) (<a target="_blank" href="https://chat.qwenlm.ai/">Try It</a>)</p><p><strong>One-sentence summary</strong>: Alibaba’s Qwen2.5-Max stands toe-to-toe with GPT-4 on some tasks, while also quietly rolling out video-generation features.</p><p>While Western media fixates on DeepSeek, Alibaba’s Qwen team quietly dropped the Qwen2.5-Max MoE model. It clocks in at 69% on MMLU-Pro—beating some OpenAI or Google offerings—and comes with a 1-million-token context window. And guess what? The official Chat interface apparently does hidden video generation, though Alibaba hasn’t publicized it in the English internet. </p><p>In the Chinese AI internet, this video generation model is called <a target="_blank" href="https://tongyi.aliyun.com/wanxiang/">Tongyi Wanxiang</a>, and even has it’s own website, can support first and last video generation and looks really really good, they have a gallery up there, and it even has audio generation together with the video!</p><p>This one was an img2video, but the movements are really natural! </p><p>Zuckerberg on LLama4 & LLama4 Mini</p><p>In Meta’s Q4 earnings call, Zuck was all about AI (sorry, Metaverse). He declared that LLama4 is in advanced training, with a smaller “LLama4 Mini” finishing pre-training. More importantly, a “reasoning model” is in the works, presumably influenced by the mania around R1. Some employees had apparently posted on Blind about “Why are we paying billions for training if DeepSeek did it for $5 million?” so the official line is that Meta invests heavily for top-tier scale. </p><p>Zuck also doubled down on saying "Glasses are the perfect form factor for AI" , to which I somewhat agree, I love my Meta Raybans, I just wished they were integrated into the ios more. </p><p>He also boasted about their HUGE datacenters, called Mesa, spanning the size of Manhattan, being built for the next step of AI. </p><p>(Nearly) Announced: O3-Mini</p><p>Right before the ThursdAI broadcast, rumors swirled that OpenAI might reveal O3-Mini. It’s presumably GPT-4’s “little cousin” with a fraction of the cost. Then…silence. Sam Altman also mentioned they would be bringing o3-mini by end of January, but maybe the R1 crazyness made them keep working on it and training it a bit more? 🤔 </p><p>In any case, we'll cover it when it launches. </p><p>This Week’s Buzz</p><p>We're still the #1 spot on Swe-bench verified with W&B programmer, and our CTO, Shawn Lewis, chatted with friends of the pod Swyx and Alessio about it! (give it a listen)</p><p>We have two upcoming events:</p><p>* <a target="_blank" href="AI.engineer"><strong>AI.engineer</strong></a> in New York (Feb 20–22). Weights & Biases is sponsoring, and I will broadcast ThursdAI live from the summit. If you snagged a ticket, come say hi—there might be a cameo from the “Chef.”</p><p>* <strong>Toronto Tinkerer Workshops</strong> (late February) in the University of Toronto. The Canadian AI scene is hot, so watch out for sign-ups (will add them to the show next week)</p><p>Weights & Biases also teased more features for LLM observability (Weave) and reminded folks of their new suite of evaluation tools. “If you want to know if your AI is actually better, you do evals,” <strong>Alex</strong> insisted. For more details, check out <a target="_blank" href="https://wandb.ai/site/weave?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=jan30">wandb.me/weave</a> or tune into the next ThursdAI.</p><p>Vision & Video</p><p>DeepSeek - Janus Pro - multimodal understanding and image gen unified (1.5B & 7B)</p><p><strong>One-sentence summary</strong>: Alongside R1, DeepSeek also released Janus Pro, a unified model for image understanding and generation (like GPT-4’s rumored image abilities).</p><p>DeepSeek apparently never sleeps. <strong>Janus Pro</strong> is MIT-licensed, 7B parameters, and can both parse images (SigLIP) and generate them (LlamaGen).  The model outperforms DALL·E 3 and SDXL! on some internal benchmarks—though at a modest 384×384 resolution. </p><p>NVIDIA’s Eagle 2 Redux</p><p><strong>One-sentence summary</strong>: NVIDIA re-released the Eagle 2 vision-language model with 4K resolution support, after mysteriously yanking it a week ago.</p><p>Eagle 2 is back, boasting multi-expert architecture, 16k context, and high-res video analysis. Rumor says it competes with big 70B param vision models at only 9B. But it’s overshadowed by Qwen2.5-VL (below). Some suspect NVIDIA is aiming to outdo Meta’s open-source hold on vision—just in time to keep GPU demand strong.</p><p>Qwen 2.5 VL - SOTA oss vision model is here </p><p><strong>One-sentence summary</strong>: Alibaba’s Qwen 2.5 VL model claims state-of-the-art in open-source vision, including 1-hour video comprehension and “object grounding.”</p><p>The Qwen team didn’t hold back: “It’s the final boss for vision,” joked <strong>Nisten</strong>. Qwen 2.5 VL uses advanced temporal modeling for video and can handle complicated tasks like OCR or multi-object bounding boxes. </p><p>Featuring advances in precise object localization, video temporal understanding and agentic capabilities for computer, this is going to be the model to beat! </p><p>Voice & Audio</p><p>YuE 7B (Open “Suno”)</p><p>Ever dream of building the next pop star from your code editor? YuE 7B is your ticket. This model, now under Apache 2.0, supports chain-of-thought creation of structured songs, multi-lingual lyrics, and references. It’s slow to infer, but it’s arguably the best open music generator so far in the open source</p><p>What's more, they have changed the license to apache 2.0 just before we went live, so you can use YuE everywhere! </p><p>Refusion Fuzz</p><p>Refusion, a new competitor to paid audio models like Suno and Udio, launched “Fuzz,” offering free music generation online until GPU meltdown.</p><p>If you want to dabble in “prompt to jam track” without paying, check out <a target="_blank" href="https://refusion.ai/fuzz">Refusion Fuzz</a>. Will it match the emotional nuance of premium services like 11 Labs or Hauio? Possibly not. But hey, free is free.</p><p>Tools (that have integrated R1)</p><p>Perplexity with R1</p><p>In the <a target="_blank" href="perplexity.ai">perplexity.ai</a> chat, you can choose “Pro with R1” if you pay for it,  harnessing R1’s improved reasoning to parse results. For some, it’s a major upgrade to “search-based question answering.” Others prefer it to paying for O1 or GPT-4. </p><p>I always check Perplexity if it knows what the latest episode of ThursdAI was, and it's the first time it did a very good summary! I legit used it to research the show this week! It's really something. </p><p>Meanwhile, <a target="_blank" href="Exa.ai">Exa.ai</a> also integrated a “DeepSeek Chat” for your agent-based workflows. Like it or not, R1 is everywhere.</p><p><a target="_blank" href="Krea.ai">Krea.ai</a> with DeepSeek</p><p>Our friends at Krea, an AI art tool aggregator, also hopped on the R1 bandwagon for chat-based image searching or generative tasks.</p><p>Conclusion</p><p>Key Takeaways</p><p>* <strong>DeepSeek’s R1 has massive cultural reach</strong>, from #1 apps to spooking the stock market.</p><p>* <strong>Reasoning mania</strong> is upon us—everyone from Mistral to Meta wants a piece of the logic-savvy LLM pie.</p><p>* <strong>Agentic frameworks</strong> like Goose, Operator, and browser-use are proliferating, though they’re still baby-stepping through reliability issues.</p><p>* <strong>Vision and audio</strong> get major open-source love, with Janus Pro, Qwen 2.5 VL, YuE 7B, and more reshaping multimodality.</p><p>* <strong>Big Tech</strong> (Meta, Alibaba, OpenAI) is forging ahead with monster models, multi-billion-dollar projects, and cross-country expansions in search of the best reasoning approaches.</p><p>At this point, it’s not even about where the next big model drop comes from; it’s about how quickly the entire ecosystem can adopt (or replicate) that new methodology. </p><p>Stay tuned for next week’s ThursdAI, where we’ll hopefully see new updates from OpenAI (maybe O3-Mini?), plus the ongoing race for best agent. Also, catch us at <a target="_blank" href="AI.engineer">AI.engineer</a> in NYC if you want to talk shop or share your own open-source success stories. Until then, keep calm and carry on training.</p><p>TLDR</p><p>* <strong>Open Source LLMs</strong></p><p>* DeepSeek Crashes the Stock Market: Did $5.5M training or hype do it?</p><p>* Open Thoughts Reasoning Dataset OpenThoughts-114k (<a target="_blank" href="https://x.com/madiator/status/1884284103354376283">X</a>, <a target="_blank" href="https://t.co/MUAJd9mWZD">HF</a>)</p><p>* Mistral Small 2501 (24B, Apache 2.0) (<a target="_blank" href="https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501">HF</a>)</p><p>* Berkeley TinyZero & RAGEN (R1-Zero Replications) (<a target="_blank" href="https://app.reflect.app/g/altryne/github.com/Jiayi-Pan/TinyZero">Github</a>, <a target="_blank" href="https://app.reflect.app/g/altryne/wandb.ai/jiayipan/TinyZero">WANDB</a>)</p><p>* Allen Institute - Tulu 405B (<a target="_blank" href="https://allenai.org/blog/tulu-3-405B">Blog</a>, <a target="_blank" href="https://huggingface.co/collections/allenai/tulu-3-models-673b8e0dc3512e30e7dc54f5">HF</a>)</p><p>* <strong>Agents</strong></p><p>* Goose by Blocks (local agent framework) - (<a target="_blank" href="https://x.com/blocks/status/1884292904753254488">X</a>, <a target="_blank" href="https://block.github.io/goose/">Github</a>)</p><p>* Operator (OpenAI) – One-Week-In (<a target="_blank" href="https://x.com/altryne/status/1883056651332448761">X</a>)</p><p>* Browser-use - oss version of Operator (<a target="_blank" href="https://github.com/browser-use/browser-use">Github</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Alibaba Qwen2.5-Max (+ hidden video model) - (<a target="_blank" href="https://x.com/Alibaba_Qwen/status/1884263157574820053">X</a>, <a target="_blank" href="https://chat.qwenlm.ai/">Try it</a>)</p><p>* Zuckerberg on LLama4 & “Reasoning Model” (<a target="_blank" href="https://x.com/altryne/status/1884778839009796411">X</a>)</p><p>* <strong>This Week’s Buzz</strong></p><p>* Shawn Lewis <a target="_blank" href="https://x.com/latentspacepod/status/1884065983062761548">interview</a> on <a target="_blank" href="https://open.substack.com/pub/swyx">Latent Space</a> with <a target="_blank" href="https://substack.com/profile/89230629-swyx-and-alessio">swyx & Alessio</a> </p><p>* We’re sponsoring the <a target="_blank" href="https://ai.engineer">ai.engineer</a> upcoming summit in NY (Feb 19-22), come say hi! </p><p>* After that, we’ll host 2 workshops with AI Tinkerers Toronto (Feb 23-24), make sure you’re signed up to <a target="_blank" href="https://toronto.aitinkerers.org/">Toronto Tinkerers</a> to receive the invite (we were sold out quick last time!) </p><p>* <strong>Vision & Video</strong></p><p>* DeepSeek Janus Pro - 1.5B and 7B (<a target="_blank" href="https://github.com/deepseek-ai/Janus/tree/main?tab=readme-ov-file">Github</a>, <a target="_blank" href="https://huggingface.co/spaces/AP123/Janus-Pro-7b">Try It</a>)</p><p>* NVIDIA Eagle 2 (<a target="_blank" href="http://arxiv.org/abs/2501.14818">Paper</a>, <a target="_blank" href="https://huggingface.co/collections/nvidia/eagle-2-6764ba887fa1ef387f7df067">Model</a>, <a target="_blank" href="https://eagle-vlm.xyz/">Demo</a>)</p><p>* Alibaba Qwen 2.5 VL  (<a target="_blank" href="https://qwenlm.github.io/blog/qwen2.5-vl/">Project</a>, <a target="_blank" href="huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct">HF</a>, <a target="_blank" href="https://github.com/QwenLM/Qwen2.5-VL">Github</a>, <a target="_blank" href="https://chat.qwenlm.ai/">Try It</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* Yue 7B (Open Suno) - (<a target="_blank" href="https://fal.ai/models/fal-ai/yue/requests">Demo</a>, <a target="_blank" href="https://huggingface.co/m-a-p">HF</a>, <a target="_blank" href="https://github.com/multimodal-art-projection/YuE">Github</a>)</p><p>* Refusion Fuzz (<a target="_blank" href="https://refusion.ai/fuzz">free for now</a>)</p><p>* <strong>Tools</strong></p><p>* Perplexity with R1 (choose Pro with R1) </p><p>* Exa integrated R1 for free (<a target="_blank" href="https://demo.exa.ai/deepseekchat">demo</a>)</p><p>* <strong>Participants</strong></p><p>* Alex Volkov (<a target="_blank" href="https://x.com/altryne">@altryne</a>)</p><p>* Wolfram Ravenwolf (<a target="_blank" href="https://x.com/WolframRvnwlf">@WolframRvnwlf</a>)</p><p>* Nisten Tahiraj (<a target="_blank" href="https://x.com/nisten/">@nisten</a> )</p><p>* LDJ (<a target="_blank" href="https://x.com/ldjconfirmed/status/1884678546431373764">@ldjOfficial</a>)</p><p>* Simon Willison (<a target="_blank" href="https://x.com/simonw/status/1882507741694189706">@simonw</a>)</p><p>* W&B Weave (<a target="_blank" href="https://x.com/weave_wb">@weave_wb</a>)</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-jan-30-deepseek-vs-nasdaq</link><guid isPermaLink="false">substack:post:156126960</guid><dc:creator><![CDATA[Alex Volkov, Simon Willison, and Nisten]]></dc:creator><pubDate>Thu, 30 Jan 2025 22:35:24 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/156126960/a9cff07292f5c75f4992db1bd875c1c5.mp3" length="82632701" type="audio/mpeg"/><itunes:author>Alex Volkov, Simon Willison, and Nisten</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6886</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/156126960/8cf1647205fc4e45d95cd8f3f2941dfc.jpg"/></item><item><title><![CDATA[📆 ThursdAI - Jan 23, 2025 - 🔥 DeepSeek R1 is HERE, OpenAI Operator Agent, $500B AI manhattan project, ByteDance UI-Tars, new Gemini Thinker & more AI news]]></title><description><![CDATA[<p>What a week, folks, what a week! Buckle up, because ThursdAI just dropped, and this one's a doozy. We're talking seismic shifts in the open source world, a potential game-changer from DeepSeek AI that's got everyone buzzing, and oh yeah, just a casual $500 BILLION infrastructure project announcement. Plus, OpenAI finally pulled the trigger on "Operator," their agentic browser thingy – though getting it to actually <em>operate</em> proved to be a bit of a live show adventure, as you'll hear. </p><p>This week felt like one of those pivotal moments in AI, a real before-and-after kind of thing. DeepSeek's R1 hit the open source scene like a supernova, and suddenly, top-tier reasoning power is within reach for anyone with a Mac and a dream. And then there's OpenAI's Operator, promising to finally bridge the gap between chat and action. Did it live up to the hype? Well, let's just say things got interesting.</p><p>As I’m writing this, White House just published that an Executive Order on AI was just signed and published as well, what a WEEK.</p><p>Open Source AI Goes Nuclear: DeepSeek R1 is HERE!</p><p>Hold onto your hats, open source AI just went supernova! This week, the Chinese Whale Bros – DeepSeek AI, that quant trading firm turned AI powerhouse – dropped a bomb on the community in the best way possible: <strong>R1, their reasoning model, is now open source under the MIT license!</strong> As I said on the show, "Open source AI has never been as hot as this week."</p><p>This isn't just <em>a</em> model, folks. DeepSeek unleashed a whole arsenal: two full-fat R1 models (DeepSeek R1 and DeepSeek R1-Zero), and a whopping six distilled finetunes based on Qwen (1.5B, 7B, 14B, and 32B) and Llama (8B, 72B). </p><p>One stat that blew my mind, and Nisten's for that matter, is that <strong>DeepSeek-R1-Distill-Qwen-1.5B, the </strong><strong><em>tiny</em></strong><strong> 1.5 billion parameter model, is outperforming GPT-4o and Claude-3.5-Sonnet on math benchmarks!</strong> "This 1.5 billion parameter model that now does this. It's absolutely insane," I exclaimed on the show. We're talking 28.9% on AIME and 83.9% on MATH. Let that sink in. A model you can probably run on your phone is schooling the big boys in math.</p><p>License-wise, it's MIT, which as Nisten put it, "MIT is like a jailbreak to the whole legal system, pretty much. That's what most people don't realize. It's like, this is, it's not my problem. You're a problem now." Basically, do whatever you want with it. Distill it, fine-tune it, build Skynet – it's all fair game.</p><p>And the vibes? "Vibes are insane," as I mentioned on the show. Early benchmarks are showing R1 models trading blows with o1-preview and o1-mini, and even nipping at the heels of the full-fat o1 in some areas. Check out these numbers:</p><p>And the price? Forget about it. We're talking 50x cheaper than o1 currently. DeepSeek R1 API is priced at $0.14 / 1M input tokens and $2.19 / 1M output tokens, compared to OpenAI's o1 at $15.00 / 1M input and a whopping $60.00 / 1M output. Suddenly, high-quality reasoning is democratized.</p><p>LDJ highlighted the "aha moment" in DeepSeek's paper, where they talk about how reinforcement learning enabled the model to re-evaluate its approach and "think more." It seems like simple RL scaling, combined with a focus on reasoning, is the secret sauce. No fancy Monte Carlo Tree Search needed, apparently!</p><p>But the real magic of open source is what the community does with it. Pietro Schirano joined us to talk about his "Retrieval Augmented Thinking" (RAT) approach, where he extracts the thinking process from R1 and transplants it to other models. "And what I found out is actually by doing so, you may even like smaller, quote unquote, you know, less intelligent model actually become smarter," Pietro explained. Frankenstein models, anyone? (John Lindquist has a tutorial on how to do it <a target="_blank" href="https://egghead.io/combine-deep-seek-r1-reasoning-with-gpt-3-5-turbo-for-the-cheapest-fastest-and-best-ai~24oy1">here</a>)</p><p>And then there's the genius hack from Voooogel, who figured out how to emulate a "reasoning_effort" knob by simply replacing the "end" token with "Wait, but".  "This tricks the model into keeps thinking," as I described it. Want your AI to really ponder the meaning of life (or just 1+1)? Now you can, thanks to open source tinkering.</p><p>Georgi Gerganov, the legend behind llama.cpp, even jumped in with a two-line snippet to enable speculative decoding, boosting inference speeds on the 32B model on my Macbook from a sluggish 5 tokens per second to a much more respectable 10-11 tokens per second. Open source collaboration at its finest and it's only going to get better! </p><p>Thinking like a Neurotic</p><p>Many people really loved the way R1 thinks, and what I found astonishing is that I just sent "hey" and the thinking went into a whole 5 paragraph debate of how to answer, a user on X answered with "this is Woody Allen-level of Neurotic" which... nerd sniped me so hard! I used Hauio Audio (which is great!) and ByteDance latentSync and gave R1 a voice! It's really something when you hear it's inner monologue being spoken out like this! </p><p></p><p>ByteDance Enters the Ring: UI-TARS Controls Your PC</p><p>Not to be outdone in the open source frenzy, ByteDance, the TikTok behemoth, dropped UI-TARS, a set of models designed to control your PC. And they claim SOTA performance, beating even Anthropic's computer use models and, in some benchmarks, GPT-4o and Claude.</p><p>UI-TARS comes in 2B, 7B, and 72B parameter flavors, and ByteDance even released desktop apps for Mac and PC to go along with them. "They released an app it's called the UI TARS desktop app. And then, this app basically allows you to Execute the mouse clicks and keyboard clicks," I explained during the show.</p><p>While I personally couldn't get the desktop app to work flawlessly (quantization issues, apparently), the potential is undeniable. Imagine open source agents controlling your computer – the possibilities are both exciting and slightly terrifying. As Nisten wisely pointed out, "I would use another machine. These things are not safe to tell people. I might actually just delete your data if you, by accident." Words to live by, folks.</p><p>LDJ chimed in, noting that UI-TARS seems to excel particularly in operating system-level control tasks, while OpenAI's leaked "Operator" benchmarks might show an edge in browser control. It's a battle for desktop dominance brewing in open source!</p><p>Noting that the common benchmark between Operator and UI-TARS is OSWorld, UI-Tars launched with a SOTA </p><p>Humanity's Last Exam: The Benchmark to Beat</p><p>Speaking of benchmarks, a new challenger has entered the arena: <strong>Humanity's Last Exam (HLE).</strong> A cool new unsaturated bench of 3,000 <em>challenging</em> questions across over a hundred subjects, crafted by nearly a thousand subject matter experts from around the globe. "There's no way I'm answering any of those myself. I need an AI to help me," I confessed on the show.</p><p>And guess who's already topping the HLE leaderboard? You guessed it: <strong>DeepSeek R1, with a score of 9.4%!</strong> "Imagine how hard this benchmark is if the top reasoning models that we have right now... are getting less than 10 percent completeness on this," MMLU and Math are getting saturated? HLE is here to provide a serious challenge. Get ready to hear a lot more about HLE, folks.</p><p>Big CO LLMs + APIs: Google's Gemini Gets a Million-Token Brain</p><p>While open source was stealing the show, the big companies weren't completely silent. Google quietly dropped an update to <strong>Gemini Flash Thinking</strong>, their experimental reasoning model, and it's a big one. We're talking <strong>1 million token context window</strong> and code execution capabilities now baked in!</p><p>"This is Google's scariest model by far ever built ever," Nisten declared. "This thing, I don't like how good it is. This smells AGI-ish" High praise, and high concern, coming from Nisten! Benchmarks are showing significant performance jumps in math and science evals, and the speed is, as Nisten put it, "crazy usable." They have enabled the whopping 1M context window for the new Gemini Flash 2.0 Thinking Experimental (long ass name, maybe let's call it G1?) and I agree, it's really really good!</p><p>And unlike some other reasoning models <em>cough</em> OpenAI <em>cough</em>, Gemini Flash Thinking <strong>shows you its thinking process!</strong> You can actually see the chain of thought unfold, which is incredibly valuable for understanding and debugging. Google's Gemini is quietly becoming a serious contender in the reasoning race (especially with Noam Shazeer being responsible for it!)</p><p>OpenAI's "Operator" - Agents Are (Almost) Here</p><p>The moment we were all waiting for (or at least, <em>I</em> was): OpenAI finally unveiled <strong>Operator</strong>, their first foray into Level 3 Autonomy - agentic capabilities with ChatGPT. Sam Altman himself hyped it up as "AI agents are AI systems that can do work for you. You give them a task and they go off and do it." Sounds amazing, right?</p><p>Operator is built on a new model called <strong>CUA (Computer Using Agent)</strong>, trained on top of GPT-4, and it's designed to control a web browser in the cloud, just like a human would, using screen pixels, mouse, and keyboard. "This is just using screenshots, no API, nothing, just working," one of the OpenAI presenters emphasized. </p><p>They demoed Operator booking restaurant reservations on OpenTable, ordering groceries on Instacart, and even trying to buy Warriors tickets on StubHub (though that demo got a little… glitchy). The idea is that you can delegate tasks to Operator, and it'll go off and handle them in the background, notifying you when it needs input or when the task is complete.</p><p>As I'm writing these words, I have an Operator running trying to get me some fried rice, and another one trying to book me a vacation with kids over the summer, find some options and tell me what it found. </p><p>Benchmarks-wise, OpenAI shared numbers for OSWorld (38.1%) and WebArena (58.1%), showing Operator outperforming previous SOTA but still lagging behind human performance. "Still a way to go," as they admitted. But the potential is massive.</p><p>The catch? <strong>Operator is initially launching in the US for Pro users only, and even </strong><strong><em>then</em></strong><strong>, it wasn't exactly smooth sailing.</strong> I immediately paid the $200/mo to try it out (pro mode didn't convince me, unlimited SORA videos didn't either, operator definitely did, SOTA agents from OpenAI is definitely something I must try!) and my first test? Writing a tweet 😂 Here's a video of that first attempt, which I had to interrupt 1 time. </p><p>But hey, it's a "low key research preview" right? And as Sam Altman said, "This is really the beginning of this product. This is the beginning of our step into Agents Level 3 on our tiers of AGI" Agentic ChatGPT is coming, folks, even if it's taking a slightly bumpy route to get here.</p><p>BTW, while I'm writing these words, Operator is looking up some vacation options for me and is sending me notifications about them, what a world and we've only just started 2025!</p><p>Project Stargate: $500 Billion for AI Infrastructure</p><p>If R1 and Operator weren't enough to make your head spin, how about a <strong>$500 BILLION "Manhattan Project for AI infrastructure"?</strong> That's exactly what OpenAI, SoftBank, and Oracle announced this week: <a target="_blank" href="https://openai.com/index/announcing-the-stargate-project/"><strong>Project Stargate</strong></a><strong>.</strong></p><p>"This is insane," I exclaimed on the show. "Power ups for the United States compared to like, other, other countries, like 500 billion commitment!" We're talking about a massive investment in data centers, power plants, and everything else needed to fuel the AI revolution. 2% of the US GDP, according to some estimates!</p><p>Larry Ellison even hinted at using this infrastructure for… curing cancer with personalized vaccines. Whether you buy into that or not, the scale of this project is mind-boggling. As LDJ explained, "It seems like it is very specifically for open AI. Open AI will be in charge of operating it. And yeah, it's, it sounds like a smart way to actually kind of get funding and investment for infrastructure without actually having to give away open AI equity."</p><p>And in a somewhat related move, Microsoft, previously holding exclusive cloud access for OpenAI, has opened the door for OpenAI to potentially run on other clouds, with Microsoft's approval if "they cannot meet demant".  Is AGI closer than we think? Sam Altman himself downplayed the hype, tweeting, "Twitter hype is out of control again. We're not going to deploy AGI next month, nor have we built it. We have some very cool stuff for you, but please chill and cut your expectations a hundred X."</p><p>But then he drops Operator and a $500 billion infrastructure bomb in the same week and announces that o3-mini is going to be available for the FREE tier of chatGPT.</p><p>Sure, Sam, <em>we're going to chill... yeah right. </em></p><p>This Week's Buzz at Weights & Biases: SWE-bench SOTA!</p><p>Time for our weekly dose of Weights & Biases awesomeness! This week, our very own CTO, Shawn Lewis, <strong>broke the SOTA on SWE-bench Verified!</strong> That's right, W&B Programmer, Shawn's agentic framework built on top of o1, achieved a <strong>64.6%</strong> solve rate on this notoriously challenging coding benchmark.</p><p>Shawn detailed his journey in a <a target="_blank" href="https://wandb.ai/wandb/agents/reports/Creating-a-state-of-the-art-AI-programming-agent-with-OpenAI-s-o1--VmlldzoxMTAyODI2Ng?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=Jan9">blog post</a>, highlighting the importance of iteration and evaluation – powered by Weights & Biases <a target="_blank" href="https://wandb.ai/site/weave?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=jan23">Weave</a>, naturally. He ran over 1000 evaluations to reach this SOTA result! Talk about eating your own dogfood!</p><p>REMOVING BARRIERS TO AMERICAN LEADERSHIP IN ARTIFICIAL INTELLIGENCE - Executive order</p><p>Just now as I’m editing the podcast, President Trump signed into effect an executive order for AI, and here are the highlights. </p><p>- Revokes existing AI policies that hinder American AI innovation</p><p>- Aims to solidify US as global leader in AI for human flourishing, competitiveness, and security</p><p>- Directs development of an AI Action Plan within 180 days</p><p>- Requires immediate review and revision of conflicting policies</p><p>- Directs OMB to revise relevant memos within 60 days</p><p>- Preserves agency authority and OMB budgetary functions</p><p>- Consistent with applicable law and funding availability</p><p>- Seeks to remove barriers and strengthen US AI dominance</p><p>This marks such a significant pivot into AI acceleration, removing barriers, acknowledging that AI is a huge piece of our upcoming future and that US really needs to innovate here, become the global leader, and remove regulation and obstacles. The folks that work on this behind the scenes, Sriram Krishan (previously A16Z) and David Sacks, are starting to get into the government and implement those policies, so we’re looking forward to what will come form that! </p><p>Vision & Video: Nvidia's Vanishing Eagle 2 & Hugging Face's Tiny VLM</p><p>In the world of vision and video, Nvidia teased us with <strong>Eagle 2</strong>, a series of frontier vision-language models promising 4K HD input, long-context video, and grounding capabilities with some VERY impressive evals. Weights were released, then…yanked. "NVIDIA released Eagle 2 and then yanked it back. So I don't know what's that about," I commented. Mysterious Nvidia strikes again.</p><p>On the brighter side, Hugging Face released <strong>SmolVLM</strong>, a truly <em>tiny</em> vision-language model, coming in at just 256 million and 500 million parameters. "This tiny model that runs in like one gigabyte of RAM or some, some crazy things, like a smart fridge" I exclaimed, impressed. The 256M model even outperforms their previous 80 <em>billion</em> parameter Idefics model from just 17 months ago. Progress marches on, even in tiny packages.</p><p>AI Art & Diffusion & 3D: Hunyuan 3D 2.0 is State of the Art</p><p>For the artists and 3D enthusiasts, Tencent's <strong>Hunyuan 3D 2.0</strong> dropped this week, and it's looking seriously impressive. "Just look at this beauty," I said, showcasing a generated dragon skull. "Just look at this."</p><p>Hunyuan 3D 2.0 boasts two models: Hunyuan3D-DiT-v2-0 for shape generation and Hunyuan3D-Paint-v2-0 for coloring. Text-to-3D and image-to-3D workflows are both supported, and the results are, well, see for yourself:</p><p>If you're looking to move beyond 2D images, Hunyuan 3D 2.0 is definitely worth checking out.</p><p>Tools: ByteDance Clones Cursor with Trae</p><p>And finally, in the "tools" department, ByteDance continues its open source blitzkrieg with <strong>Trae</strong>, a free Cursor competitor. "ByteDance drops Trae, which is a cursor competitor, which is free for now" I announced on the show, so if you don't mind your code being sent to... china somewhere, and can't afford Cursor, this is not a bad alternative! </p><p>Trae imports your Cursor configs, supports Claude 3.5 and GPT-4o, and offers a similar AI-powered code editing experience, complete with chat interface and "builder" (composer) mode. The catch? Your code gets sent to a server in China. If you're okay with that, you've got yourself a free Cursor alternative. "If you're okay with your like code getting shared with ByteDance, this is a good option for you," I summarized. Decisions, decisions.</p><p>Phew! That was a whirlwind tour through another insane week in AI. From DeepSeek R1's open source reasoning revolution to OpenAI's Operator going live, and Google's million-token Gemini brain, it's clear that the pace of innovation is showing no signs of slowing down. </p><p>Open source is booming, agents are inching closer to reality, and the big companies are throwing down massive infrastructure investments. We're accelerating as f**k, and it's only just beginning, hold on to your butts.</p><p>Make sure to dive into the show notes below for all the links and details on everything we covered. And don't forget to give R1 a spin – and maybe try out that "reasoning_effort" hack. Just don't blame me if your AI starts having an existential crisis.</p><p>And as a final thought, channeling my inner Woody Allen-R1, "Don't overthink too much. enjoy our one. Enjoy the incredible things we received this week from open source."</p><p>See you all next week for more ThursdAI madness! And hopefully, by then, Operator will actually be operating. 😉</p><p>TL;DR and show notes</p><p>* <strong>Open Source LLMs</strong></p><p>* DeepSeek R1 - MIT licensed SOTA open source reasoning model (<a target="_blank" href="https://huggingface.co/deepseek-ai">HF</a>, X)</p><p>* ByteDance UI-TARS - PC control models (<a target="_blank" href="https://huggingface.co/bytedance-research/UI-TARS-7B-SFT">HF</a>, <a target="_blank" href="https://github.com/bytedance/UI-TARS-desktop">Github</a> )</p><p>* HLE - Humanity's Last Exam benchmark (<a target="_blank" href="https://lastexam.ai/">Website</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* SoftBank, Oracle, OpenAI Stargate Project - $500B AI infrastructure (<a target="_blank" href="https://openai.com/index/announcing-the-stargate-project/">OpenAI Blog</a>)</p><p>* Google Gemini Flash Thinking 01-21 - 1M context, Code execution, Better Evals (<a target="_blank" href="https://x.com/NoamShazeer/status/1881845900659896773">X</a>)</p><p>* OpenAI Operator - Agentic browser in ChatGPT Pro <a target="_blank" href="operator.chatgpt.com">operator.chatgpt.com</a></p><p>* Anthropic launches citations in API (<a target="_blank" href="https://docs.anthropic.com/en/docs/build-with-claude/citations">blog</a>)</p><p>* Perplexity SonarPRO Search API and an Android AI assistant (<a target="_blank" href="https://x.com/perplexity_ai/status/1882466239123255686">X</a>)</p><p>* <strong>This weeks Buzz 🐝</strong></p><p>* W&B broke SOTA SWE-bench verified (<a target="_blank" href="https://wandb.ai/site/weave?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=jan23">W&B Blog</a>)</p><p>* <strong>Vision & Video</strong></p><p>* HuggingFace SmolVLM - Tiny VLMs - runs even on WebGPU (<a target="_blank" href="https://huggingface.co/spaces/HuggingFaceTB/SmolVLM-256M-Instruct-WebGPU">HF</a>)</p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* Hunyuan 3D 2.0 - SOTA open-source 3D (<a target="_blank" href="https://huggingface.co/tencent/Hunyuan3D-2">HF</a>)</p><p>* <strong>Tools</strong></p><p>* ByteDance Trae - Cursor competitor (Trae AI: <a target="_blank" href="https://trae.ai/)">https://trae.ai/)</a></p><p>* <strong>Show Notes:</strong> </p><p>* Pietro Skirano RAT - Retrieval augmented generation (<a target="_blank" href="https://x.com/skirano/status/1881854481304047656">X</a>)</p><p>* Run DeepSeek with more “thinking” script (<a target="_blank" href="https://gist.github.com/vgel/8a2497dc45b1ded33287fa7bb6cc1adc">Gist</a>)</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-jan-23-2025-deepseek-r1</link><guid isPermaLink="false">substack:post:155578714</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Fri, 24 Jan 2025 01:52:41 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/155578714/3a592bffc306429b313b58c70d5b6b49.mp3" length="78951246" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6579</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/155578714/5e7a2fc0d8dc05fd7177e99f626900ed.jpg"/></item><item><title><![CDATA[📆 ThursdAI - Jan 16, 2025 - Hailuo 4M context LLM, SOTA TTS in browser, OpenHands interview & more AI news]]></title><description><![CDATA[<p>Hey everyone, Alex here 👋 </p><p>Welcome back, to an absolute banger of a week in AI releases, highlighted with just massive Open Source AI push. We're talking a MASSIVE 4M context window context window model from Hailuo (remember when a jump from 4K to 16K seemed like a big deal?), a 8B omni model that lets you livestream video and glimpses of Agentic ChatGPT? </p><p>This week's ThursdAI was jam-packed with so much open source goodness that the big companies were practically silent. But don't worry, we still managed to squeeze in some updates from OpenAI and Mistral, along with a fascinating new paper from Sakana AI on self-adaptive LLMs. Plus, we had the incredible Graham Neubig, from All Hands AI, join us to talk about Open Hands (formerly OpenDevin) and even contributed to our free, LLM Evaluation course on Weights & Biases!</p><p>Before we dive in, a friend asked me over dinner, what are the main 2 things that happened in AI in 2024, and this week highlights one of those trends. Most of the Open Source is now from China. This week, we got MiniMax from Hailuo, OpenBMB with a new MiniCPM, InternLM came back and most of the rest were Qwen finetunes. Not to mention DeepSeek. Wanted to highlight this significant narrative change and that this is being done despite the chip export restrictions. </p><p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></p><p>Open Source AI & LLMs</p><p>MiniMax-01: 4 Million Context, 456 Billion Parameters, and Lightning Attention </p><p>This came absolutely from the left field, given that we've seen no prior LLMs from Haulio, the company previously releasing video models with consistent characters. Dropping a massive 456B mixture of experts model (45B active parameters) with such a long context support in open weights, but also with very significant benchmarks that compete with Gpt-4o, Claude and DeekSeek v3 (75.7 MMLU-pro, 89 IFEval, 54.4 GPQA)</p><p>They have trained the model on up to 1M context window and then extended it to 4M with ROPE scaling methods (<a target="_blank" href="https://sub.thursdai.news/p/thursdai-sunday-special-extending?utm_source=publication-search">our coverage</a> of RoPE) during Inference. MiniMax-Text-01 adopts a hybrid architecture that combines Lightning Attention, Softmax Attention and Mixture-of-Experts (MoE) with 45B active parameters. </p><p>I gotta say, when we started talking about context window, imagining a needle in a haystack graph that shows 4M, in the open source seemed far fetched, though we did say that theoretically, there may not be a limit to context windows. I just always expected that limit to be unlocked by transformers alternative architectures like Mamba or other State Space Models.</p><p>Vision, API and Browsing - Minimax-VL-01</p><p>It feels like such a well rounded and complete release, that it highlights just how mature company that is behind it. They have also released a vision version of this model, that includes a 300M param Vision Transformer on top (trained with 512B vision language tokens) that features dynamic resolution and boasts very high DocVQA and ChartQA scores. </p><p>Not only did these two models were released in open weights, they also launched as a unified API endpoint (supporting up to 1M tokens) and it's  cheap! $0.2/1M input and $1.1/1M output tokens! AFAIK this is only the 3rd API that supports this much context, after Gemini at 2M and Qwen Turbo that supports 1M as well.</p><p>Surprising web browsing capabilities</p><p>You can play around with the model on their website, <a target="_blank" href="https://www.hailuo.ai">hailuo.ai</a> which also includes web grounding, which I found quite surprising to find out, that they are beating chatGPT and Perplexity on how fast they can find information that just happened that same day! Not sure what search API they are using under the hood but they are very quick. </p><p>8B chat with video model omni-model from OpenBMB</p><p>OpenBMB has been around for a while and we've seen consistently great updates from them on the MiniCPM front, but this one takes the cake! </p><p>This is a complete omni modal end to end model, that does video streaming, audio to audio and text understanding, all on a model that can run on an iPad! </p><p>They have a demo interface that is very similar to the chatGPT demo from spring of last year, and allows you to stream your webcam and talk to the model, but this is just an 8B parameter model we're talking about! It's bonkers! </p><p></p><p>They are boasting some incredible numbers, and to be honest, I highly doubt  their methodology in textual understanding, because, well, based on my experience alone, this model understands less than close to chatGPT advanced voice mode, but miniCPM has been doing great visual understanding for a while, so ChartQA and DocVQA are close to SOTA. </p><p>But all of this doesn't matter, because, I say again, just a little over a year ago, Google released a video announcing these capabilities, having an AI react to a video in real time, and it absolutely blew everyone away, and it was <a target="_blank" href="https://techcrunch.com/2023/12/07/googles-best-gemini-demo-was-faked/">FAKED</a>. And this time a year after, we have these capabilities, essentially, in an 8B model that runs on device 🤯 </p><p>Voice & Audio </p><p>This week seems to be very multimodal, not only did we get an omni-modal from OpenBMB that can speak, and last week's Kokoro still makes a lot of waves, but this week there were a lot of voice updates as well</p><p>Kokoro.js - run the SOTA open TTS now in your browser</p><p>Thanks to friend of the pod Xenova (and the fact that Kokoro was released with ONNX weights), we now have kokoro.js, or npm -i kokoro-js if you will. </p><p>This allows you to install and run Kokoro, the best tiny TTS model, completely within your browser, with a tiny 90MB download and it sounds really good (demo <a target="_blank" href="https://huggingface.co/spaces/webml-community/kokoro-web">here</a>)</p><p>Hailuo T2A - Emotional text to speech + API </p><p>Hailuo didn't rest on their laurels of releasing a huge context window LLM, they also released a new voice framework (tho not open sourced) this week, and it sounds remarkably good (competing with 11labs) </p><p>They have all the standard features like Voice Cloning, but claim to have a way to preserve the emotional undertones of a voice. They also have 300 voices to choose from and professional effects applied on the fly, like acoustics or telephone filters. (Remember, they have a video model as well, so assuming that some of this is to for the holistic video production) </p><p>What I specifically noticed is their "emotional intelligence system" that's either automatic or can be selected from a dropdown. I also noticed their "lax" copyright restrictions, as one of the voices that was called "Imposing Queen" sounded just like a certain blonde haired heiress to the iron throne from a certain HBO series. </p><p>When I generated a speech worth of that queen, I noticed that the emotion in that speech sounded very much like an actress would read them, and unlike any old TTS, just listen to it in the clip above, I don't remember getting TTS outputs with this much emotion from anything, maybe outside of advanced voice mode! Quite impressive!</p><p>This Weeks Buzz from Weights & Biases - AGENTS!</p><p>Breaking news from W&B as our CTO <a target="_blank" href="https://x.com/shawnup/status/1880004026957500434">just broke</a> SWE-bench Verified SOTA, with his own o1 agentic framework he calls W&B Programmer 😮 at <strong>64.6% </strong>of the issues!</p><p>Shawn describes how he achieved this massive breakthrough <a target="_blank" href="https://medium.com/@shawnup/the-best-ai-programmer-from-weights-biases-04cf8127afd8">here</a> and we'll be publishing more on this soon, but the highlight for me is he ran over 900 evaluations during the course of this, and tracked all of them in <a target="_blank" href="https://wandb.ai/site/weave?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=Jan16">Weave</a>! </p><p>We also have an upcoming event in NY, on Jan 22nd, if you're there, come by and learn how to evaluate your AI agents, RAG applications and hang out with our team! (Sign up <a target="_blank" href="https://lu.ma/eufkbeem?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=Jan16">here</a>)</p><p>Big Companies & APIs</p><p>OpenAI adds chatGPT tasks - first agentic feature with more to come! </p><p>We finally get a glimpse of an agentic chatGPT, in the form of scheduled tasks! Deployed to all users, it is now possible to select gpt-4o with tasks, and schedule tasks in the future. </p><p>You can schedule them in natural language, and then will execute a chat (and maybe perform a search or do a calculation) and then send you a notification (and an email!) when the task is done! </p><p>A bit underwhelming at first, as I didn't really find a good use for this yet, I don't doubt that this is just a building block for something more Agentic to come that can connect to my email or calendar and do actual tasks for me, not just... save me from typing the chatGPT query at "that time" </p><p>Mistral CodeStral 25.01 - a new #1 coding assistant model</p><p>An updated Codestral was released at the beginning of the week, and TBH I've never seen the vibes split this fast on a model. </p><p>While it's super exciting that Mistral is placing a coding model at #1 on the LMArena CoPilot's arena, near Claude 3.5 and DeepSeek, the fact that this new model is not released weights is really a bummer (especially as a reference to the paragraph I mentioned on top) </p><p>We seem to be closing down on OpenSource in the west, while the Chinese labs are absolutely crushing it (while also releasing in the open, including Weights, Technical papers). </p><p>Mistral has released this model in API and via a collab with the Continue dot dev coding agent, but they used to be the darling of the open source community by releasing great models! </p><p>Also notable, a very quick new benchmark post release was dropped that showed a significant difference between their reported benchmarks and how it performs on Aider polyglot </p><p>There was way more things for this week than we were able to cover, including a new and exciting transformers squared new architecture from Sakana, a new open source TTS with voice cloning and a few other open source LLMs, one of which cost only $450 to train! All the links in the TL;DR below! </p><p></p><p>TL;DR and show notes</p><p>* <strong>Open Source LLMs</strong> </p><p>* MiniMax-01 from Hailuo - 4M context 456B (45B A) LLM (<a target="_blank" href="https://github.com/MiniMax-AI/MiniMax-01">Github</a>, <a target="_blank" href="https://huggingface.co/MiniMaxAI">HF</a>, <a target="_blank" href="https://www.minimaxi.com/en/news/minimax-01-series-2">Blog</a>, <a target="_blank" href="https://filecdn.minimax.chat/_Arxiv_MiniMax_01_Report.pdf">Report</a>)</p><p>* Jina - reader V2 model - HTML 2 Markdown/JSON (<a target="_blank" href="https://huggingface.co/jinaai/ReaderLM-v2">HF</a>)</p><p>* InternLM3-8B-Instruct - apache 2 License (<a target="_blank" href="https://github.com/InternLM/InternLM">Github</a>, <a target="_blank" href="https://huggingface.co/internlm">HF</a>)</p><p>* OpenBMB - <strong>MiniCPM-o 2.6</strong> - Multimodal Live Streaming on Your Phone (<a target="_blank" href="https://huggingface.co/openbmb/MiniCPM-o-2_6">HF</a>, <a target="_blank" href="https://github.com/OpenBMB/MiniCPM-o">Github</a>, <a target="_blank" href="https://minicpm-omni-webdemo-us.modelbest.cn/">Demo</a>)</p><p>* KyutAI - Helium-1 2B - Base (<a target="_blank" href="https://x.com/kyutai_labs/thread/1878857673174864318">X</a>, <a target="_blank" href="https://huggingface.co/kyutai/helium-1-preview-2b">HF</a>)</p><p>* Dria-Agent-α - 3B model that outputs python code (<a target="_blank" href="https://huggingface.co/driaforall/Dria-Agent-a-3B">HF</a>)</p><p>* Sky-T1, a ‘reasoning’ AI model that can be trained for less than $450 (<a target="_blank" href="https://novasky-ai.github.io/posts/sky-t1/">blog</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* OpenAI launches ChatGPT tasks (<a target="_blank" href="https://x.com/OpenAI/status/1879267274185756896">X</a>)</p><p>* Mistral - new CodeStral 25.01 (<a target="_blank" href="https://mistral.ai/news/codestral-2501/">Blog</a>, no Weights)</p><p>* Sakana AI - Transformer²: Self-Adaptive LLMs (<a target="_blank" href="https://sakana.ai/transformer-squared">Blog</a>)</p><p>* <strong>This weeks Buzz </strong></p><p>* Evaluating RAG Applications Workshop - NY, Jan 22, W&B and PineCone (<a target="_blank" href="https://lu.ma/eufkbeem">Free Signup</a>)</p><p>* Our evaluations course is going very strong! (chat w/ Graham Neubig) (<a target="_blank" href="https://wandb.me/evals-t">https://wandb.me/evals-t</a>)</p><p>* <strong>Vision & Video</strong></p><p>* Luma releases Ray2 video model (<a target="_blank" href="https://lumalabs.ai/ray">Web</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* Hailuo <strong>T2A-01-HD</strong> - Emotions Audio Model from Hailuo (<a target="_blank" href="https://x.com/Hailuo_AI/status/1879554062993195421">X</a>, <a target="_blank" href="https://t.co/r58fjgvJ7w">Try It</a>)</p><p>* OuteTTS 0.3 - 1B & 500M - zero shot voice cloning model (<a target="_blank" href="https://huggingface.co/collections/OuteAI/outetts-03-6786b1ebc7aeb757bc17a2fa">HF</a>)</p><p>* Kokoro.js - 80M SOTA TTS in your browser! (X, <a target="_blank" href="https://github.com/hexgrad/kokoro/pull/3">Github</a>, <a target="_blank" href="https://huggingface.co/spaces/webml-community/kokoro-web">try it</a> )</p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* Black Forest Labs - Finetuning for Flux Pro and Ultra via API (<a target="_blank" href="https://blackforestlabs.ai/announcing-the-flux-pro-finetuning-api/">Blog</a>)</p><p>* <strong>Show Notes and other Links</strong></p><p>* Hosts - Alex Volkov (<a target="_blank" href="https://x.com/altryne">@altryne</a>), Wolfram RavenWlf (<a target="_blank" href="https://twitter.com/WolframRvnwlf">@WolframRvnwlf</a>), Nisten Tahiraj (<a target="_blank" href="https://x.com/nisten/">@nisten</a>)</p><p>* Guest - Graham Neubig (<a target="_blank" href="https://x.com/gneubig">@gneubig</a>) from All Hands AI (<a target="_blank" href="https://x.com/allhands_ai">@allhands_ai</a>)</p><p>* Graham’s mentioned Agents blogpost - 8 things that agents can do right <a target="_blank" href="https://www.all-hands.dev/blog/8-use-cases-for-generalist-software-development-agents">now</a></p><p>* Projects - Open Hands (previously Open Devin) - <a target="_blank" href="https://github.com/All-Hands-AI/OpenHands">Github</a></p><p>* Germany meetup in Cologne (<a target="_blank" href="https://twitter.com/WolframRvnwlf/status/1877338980632383713">here</a>)</p><p>* Toronto Tinkerer Meetup *Sold OUT* (<a target="_blank" href="https://toronto.aitinkerers.org/p/ai-tinkerers-toronto-january-2025-meetup-at-google">Here</a>)</p><p>* YaRN conversation we had with the Authors (<a target="_blank" href="https://sub.thursdai.news/p/thursdai-sunday-special-extending?utm_source=publication-search">coverage</a>)</p><p></p><p>See you folks next week! Have a great long weekend if you’re in the US 🫡 </p><p><p>Please help to promote the podcast and newsletter by sharing with a friend!</p></p><p></p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-jan-16-2025-hailuo-4m-context</link><guid isPermaLink="false">substack:post:154986493</guid><dc:creator><![CDATA[Alex Volkov and Graham Neubig]]></dc:creator><pubDate>Fri, 17 Jan 2025 02:38:33 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/154986493/4c658028b0daf293d2c597c2c3fe0ea8.mp3" length="72386988" type="audio/mpeg"/><itunes:author>Alex Volkov and Graham Neubig</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6032</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/154986493/e1f60bd05cf5dd34e5e17ca6cd109c38.jpg"/></item><item><title><![CDATA[📆 ThursdAI - Jan 9th - NVIDIA's Tiny Supercomputer, Phi-4 is back, Kokoro TTS & Moondream gaze, ByteDance SOTA lip sync & more AI news]]></title><description><![CDATA[<p>Hey everyone, Alex here 👋</p><p>This week's ThursdAI was a whirlwind of announcements, from Microsoft finally dropping Phi-4's official weights on Hugging Face (a month late, but who's counting?) to Sam Altman casually mentioning that OpenAI's got AGI in the bag and is now setting its sights on <em>superintelligence</em>. Oh, and NVIDIA? They're casually releasing a $3,000 supercomputer that can run 200B parameter models on your desktop. No big deal.</p><p>We had some amazing guests this week too, with <a target="_blank" href="https://x.com/olliezliu/status/1876312788873977996">Oliver</a> joining us to talk about a new foundation model in genomics and biosurveillance (yes, you read that right - think wastewater and pandemic monitoring!), and then, we've got some breaking news! <a target="_blank" href="https://x.com/vikhyatk">Vik</a> returned to the show with a brand new Moondream release that can do some pretty wild things. Ever wanted an AI to tell you where someone's looking in a photo? Now you can, thanks to a tiny model that runs on edge devices. 🤯</p><p>So buckle up, folks, because we've got a ton to cover. Let's dive into the juicy details of this week's AI madness, starting with open source.</p><p>03:10 TL;DR</p><p>03:10 Deep Dive into Open Source LLMs</p><p>10:58 MetaGene: A New Frontier in AI</p><p>20:21 PHI4: The Latest in Open Source AI</p><p>27:46 R Star Math: Revolutionizing Small LLMs</p><p>34:02 Big Companies and AI Innovations</p><p>42:25 NVIDIA's Groundbreaking Announcements</p><p>43:49 AI Hardware: Building and Comparing Systems</p><p>46:06 NVIDIA's New AI Models: LLAMA Neumatron</p><p>47:57 Breaking News: Moondream's Latest Release</p><p>50:19 Moondream's Journey and Capabilities</p><p>58:41 Weights & Biases: New Evals Course</p><p>01:08:29 NVIDIA's World Foundation Models</p><p>01:08:29 ByteDance's LatentSync: State-of-the-Art Lip Sync</p><p>01:12:54 Kokoro TTS: High-Quality Text-to-Speech</p><p>As always, TL;DR section with links and show notes below 👇</p><p></p><p>Open Source AI & LLMs</p><p>Phi-4: Microsoft's "Small" Model Finally Gets its Official Hugging Face Debut</p><p>Finally, after a month, we're getting Phi-4 14B on HugginFace. So far, we've had bootlegged copies of it, but it's finally officially uploaded by Microsoft. Not only is it now official, it's also officialy MIT licensed which is great!</p><p>So, what's the big deal? Well, besides the licensing, it's a 14B parameter, dense decoder-only Transformer with a 16K token context length and trained on a whopping 9.8 <em>trillion</em> tokens. It scored 80.4 on math and 80.6 on MMLU, making it about 10% better than its predecessor, Phi-3 and better than Qwen 2.5's 79</p><p>What’s interesting about phi-4 is that the training data consisted of 40% synthetic data (almost half!)</p><p>The vibes are always interesting with Phi models, so we'll keep an eye out, notable also, the base models weren't released due to "safety issues" and that this model was not trained for multi turn chat applications but single turn use-cases</p><p>MetaGene-1: AI for Pandemic Monitoring and Pathogen Detection</p><p>Now, this one's a bit different. We usually talk about LLMs in this section, but this is more about the "open source" than the "LLM." Prime Intellect, along with folks from USC, released MetaGene-1, a <em>metagenomic foundation model</em>. That's a mouthful, right? Thankfully, we had Oliver Liu, a PhD student at USC, and an author on this paper, join us to explain.</p><p>Oliver clarified that the goal is to use AI for "biosurveillance, pandemic monitoring, and pathogen detection." They trained a 7B parameter model on 1.5 <em>trillion</em> base pairs of DNA and RNA sequences from wastewater, creating a model surprisingly capable of zero-shot embedding. Oliver pointed out that while using genomics to pretrain foundation models is not new, MetaGene-1 is, "in its current state, the largest model out there" and is "one of the few decoder only models that are being used". They also have collected 15T bae pairs but trained on 10% of them due to grant and compute constraints.</p><p>I really liked this one, and though the science behind this was complex, I couldn't help but get excited about the potential of transformer models catching or helping catch the next COVID 👏</p><p>rStar-Math: Making Small LLMs Math Whizzes with Monte Carlo Tree Search</p><p>Alright, this one blew my mind. A paper from Microsoft (yeah, them again) called "rStar-Math" basically found a way to make <em>small</em> LLMs do math better than o1 using Monte Carlo Tree Search (MCTS). I know, I know, it sounds wild. They took models like Phi-3-mini (a tiny 3.8B parameter model) and Qwen 2.5 3B and 7B, slapped some MCTS magic on top, and suddenly these models are acing the AIME 2024 competition math benchmark and scoring 90% on general math problems. For comparison, OpenAI's o1-preview scores 85.5% on math and o1-mini scores 90%. This is WILD, as just 5 months ago, it was unimaginable that any LLM can solve math of this complexity, then reasoning models could, and now small LLMs with some MCTS can!</p><p>Even crazier, they observed an "emergence of intrinsic self-reflection capability" in these models during problem-solving, something they weren't designed to do. LDJ chimed in saying "we're going to see more papers showing these things emerging and caught naturally." So, is 2025 the year of not just AI agents, but also emergent reasoning in LLMs? It's looking that way. The code isn't out yet (the GitHub link in the paper is currently a <a target="_blank" href="https://github.com/microsoft/rStar">404</a>), but when it drops, you can bet we'll be all over it.</p><p>Big Companies and LLMs</p><p>OpenAI: From AGI to ASI</p><p>Okay, let's talk about the elephant in the room: Sam Altman's blog post. While reflecting on getting fired from his job on like a casual Friday, he dropped this bombshell: "We are now confident that we know how to build AGI as we have traditionally understood it." And then, as if that wasn't enough, he added, "We're beginning to turn our aim beyond that <strong>to superintelligence in the true sense of the word</strong>." So basically, OpenAI is saying, "AGI? Done. Next up: ASI."</p><p>This feels like a big shift in how openly folks at OpenAI is talking about Superintelligence, and while AGI is yet to be properly defined (LDJ read out the original OpenAI definition on the live show, but the Microsoft definition contractually with OpenAI was a system that generates $100B in revenue) they are already talking about Super Intelligence which supersedes all humans ever lived in all domains</p><p>NVIDIA @ CES - Home SuperComputers, 3 scaling laws, new Models</p><p>There was a lot of things happening at CES, the largest consumer electronics show, but the AI focus was on NVIDIA, namely on Jensen Huangs keynote speech!</p><p>He talked about a lot of stuff, really, it's a show, and is a very interesting watch, NVIDIA is obviously at the forefront of all of this AI wave, and when Jensen tells you that we're at the high of the 3rd scaling law, he knows what he's talking about (because he's fueling all of it with his GPUs) - the third one is of course test time scaling or "reasoning", the thing that powers o1, and the coming soon o3 model and other reasoners.</p><p>Project Digits - supercomputer at home?</p><p>Jensen also announced Project Digits: a compact AI supercomputer priced at a relatively modest $3,000. Under the hood, it wields a Grace Blackwell “GB10” superchip that supposedly offers 1 petaflop of AI compute and can support LLMs up to 200B parameters (or you can link 2 of them to run LLama 405b at home!)</p><p>This thing seems crazy, but we don't know more details like the power requirements for this beast!</p><p>Nemotrons again?</p><p>Also announced was a family of NVIDIA LLama Nemotron foundation models, but.. weirdly we already have Nemotron LLamas (<a target="_blank" href="https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF">3 months ago</a>) , so those are... new ones? I didn't really understand what was announced here, as we didn't get new models, but the announcement was made nonetheless. We're due to get 3 new version of Nemotron on the Nvidia NEMO platform (and Open), sometime soon.</p><p>NVIDIA did release new open source models, with COSMOS, which is a whole platform that includes pretrained world foundation models to help simulate world environments to train robots (among other things).</p><p>They have released txt2world and video2world Pre-trained Diffusion and Autoregressive models in 7B and 14B sizes, that generate videos to simulate visual worlds that have strong alignment to physics.</p><p>If you believe Elon when he says that Humanoid Robots are going to be the biggest category of products (every human will want 1 or 3, so we're looking at 20 billion of them), then COSMOS is a platform to generate synthetic data to train these robots to do things in the real world!</p><p>This weeks buzz - Weights & Biases corner</p><p>The wait is over, our LLM Evals course is now LIVE, featuring speakers Graham Neubig (who we had on the pod before, back when Open Hands was still called Open Devin) and Paige Bailey, and Anish and Ayush from my team at W&B!</p><p>If you're building with LLM in production and don't have a robust evaluation setup, or don't even know where to start with one, this course is definitely for you! <a target="_blank" href="https://wandb.ai/site/courses/evals/?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=Jan9">Sign up today</a>. You'll learn from examples of Imagen and Veo from Paige, Agentic examples using <a target="_blank" href="https://wandb.ai/site/weave?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=Jan9">Weave</a> from Graham and Basic and Advanced Evaluation from Anish and Ayush.</p><p>The workshop in Seattle next was filled out super quick, so since we didn't want to waitlist tons of folks, we have extended it to another night, so those of you who couldn't get in, will have another opportunity on Tuesday! (<a target="_blank" href="https://seattle.aitinkerers.org/p/ai-in-production-evals-observability-workshop">Workshop page</a>) but while working on it I came up with this distillation of what I'm going to deliver, and wanted to share with you.</p><p>Vision & Video</p><p>New Moondream 01-09 can tell where you look (among other things) (<a target="_blank" href="https://moondream.ai/blog/introducing-a-new-moondream-1-9b-and-gpu-support">blog</a>, <a target="_blank" href="https://huggingface.co/vikhyatk/moondream2">HF</a>)</p><p>We had some breaking news on the show! Vik Korrapati, the creator of Moondream, joined us to announce updates to Moondream, a new version of his tiny vision language model. This new release has some incredible capabilities, including pointing, object detection, structured output (like JSON), and even <em>gaze detection</em>. Yes, you read that right. Moondream can now tell you where someone (or even a pet!) is looking in an image.</p><p>Vic explained how they achieved this: "We took one of the training datasets that Gazelle trained on and added it to the Moondream fine tuning mix". What's even more impressive is that Moondream is tiny - the new version comes in 2B and 0.5B parameter sizes. As Vic said, "0.5b is we actually started with the 2b param model and we pruned down while picking specific capabilities you want to preserve". This makes it perfect for edge devices and applications where cost or privacy is a concern. It's incredible to see how far Moondream has come, from a personal project to a company with seven employees working on it.</p><p>Since Vik joined ThursdAI last January (we seem to be on a kick of revisiting with our guests from last year!) Moondream is a company, but they are committed to open source and so this releases is also Apache 2 👏 but you can also try this out on their website <a target="_blank" href="https://moondream.ai/playground">playground</a> and hire them if you need to finetune a custom tiny vision model!</p><p>Voice & Audio</p><p>Very exciting updates in the OSS voice and audio this week!</p><p>KOKORO TTS - Apache 2 tiny (82M! params) TTS that's #1 on TTS arena (<a target="_blank" href="https://huggingface.co/hexgrad/Kokoro-82M">HF</a>,<a target="_blank" href="https://huggingface.co/spaces/hexgrad/Kokoro-TTS">Demo</a>)</p><p>Honestly when Wolfram told me about Kokoro being #1 on TTS arena and that it was released a few weeks back, I almost skipped giving this an update, but wow, this tiny tiny model can run on edge devices, can run in your browser, and the sound it generates is SO clean!</p><p>It's Apache 2 license and the voices were trained on non licensed data (per the author)</p><p>There's no voice cloning support yet, but there are voice packs you can use, and somehow, they got the SKY voice. Remember the one that Scarlett Johanson almost sued OpenAI for? That one! And for 82M parameters it sounds so good, hell, for any TTS, it sounds very good!</p><p>ByteDance - LatentSync state of the art lip syncing (<a target="_blank" href="https://x.com/bdsqlsz/status/1875474807124586524">X</a>, <a target="_blank" href="https://arxiv.org/abs/2412.09262">Paper</a>, <a target="_blank" href="https://fal.ai/models/fal-ai/latentsync">Fal</a>)</p><p>In the same week, ByteDance released a SOTA lip syncing OSS model called LatentSync, which takes a voice (for example, such as the one you can create with Kokoro above) and a video, and sync the lips of the person in the video, to make it seem like that person said the thing.</p><p>This is for example great for translation purposes, here's a quick example of my cloned voice (via 11labs) and translated opening of the show in spanish, and overlays it on top of my actual video, and it's pretty good!</p><p>This week Lex Fridman interviewed Volodymir Zelensky and I loved the technical and AI aspect of that whole multilingual interview, they have translated that into English, Russian and Ukrainian. But the lips weren't synced so it looked a bit off still. Now consider the different with and without lip syncing (here's a quick example I whipped up)</p><p>Baidu - Hallo 3 - generative avatars now with animated backgrounds</p><p>Meanwhile over at Baidu, Hallo 3 is their 3rd iteration of generative portraits, a way to turn a single image into a completely animated avatar, by also providing it a recording of your voice (or a TTS, does it really matter at this point?)</p><p>The highlight here is, the background is now part of these avatars! Where as previously these avatars used to be static, now they have dynamic backgrounds. Tho I still feel weirded out by their lip movements, but maybe with the above lipsyncing this can be fixed?</p><p>Not a bad second week of the yeah eh? A LOT of open source across multimodalities, supercomputers at home, tiny vision and TTS models and tons of apache 2 or MIT licensed models all over!</p><p>See you guys next week (well, some of you in person in SF and Seattle) but most of you next week on ThursdAI! 🫡</p><p>Tl;DR + Show Notes</p><p>* <strong>Open Source LLMs</strong></p><p>* Phi-4 MIT licensed family of models from Microsoft (<a target="_blank" href="https://x.com/sytelus/status/1877015492126220594">X</a>, <a target="_blank" href="https://techcommunity.microsoft.com/blog/aiplatformblog/introducing-phi-4-microsoft%E2%80%99s-newest-small-language-model-specializing-in-comple/4357090">Blog</a>, <a target="_blank" href="https://huggingface.co/microsoft/phi-4">HF</a>)</p><p>* Prime Intellect - MetaGENE-1 - <em>metagenomic foundation model</em> (<a target="_blank" href="https://metagene.ai/">Site</a>, <a target="_blank" href="https://x.com/olliezliu/status/1876312788873977996">X</a>, <a target="_blank" href="https://arxiv.org/abs/2501.02045">Paper</a>)</p><p>* rStar-Math - making Small LLMs do Math better than o1 with MCTS (<a target="_blank" href="https://arxiv.org/abs/2501.04519">Paper</a>, <a target="_blank" href="https://github.com/microsoft/rStar">Github</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Sam Altman releases an ASI blog, multiple OpenAI people switch from AGI to ASI (<a target="_blank" href="https://x.com/slow_developer/status/1876962062473023488">X</a>)</p><p>* NVIDIA updates from CES (<a target="_blank" href="https://x.com/alxfazio/status/1876507737909293339">X</a>)</p><p>* XAI - Grok IOS app + Grok 3 finished pre-training</p><p>* Qwen has a new web portal with all their modals - <a target="_blank" href="https://chat.qwenlm.ai/auth#email=git@alexw.me&#38;name=altryne&#38;oauth_sub=github@463317">chat.qwenlm.ai</a></p><p>* <strong>This weeks Buzz</strong></p><p>* Evals Course is LIVE - Evals with Paige Bailey and Graham Neubig Course Signup (<a target="_blank" href="https://wandb.ai/site/weave?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=Jan9">Signup</a>)</p><p>* San Francisco is still open (<a target="_blank" href="https://lu.ma/bzqvsqaa">Details</a>)</p><p>* Seattle is almost waitlisted (<a target="_blank" href="https://seattle.aitinkerers.org/p/ai-in-production-evals-observability-workshop">Workshop</a>)</p><p>* <strong>Vision & Video</strong></p><p>* NVIDIA Cosmos - World Foundation Models (<a target="_blank" href="https://research.nvidia.com/publication/2025-01_cosmos-world-foundation-model-platform-physical-ai">Post</a>, <a target="_blank" href="https://github.com/NVIDIA/Cosmos?tab=readme-ov-file">Github</a>, <a target="_blank" href="https://huggingface.co/collections/nvidia/cosmos-6751e884dc10e013a0a0d8e6">HF</a>)</p><p>* Moondream 2 announcement - new evals - Chat with Vik Korrapati (<a target="_blank" href="https://x.com/vikhyatk/status/1877407680228143370">X</a>, <a target="_blank" href="https://huggingface.co/vikhyatk/moondream2">HF</a>, <a target="_blank" href="https://moondream.ai/playground">Try It</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* Kokoro - #1 TTS with Apache 2 license (<a target="_blank" href="https://huggingface.co/hexgrad/Kokoro-82M">HF</a>, <a target="_blank" href="https://huggingface.co/spaces/hexgrad/Kokoro-TTS">Demo</a>)</p><p>* Baidu - Hallo 3 - generative portraits (<a target="_blank" href="https://fudan-generative-vision.github.io/hallo3/#/">Project</a>, <a target="_blank" href="https://github.com/fudan-generative-vision/hallo3">Github</a>, <a target="_blank" href="https://huggingface.co/fudan-generative-ai/hallo3">HF</a>)</p><p>* ByteDance - LatentSync lip syncing model (<a target="_blank" href="https://x.com/bdsqlsz/status/1875474807124586524">X</a>, <a target="_blank" href="https://arxiv.org/abs/2412.09262">Paper</a>, <a target="_blank" href="https://fal.ai/models/fal-ai/latentsync">Fal</a>)</p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* Stability - SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images ( <a target="_blank" href="https://huggingface.co/spaces/stabilityai/stable-point-aware-3d">HF</a>)</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-jan-9th-nvidias-tiny-supercomputer</link><guid isPermaLink="false">substack:post:154514223</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Fri, 10 Jan 2025 02:10:37 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/154514223/4b59c7562f5005584aceda49d6cd8fef.mp3" length="57909991" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>4826</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/154514223/6ef6deff40f2aed512b109e294478b50.jpg"/></item><item><title><![CDATA[📆 ThursdAI - Jan 2 - is 25' the year of AI agents? ]]></title><description><![CDATA[<p>Hey folks, Alex here 👋 Happy new year!</p><p>On our first episode of this year, and the second quarter of this century, there wasn't a lot of AI news to report on (most AI labs were on a well deserved break). So this week, I'm very happy to present a special ThursdAI episode, an interview with <a target="_blank" href="https://x.com/joaomdmoura">Joāo Moura</a>, CEO of <a target="_blank" href="http://Crew.ai">Crew.ai</a> all about AI agents!</p><p>We first chatted with Joāo a <a target="_blank" href="https://sub.thursdai.news/p/jan14-sunday-special-deep-dives">year ago</a>, back in January of 2024, as CrewAI was blowing up but still just an open source project, it got to be the number 1 trending project on Github, and #1 project on Product Hunt. (You can either listen to the podcast or watch it in the embedded Youtube above)</p><p>00:36 Introduction and New Year Greetings</p><p>02:23 Updates on Open Source and LLMs</p><p>03:25 Deep Dive: AI Agents and Reasoning</p><p>03:55 Quick TLDR and Recent Developments</p><p>04:04 Medical LLMs and Modern BERT</p><p>09:55 Enterprise AI and Crew AI Introduction</p><p>10:17 Interview with João Moura: Crew AI</p><p>25:43 Human-in-the-Loop and Agent Evaluation</p><p>33:17 Evaluating AI Agents and LLMs</p><p>44:48 Open Source Models and Fin to OpenAI</p><p>45:21 Performance of Claude's Sonnet 3.5</p><p>48:01 Different parts of an agent topology, brain, memory, tools, caching</p><p>53:48 Tool Use and Integrations</p><p>01:04:20 Removing LangChain from Crew</p><p>01:07:51 The Year of Agents and Reasoning</p><p>01:18:43 Addressing Concerns About AI</p><p>01:24:31 Future of AI and Agents</p><p>01:28:46 Conclusion and Farewell</p><p>---</p><p>Is 2025 "the year of AI agents"?</p><p>AI agents as I remember them as a concept started for me a few month after I started ThursdAI ,when AutoGPT exploded. Was such a novel idea at the time, run LLM requests in a loop,</p><p>(In fact, back then, I came up with a retry with AI concept and called it <a target="_blank" href="https://x.com/altryne/status/1632253117827010566">TrAI/Catch</a>, where upon an error, I would feed that error back into the GPT api and ask it to correct itself. it feels so long ago!)</p><p>AutoGPT became the fastest ever Github project to reach 100K stars, and while exciting, it did not work.</p><p>Since then we saw multiple attempts at agentic frameworks, like babyAGI, autoGen. Crew AI was one of them that keeps being the favorite among many folks.</p><p>So, what is an AI agent? Simon Willison, friend of the pod, has a mission, to ask everyone who announces a new agent, what they mean when <a target="_blank" href="https://x.com/simonw/status/1863567881553977819">they say it</a> because it seems that everyone "shares" a common understanding of AI agents, but it's different for everyone.</p><p>We'll start with Joāo's explanation and go from there. But let's assume the basic, it's a set of LLM calls, running in a self correcting loop, with access to planning, external tools (via function calling) and a memory or sorts that make decisions.</p><p>Though, as we go into detail, you'll see that since the very basic "run LLM in the loop" days, the agents in 2025 have evolved and have a lot of complexity.</p><p>My takeaways from the conversation</p><p>I encourage you to listen / watch the whole interview, Joāo is deeply knowledgable about the field and we go into a lot of topics, but here are my main takeaways from our chat</p><p>* Enterprises are adopting agents, starting with internal use-cases</p><p>* Crews have 4 different kinds of memory, Long Term (across runs), short term (each run), Entity term (company names, entities), pre-existing knowledge (DNA?)</p><p>* TIL about a "do all links respond with 200" guardrail</p><p>* Some of the agent tools we mentioned</p><p>* Stripe Agent API - for agent payments and access to payment data (<a target="_blank" href="https://stripe.dev/blog/adding-payments-to-your-agentic-workflows">blog</a>)</p><p>* Okta Auth for Gen AI - agent authentication and role management (<a target="_blank" href="https://www.auth0.ai/">blog</a>)</p><p>* E2B - code execution platform for agents (<a target="_blank" href="https://e2b.dev/">e2b.dev</a>)</p><p>* BrowserBase - programmatic web-browser for your AI agent</p><p>* Exa - search grounding for agents for real time understanding</p><p>* Crew has 13 crews that run 24/7 to automate their company</p><p>* Crews like Onboarding User Enrichment Crew, Meetings Prep, Taking Phone Calls, Generate Use Cases for Leads</p><p>* GPT-4o mini is the most used model for 2024 for CrewAI with main factors being speed / cost</p><p>* Speed of AI development makes it hard to standardize and solidify common integrations.</p><p>* Reasoning models like o1 still haven't seen a lot of success, partly due to speed, partly due to different way of prompting required.</p><p>This weeks Buzz</p><p>We've just opened up pre-registration for our upcoming FREE evaluations course, featuring Paige Bailey from Google and Graham Neubig from All Hands AI (previously Open Devin). We've distilled a lot of what we learned about evaluating LLM applications while building <a target="_blank" href="https://wandb.ai/site/weave?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=jan2">Weave</a>, our LLM Observability and Evaluation tooling, and are excited to share this with you all! <a target="_blank" href="https://wandb.ai/site/courses/evals/?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=jan2">Get on the list</a></p><p>Also, 2 workshops (also about Evals) from us are upcoming, one in SF on <a target="_blank" href="https://lu.ma/bzqvsqaa">Jan 11th</a> and one in Seattle on <a target="_blank" href="https://seattle.aitinkerers.org/p/ai-in-production-evals-observability-workshop">Jan 13th</a> (which I'm going to lead!) so if you're in those cities at those times, would love to see you!</p><p>And that's it for this week, there wasn't a LOT of news as I said. The interesting thing is, even in the very short week, the news that we did get were all about agents and reasoning, so it looks like 2025 is agents and reasoning, agents and reasoning!</p><p>See you all next week 🫡</p><p>TL;DR with links:</p><p>* <strong>Open Source LLMs</strong></p><p>* HuatuoGPT-o1 - medical LLM designed for medical reasoning (<a target="_blank" href="https://huggingface.co/FreedomIntelligence/HuatuoGPT-o1-8B">HF</a>, <a target="_blank" href="https://huggingface.co/papers/2412.18925">Paper</a>, <a target="_blank" href="https://github.com/FreedomIntelligence/HuatuoGPT-o1">Github</a>, <a target="_blank" href="https://huggingface.co/datasets/FreedomIntelligence/medical-o1-verifiable-problem">Data</a>)</p><p>* Nomic - modernbert-embed-base - first embed model on top of modernbert (<a target="_blank" href="https://huggingface.co/nomic-ai/modernbert-embed-base">HF</a>)</p><p>* HuggingFace - SmolAgents lib to build agents (<a target="_blank" href="https://huggingface.co/blog/smolagents">Blog</a>)</p><p>* SmallThinker-3B-Preview - a QWEN 2.5 3B "reasoning" finetune (<a target="_blank" href="https://huggingface.co/PowerInfer/SmallThinker-3B-Preview">HF</a>)</p><p>* Wolfram new Benchmarks including DeepSeek v3 (<a target="_blank" href="https://x.com/WolframRvnwlf/status/1874889165919384057">X</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Newcomer Rubik's AI Sonus-1 family - Mini, Air, Pro and Reasoning (<a target="_blank" href="https://x.com/RubiksAI/status/1874682159379972325">X</a>, Chat)</p><p>* Microsoft "estimated" GPT-4o-mini is a ~8B (<a target="_blank" href="https://x.com/Yuchenj_UW/status/1874507299303379428">X</a>)</p><p>* Meta plans to bring AI profiles to their social networks (<a target="_blank" href="https://x.com/petapixel/status/1874792802061844829">X</a>)</p><p>* <strong>This Week's Buzz</strong></p><p>* W&B Free Evals Course with Page Bailey and Graham Beubig - <a target="_blank" href="https://wandb.ai/site/courses/evals/?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=jan2">Free Sign Up</a></p><p>* SF evals event - <a target="_blank" href="https://lu.ma/bzqvsqaa">January 11th</a></p><p>* Seattle evals workshop - <a target="_blank" href="https://seattle.aitinkerers.org/p/ai-in-production-evals-observability-workshop">January 13th</a></p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-jan-2-is-25-the-year-of</link><guid isPermaLink="false">substack:post:154033660</guid><dc:creator><![CDATA[Alex Volkov and Joāo Moura]]></dc:creator><pubDate>Thu, 02 Jan 2025 23:53:25 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/154033660/bebb8fc6d35fb1bdd06dd2b01c1f01ec.mp3" length="65867846" type="audio/mpeg"/><itunes:author>Alex Volkov and Joāo Moura</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5489</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/154033660/5ef100b5c7ab48ba23534d7771ed989e.jpg"/></item><item><title><![CDATA[📆 ThursdAI - Dec 26 - OpenAI o3 & o3 mini, DeepSeek v3 658B beating Claude, Qwen Visual Reasoning, Hume OCTAVE & more AI news ]]></title><description><![CDATA[<p>Hey everyone, Alex here 👋</p><p>I was hoping for a quiet holiday week, but whoa, while the last newsletter was only a week ago, what a looong week it has been, just Friday after the last newsletter, it felt like OpenAI has changed the world of AI once again with o3 and left everyone asking "was this AGI?" over the X-mas break (Hope Santa brought you some great gifts!) and then not to be outdone, DeepSeek open sourced basically a Claude 2.5 level behemoth DeepSeek v3 just this morning!</p><p>Since the breaking news from DeepSeek took us by surprise, the show went a bit longer (3 hours today!) than expected, so as a Bonus, I'm going to release a separate episode with a yearly recap + our predictions from last year and for next year in a few days (soon in your inbox!) </p><p>TL;DR</p><p>* <strong>Open Source LLMs</strong></p><p>* <a target="_blank" href="https://huggingface.co/THUDM/cogagent-9b-20241220">CogAgent-9B</a> (<a target="_blank" href="https://cogagent.aminer.cn/blog#/articles/cogagent-9b-20241220-technical-report-en">Project</a>, <a target="_blank" href="https://cogagent.aminer.cn/blog#/articles/cogagent-9b-20241220-technical-report-en">Github</a>)</p><p>* Qwen QvQ 72B - open weights visual reasoning (<a target="_blank" href="https://x.com/Alibaba_Qwen/status/1871602879972405626">X</a>, <a target="_blank" href="https://huggingface.co/collections/Qwen/qvq-676448c820912236342b9888">HF</a>, <a target="_blank" href="https://huggingface.co/spaces/Qwen/QVQ-72B-preview">Demo</a>, <a target="_blank" href="https://huggingface.co/spaces/Qwen/QVQ-72B-preview">Project</a>)</p><p>* GoodFire Ember - MechInterp API - GoldenGate LLama 70B</p><p>* 🔥 DeepSeek v3 658B MoE - Open Source Claude level model at $6M (<a target="_blank" href="https://x.com/deepseek_ai/status/1872242657348710721">X</a>, <a target="_blank" href="https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf">Paper</a>, <a target="_blank" href="https://huggingface.co/deepseek-ai/DeepSeek-V3">HF</a>, <a target="_blank" href="https://chat.deepseek.com/">Chat</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* 🔥 OpenAI reveals o3 and o3 mini (<a target="_blank" href="https://openai.com">Blog</a>, <a target="_blank" href="https://x.com/altryne/status/1870169615910772971">X</a>)</p><p>* <a target="_blank" href="X.ai">X.ai</a> raises ANOTHER 6B dollars - on their way to 200K H200s (<a target="_blank" href="https://x.com/xai/status/1871313084280644079?s=46">X</a>)</p><p>* <strong>This weeks Buzz</strong></p><p>* Two W&B workshops upcoming in January</p><p>* SF - <a target="_blank" href="https://lu.ma/bzqvsqaa">January 11</a></p><p>* Seattle - <a target="_blank" href="https://seattle.aitinkerers.org/p/ai-in-production-evals-observability-workshop">January 13</a> (workshop by yours truly!)</p><p>* New Evals course with Paige Bailey and Graham Neubig - <a target="_blank" href="https://wandb.ai/site/courses/evals/?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=dec19">pre-sign up for free</a></p><p>* <strong>Vision & Video</strong></p><p>* Kling 1.6 update (<a target="_blank" href="https://twitter.com/dinprasetyo_id/status/1871645159789920739">Tweet</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* Hume OCTAVE - 3B speech-language model (<a target="_blank" href="https://x.com/hume_ai/status/1871263932742246513">X</a>, <a target="_blank" href="https://hume.ai/">Blog</a>)</p><p>* <strong>Tools</strong></p><p>* OpenRouter added Web Search Grounding to 300+ models (<a target="_blank" href="https://x.com/OpenRouterAI/status/1871682806335824029">X</a>)</p><p>Open Source LLMs</p><p>DeepSeek v3 658B - frontier level open weights model for ~$6M  (<a target="_blank" href="https://x.com/deepseek_ai/status/1872242657348710721">X</a>, <a target="_blank" href="https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf">Paper</a>, <a target="_blank" href="https://huggingface.co/deepseek-ai/DeepSeek-V3">HF</a>, <a target="_blank" href="https://chat.deepseek.com/">Chat</a> )</p><p>This was absolutely the top of the open source / open weights news for the past week, and honestly maybe for the past month. DeepSeek, the previous quant firm from China, has dropped a behemoth model, a 658B parameter MoE (37B active), that you'd need 8xH200 to even run, that beats Llama 405, GPT-4o on most benchmarks and even Claude Sonnet 3.5 on several evals! </p><p>The vibes seem to be very good with this one, and while it's not all the way beating Claude yet, it's nearly up there already, but the kicker is, they trained it with a very restricted compute, per the paper, with ~2K h800 (which is like H100 but with less bandwidth) for 14.8T tokens. (that's 15x cheaper than LLama 405 for comparison) </p><p>For evaluations, this model excels on Coding and Math, which is not surprising given how excellent DeepSeek coder has been, but still, very very impressive! </p><p>On the architecture front, the very interesting thing is, this feels like Mixture of Experts v2, with a LOT of experts (256) and 8+1 active at the same time, multi token prediction, and a lot optimization tricks outlined in the impressive paper (here's a great <a target="_blank" href="https://x.com/eliebakouch/status/1872304368462004608">recap</a> of the technical details)</p><p>The highlight for me was, that DeepSeek is distilling their recent R1 version into this version, which likely increases the performance of this model on Math and Code in which it absolutely crushes (51.6 on CodeForces and 90.2 on MATH-500)  </p><p>The additional aspect of this is the API costs, and while they are going to raise the prices come February (they literally just swapped v2.5 for v3 in their APIs without telling a soul lol), the price performance for this model is just absurd. </p><p>Just a massive massive release from the WhaleBros, now I just need a quick 8xH200 to run this and I'm good 😅 </p><p>Other OpenSource news - Qwen QvQ, CogAgent-9B and GoldenGate LLama</p><p>In other open source news this week, our friends from Qwen have released a very interesting preview, called Qwen QvQ, a visual reasoning model. It uses the same reasoning techniques that we got from them in QwQ 32B, but built with the excellent Qwen VL, to reason about images, and frankly, it's really fun to see it think about an image. You can try it <a target="_blank" href="https://huggingface.co/spaces/Qwen/QVQ-72B-preview">here</a></p><p>and a new update to CogAgent-9B (<a target="_blank" href="https://cogagent.aminer.cn/blog#/articles/cogagent-9b-20241220-technical-report-en">page</a>), an agent that claims to understand and control your computer, claims to beat Claude 3.5 Sonnet Computer Use with just a 9B model! </p><p>This is very impressive though I haven't tried it just yet, I'm excited to see those very impressive numbers from open source VLMs driving your computer and doing tasks for you!</p><p>A super quick word from ... Weights & Biases! </p><p>We've just opened up pre-registration for our upcoming FREE evaluations course, featuring Paige Bailey from Google and Graham Neubig from All Hands AI. We've distilled a lot of what we learned about evaluating LLM applications while building <a target="_blank" href="https://wandb.ai/site/weave?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=dec26">Weave</a>, our LLM Observability and Evaluation tooling, and are excited to share this with you all! <a target="_blank" href="https://wandb.ai/site/courses/evals/?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=dec19">Get on the list</a></p><p>Also, 2 workshops (also about Evals) from us are upcoming, one in <a target="_blank" href="https://lu.ma/bzqvsqaa">SF on Jan 11th</a> and one in <a target="_blank" href="https://seattle.aitinkerers.org/p/ai-in-production-evals-observability-workshop">Seattle on Jan 13th</a> (which I'm going to lead!) so if you're in those cities at those times, would love to see you!</p><p>Big Companies - APIs & LLMs</p><p>OpenAI - introduces o3 and o3-mini - breaking Arc-AGI challenge, GQPA and teasing AGI? </p><p>On the last day of the 12 days of OpenAI, we've got the evals of their upcoming o3 reasoning model (and o3-mini) and whoah. I think I speak on behalf of most of my peers that we were all shaken by how fast the jump in capabilities happened from o1-preview and o1 full (being released fully just two weeks prior on day 1 of the 12 days) </p><p>Almost all evals shared with us are insane, from 96.7 on AIME (from 13.4 with Gpt40 earlier this year) to 87.7 GQPA Diamond (which is... PhD level Science Questions) </p><p>But two evals stand out the most, and one of course is the Arc-AGI eval/benchmark. It was designed to be very difficult for LLMs and easy for humans, and o3 solved it with an unprecedented 87.5% (on high compute setting)</p><p>This benchmark was long considered impossible for LLMs, and just the absolute crushing of this benchmark for the past 6 months is something to behold: </p><p>The other thing I want to highlight is the Frontier Math benchmark, which was released just two months ago by Epoch, collaborating with top mathematicians to create a set of very challenging math problems. At the time of release (Nov 12), the top LLMs solved only 2% of this benchmark. With o3 solving 25% of this benchmark just 3 months after o1 taking 2%, it's quite incredible to see how fast these models are increasing in capabilities. </p><p>Is this AGI? </p><p>This release absolutely started or restarted a debate of what is AGI, given that, these goal posts move all the time. Some folks are freaking out and saying that if you're a software engineer, you're "cooked" (o3 solved 71.7 of SWE-bench verified and gets 2727 ELO on CodeForces which is competition code, which is 175th global rank among human coders!), some have also calculated its IQ and estimate it to be at 157 based on the above CodeForces rating. </p><p>So the obvious question is being asked (among the people who follow the news, most people who don't follow the news could care less) is.. is this AGI? Or is something else AGI? </p><p>Well, today we got a very interesting answer to this question, from a leak between a Microsoft and OpenAI negotiation and agreement, in which they have a very clear definition of AGI. "A system generating $100 Billion in profits" - a reminder, per their previous agreement, if OpenAI builds AGI, Microsoft will lose access to OpenAI technologies. </p><p>o3-mini and test-time compute as the new scaling law</p><p>While I personally was as shaken as most of my peers at these incredible breakthroughs, I was also looking at the more practical and upcoming o3-mini release, which is supposed to come on January this year per Sam Altman. Per their evaluations, o3-mini is going to be significantly cheaper and faster than o3, while offering 3 levels of reasoning effort to developers (low, medium and high) and on medium level, it would beat the current best model (o1) while being cheaper than o1-mini. </p><p>All of these updates and improvements in the span of less than 6 months are a testament of just how impressive test-time compute is as our additional new scaling law. Not to mention that current scaling laws still hold, we're waiting for Orion or GPT 4.5 or whatever it's called, and that underlying model will probably significantly improve the reasoning models that are built on top of it. </p><p>Also, if the above results from DeepSeek are anything to go by (and they should be), the ability of these reasoning models to generate incredible synthetic training data for the next models is also quite incredible so... flywheel is upon us, models get better and make better models. </p><p>Other AI news from this week: </p><p>The most impressive other news came from HUME, showcasing OCTAVE - their new 3B speech-language model, which is able to not only fake someone's voice with 5 seconds of audio, but also take on their personality and style of speaking and mannerisms. This is not only a voice model mind you, but a 3B LLM as well, so it can mimic a voice, and even create new voices from a prompt. </p><p>While they mentioned the size, the model was not released yet and will be coming to their API soon, and when I asked about open source, it seems that Hume CEO did not think it's a safe bet opening up this kind of tech to the world yet. </p><p>I also loved a new little x-mas experiment from OpenRouter and Exa, where-in on the actual OpenRouter interface, you can now chat with over 300 models they serve, and ground answers in search. </p><p>This is it for this week, which again, I thought is going to be a very chill one, and .. nope! </p><p>The second part of the show/newsletter, in which we did a full recap of the last year, talked about our predictions from last year and did predictions for this next year, is going to drop in a few days 👀 So keep your eyes peeled. (I decided to separate the two, as 3 hour podcast about AI is... long, I'm no Lex Fridman lol) </p><p>As always, if you found any of this interesting, please share with a friend, and comment on social media, or right here on Substack, I love getting feedback on what works and what doesn't. </p><p>Thank you for being part of the ThursdAI community 👋</p><p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></p><p></p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/dec-26-openai-o3-and-o3</link><guid isPermaLink="false">substack:post:153662436</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Fri, 27 Dec 2024 02:34:26 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/153662436/05bb48d07edd106d906277aad8ca5455.mp3" length="68786202" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5732</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/153662436/77b99758bfeef747962e55c3a36c7290.jpg"/></item><item><title><![CDATA[🎄ThursdAI - Dec19 - o1 vs gemini reasoning, VEO vs SORA, and holiday season full of AI surprises]]></title><description><![CDATA[<p>For the full show notes and links visit https://sub.thursdai.news</p><p>🔗 Subscribe to our show on Spotify: https://thursdai.news/spotify</p><p>🔗 Apple: https://thursdai.news/apple</p><p>Ho, ho, holy moly, folks! Alex here, coming to you live from a world where AI updates are dropping faster than Santa down a chimney! 🎅 It's been another absolutely BANANAS week in the AI world, and if you thought last week was wild, and we're due for a break, buckle up, because this one's a freakin' rollercoaster! 🎢</p><p>In this episode of ThursdAI, we dive deep into the recent innovations from OpenAI, including their 1-800 ChatGPT phone service and new advancements in voice mode and API functionalities. We discuss the latest updates on O1 model capabilities, including Reasoning Effort settings, and highlight the introduction of WebRTC support by OpenAI. Additionally, we explore the groundbreaking VEO2 model from Google, the generative physics engine Genesis, and new developments in open source models like Cohere's Command R7b. We also provide practical insights on using tools like Weights & Biases for evaluating AI models and share tips on leveraging GitHub Gigi. Tune in for a comprehensive overview of the latest in AI technology and innovation.</p><p>00:00 Introduction and OpenAI's 12 Days of Releases</p><p>00:48 Advanced Voice Mode and Public Reactions</p><p>01:57 Celebrating Tech Innovations</p><p>02:24 Exciting New Features in AVMs</p><p>03:08 TLDR - ThursdAI December 19</p><p>12:58 Voice and Audio Innovations</p><p>14:29 AI Art, Diffusion, and 3D</p><p>16:51 Breaking News: Google Gemini 2.0</p><p>23:10 Meta Apollo 7b Revisited</p><p>33:44 Google's Sora and Veo2</p><p>34:12 Introduction to Veo2 and Sora</p><p>34:59 First Impressions of Veo2</p><p>35:49 Comparing Veo2 and Sora</p><p>37:09 Sora's Unique Features</p><p>38:03 Google's MVP Approach</p><p>43:07 OpenAI's Latest Releases</p><p>44:48 Exploring OpenAI's 1-800 CHAT GPT</p><p>47:18 OpenAI's Fine-Tuning with DPO</p><p>48:15 OpenAI's Mini Dev Day Announcements</p><p>49:08 Evaluating OpenAI's O1 Model</p><p>54:39 Weights & Biases Evaluation Tool - Weave</p><p>01:03:52 ArcAGI and O1 Performance</p><p>01:06:47 Introduction and Technical Issues</p><p>01:06:51 Efforts on Desktop Apps</p><p>01:07:16 ChatGPT Desktop App Features</p><p>01:07:25 Working with Apps and Warp Integration</p><p>01:08:38 Programming with ChatGPT in IDEs</p><p>01:08:44 Discussion on Warp and Other Tools</p><p>01:10:37 GitHub GG Project</p><p>01:14:47 OpenAI Announcements and WebRTC</p><p>01:24:45 Modern BERT and Smaller Models</p><p>01:27:37 Genesis: Generative Physics Engine</p><p>01:33:12 Closing Remarks and Holiday Wishes</p><p>Here’s a talking podcast host speaking excitedly about his show</p><p>TL;DR - Show notes and Links</p><p>* <strong>Open Source LLMs</strong></p><p>* Meta Apollo 7B – LMM w/ SOTA video understanding (<a target="_blank" href="https://apollo-lmms.github.io/">Page</a>, <a target="_blank" href="https://huggingface.co/spaces/Apollo-LMMs/Apollo-3B">HF</a>)</p><p>* Microsoft Phi-4 – 14B SLM (<a target="_blank" href="https://techcommunity.microsoft.com/blog/aiplatformblog/introducing-phi-4-microsoft%E2%80%99s-newest-small-language-model-specializing-in-comple/4357090">Blog</a>, <a target="_blank" href="https://aka.ms/Phi-4TechReport">Paper</a>)</p><p>* Cohere Command R 7B – (<a target="_blank" href="https://cohere.com/blog/command-r7b">Blog</a>)</p><p>* Falcon 3 – series of models (<a target="_blank" href="https://x.com/reach_vb/status/1868958425389908343">X</a>, <a target="_blank" href="https://huggingface.co/collections/tiiuae/falcon3-67605ae03578be86e4e87026">HF</a>, <a target="_blank" href="https://falconllm.tii.ae/">web</a>)</p><p>* IBM updates Granite 3.1 + embedding models (<a target="_blank" href="https://huggingface.co/collections/ibm-granite/granite-31-language-models-6751dbbf2f3389bec5c6f02d">HF</a>, <a target="_blank" href="https://huggingface.co/collections/ibm-granite/granite-embedding-models-6750b30c802c1926a35550bb">Embedding</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* OpenAI releases new o1 + API access (<a target="_blank" href="https://x.com/altryne/status/1869080816485281881">X</a>)</p><p>* Microsoft makes CoPilot Free! (<a target="_blank" href="https://x.com/code/status/1869449373995708703">X</a>)</p><p>* Google - Gemini Flash 2 Thinking experimental reasoning model (<a target="_blank" href="https://x.com/OfficialLoganK/status/1869789820308074837">X</a>, <a target="_blank" href="https://aistudio.google.com/">Studio</a>)</p><p>* <strong>This weeks Buzz</strong></p><p>* W&B weave Playground now has Trials (and o1 compatibility) (try it)</p><p>* Alex Evaluation of o1 and Gemini Thinking experimental (<a target="_blank" href="https://x.com/altryne/status/1869835859727393234">X</a>, <a target="_blank" href="https://wandb.me/compare-thinking?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=dec19">Colab</a>, <a target="_blank" href="https://wandb.ai/thursdai/simple-bench-evaluation/weave/compare-evaluations?evaluationCallIds=%5B%220193e073-1f4c-7831-a21f-dd7dea552ff6%22%2C%220193dd58-c032-7ff1-bd0f-563e41d1d929%22%2C%220193dd5c-9bb6-7440-8a9b-26ee9a715b24%22%2C%220193dd63-e2fb-79d0-b64c-8e8ed3f01e7d%22%5D&#38;metrics=%7B%22accuracy_scorer.correct%22%3Atrue%2C%22Model%20Latency%20(avg)%22%3Atrue%2C%22Total%20Tokens%20(avg)%22%3Atrue%7D&#38;utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=dec19">Dashboard</a>)</p><p>* <strong>Vision & Video</strong></p><p>* Google releases Veo 2 – SOTA text2video modal - beating SORA by most vibes (<a target="_blank" href="https://x.com/GoogleDeepMind/status/1868703624714395907">X</a>)</p><p>* HunyuanVideo distilled with FastHunyuan down to 6 steps (<a target="_blank" href="https://huggingface.co/FastVideo/FastHunyuan">HF</a>)</p><p>* Kling 1.6 (<a target="_blank" href="https://x.com/Kling_ai/status/1869599147046871488">X</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* OpenAI realtime audio improvements (<a target="_blank" href="https://platform.openai.com/docs/guides/realtime">docs</a>)</p><p>* 11labs new Flash 2.5 model – 75ms generation (<a target="_blank" href="https://x.com/elevenlabsio/status/1869462840941461941">X</a>)</p><p>* Nexa OmniAudio – 2.6B – multimodal local LLM (<a target="_blank" href="https://nexa.ai/blogs/omniaudio-2.6b">Blog</a>)</p><p>* Moonshine Web – real time speech recognition in the browser (<a target="_blank" href="https://x.com/xenovacom/status/1869423057741230539">X</a>)</p><p>* Sony MMAudio - open source video 2 audio model (<a target="_blank" href="https://hkchengrex.com/MMAudio/">Blog</a>, <a target="_blank" href="https://huggingface.co/spaces/hkchengrex/MMAudio">Demo</a>)</p><p>* AI Art & Diffusion & 3D</p><p>* Genesys – open source generative 3D physics engine (<a target="_blank" href="https://x.com/Genesis_ai/status/1869477624980153488">X</a>, <a target="_blank" href="https://genesis-embodied-ai.github.io/">Site</a>, <a target="_blank" href="https://github.com/Genesis-Embodied-AI/Genesis">Github</a>)</p><p>* Tools</p><p>* CerebrasCoder – extremely fast apps creation (<a target="_blank" href="https://cerebrascoder.com/">Try It</a>)</p><p>* RepoPrompt to chat with o1 Pro – (<a target="_blank" href="https://repoprompt.com/#features">download</a>)</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-dec19-o1-vs-gemini-reasoning</link><guid isPermaLink="false">substack:post:153392966</guid><dc:creator><![CDATA[Alex Volkov and Kwindla Hultman Kramer]]></dc:creator><pubDate>Fri, 20 Dec 2024 02:48:55 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/153392966/b6c3668a25ff877e29e18205beda3a7e.mp3" length="91804168" type="audio/mpeg"/><itunes:author>Alex Volkov and Kwindla Hultman Kramer</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5738</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/153392966/80f6b4045cdf4ac08439ed00e031b20b.jpg"/></item><item><title><![CDATA[📆 ThursdAI - Dec 12 - unprecedented AI week - SORA, Gemini 2.0 Flash, Apple Intelligence, LLama 3.3, NeurIPS Drama & more AI news]]></title><description><![CDATA[<p>Hey folks, Alex here, writing this from the beautiful Vancouver BC, Canada. I'm here for NeurIPS 2024, the biggest ML conferences of the year, and let me tell you, <strong>this was one hell of a week to not be glued to the screen.</strong> </p><p>After last week banger week, with OpenAI kicking off their 12 days of releases, with releasing o1 full and pro mode <a target="_blank" href="https://sub.thursdai.news/p/thursdai-dec-4-openai-o1-and-o1-pro">during ThursdAI</a>, things went parabolic. It seems that all the AI labs decided to just dump EVERYTHING they have before the holidays? 🎅</p><p>A day after our show, on Friday, Google announced a new Gemini 1206 that became the #1 leading model on LMarena and Meta released LLama 3.3, then on Saturday Xai releases their new image model code named Aurora.</p><p>On a regular week, the above Fri-Sun news would be enough for a full 2 hour ThursdAI show on it's own, but not this week, this week this was barely a 15 minute segment 😅 because so MUCH happened starting Monday, we were barely able to catch our breath, so lets dive into it! </p><p>As always, the TL;DR and full show notes at the end 👇 and this newsletter is sponsored by <strong>W&B Weave</strong>, if you're building with LLMs in production, and want to switch to the new Gemini 2.0 today, how will you know if your app is not going to degrade? Weave is the best way! Give it a try <a target="_blank" href="https://wandb.ai/site/weave?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=dec12">for free</a>.</p><p>Gemini 2.0 Flash - a new gold standard of fast multimodal LLMs</p><p>Google has absolutely taken the crown away from OpenAI with Gemini 2.0 believe it or not this week with this incredible release. All of us on the show were in agreement that this is a phenomenal release from Google for the 1 year anniversary of Gemini. </p><p>Gemini 2.0 Flash is beating Pro 002 and Flash 002 on all benchmarks, while being 2x faster than Pro, having 1M context window, and being fully multimodal! </p><p>Multimodality on input and output</p><p>This model was announced to be fully multimodal on inputs AND outputs, which means in can natively understand text, images, audio, video, documents and output text, text + images and audio (so it can speak!). Some of these capabilities are restricted for beta users for now, but we know they exists. If you remember project Astra, this is what powers that project. In fact, we had <a target="_blank" href="https://x.com/mreflow">Matt Wolfe</a> join the show, and he demoed had early access to Project Astra and demoed it live on the show (see above) which is powered by Gemini 2.0 Flash. </p><p>The most amazing thing is, this functionality, that was just 8 months ago, presented to us in Google IO, in a premium Booth experience, is now available to all, in Google AI studio, for free! </p><p>Really, you can try out right now, yourself at <a target="_blank" href="https://aistudio.google.com/live">https://aistudio.google.com/live</a> but here's a demo of it, helping me proof read this exact paragraph by watching the screen and talking me through it. </p><p>Performance out of the box</p><p>This model beating Sonnet 3.5 on Swe-bench Verified completely blew away the narrative on my timeline, nobody was ready for that. This is a flash model, that's outperforming o1 on code!?</p><p>So having a Flash MMIO model with 1M context that is accessible via with real time streaming option available via APIs from the release time is honestly quite amazing to begin with, not to mention that during the preview phase, this is currently free, but if we consider the previous prices of Flash, this model is going to considerably undercut the market on price/performance/speed matrix. </p><p>You can see why this release is taking the crown this week. 👏 </p><p>Agentic is coming with Project Mariner</p><p>An additional thing that was announced by Google is an Agentic approach of theirs is project Mariner, which is an agent in the form of a Chrome extension completing webtasks, breaking SOTA on the WebVoyager with 83.5% score with a single agent setup. </p><p>We've seen agents attempts from Adept to Claude Computer User to Runner H, but this breaking SOTA from Google seems very promising. Can't wait to give this a try. </p><p>OpenAI gives us SORA, Vision and other stuff from the bag of goodies</p><p>Ok so now let's talk about the second winner of this week, OpenAI amazing stream of innovations, which would have taken the crown, if not for, well... ☝️ </p><p>SORA is finally here (for those who got in)</p><p>Open AI has FINALLY released SORA, their long promised text to video and image to video (and video to video) model (nee, world simulator) to general availability, including a new website - <a target="_blank" href="http://sora.com">sora.com</a> and a completely amazing UI to come with it. </p><p>SORA can generate images of various quality from 480p up to 1080p and up to 20 seconds long, and they promised that those will be generating fast, as what they released is actually SORA turbo! (apparently SORA 2 is already in the works and will be even more amazing, more on this later) </p><p>New accounts paused for now</p><p>OpenAI seemed to have severely underestimated how many people would like to generate the 50 images per month allowed on the plus account (pro account gets you 10x more for $200 + longer durations whatever that means), and since the time of writing these words on ThursdAI afternoon, I still am not able to create a <a target="_blank" href="http://sora.com">sora.com</a> account and try out SORA myself (as I was boarding a plane when they launched it) </p><p>SORA magical UI</p><p>I've invited one of my favorite video creators, Blaine Brown to the show, who does incredible video experiments, that always go viral, and had time to play with SORA to tell us what he thinks both from a video perspective and from a interface perspective. </p><p>Blaine had a great take that we all collectively got so much HYPE over the past 8 months of getting teased, that many folks expected SORA to just be an incredible text to video 1 prompt to video generator and it's not that really, in fact, if you just send prompts, it's more like a slot machine (which is also confirmed by another friend of the pod <a target="_blank" href="https://x.com/bilawalsidhu/status/1866510079836786974">Bilawal</a>)</p><p>But the magic starts to come when the additional tools like blend are taken into play. One example that Blaine talked about is the Remix feature, where you can Remix videos and adjust the remix strength (Strong, Mild) </p><p>Another amazing insight Blaine shared is a that SORA can be used by fusing two videos that were not even generated with SORA, but SORA is being used as a creative tool to combine them into one. </p><p>And lastly, just like Midjourney (and StableDiffusion before that), SORA has a featured and a recent wall of video generations, that show you videos and prompts that others used to create those videos with, for inspiration and learning, so you can remix those videos and learn to prompt better + there are prompting extension tools that OpenAI has built in. </p><p>One more thing.. this model thinks</p><p>I love this discovery and wanted to share this with you, the prompt is "A man smiles to the camera, then holds up a sign. On the sign, there is only a single digit number (the number of 'r's in 'strawberry')"</p><p>Advanced Voice mode now with Video!</p><p>I personally have been waiting for Voice mode with Video for such a long time, since the that day in the spring, where the first demo of advanced voice mode talked to an OpenAI employee called Rocky, in a very flirty voice, that in no way resembled Scarlet Johannson, and told him to run a comb through his hair.  </p><p>Well today OpenAI have finally announced that they are rolling out this option soon to everyone, and in chatGPT, we'll all going to have the camera button, and be able to show chatGPT what we're seeing via camera or the screen of our phone and have it have the context. </p><p>If you're feeling a bit of a deja-vu, yes, this is very similar to what Google just launched (for free mind you) with Gemini 2.0 just yesterday in AI studio, and via APIs as well. </p><p>This is an incredible feature, it will not only see your webcam, it will also see your IOS screen, so you’d be able to reason about an email with it, or other things, I honestly can’t wait to have it already! </p><p>They also announced Santa mode, which is also super cool, tho I don’t quite know how to .. tell my kids about it? Do I… tell them this IS Santa? Do I tell them this is an AI pretending to be Santa? Where is the lie end exactly? </p><p>And in one of his funniest jailbreaks (and maybe one of the toughest ones) <a target="_blank" href="https://x.com/elder_plinius/status/1867366758195122469">Pliny</a> the liberator just posted a Santa jailbreak that will definitely make you giggle (and him get Coal this X-mas)</p><p>The other stuff (with 6 days to go) </p><p>OpenAI has 12 days of releases, and the other amazing things we got obviously got overshadowed but they are still cool, Canvas can now run code and have custom GPTs, GPT in Apple Intelligence is now widely supported with the public release of iOS 18.2 and they have announced fine tuning with reinforcement learning, allowing to funetune o1-mini to outperform o1 on specific tasks with a few examples. </p><p>There's 6 more work days to go, and they promised to "end with a bang" so... we'll keep you updated! </p><p>This weeks Buzz - Guard Rail Genie</p><p>Alright, it's time for "This Week's Buzz," our weekly segment brought to you by Weights & Biases! This week I hosted <a target="_blank" href="https://x.com/soumikRakshit96">Soumik Rakshit</a> from the Weights and Biases AI Team (The team I'm also on btw!). </p><p>Soumik gave us a deep dive into Guardrails, our new set of features in Weave for ensuring reliability in GenAI production! Guardrails serve as a "safety net" for your LLM powered applications, filtering out inputs or llm responses that trigger a certain criteria or boundary. </p><p>Types of guardrails include <a target="_blank" href="https://github.com/soumik12345/guardrails-genie/tree/main/guardrails_genie/guardrails/injection">prompt injection attacks</a>, <a target="_blank" href="https://github.com/soumik12345/guardrails-genie/tree/main/guardrails_genie/guardrails/entity_recognition/pii_examples">PII leakage</a>, jailbreaking attempts and toxic language as well, but can also include a <a target="_blank" href="https://github.com/soumik12345/guardrails-genie/tree/main/guardrails_genie/guardrails/entity_recognition">competitor mention</a>, or selling a product at $0 or a policy your company doesn't have. </p><p>As part of developing the guardrails Soumik also developed and open sourced an app to test prompts against those guardrails "<a target="_blank" href="https://huggingface.co/spaces/wandb/guardrails-genie">Guardrails Genie</a>"  and we're going to host it to allow folks to test their prompts against our guardrails, and also are developing it and the guardrails in the open so please check out our <a target="_blank" href="https://github.com/soumik12345/guardrails-genie">Github</a> </p><p>Apple iOS 18.2 Apple Intelligence + ChatGPT integration</p><p>Apple Intelligence is finally here, you can download it if you have iPhone 15 pro and pro Max and iPhone 16 all series. </p><p>If you have one of those phones, you will get the following new additional features that have been in Beta for a while, features like Image Playground with the ability to create images based on your face or faces that you have stored in your photo library.</p><p>You can also create GenMoji and those are actually pretty cool! </p><p>The highlight and the connection with OpenAI's release is of course the ChatGPT integration, where in if Siri is too dumdum to answer any real AI questions, and let's face it, it's most of the time, a user will get a button and chatGPT will take over upon user approval. This will not require an account! </p><p>Grok New Image Generation Codename "Aurora"</p><p>Oh, Space Uncle is back at it again! The team at XAI launched its image generation model with the codename "Aurora" and briefly made it public only to pull it and launch it again (this time, the model is simply "Grok"). Apparently, they've trained their own image model from scratch in like three months but they pulled it back a day after, I think because they forgot to add watermarks 😅 but it's still unconfirmed why the removal occurred in the first place, Regardless of the reason, many folks, such as Wolfram, found it was not on the same level as their Flux integration. </p><p>It is really good at realism and faces, and is really unrestricted in terms of generating celebrities or TV shows form the 90's or cartoons. They really don't care about copyright. </p><p>The model however does appear to generate fairly realistic images with its autoregressive model approach where generation occurs pixel-by-pixel instead of diffusion. But as I said on the show "It's really hard to get a good sense for the community vibe about anything that Elon Musk does because there's so much d**k riding on X for Elon Musk..." Many folks post only positive things on anything X or Xai does in the hopes that space uncle will notice them or reposts them, it's really hard to get an honest "vibes check" on Xai stuff.</p><p>All jokes aside we'll hopefully have some better comparisons on sites such as image LmArena who just today launched <a target="_blank" href="https://x.com/lmarena_ai/status/1867302006182097206">ImgArena</a> but until that day comes we'll just have to wait and see what other new iterations and announcements follow!</p><p>NeurIPS Drama: Best Paper Controversy!</p><p>Now, no week in AI would be complete without a little drama. This time around it’s with the biggest machine learning engineering conference of the year, NeurIPS. This year's "Best Paper" award went to a work entitled Visual Auto Aggressive Modeling (VAR). This paper apparently introduced an innovative way to outperform traditional diffusion models when it comes to image generation! Great right? well not so fast because here’s where things get spicy. This is where Keyu Tian comes in, the main author of this work and a former intern of ByteDance who are getting their fair share of the benefits with their co-signing on the paper but their lawsuit may derail its future. ByteDance is currently <strong>suing</strong> Keyu Tian for a whopping one million dollars citing alleged sabotage on the work in a coordinated series of events that compromised other colleagues work.</p><p>Specifically, according to some reports "He modified source code to changes random seeds and optimizes which, uh, lead to disrupting training processes...Security attacks. He gained unauthorized access to the system. Login backdoors to checkpoints allowing him to launch automated attacks that interrupted processes to colleagues training jobs." Basically, they believe that he "gained unauthorized access to the system" and hacked other systems. Now the paper is legit and it introduces potentially very innovative solutions but we have an ongoing legal situation. Also to note is despite firing him they did not withdraw the paper which could speak volumes to its future! As always, if it bleeds, it leads and drama is usually at the top of the trends, so definitely a story that will stay in everyone's mind when they look back at NeurIPS this year.</p><p>Phew.. what a week folks, what a week! </p><p>I think with 6 more days of OpenAI gifts, there's going to be plenty more to come next week, so share this newsletter with a friend or two, and if you found this useful, consider subscribing to our other channels as well and checkout <a target="_blank" href="https://wandb.ai/site/weave?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=dec12">Weave</a> if you've building with GenAI, it's really helpful! </p><p>TL;DR and show notes </p><p>* Meta llama 3.3 (<a target="_blank" href="https://x.com/Ahmad_Al_Dahle/status/1865071436630778109">X</a>, <a target="_blank" href="https://github.com/meta-llama/llama-models/blob/main/models/llama3_3/MODEL_CARD.md">Model Card</a>)</p><p>* OpenAI 12 days of Gifts (<a target="_blank" href="https://openai.com/12-days/?day=5">Blog</a>)</p><p>* Apple ios 18.2 - Image Playground, GenMoji, ChatGPT integration (<a target="_blank" href="https://x.com/theapplehub/status/1866920729079427536">X</a>)</p><p>* 🔥 Google Gemini 2.0 Flash - the new gold standard of LLMs (<a target="_blank" href="https://x.com/GoogleDeepMind/status/1866869343570608557">X</a>, <a target="_blank" href="https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/">AI Studio</a>)</p><p>* Google Project Mariner - Agent that browsers for you (<a target="_blank" href="https://x.com/GoogleDeepMind/status/1866911079194038368">X</a>)</p><p>* This weeks Buzz - chat with Soumik Rakshit from AI Team at W&B (<a target="_blank" href="https://github.com/soumik12345/guardrails-genie">Github</a>)</p><p>* NeurIPS Drama - Best Paper Controversy - VAR author is sued by ByteDance (<a target="_blank" href="https://x.com/i/grok?conversation=1867070247519543429">X</a>, <a target="_blank" href="https://var-integrity-report.github.io/">Blog</a>)</p><p>* Xai new image generation codename Aurora (<a target="_blank" href="https://x.ai/blog/grok-image-generation-release">Blog</a>)</p><p>* Cognition launched Devin AI developer assistant - $500/mo</p><p>* LMArena launches txt2img Arena for Diffusion models (<a target="_blank" href="https://x.com/lmarena_ai/status/1867302006182097206">X</a>)</p><p></p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-dec-12-unprecedented-ai</link><guid isPermaLink="false">substack:post:153048387</guid><dc:creator><![CDATA[Alex Volkov, Matt Wolfe, Blaine Brown, and Kwindla Hultman Kramer]]></dc:creator><pubDate>Fri, 13 Dec 2024 01:13:39 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/153048387/4fa152663c6aaa84918c133e81fc3191.mp3" length="95102579" type="audio/mpeg"/><itunes:author>Alex Volkov, Matt Wolfe, Blaine Brown, and Kwindla Hultman Kramer</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5944</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/153048387/207ef84c37e5c9a318f6a245439bb4ea.jpg"/></item><item><title><![CDATA[📆 ThursdAI - Dec 5 - OpenAI o1 & o1 pro, Tencent HY-Video, FishSpeech 1.5, Google GENIE2, Weave in GA & more AI news]]></title><description><![CDATA[<p></p><p>Well well well, December is finally here, we're about to close out this year (and have just flew by the second anniversary of chatGPT 🎂) and it seems that all of the AI labs want to give us X-mas presents to play with over the holidays! </p><p>Look, I keep saying this, but weeks are getting crazier and crazier, this week we got the cheapest and the most expensive AI offerings all at once (the cheapest from Amazon and the most expensive from OpenAI), 2 new open weights models that beat commercial offerings, a diffusion model that predicts the weather and 2 world building models, oh and 2 decentralized fully open sourced LLMs were trained across the world LIVE and finished training. I said... crazy week! </p><p>And for W&B, this week started with <a target="_blank" href="https://wandb.ai/site/weave?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=dec5">Weave</a> launching finally in GA 🎉, which I personally was looking forward for (read more below)!</p><p>TL;DR Highlights</p><p>* <strong>OpenAI O1 & Pro Tier:</strong> O1 is out of preview, now smarter, faster, multimodal, and integrated into ChatGPT. For heavy usage, ChatGPT Pro ($200/month) offers unlimited calls and O1 Pro Mode for harder reasoning tasks.</p><p>* <strong>Video & Audio Open Source Explosion:</strong> Tencent’s HYVideo outperforms Runway and Luma, bringing high-quality video generation to open source. Fishspeech 1.5 challenges top TTS providers, making near-human voice available for free research.</p><p>* <strong>Open Source Decentralization:</strong> Nous Research’s DiStRo (15B) and Prime Intellect’s INTELLECT-1 (10B) prove you can train giant LLMs across decentralized nodes globally. Performance is on par with centralized setups.</p><p>* <strong>Google’s Genie 2 & WorldLabs:</strong> Generating fully interactive 3D worlds from a single image, pushing boundaries in embodied AI and simulation. Google’s GenCast also sets a new standard in weather prediction, beating supercomputers in accuracy and speed.</p><p>* <strong>Amazon’s Nova FMs:</strong> Cheap, scalable LLMs with huge context and global language coverage. Perfect for cost-conscious enterprise tasks, though not top on performance.</p><p>* 🎉 <strong>Weave by W&B:</strong> Now in GA, it’s your dashboard and tool suite for building, monitoring, and scaling GenAI apps. Get Started with <a target="_blank" href="https://wandb.ai/site/weave?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=dec5">1 line of code</a></p><p>OpenAI’s 12 Days of Shipping: O1 & ChatGPT Pro</p><p>The biggest splash this week came from OpenAI. They’re kicking off “12 days of launches,” and Day 1 brought the long-awaited full version of o1. The main complaint about o1 for many people is how slow it was! Well, now it’s not only smarter but significantly faster (60% faster than preview!), and officially multimodal: it can see images and text together.</p><p>Better yet, OpenAI introduced a new ChatGPT Pro tier at $200/month. It offers unlimited usage of o1, advanced voice mode, and something called o1 pro mode — where o1 thinks even harder and longer about your hardest math, coding, or science problems. For power users—maybe data scientists, engineers, or hardcore coders—this might be a no-brainer. For others, 200 bucks might be steep, but hey, someone’s gotta pay for those GPUs. Given that OpenAI recently confirmed that there are now 300 Million monthly active users on the platform, and many of my friends already upgraded, this is for sure going to boost the bottom line at OpenAI! </p><p>Quoting Sam Altman from the stream, “This is for the power users who push the model to its limits every day.” For those who complained o1 took forever just to say “hi,” rejoice: trivial requests will now be answered quickly, while super-hard tasks get that legendary deep reasoning including a new progress bar and a notification when a task is complete. Friend of the pod Ray Fernando gave pro a prompt that took 7 minutes to think through! </p><p>I've tested the new o1 myself, and while I've gotten dangerously close to my 50 messages per week quota, I've gotten some incredible results already, and very fast as well. This ice-cubes question failed o1-preview and o1-mini and it took both of them significantly longer, and it took just 4 seconds for o1. </p><p>Open Source LLMs: Decentralization & Transparent Reasoning</p><p>Nous Research DiStRo & DeMo Optimizer</p><p>We’ve talked about decentralized training before, but the folks at Nous Research are making it a reality at scale. This week, Nous Research wrapped up the training of a new 15B-parameter LLM—codename “Psyche”—using a fully decentralized approach called “Nous DiStRo.” Picture a massive AI model trained not in a single data center, but across GPU nodes scattered around the globe. According to Alex Volkov (host of ThursdAI), “This is crazy: they’re literally training a 15B param model using GPUs from multiple companies and individuals, and it’s working as well as centralized runs.”</p><p>The key to this success is “DeMo” (Decoupled Momentum Optimization), a <a target="_blank" href="https://arxiv.org/abs/2411.19870">paper</a> co-authored by none other than Diederik Kingma (yes, the Kingma behind Adam optimizer and VAEs). DeMo drastically reduces communication overhead and still maintains stability and speed. The training loss curve they’ve shown looks just as good as a normal centralized run, proving that decentralized training isn’t just a pipe dream. The code and paper are open source, and soon we’ll have the fully trained Psyche model. It’s a huge step toward democratizing large-scale AI—no more waiting around for Big Tech to drop their weights. Instead, we can all chip in and train together.</p><p>Prime Intellect INTELLECT-1 10B: Another Decentralized Triumph</p><p>But wait, there’s more! Prime Intellect also finished training their 10B model, INTELLECT-1, using a similar decentralized setup. INTELLECT-1 was trained with a custom framework that reduces inter-GPU communication by 400x. It’s essentially a global team effort, with nodes from all over the world contributing compute cycles.</p><p>The result? A model hitting performance similar to older Meta models like Llama 2—but fully decentralized. </p><p>Ruliad DeepThought 8B: Reasoning You Can Actually See</p><p>If that’s not enough, we’ve got yet another open-source reasoning model: Ruliad’s DeepThought 8B. This 8B parameter model (finetuned from LLaMA-3.1) from friends of the show FarEl, Alpin and Sentdex 👏</p><p>Ruliad’s DeepThought attempts to match or exceed performance of much larger models in reasoning tasks (beating several 72B param models while being 8B itself) is very impressive. </p><p>Google is firing on all cylinders this week</p><p>Google didn't stay quiet this week as well, and while we all wait for the Gemini team to release the next Gemini after the myriad of very good experimental models recently, we've gotten some very amazing things this week. </p><p>Google’s PaliGemma 2 - finetunable SOTA VLM using Gemma</p><p>PaliGemma v2, a new vision-language family of models (3B, 10B and 33B)  for 224px, 448px, 896px resolutions are a suite of base models, that include image segmentation and detection capabilities and are great at OCR which make them very versatile for fine-tuning on specific tasks. </p><p>They claim to achieve SOTA on chemical formula recognition, music score recognition, spatial reasoning, and chest X-ray report generation! </p><p>Google GenCast SOTA weather prediction with... diffusion!?</p><p>More impressively, Google DeepMind released GenCast, a diffusion-based model that beats the state-of-the-art ENS system in 97% of weather predictions. Did we say weather predictions? Yup. </p><p>Generative AI is now better at weather forecasting than dedicated physics based deterministic algorithms running on supercomputers. Gencast can predict 15 days in advance in just 8 minutes on a single TPU v5, instead of hours on a monstrous cluster. This is mind-blowing. As Yam said on the show, “Predicting the world is crazy hard” and now diffusion models handle it with ease. </p><p>W&B Weave: Observability, Evaluation and Guardrails now in GA</p><p>Speaking of building and monitoring GenAI apps, we at Weights & Biases (the sponsor of ThursdAI) announced that <a target="_blank" href="https://wandb.ai/site/weave?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=dec5">Weave</a> is now GA. Weave is a developer tool for evaluating, visualizing, and debugging LLM calls in production. If you’re building GenAI apps—like a coding agent or a tool that processes thousands of user requests—Weave helps you track costs, latency, and quality systematically.</p><p>We showcased two internal apps: Open UI (a website builder from a prompt) and Winston (an AI agent that checks emails, Slack, and more). Both rely on Weave to iterate, tune prompts, measure user feedback, and ensure stable performance. With O1 and other advanced models coming to APIs soon, tools like Weave will be crucial to keep those applications under control.</p><p>If you follow this newsletter and develop with LLMs, now is a great way to <a target="_blank" href="https://wandb.ai/site/weave?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=dec5">give Weave a try</a></p><p>Open Source Audio & Video: Challenging Proprietary Models</p><p>Tencent’s HY Video: Beating Runway & Luma in Open Source</p><p>Tencent came out swinging with their open-source model, HYVideo. It’s a video model that generates incredible realistic footage, camera cuts, and even audio—yep, Foley and lip-synced character speech. Just a single model doing text-to-video, image-to-video, puppeteering, and more. It even outperforms closed-source giants like Runway Gen 3 and Luma 1.6 on over 1,500 prompts.</p><p>This is the kind of thing we dreamed about when we first heard of video diffusion models. Now it’s here, open-sourced, ready for tinkering. “It’s near SORA-level,” as I mentioned, referencing OpenAI’s yet-to-be-fully-released SORA model. The future of generative video just got more accessible, and competitors should be sweating right now. We may just get SORA as one of the 12 days of OpenAI releases! </p><p>FishSpeech 1.5: Open Source TTS Rivaling the Big Guns</p><p>Not just video—audio too. FishSpeech 1.5 is a multilingual, zero-shot voice cloning model that ranks #2 overall on TTS benchmarks, just behind 11 Labs. This is a 500M-parameter model, trained on a million hours of audio, achieving near-human quality, fast inference, and open for research.</p><p>This puts high-quality text-to-speech capabilities in the open-source community’s hands. You can now run a top-tier TTS system locally, clone voices, and generate speech in multiple languages with low latency. No more relying solely on closed APIs. This is how open-source chases—and often catches—commercial leaders.</p><p>If you’ve been longing for near-instant voice cloning on your own hardware, this is the model to go play with!</p><p>Creating World Models: Genie 2 & WorldLabs</p><p>Fei Fei Li’s WorldLabs: Images to 3D Worlds</p><p>WorldLabs, founded by Dr. Fei Fei Li, showcased a mind-boggling demo: turning a single image into a walkable 3D environment. Imagine you take a snapshot of a landscape, load it into their system, and now you can literally walk around inside that image as if it were a scene in a video game. “I can literally use WASD keys and move around,” Alex commented, clearly impressed.</p><p>It’s not perfect fidelity yet, but it’s a huge leap toward generating immersive 3D worlds on the fly. These tools could revolutionize virtual reality, gaming, and simulation training. WorldLabs’ approach is still in early stages, but what they demonstrated is nothing short of remarkable.</p><p>Google’s Genie 2: Playable Worlds from a Single Image</p><p>If WorldLabs’s 3D environment wasn’t enough, Google dropped Genie 2. Take an image generated by Imagen 3, feed it into Genie 2, and you get a playable world lasting up to a minute. Your character can run, objects have physics, and the environment is consistent enough that if you leave an area and return, it’s still there.</p><p>As I said on the pod, “It looks like a bit of Doom, but generated from a single static image. Insane!” The model simulates complex interactions—think water flowing, balloons bursting—and even supports long-horizon memory. This could be a goldmine for AI-based game development, rapid prototyping, or embodied agent training.</p><p>Amazon’s Nova: Cheaper LLMs, Not Better LLMs</p><p>Amazon is also throwing their hat in the ring with the Nova series of foundational models. They’ve got variants like Nova Micro, Lite, Pro, and even a Premier tier coming in 2025. The catch? Performance is kind of “meh” compared to Anthropic or OpenAI’s top models, but Amazon is aiming to be the cheapest high-quality LLM among the big players. With a context window of up to 300K tokens and 200+ language coverage, Nova could find a niche, especially for those who want to pay less per million tokens.</p><p>Nova Micro costs around 3.5 cents per million input tokens and 14 cents per million output tokens—making it dirt cheap to process massive amounts of data. Although not a top performer, Amazon’s approach is: “We may not be best, but we’re really cheap and we scale like crazy.” Given Amazon’s infrastructure, this could be compelling for enterprises looking for cost-effective large-scale solutions.</p><p>Phew, this was a LONG week with a LOT of AI drops, and NGL, o1 actually helped me a bit for this newsletter, I wonder if you can spot the places where o1 wrote some of the text using a the transcription of the show and the outline as guidelines and the previous newsletter as a tone guide and where I wrote it myself? </p><p>Next week, NEURIPS 2024, the biggest ML conference in the world, I'm going to be live streaming from there, so if you're at the conference, come by booth #404 and say hi! I'm sure there will be a TON of new AI updates next week as well! </p><p>Show Notes & Links</p><p>TL;DR of all topics covered: </p><p>* <strong>This weeks Buzz </strong></p><p>* Weights & Biases announces Weave is now in GA 🎉(<a target="_blank" href="https://wandb.ai/site/weave?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=dec5">wandb.me/tryweave</a>)  </p><p>* Tracing LLM calls</p><p>* Evaluation & Playground</p><p>* Human Feedback integration</p><p>* Scoring & Guardrails (in preview)</p><p>* <strong>Open Source LLMs</strong> </p><p>* DiStRo & DeMo from NousResearch - decentralized DiStRo 15B run (<a target="_blank" href="https://x.com/NousResearch/status/1863622813317464157">X</a>, <a target="_blank" href="https://distro.nousresearch.com/">watch live</a>, <a target="_blank" href="https://arxiv.org/abs/2411.19870">Paper</a>)</p><p>* Prime Intellect - INTELLECT-1 10B decentralized LLM (<a target="_blank" href="https://www.primeintellect.ai/blog/intellect-1">Blog</a>, <a target="_blank" href="https://app.primeintellect.ai/intelligence">watch</a>)</p><p>* Ruliad DeepThoutght 8B - Transparent reasoning model (LLaMA-3.1) w/ test-time compute scaling (<a target="_blank" href="https://x.com/ruliad_ai">X</a>, <a target="_blank" href="https://huggingface.co/ruliad/deepthought-8b-llama-v0.01-alpha">HF</a>, <a target="_blank" href="https://chat.ruliad.co/">Try It</a>)</p><p>* Google GenCast - diffusion model SOTA weather prediction (<a target="_blank" href="https://deepmind.google/discover/blog/gencast-predicts-weather-and-the-risks-of-extreme-conditions-with-sota-accuracy/">Blog</a>)</p><p>* Google open sources PaliGemma 2 (<a target="_blank" href="https://x.com/AndreasPSteiner/status/1864729070526681510">X</a>, <a target="_blank" href="https://developers.googleblog.com/en/introducing-paligemma-2-powerful-vision-language-models-simple-fine-tuning/">Blog</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Amazon announces Nova series of FM at AWS (<a target="_blank" href="https://x.com/_philschmid/status/1864016010464080260">X</a>)</p><p>* Google GENIE 2 creates playable worlds from a picture! (<a target="_blank" href="https://deepmind.google/discover/blog/genie-2-a-large-scale-foundation-world-model/?utm_source=x&#38;utm_medium=social&#38;utm_campaign=&#38;utm_content=">Blog</a>)</p><p>* OpenAI 12 days started with o1 full and o1 pro and pro tier $200/mo (<a target="_blank" href="https://x.com/altryne/status/1864732261628702896">X</a>, <a target="_blank" href="https://openai.com/index/introducing-chatgpt-pro/">Blog</a>)</p><p>* <strong>Vision & Video</strong></p><p>* Tencent open sources HY Video - beating Luma & Runway (<a target="_blank" href="http://aivideo.hunyuan.tencent.com/">Blog</a>, <a target="_blank" href="http://git.new/hyvideo">Github</a>, <a target="_blank" href="http://thursdai.news/hypaper">Paper</a>, <a target="_blank" href="http://thursdai.news/hyv-weights">HF</a>)</p><p>* Runway video keyframing prototype (<a target="_blank" href="https://x.com/runwayml/status/1863679848092246260">X</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* FishSpeech V1.5 - multilingual, zero-shot instant voice cloning, low-latency, open text to speech model (<a target="_blank" href="https://x.com/FishAudio/status/1864370933496205728">X</a>, <a target="_blank" href="https://fish.audio/auth/">Try It</a>)</p><p>* Eleven labs - real time audio agents builder (<a target="_blank" href="https://x.com/elevenlabsio/status/1864011712795468094">X</a>)</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-dec-4-openai-o1-and-o1-pro</link><guid isPermaLink="false">substack:post:152638808</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Fri, 06 Dec 2024 02:09:28 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/152638808/0247ba1ff6a9c05f55cedea98530a6ce.mp3" length="87956942" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5497</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/152638808/458799f7624b2c28d1205f7e933d9a69.jpg"/></item><item><title><![CDATA[🦃 ThursdAI - Thanksgiving special 24' - Qwen Open Sources Reasoning, BlueSky hates AI, H controls the web & more AI news]]></title><description><![CDATA[<p>Hey ya'll, Happy Thanskgiving to everyone who celebrates and thank you for being a subscriber, I truly appreciate each and every one of you! </p><p>We had a blast on today's celebratory stream, especially given that today's "main course" was the amazing open sourcing of a reasoning model from Qwen, and we had Junyang Lin with us again to talk about it! First open source reasoning model that you can run on your machine, that beats a 405B model, comes close to o1 on some metrics 🤯 </p><p>We also chatted about a new hybrid approach from Nvidia called Hymba 1.5B (<a target="_blank" href="https://www.arxiv.org/abs/2411.13676">Paper</a>, <a target="_blank" href="https://huggingface.co/collections/nvidia/hymba-673c35516c12c4b98b5e845f">HF</a>) that beats Qwen 1.5B with 6-12x less training, and Allen AI releasing Olmo 2, which became the best fully open source LLM 👏 (<a target="_blank" href="https://allenai.org/blog/olmo2">Blog</a>, <a target="_blank" href="https://huggingface.co/allenai/OLMo-2-1124-7B">HF</a>, <a target="_blank" href="https://playground.allenai.org/">Demo</a>), though they didn't release WandB logs this time, they did release data! </p><p>I encourage you to watch todays show (or listen to the show, I don't judge), there's not going to be a long writeup like I usually do, as I want to go and enjoy the holiday too, but of course, the TL;DR and show notes are right here so you won't miss a beat if you want to use the break to explore and play around with a few things! </p><p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></p><p>TL;DR and show notes</p><p>* Qwen QwQ 32B preview - the first open weights reasoning model (<a target="_blank" href="https://x.com/altryne/status/1861854768228089944">X</a>, <a target="_blank" href="https://qwenlm.github.io/blog/qwq-32b-preview/">Blog</a>, <a target="_blank" href="https://huggingface.co/Qwen/QwQ-32B-Preview">HF</a>, <a target="_blank" href="https://huggingface.co/spaces/Qwen/QwQ-32B-preview">Try it</a>)</p><p>* Allen AI - Olmo 2 the best fully open language model (<a target="_blank" href="https://allenai.org/blog/olmo2">Blog</a>, HF, <a target="_blank" href="https://playground.allenai.org/">Demo</a>)</p><p>* NVIDIA Hymba 1.5B - Hybrid smol model beating Qwen, SmolLM w/ 6-12x less training (<a target="_blank" href="https://x.com/PavloMolchanov/thread/1861484218087584217">X</a>, <a target="_blank" href="https://www.arxiv.org/abs/2411.13676">Paper</a>, <a target="_blank" href="https://huggingface.co/collections/nvidia/hymba-673c35516c12c4b98b5e845f">HF</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Anthropic MCP - model context protocol (<a target="_blank" href="https://x.com/alexalbert__/status/1861079762506252723">X</a>,Blog, <a target="_blank" href="https://modelcontextprotocol.io/quickstart">Spec</a>, <a target="_blank" href="https://simonwillison.net/2024/Nov/25/model-context-protocol/">Explainer</a>)</p><p>* Cursor, Jetbrains now integrate with ChatGPT MacOS app (<a target="_blank" href="https://x.com/embirico/status/1861525484091117781">X</a>)</p><p>* Xai is going to be a Gaming company?! (<a target="_blank" href="https://x.com/elonmusk/status/1861801046949191686">X</a>)</p><p>* H company shows Runner H - WebVoyager Agent (<a target="_blank" href="https://x.com/hcompany_ai/status/1861852350828224967">X</a>, <a target="_blank" href="https://hcompany.ai/waitlist">Waitlist</a>) </p><p>* <strong>This weeks Buzz</strong></p><p>* Interview w/ Thomas Cepelle about Weave scorers and guardrails (<a target="_blank" href="https://weave-docs.wandb.ai/guides/evaluation/scorers/#summarizationscorer">Guide</a>)</p><p>* <strong>Vision & Video</strong></p><p>* OpenAI SORA API was "leaked" on HuggingFace (<a target="_blank" href="https://x.com/AILeaksAndNews/status/1861444366251806969">here</a>)</p><p>* Runway launches video Expand feature (<a target="_blank" href="https://x.com/blizaine/status/1860302891560456327">X</a>)</p><p>* Rhymes Allegro-TI2V - updated image to video model (<a target="_blank" href="https://huggingface.co/rhymes-ai/Allegro-TI2V">HF</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* OuteTTS v0.2 - 500M smol TTS with voice cloning (<a target="_blank" href="https://www.outeai.com/blog/outetts-0.2-500m">Blog</a>, <a target="_blank" href="https://huggingface.co/OuteAI/OuteTTS-0.2-500M-GGUF">HF</a>)</p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* Runway launches an image model called Frames (<a target="_blank" href="https://x.com/runwayml/status/1861047681163857924">X</a>, <a target="_blank" href="https://runwayml.com/research/introducing-frames">Blog</a>)</p><p>* ComfyUI Desktop app was released 🎉</p><p>* Chat</p><p>* 24 hours of AI hate on 🦋  (<a target="_blank" href="https://x.com/altryne/status/1861864203012755626">thread</a>)</p><p>* <strong>Tools</strong></p><p>* Cursor agent (<a target="_blank" href="https://x.com/imrat/status/1861370888517517646">X thread</a>)</p><p>* Google Generative Chess toy (<a target="_blank" href="https://x.com/labsdotgoogle/status/1861447589604008087">Link</a>)</p><p>See you next week and happy Thanks Giving 🦃</p><p></p><p><p>Thanks for reading ThursdAI - Recaps of the most high signal AI weekly spaces! This post is public so feel free to share it.</p></p><p>Full Subtitles for convenience</p><p>[00:00:00] <strong>Alex Volkov:</strong> let's get it going.</p><p>[00:00:10] <strong>Alex Volkov:</strong> Welcome, welcome everyone to ThursdAI November 28th Thanksgiving special. My name is Alex Volkov. I'm an AI evangelist with Weights Biases. You're on ThursdAI. We are live [00:00:30] on ThursdAI. Everywhere pretty much.</p><p>[00:00:32] <strong>Alex Volkov:</strong></p><p>[00:00:32] Hosts and Guests Introduction</p><p>[00:00:32] <strong>Alex Volkov:</strong> I'm joined here with two of my co hosts.</p><p>[00:00:35] <strong>Alex Volkov:</strong> Wolfram, welcome.</p><p>[00:00:36] <strong>Wolfram Ravenwolf:</strong> Hello everyone! Happy Thanksgiving!</p><p>[00:00:38] <strong>Alex Volkov:</strong> Happy Thanksgiving, man.</p><p>[00:00:39] <strong>Alex Volkov:</strong> And we have Junyang here. Junyang, welcome, man.</p><p>[00:00:42] <strong>Junyang Lin:</strong> Yeah, hi everyone. Happy Thanksgiving. Great to be here.</p><p>[00:00:46] <strong>Alex Volkov:</strong> You had a busy week. We're going to chat about what you had. I see Nisten joining us as well at some point.</p><p>[00:00:51] <strong>Alex Volkov:</strong> Yam pe joining us as well. Hey, how, Hey Yam. Welcome. Welcome, as well. Happy Thanksgiving. It looks like we're assembled folks. We're across streams, across [00:01:00] countries, but we are.</p><p>[00:01:01] Overview of Topics for the Episode</p><p>[00:01:01] <strong>Alex Volkov:</strong> For November 28th, we have a bunch of stuff to talk about. Like really a big list of stuff to talk about. So why don't we just we'll just dive in. We'll just dive in. So obviously I think the best and the most important.</p><p>[00:01:13] DeepSeek and Qwen Open Source AI News</p><p>[00:01:13] <strong>Alex Volkov:</strong> Open source kind of AI news to talk about this week is going to be, and I think I remember last week, Junyang, I asked you about this and you were like, you couldn't say anything, but I asked because last week, folks, if you remember, we talked about R1 from DeepSeek, a reasoning model from [00:01:30] DeepSeek, which really said, Oh, maybe it comes as a, as open source and maybe it doesn't.</p><p>[00:01:33] <strong>Alex Volkov:</strong> And I hinted about, and I asked, Junyang, what about some reasoning from you guys? And you couldn't say anything. so this week. I'm going to do a TLDR. So we're going to actually talk about the stuff that, you know, in depth a little bit later, but this week, obviously one of the biggest kind of open source or sorry, open weights, and news is coming from our friends at Qwen as well, as we always celebrate.</p><p>[00:01:56] <strong>Alex Volkov:</strong> So one of the biggest things that we get as. [00:02:00] is, Qwen releases, I will actually have you tell me what's the pronunciation here, Junaid, what is, I say Q W Q or maybe quick, what is the pronunciation of this?</p><p>[00:02:12] <strong>Junyang Lin:</strong> I mentioned it in the blog, it is just like the word quill. Yeah. yeah, because for the qw you can like work and for the q and you just like the U, so I just combine it together and create a new pronunciation called Quill.</p><p>[00:02:28] <strong>Junyang Lin:</strong> Yeah.</p><p>[00:02:28] <strong>Alex Volkov:</strong> So we're saying it's Quin [00:02:30] Quill 32 B. Is that the right pronunciation to say this?</p><p>[00:02:33] <strong>Junyang Lin:</strong> Yeah, it's okay. I would just call it qui quill. It is, some something funny because,the ca the characters look very funny. Oh, we have a subculture,for these things. Yeah. Just to express some, yeah.</p><p>[00:02:46] <strong>Junyang Lin:</strong> our. feelings.</p><p>[00:02:49] <strong>Alex Volkov:</strong> Amazing. Qwen, Quill, 32B, and it's typed,the name is typed QWQ, 32Breview. This is the first OpenWeights reasoning model. This [00:03:00] model is not only predicting tokens, it's actually doing reasoning behind this. What this means is we're going to tell you what this means after we get to this.</p><p>[00:03:07] <strong>Alex Volkov:</strong> So we're still in the, we're still in the TLDR area. We also had. Another drop from Alien Institute for AI, if you guys remember last week we chatted with Nathan, our dear friend Nathan, from Alien Institute about 2. 0. 3, about their efforts for post training, and he gave us all the details about post training, so they released 2.</p><p>[00:03:28] <strong>Alex Volkov:</strong> 0. 3, this week they released Olmo 2. [00:03:30] 0. We also talked about Olmo with the friends from Alien Institute a couple of months ago, and now they released Olmo 2. 0. Which they claim is the best fully open sourced, fully open sourced language models, from Allen Institute for AI.and, so we're going to chat about, Olmo a little bit as well.</p><p>[00:03:46] <strong>Alex Volkov:</strong> And last minute addition we have is NVIDIA Haimba, which is a hybrid small model from NVIDIA, very tiny one, 1. 5 billion parameters. small model building Qwen and building small LLM as well. this is in the area [00:04:00] of open source. I</p><p>[00:04:01] <strong>Alex Volkov:</strong> Okay, in the big companies, LLMs and APIs, I want to run through a few things.</p><p>[00:04:06] Anthropic's MCP and ChatGPT macOS Integrations</p><p>[00:04:06] <strong>Alex Volkov:</strong> So first of all, Anthropic really something called MCP. It's a, something they called Model Context Protocol. We're going to briefly run through this. It's a, it's a kind of a release from them that's aimed for developers is a protocol that enables secure connections between a host application, like a cloud desktop, for example,</p><p>[00:04:24] <strong>Alex Volkov:</strong> there's also a bunch of new integrations for the ChatGPT macOS app. If you guys remember a couple of [00:04:30] weeks ago, We actually caught this live.</p><p>[00:04:31] <strong>Alex Volkov:</strong> I refreshed my MacOS app and there's ta da, there's a new thing. And we discovered this live. It was very fun. The MacOS app for ChatGPT integrates with VS Code, et cetera. and so we tried to run this with Cursor. It didn't work. So now it works with Cursor,</p><p>[00:04:43] <strong>Wolfram Ravenwolf:</strong></p><p>[00:04:43] <strong>Alex Volkov:</strong> So the next thing we're going to look at, I don't know if it's worth mentioning, but you guys know the XAI, the company that Elon Musk is raising another 6 billion for that tries to compete with OpenAI</p><p>[00:04:54] <strong>Alex Volkov:</strong> Do you guys hear that it's going to be a gaming company as well? I don't know if it's worth talking about, but we'll at least [00:05:00] mention this. And the one thing that I wanted to chat about is H, the French company, H that showed a runner that looks. Three times as fast and as good as the Claude computer use runner, and we're definitely going to show examples of this, video live because that looks just incredible.</p><p>[00:05:18] <strong>Alex Volkov:</strong> this out of nowhere company, the biggest fundraise or the biggest seed round that Europe has ever seen, at least French has ever seen, just show they, An agent that controls your [00:05:30] computer that's tiny, ridiculously tiny, I think it's like the three billion parameter, two billion parameter or something.</p><p>[00:05:36] <strong>Alex Volkov:</strong> And it runs way better than computer, cloud computer use. Something definitely worth talking about. after with, after which in this week's Bars, we're going to talk with Thomas Capelli, from, from my team at Weights Biases. about LLM guardrails, that's gonna be fun. and in vision video category, we're gonna cover that OpenAI Sora quote unquote leaked, this week.</p><p>[00:05:56] <strong>Alex Volkov:</strong> And this leak wasn't really a leak, but, definitely [00:06:00] we saw some stuff. and then there's also a new expand feature that we saw in, Runway. And we saw another video model from, Rhymes called Allegro TIV2. which is pretty cool in voice and audio. If we get there in voice and audio, we saw out TTS vision 0.</p><p>[00:06:19] <strong>Alex Volkov:</strong> 2, which is a new TTS, a 500 million parameter, small TTS you can run in your browser and sounds pretty <a target="_blank" href="dope.art">dope.art</a> in the fusion, super quick runway launches an image [00:06:30] model. Yep, Runway, the guys who do video, they launched an image model that looks pretty sick, and we're definitely going to look at some examples of this, and Confi UI Desktop, for those of you who are celebrating something like this, Confi UI now is runnable with desktop, and there's a bunch of tool stuff, but honestly, I can talk about two things.</p><p>[00:06:47] <strong>Alex Volkov:</strong> Tools and there's a cool thing with Google generative chess toy. I can show you this so you can show your folks in Thanksgiving and going to impress them with a generative chess toy. But honestly, instead of this, I would love to chat about the thing that [00:07:00] some of us saw on the other side of the social media networks.</p><p>[00:07:04] <strong>Alex Volkov:</strong> And definitely we'll chat about this, for the past 24 hours. So chat, for the past. 24 hours, on BlueSky, we saw a little bit of a mob going against the Hug Face folks and then, other friends of ours on,from the AI community and the anti AI mob on BlueSky. So we're going to chat about that.</p><p>[00:07:26] <strong>Alex Volkov:</strong> And hopefully give you our feelings about what's going on, about this [00:07:30] world. And this is a pro AI show. And when we see injustice happens against ai, we have to speak out about against this. And I think that this is mostly what we're gonna cover this show, unless this is.</p><p>[00:07:42] <strong>Wolfram Ravenwolf:</strong> Where I could insert the two things I have.</p><p>[00:07:44] <strong>Wolfram Ravenwolf:</strong> One is a tool, which is the AI video composer, which, allows you to talk to, ff mpac, which is a complicated comment line tool, but very powerful. And so you have a UI where you just use natural language to control the tool. So that is one tool. Maybe we get to [00:08:00] it, if not just Google or ask for Plexity or anything.</p><p>[00:08:03] <strong>Alex Volkov:</strong> No, we'll drop it in. Yeah, we'll drop it in show notes, absolutely.</p><p>[00:08:04] <strong>Wolfram Ravenwolf:</strong> Yeah, that's the best part. Okay. And echo mimic. Version 2 is also an HN Synthesia alternative for local use, which is also, yeah, a great open source local runnable tool.</p><p>[00:08:17] <strong>Alex Volkov:</strong> What do we call this? EcoMimic?</p><p>[00:08:19] <strong>Wolfram Ravenwolf:</strong> EcoMimic. EcoMimic</p><p>[00:08:21] <strong>Alex Volkov:</strong> v2.</p><p>[00:08:21] <strong>Wolfram Ravenwolf:</strong> EcoMimic</p><p>[00:08:23] <strong>Alex Volkov:</strong> 2.</p><p>[00:08:24] <strong>Alex Volkov:</strong> Alright, we have a special guest here that we're gonna add Alpin. Hey Alpen, [00:08:30] welcome, feel free to stay anonymous and don't jump, we're gonna start with open source AI and then we're gonna chat with you briefly about the experience you had.</p><p>[00:08:38] <strong>Alpin Dale:</strong> hello everyone.</p><p>[00:08:39] <strong>Alex Volkov:</strong> Hey man. Yeah, you've been on the show before, right Alton? You've been on the show.</p><p>[00:08:43] <strong>Alpin Dale:</strong> a few times, yeah. it's nice to be back here again.</p><p>[00:08:46] <strong>Alex Volkov:</strong> Yeah. Alton, we're gonna get, we're gonna chat with you soon, right? We're gonna start with open source. We need to go to Junyang and talk about reasoning models.</p><p>[00:08:52] <strong>Alex Volkov:</strong> so feel free to stay with us. And then I definitely want to hear about some of the stuff we're going to cover after open source. We're going to cover the [00:09:00] anti AI mob over there.</p><p>[00:09:05] <strong>Alex Volkov:</strong> Alrighty folks, it's time to start with the,with the corner we love the most, yeah? let's dive into this. Let's dive in straight to Open Source AI.</p><p>[00:09:29] <strong>Alex Volkov:</strong> Open Source AI, [00:09:30] let's get it started. Let's start it.</p><p>[00:09:35] <strong>Alex Volkov:</strong> Okay, folks, so open source this week, we're going to get, let me cover the other two things super quick before we dive in.</p><p>[00:09:43] NVIDIA Haimba Hybrid Model Discussion</p><p>[00:09:43] <strong>Alex Volkov:</strong> Alright, so I want to like briefly cover the Haimba paper super quick, because we're going to get the least interesting stuff out of the way so we can focus on the main topic. Course, NVIDIA released Heimbar 1. 5 parameters. Heimbar is a hybrid small model, from NVIDIA. We talked about hybrid models [00:10:00] multiple times before.</p><p>[00:10:00] <strong>Alex Volkov:</strong> we have our friend of the pod, LDJ here. He loves talking about hybrid models. He actually brought this to our attention in the, in, in the group chat. We talked about, you guys know the Transformer, we love talking about the Transformer. Haimba specifically is a hybrid model between Transformer and I think they're using a hybrid attention with Mamba layers in parallel.</p><p>[00:10:22] <strong>Alex Volkov:</strong> they claim they're beating Lama and Qwen and SmallLM with 6 to 12 times less training as well. Let's look [00:10:30] at the, let's look at their, let's look at their <a target="_blank" href="X.so">X.so</a> this is what they're, this is what they're showing, this is the table they're showing some impressive numbers, the interesting thing is, this is a table of comparison that they're showing, and in this table of comparison, the comparison is not only Evaluations.</p><p>[00:10:47] <strong>Alex Volkov:</strong> The comparison they're showing is also cache size and throughput, which I like. it's do you guys know what this reminds me of? This reminds me of when you have a electric vehicle [00:11:00] and you have a gas based vehicle or standard combustion engine vehicle, and then they compare the electric vehicle and acceleration.</p><p>[00:11:07] <strong>Alex Volkov:</strong> It's Oh, our car is faster. But you get this by default, you get the acceleration by default with all the electric vehicles. This is how the model works. This is how those model works. So for me, when you compare like hybrid models, or, non transformer based models, a Mamba based models, the throughput speed up is generally faster because of it.</p><p>[00:11:29] <strong>Alex Volkov:</strong> [00:11:30] But definitely the throughput is significantly higher. Tokens per second. is significantly higher. So for comparison for folks who are listening to us, just so you, you'll hear the comparison, the throughput for this 1. 5 billion model is 664 tokens per second versus a small LM 238 tokens per second, or something like Qwen 1.</p><p>[00:11:54] <strong>Alex Volkov:</strong> 5 at 400. So 600 versus 400. the training cost in [00:12:00] tokens, they say this was, 1. 5 trillion tokens versus Qwen at 18. I don't know if Junyang you want to confirm or deny the 18 mentioned here that they added. Sometimes they, they say different things, but yeah, definitely the highlight of this Heimwehr thing.</p><p>[00:12:14] <strong>Alex Volkov:</strong> And this is from NVIDIA, by the way, I think it's very worth like shouting out that this specific thing comes from this model comes from NVIDIA. Um,they specifically mentioned that the cost, And outperformance of this model comes at 6 to 12 times less [00:12:30] training, which is very impressive.</p><p>[00:12:31] <strong>Alex Volkov:</strong> what else about this model? Performance wise, MMLU at 52, which is lower than Qwen at 59, at, at 1. 5 billion parameters. GSM 8K, we know the GSM 8K is not that interesting anymore, I think, at this point. We're not like over, we're not over, we're not looking at this like too much. What else should we say about this model?</p><p>[00:12:52] <strong>Alex Volkov:</strong> GPK is pretty interesting at 31. GPK is usually knowledge versus something. [00:13:00] Anything else to say about this model? Yeah, you have anything to say Nisten? Anything to say about the small models? About the hybrid model specifically? I know that like our friend LDJ said that like this seems like the first actual model that competes apples to apples.</p><p>[00:13:13] <strong>Alex Volkov:</strong> Because usually when we compare Hybrid models specifically, those usually people say that those are not like necessarily one to one comparisons between hybrid models and just formal models.</p><p>[00:13:24] <strong>Nisten Tahiraj:</strong> I was just going to say that fromfrom NVIDIA, we've heard these [00:13:30] claims before and they didn't quite turn out that way, so I'm going to start off a little bit more skeptical on that end. also from, from the Mistral Mamba, Mambastral, that one was not very performant.</p><p>[00:13:44] <strong>Nisten Tahiraj:</strong> it seemed like it was going to be good for long context stuff. The runtime wasn't that good as well. yeah, I'm going to give this one a test because. Again, the promise of, of like hybrid, SSM models is that it can do better [00:14:00] in longer contexts and it can run faster. So it is worth testing given what, what they're claiming.</p><p>[00:14:06] <strong>Nisten Tahiraj:</strong> But, again, on MMLU, it didn't do that well, but, yeah, overall the numbers do look great actually for what it is, but I think we do need to do further testing on this, whether it is practically. That's good. Because I'm not sure how well it's going to hold up after you just throw like 32k of context of it.</p><p>[00:14:25] <strong>Nisten Tahiraj:</strong> I guess it's going to remember all that, but, yeah, this on paper, this does [00:14:30] look like it's one of the first ones that is Applesauce.</p><p>[00:14:33] <strong>Alex Volkov:</strong> Yeah. All right. anything else to say here? Yeah, the architecture. Jan, go ahead.</p><p>[00:14:39] <strong>Yam Peleg:</strong> Yeah, about the architecture. I tweeted about <a target="_blank" href="it.It">it.It</a> is, I think it has extreme potential and, it might, I just by looking at the attention maps, from the paper, like just a glimpse is enough for you to see that.</p><p>[00:14:55] <strong>Yam Peleg:</strong> They really do solve something really profound [00:15:00] with many of the models that we have today. basically, I'm really simplifying here, but basically, when you look at the Attention versus Mamba, they act very differently in terms of how they process the tokens, sliding window ones, you could say.</p><p>[00:15:20] <strong>Yam Peleg:</strong> And of course self attention is like global, to everything, but Mamba is not exactly global, it's sequential, and sliding window is also not exactly [00:15:30] global, but it's not the same sequential, it's like everything to everything, but with a window. So what they did is combine the two, and you can really see the difference in attention map of the trained model.</p><p>[00:15:44] <strong>Yam Peleg:</strong> it's not exactly the same as just, hybrid Mamba attention models that we all saw before.there is a lot to this model and I really want to see one of those. I just [00:16:00] trained for like at scale, like a large one on, on, on a huge data set, because I think it might be an improvement to either,just by looking at the way the model learned, but you cannot know until you actually try.</p><p>[00:16:15] <strong>Yam Peleg:</strong> I tweeted about it just like briefly. So if you want to go and look at, I'm just, I'm just pointing out that go and check the paper out because the architecture is unique. There is, there is a reason the model is, for its size, very performant. [00:16:30]</p><p>[00:16:30] <strong>Alex Volkov:</strong> Yeah, I'm gonna add your tweet.</p><p>[00:16:31] <strong>Alex Volkov:</strong> All right, folks, time for us to move to the second thing.</p><p>[00:16:36] Allen Institute's Olmo 2.0 Release</p><p>[00:16:36] <strong>Alex Volkov:</strong> The folks at Allen AI, surprises with another release this week, and they have, as always they do, they say, hey folks, we divide the categories of open source to not open source at all, then somewhat open weights maybe, and then fully open source, the folks who release the checkpoints, the data, the, the training code.</p><p>[00:16:57] <strong>Alex Volkov:</strong> I will say this, they used to release Weights [00:17:00] Biases logs as well, and they stopped. So if somebody listens to the show from LMAI, as I know they do, folks, what's up with the Weights Biases logs? We know, and we love them, so please release the Weights Biases logs again. but, they released Olmo 2.</p><p>[00:17:14] <strong>Alex Volkov:</strong> Congrats, folks, for releasing Olmo 2. Let me actually do the clap as well. Yay!Olmo 2 is, they claim, is, they claim,the best open, fully open language model to date, and they show this nice graph as well, where, they released two models, Olmo [00:17:30] 2. 7b and Olmo 2. 13b, and they cite multiple things, to, to attribute for the best performance here.</p><p>[00:17:37] <strong>Alex Volkov:</strong> specifically the training stability, they ran this for a significant longer before. they cite some of the recipes of. What we talked about last week from TULU3 methodology, the kind of the state of the art post training methodology from TULU3 that we've talked with Nathan last week, specifically the verifiable framework, thing that we've talked about, multiple other technical things like rate [00:18:00] annealing and the data curriculum.</p><p>[00:18:01] <strong>Alex Volkov:</strong> And obviously they're focusing on their data. they have their, Ohm's selection of tasks on which they compared these models and,the breakdown that I told you about that they do is the open weights models, partially open models, and then fully open models. So this is the breakdown that they have in the area of open weights models.</p><p>[00:18:18] <strong>Alex Volkov:</strong> They have Lama 2. 13b and Mistral 7b, for example, they put Qwen in there as well. So Qwen 2. 57 and 14. And the partially open models, they put Zamba and Stable [00:18:30] LLM. And the fully open models, they put themselves and Olmo and, Ember7B and Olmo2 beats all of that category with some nice, average of stats.</p><p>[00:18:40] <strong>Alex Volkov:</strong> they talk about pre training and a bunch of other stuff. and the instruct category specifically with the Tulu kind of,recipes. What else can we say about Olmo? That's very interesting for folks before we jump into Qwen. What else can we say about Olmo? The, oh, the fact that the thing about the fully open source, we always mention this, is the data set.</p><p>[00:18:59] <strong>Alex Volkov:</strong> We [00:19:00] always talk about the data, they release all of the data sets, so Olmo mix was released, Dolmino mix was released, the SFT training data, post training data set was released as well. yeah, folks, comments. You can also try this model at playground. lnai. org. I've tried it. It's interesting. it's not look, uh,the best about this is the best among open source.</p><p>[00:19:21] <strong>Alex Volkov:</strong> Obviously it's not the best at, generally with closed source data, you can get more significantly better than this. But comments from folks about OMO? [00:19:30]</p><p>[00:19:30] <strong>Wolfram Ravenwolf:</strong> Yeah, it's not multilingual, they said that there is only English, but they are working on putting that in, I think, in another version, but, yeah, it's a truly open source model, not just OpenWeights, so a big applause for them, releasing everything, that is a big thing and I always appreciate it.</p><p>[00:19:46] <strong>Wolfram Ravenwolf:</strong> Thank you.</p><p>[00:19:48] <strong>Alex Volkov:</strong> A hundred percent. All right, folks, it looks like we got Eugene back. Eugene, talk to us about Heimbar.</p><p>[00:19:54] <strong>Eugen Cheugh:</strong> Yeah, no, sorry, I was just saying that as someone who works on transformer [00:20:00] alternative,it's actually really awesome to get the data point because we all haven't decided what's the best arrangement, what's the percentage of transformer versus non transformer?</p><p>[00:20:08] <strong>Eugen Cheugh:</strong> Is the non transformer layers in the front or the back? It's like you say, the car and the car scenario, it's like electric car, do we even know if we want the electric engine in front or the back? and these are data points that we love to test to just, find out more and it's. And I appreciate what NVIDIA is doing as well and looking forward to more research in this space.</p><p>[00:20:26] <strong>Alex Volkov:</strong> Awesome. thanks for joining us and feel free to stay. The more the merrier. This is like a [00:20:30] Thanksgiving kind of pre party for all of us. The more the merrier, folks. If you're listening to this only and you're not like on the live stream, I encourage you to go and check us out because like we're also like showing stuff.</p><p>[00:20:40] <strong>Alex Volkov:</strong> We're like showing the papers. We're like, we're waving. We're like showing Turkey, whatever. we're having fun. all right, folks. I think it's time to talk about the main course. We just ate the mashed potatoes. Let's eat the turkey for open source.</p><p>[00:20:53] Qwen Quill 32B Reasoning Model</p><p>[00:20:53] <strong>Alex Volkov:</strong> In this week's Open Source Turkey dinner, the Reasoning Model, the first ever Reasoning Open [00:21:00] Source, we got Qwen Quill, Qwen Quill?</p><p>[00:21:04] <strong>Alex Volkov:</strong> Yes, Qwen Quill 32 bit preview, the first open source. Let's go! Let's go! The first open source Reasoning Model from our friends at Qwen. We have Jun Yang here, Jun Yang and Justin Lin, to talk to us about this release. Folks at OpenAI released this, they worked for, the rest of about O1, we released a couple of months ago.</p><p>[00:21:25] <strong>Alex Volkov:</strong> Then the folks at DeepSeek released R1, that they just released it, they [00:21:30] promised to give us, maybe at some point. The folks at O1 did not release the reasoning. So, what you see in O1 is the reasoning being obfuscated from us, so we can't actually see how the model reasons. R1 gave us the reasoning itself.</p><p>[00:21:44] <strong>Alex Volkov:</strong> But didn't release the model. And so now we have a reasoning model that you can actually download and use. And unlike reflection, this model actually does the thing that it promises to do. Junyang, how did you do it? What did you do? Please give us all the details as much as possible. Please do the announcement yourself.</p><p>[00:21:58] <strong>Alex Volkov:</strong> Thank you for joining us. [00:22:00] Junyang from Qwen.</p><p>[00:22:00] <strong>Junyang Lin:</strong> Yeah, thanks everyone for the attention and for the appreciation, and I'm Junyang from the Qwen team, and we just released the new model for reasoning, but we just added a tag that it is a preview. Yeah, it is something very experimental, but we would really like to receive some feedback to see how people use it and to see what people think.</p><p>[00:22:24] <strong>Junyang Lin:</strong> The internal problems,they really are. Yeah, it is called QUIL. it is [00:22:30] something, very interesting naming,because we like to see that, we first called it like Q1,things like that, but we think it's something too normal and we'd like to see there was something connected with IQ, EQ, then we call it QQ, and then we found out, QWEN with a W there.</p><p>[00:22:47] <strong>Junyang Lin:</strong> And we found a very interesting expression because it looks really cute. There is a subculture in China with the text expression to express the feelings. So it is something very interesting. So we [00:23:00] just decided to use the name and for. For the pronunciation, it's just like the word Q, because I combined QW, the pronunciation of QW, with U together, and it's still just cute.</p><p>[00:23:13] <strong>Junyang Lin:</strong> Yeah, there's something beside the model, and it is actually a model, which can, And this is the reason before it reaches the final response. If you just try with our demo and you will find that it just keeps talking to itself. And it's something really [00:23:30] surprising for us. If it asks you a question, it just keeps talking to itself to discover more possibilities as possible.</p><p>[00:23:42] <strong>Junyang Lin:</strong> And sometimes will lead to some new things. Endless generation. So we have some limitations there. So we mentioned the limitations in the almost the second paragraph, which includes endless generation. But it is very interesting. I [00:24:00] don't say it is a really strong model, something like competitive to O1 or outcompeting R1.</p><p>[00:24:06] <strong>Junyang Lin:</strong> It is not Simply like that, we show the benchmark scores, but it is something for your reference to see that, maybe it is at this level, and then if you really check the model performance, when it processes like mathematics and coding problems, it really thinks step by step, and it really discovers more possibilities.[00:24:30]</p><p>[00:24:30] <strong>Junyang Lin:</strong> Maybe it is a bit like brute forcing, just like discovering all possibilities. If there are 1 plus 2 is equal to 1, and it discovers a lot of possibilities, but it sometimes finishes,can finish some very difficult tasks. I think, you guys can wait for our more official release, maybe one month or two months later.</p><p>[00:24:53] <strong>Junyang Lin:</strong> We'll make sure, And the next one will be much better than this preview one, but you can play with it. It is something really interesting, [00:25:00] very different from the previous models.</p><p>[00:25:02] <strong>Alex Volkov:</strong> So first of all, a huge congrats on releasing something that, everybody, it looks like it piqued interest for, tons of folks, absolutely.</p><p>[00:25:09] <strong>Alex Volkov:</strong> Second of all, it definitely thinks, it looks like it's,Actually, this seems like this. you can see the thinking, like we're actually showing this right now for folks who are just listening and I'll just read you the actual kind of ice cube question that we have that,somebody places four ice cubes and then at the start of the first minute, and then five ice cubes at the start of the second minute, how many ice cubes there are at the [00:25:30] start of the third minute,we should probably have prepared like a turkey based question,for this one, but basically the answer is zero.</p><p>[00:25:36] <strong>Alex Volkov:</strong> Oh, the ice cubes melt within a minute, and the answer is zero, and people know the answer is zero because, ice cubes melt faster than a minute. But, the,LLM starts going into math and s**t, and, just to be clear, O1 answers this question, it understands the answer is zero. Quill does not.</p><p>[00:25:53] <strong>Alex Volkov:</strong> But the reasoning process is still pretty cool and compared to like other models like you see you can see it thinking It's let me set up an equation. Oh, [00:26:00] actually, it's not correct Ah, now the equation asking for this and this and this and it goes like This is confusing Let me read the problem again.</p><p>[00:26:06] <strong>Alex Volkov:</strong> And so it tries to read the problem again. This feels Not like just spitting tokens. So Junyang, what, could you tell us like what's the difference between this and training at a regular Qwen 2. 5? So as far as I saw, this is based on Qwen 5, correct?</p><p>[00:26:27] <strong>Junyang Lin:</strong> Yeah, it is based on Qwen 2. 5 [00:26:30] 32 billion de instruct Model. Yeah, we have tried a lot of options, maybe we will release more technical details later, but I can tell you something that, we mostly simply do some, do some work on the, post training data. Because it is actually based on our previous model, so we did not change the pre training, because we are actually very confident in our pre training, because we have trained it with [00:27:00] a lot of tokens, so there should be some knowledge about reasoning there, and in Qwen 2.</p><p>[00:27:05] <strong>Junyang Lin:</strong> 5, we also have some text reasoning, relative data, in the pre training process, so we just try to see that if we can align with the behavior of such, reasoning. So we have some very simple,superfines, fine tuning, and we find that while it can generate things like that, we have done a bit like RL stuff, and we also have done something like, RFT, Rejection, [00:27:30] Finetuning, so we can add more data from it.</p><p>[00:27:33] <strong>Junyang Lin:</strong> And there are a lot of techniques, just like self aligned. We use the base language model to use in context learning to build samples for us, to just We've built something like that make the model that can reason and we found that it's really surprising. We did not do very complex stuff, but we find that it has this behavior, but we still find that there is still much room in the reinforcement learning [00:28:00] from human feedback because we found that if you add some RL, you can improve the performance very significantly, so we have some belief that Maybe we, if we have done some more in a process where we're modeling LLM critiques and also things like building more nuanced data for the multi step reasoning, the model will be much better.</p><p>[00:28:26] <strong>Junyang Lin:</strong> Yeah. But this one is interesting. You can keep [00:28:30] talking to it. It keeps talking to itself, just talking about some strange thinking and sometimes maybe I'm wrong. I will check the question again and maybe I'm wrong again and then do it again and again. And sometimes it's generally too long because we have some limitations in long text generation.</p><p>[00:28:49] <strong>Junyang Lin:</strong> I think All models have this problem, so when it reaches maybe some bound and it will turn into some crazy behaviors, it just never [00:29:00] stops generating. We just mentioned this limitation. Just</p><p>[00:29:05] <strong>Alex Volkov:</strong> to make sure folks understand, this is a preview, this is not like an official release. You guys are like, hey, this is a preview, this is a test of you guys.</p><p>[00:29:12] <strong>Alex Volkov:</strong> You guys are like trying this out, like folks should give feedback, folks should try it out. Maybe Finetune also on top of it. Yeah. There's definitely we're trying this out. This is</p><p>[00:29:21] <strong>Yam Peleg:</strong> it's like chatGPT is a research preview. It's not exactly a preview. It beats the benchmarks on so many problems.</p><p>[00:29:29] <strong>Yam Peleg:</strong> We would</p><p>[00:29:29] <strong>Junyang Lin:</strong> like [00:29:30] to make it a fun, funny stuff to make people happy. It's now Thanksgiving and people are always expecting models from us. And they're just talking that all out. where's our reasoning model or things like that. Yeah. so we showed this one to you. And.</p><p>[00:29:48] <strong>Alex Volkov:</strong> Yeah, Jan Wolfram, folks, comments about the reasoning model from Qwen.</p><p>[00:29:53] <strong>Yam Peleg:</strong> Oh, I have a lot of comments. That's a lot. I don't know if you can hear me. Yeah, Jan, [00:30:00] go ahead.</p><p>[00:30:00] <strong>Alex Volkov:</strong> There's just a delay, but we're good.</p><p>[00:30:02] <strong>Yam Peleg:</strong> Yeah, I just want to say, it's like, uh, CGPT is, uh, is a research preview. It's it's a really good thing.</p><p>[00:30:10] <strong>Yam Peleg:</strong> It's a really good model. Seriously. So, I mean, it can be a preview, but it's extremely powerful. How did you guys train this? I mean, what, what, what's the data? How did you generate it? Can you Can I just create data that looks like O1 and Finetune and it's going to work? or, like, give us some details.</p><p>[00:30:28] <strong>Yam Peleg:</strong> it's a really hard thing to [00:30:30] do. it's really, really, really successful. Sohow did you make it?</p><p>[00:30:35] <strong>Alex Volkov:</strong> Give us some details if you can, I'm saying. if you can. Don't let Yam, don't let Yam go into give some details that you cannot give details. but hey, it looks like we may have lost Junyang for a bit with some connection issues, but while he reconnects, we got Maybe he can't, maybe he can't hear details, so</p><p>[00:30:52] <strong>Wolfram Ravenwolf:</strong> They put the plug.</p><p>[00:30:53] <strong>Alex Volkov:</strong> and Wolfram, what's your, I saw your take. Let's, meanwhile, let's take a look. You did some testing for this model as well, right?</p><p>[00:30:59] <strong>Wolfram Ravenwolf:</strong> [00:31:00] Yeah. And I just ran the, the IceCube prompt and on my run, it got the zero correct.</p><p>[00:31:04] <strong>Wolfram Ravenwolf:</strong> So that is a bit of a red flag. Oh, you</p><p>[00:31:06] <strong>Alex Volkov:</strong> did get it correct.</p><p>[00:31:07] <strong>Wolfram Ravenwolf:</strong> Yeah. it was fun because it wrote, Over 10, 000 characters, but in the end it said, okay, so confusing, they all melted zero. So that worked. But of course you have to run benchmarks multiple times. I did run the MMLU Pro computer science benchmark twice.</p><p>[00:31:23] <strong>Wolfram Ravenwolf:</strong> And what is very interesting is, Also here, it generated much more tokens than any other model. The second, highest [00:31:30] number of tokens was GPT 40, the latest one, which was 160, 000 tokens for the whole benchmark. And here we have over 200, 000, 232, 000 tokens it generated. So it took me two and a half hours to run it.</p><p>[00:31:45] <strong>Wolfram Ravenwolf:</strong> And, yeah, it's an 8B model, no, a 32B model at 8 bit in my system where I was running it, because I have 48GB VRAM, so you can run it locally and look at it, it's, it's placed above the 405B [00:32:00] Lama 3. 1, it's above the big Mistral, it's above the GBT, JGBT latest, and the GBT 4. 0 from, yeah, the most recent one.</p><p>[00:32:08] <strong>Wolfram Ravenwolf:</strong> So just to recap</p><p>[00:32:09] <strong>Alex Volkov:</strong> what you're saying. On the MMLU Pro Benchmark, this is a model that you run on your Mac, or whatever PC, and it beats Llama 3. 5, 4 or 5 billion parameter on this benchmark, because it's reasoning and it's smart, it runs for longer, and it uses those test time compute, inference time [00:32:30] compute, Compute, Scaling, Loss that we talked about multiple times.</p><p>[00:32:33] <strong>Alex Volkov:</strong> It runs for longer and achieves a better score. This is like the excitement. This is the stuff. so Junyang, now that you're back with us, could you answer, or at least some of Yam's question, if you couldn't hear this before, I will repeat this for you. How? What does the data look like? can you just come up with some O1 stuff?</p><p>[00:32:51] <strong>Alex Volkov:</strong> By the way, welcome, welcome Nisten.</p><p>[00:32:53] <strong>Nisten Tahiraj:</strong> But I tried it.</p><p>[00:32:54] Introduction to the New Google Model</p><p>[00:32:54] <strong>Nisten Tahiraj:</strong> It got the Martian.Rail Train Launcher, it got it perfectly [00:33:00] on first try, and I saw that it did take it three tries, so I use this as a standard question on most models, is if you're going to launch a train from the highest mountain in the solar system, which is on Mars, and you want to accelerate it at two G's, so Still comfortable.</p><p>[00:33:21] <strong>Nisten Tahiraj:</strong> how long would that track need to be in order for you to get to orbital velocity and in order for you to get to, to leave [00:33:30] Mars gravity well? And it's a very good question because there's so many steps to solve it and you can just change it to, you can say 2. 5G and that completely changes the order of the steps for, that the model has to solve.</p><p>[00:33:42] <strong>Alex Volkov:</strong> So it's unlikely to be in the training data and it got it perfectly. It's again, it's this one, it's the new Google preview, even Sonnet takes two tries, two or three tries often to get the right answer. So,yeah, the model worked, and I had the same thing as [00:34:00] Wolfram, he did put out a lot of tokens, but again, it's pretty fast to run locally, Folks, it's a good model. It's, it, for a test preview, for something that was released, as a first, open weights reasoning model, we are very impressed.</p><p>[00:34:14] Model Performance and Availability</p><p>[00:34:14] <strong>Alex Volkov:</strong> we're gonna give Junaid, one more, one more attempt here, Junaid, I see you on the spaces. and you're as a speaker, maybe you can unmute there and speak to us through the spaces,while we try this out, I will just tell to folks that like you are, you can download this model.</p><p>[00:34:27] <strong>Alex Volkov:</strong> It's already on, OLAMA. [00:34:30] You can just like OLAMA install Quill or QWQ.it's already on OpenRouter as well. You can get it on OpenRouter. So you can like replace. you can replace whatever you use, like OpenAI, you can replace and put this model in there. it's, you can try it out in Hug Face, this is where we tried it just now.</p><p>[00:34:47] <strong>Alex Volkov:</strong> And, It's awesome. It's awesome to have this. I'm pretty sure that many people are already like trying different variations and different like fine tunes of this model. And it just like going up from here, like to get a open [00:35:00] model, 32 billion parameters, that gets, what is the score? let me take a look.</p><p>[00:35:04] <strong>Alex Volkov:</strong> The score is, I think it gets, 50 on AIME. It's ridiculous. Anybody try this on ARK Challenge, by the way? Do you guys see in your like, like tweets or whatever, the ARK Challenge? Anybody try to run this model on that and try? I would be very interested because that's that's a big prize. It's a very big prize.</p><p>[00:35:22] <strong>Alex Volkov:</strong> I'm pretty sure</p><p>[00:35:22] <strong>Eugen Cheugh:</strong> someone's trying right now. You shall think that out.</p><p>[00:35:26] <strong>Alex Volkov:</strong> I'm pretty sure somebody's trying right now. They could use a</p><p>[00:35:29] <strong>Wolfram Ravenwolf:</strong> 72B [00:35:30] version of it and maybe that gets even better. Probably does.</p><p>[00:35:35] <strong>Alex Volkov:</strong> Yeah. They're probably training a bigger model than this right now. all right folks. So with this, I think that, we've covered pretty much everything that we wanted to cover with Quill.</p><p>[00:35:46] Scaling and Model Efficiency</p><p>[00:35:46] <strong>Alex Volkov:</strong> and I think, yeah, the one thing that I wanted to show, let me just show this super quick before we move on to the next topic that we have is this, scaling kind of thing. We saw pretty much the same thing. From, from [00:36:00] DeepSeq. And then we saw pretty much the same thing also from OpenAI. The kind of the scaling confirmation, the scaling log confirmation, the next scaling log confirmation, test time compute or inference time compute works.</p><p>[00:36:11] <strong>Alex Volkov:</strong> Which basically means that the more thinking, the more tokens, the more time you give these models, the better. to think, the better their answer is. We're getting more and more confirmation for this kind of Noah Brown, I don't know, thesis, that these models actually perform [00:36:30] significantly better when you give them more tokens to think.</p><p>[00:36:32] <strong>Alex Volkov:</strong> this is incredible to me. This is like incredible because not only will we have better models with more scale, but Even though some people claim a wall has been hit, no wall has been hit. but also we now have these models that can answer better with more tokens. and this is like another, another confirmation from this.</p><p>[00:36:51] <strong>Alex Volkov:</strong> Qwen, Quail32B is now here. You can, you can now run. a, a 4 0 5 B level models, at least on [00:37:00] MMLU Pro,like wolf from here said on your computers. And shout out to our friends from, Alibaba Quinn for releasing these awesome models for us as a Thanksgiving,present.</p><p>[00:37:10] <strong>Alex Volkov:</strong> Jang, you're back with us. Let's see. maybe you're back.</p><p>[00:37:14] <strong>Junyang Lin:</strong> I don't know if you can hear me. Yes,</p><p>[00:37:16] <strong>Alex Volkov:</strong> we can hear you finally, yes.</p><p>[00:37:18] <strong>Junyang Lin:</strong> I don't know what happened.</p><p>[00:37:19] <strong>Alex Volkov:</strong> it's</p><p>[00:37:20] <strong>Junyang Lin:</strong> fine. I</p><p>[00:37:22] <strong>Alex Volkov:</strong> think that, let's try this again. maybe last thing as we're going to try.</p><p>[00:37:27] Discussion on Reasoning Models</p><p>[00:37:27] <strong>Alex Volkov:</strong> What, from what you can tell us, [00:37:30] how does the work on this look like?</p><p>[00:37:34] <strong>Alex Volkov:</strong> Is a lot of it synthetic? Is a lot of it RL? Could you give us, a little bit of, Give us a hint of what's going to come in the technical release for this. And also what can we look forward to in the upcoming? Are you maybe working on a bigger model? give us some, give us something for Thanksgiving.</p><p>[00:37:51] <strong>Junyang Lin:</strong> Oh yeah. for the reasoning steps, I think, the data quality, really matters and, we, we think that, it may split the steps, [00:38:00] more, make it more nuanced. make it more small steps. It can be just, the possible answers, with higher possibility, which means that the machine may think, in a different way from, the human being.</p><p>[00:38:12] <strong>Junyang Lin:</strong> The human being may reach the answer very directly, but sometimes, for a reasoning model, it may reason to explore more possibilities. So when you label the data, you should pay attention to, these details and, This is a part of it, and now we only have done some work on mathematics and [00:38:30] coding, and especially mathematics, and I think there's still much room in general knowledge understanding.</p><p>[00:38:37] <strong>Junyang Lin:</strong> I found that Wolfram just tested it for the MMU PRO, but we actually did not strengthen its performance for the MMU PRO. this kind of benchmark. So I think for the scientific reasoning, there's still much room for it to do it. And something surprising for us, is that we found that, it sometimes generate more beautiful texts, more [00:39:00] poetic, some, something like that.</p><p>[00:39:02] <strong>Junyang Lin:</strong> I don't know why, maybe it is because it reasons. So I think it may encourage creative writing as well. A reasoning model that can encourage creative writing. That would be something very interesting. I also found some cases, in Twitter, that people find that, it sometimes generates, text more beautiful than, Claude's written by someone and created.</p><p>[00:39:22] <strong>Junyang Lin:</strong> there's still much room for a reasoning model. Yep.</p><p>[00:39:25] <strong>Alex Volkov:</strong> Very interesting. Just to recap, folks found that this model that is [00:39:30] trained for reasoning gives more poetic, writing. that's very interesting. All right, folks, I think it's time for us to move on, but</p><p>[00:39:37] <strong>Wolfram Ravenwolf:</strong> just one quick comment.</p><p>[00:39:39] Multilingual Capabilities of Qwen</p><p>[00:39:39] <strong>Wolfram Ravenwolf:</strong> It's also very good in German. I tested it in German as well. So even if it may not be the focus, if you are multilingual or another language, try it. Yeah,</p><p>[00:39:50] <strong>Junyang Lin:</strong> that's something not that difficult for us because the Qwen is strong model is multilingual And it is actually I think it is now good at German.</p><p>[00:39:59] <strong>Junyang Lin:</strong> Yeah, [00:40:00]</p><p>[00:40:02] <strong>Alex Volkov:</strong> Qwen's multilingual is very good at German.</p><p>[00:40:04] BlueSky hate on OpenSource AI discussion</p><p>[00:40:04] <strong>Alex Volkov:</strong> Alright folks, I think that it's time for us to move on a little bit and Now we're moving to less fun, less of a fun conversation, but I think we should talk about this. just a heads up, after this, we're gonna have this week's buzz, but I don't have a category for this.</p><p>[00:40:19] <strong>Alex Volkov:</strong> I don't have a category for this, but it must be said. as ThursdAI is all about positivity. We talk about AI every week to highlight the advancement we highlight with positivity we get excited about every new [00:40:30] release every new whatever we also recently and now we have you know we're on youtube as well and the reason it coincided well with some of the folks in the ai community moving over to blue sky let me actually first Say hi to my colleague here, Thomas.</p><p>[00:40:44] <strong>Alex Volkov:</strong> I'm going to pull you up on stage as well. welcome Thomas as well. Hey man, welcome. My colleagues for the past year from Weights Biases, welcome as well. You're more than welcome to join us as well, because you're also on BlueSky. And, so a bunch of the community, recently started seeing whether or not there's a [00:41:00] new place over at BlueSky.</p><p>[00:41:02] <strong>Alex Volkov:</strong> for the ML community. I saw a bunch of ML people over there as well. I see Wolfram over here has a little butterfly. you all who are joining us from Twitter, or Xspaces, for example, you've probably seen a bunch of your favorite AI folks post just a blue butterfly and maybe follow them towards the other social media platform due to your political preferences, wherever they may be, which is completely fine.</p><p>[00:41:26] <strong>Alex Volkov:</strong> That's all good and well and fine. so I started cross posting to both, [00:41:30] and I'll show you how my screen looks like recently. This is how my screen looks like. I scroll here, I scroll on X, and I scroll on blue sky. This is what my life looks like. Yes, I'm on both. because I want to make sure that I'm not missing any of the news.</p><p>[00:41:43] <strong>Alex Volkov:</strong> That I want to bring to you, and also Zinova, our friend, right? He posts everywhere, and I see the community bifurcating. I don't like it. But I want to make sure that I'm not missing anything. This is not what I want to talk to you about. Not the bifurcation. I don't mind the bifurcation. We'll figure out something.</p><p>[00:41:58] <strong>Alex Volkov:</strong> We're on YouTube as well, [00:42:00] so the folks from BlueSky who don't jump on TwitterX community, they can still join the live chat. What I want to talk to you about is this thing that happened where, a bunch of folks from Hug Face just joined Blue Sky as well, and one of the maybe nicest people in, from the Hug& Face community, Daniel,I'm blanking on his last name, Nisten, maybe you can help me out, Daniel Van Strijn?</p><p>[00:42:24] <strong>Alex Volkov:</strong> Daniel Van Strijn?basically, did what he thought was [00:42:30] maybe a cool thing. He compiled the dataset. You guys know, we talk about data and open source and Hug Face as well. This is like in the spirit of the open source community, there's, we talk about open datasets. we, I have a thing here. This is my thing.</p><p>[00:42:43] <strong>Alex Volkov:</strong> When we talk about somebody releasing. Open source datasets. We have a thing. We clap, right? and so he compiled, a dataset of 1 million blue sky posts to do some data science. This is like what Hagenfeist, put it on Hagenfeist. just to mention one thing before, [00:43:00] unlike Twitter, which used to be open, then Elon Musk bought it and then closed the API, and then you have to pay 42, 000 a year.</p><p>[00:43:07] <strong>Alex Volkov:</strong> 42, 000 a year. Yes, this is the actual price. 42, 000 a year. this is the actual literal price for the API. Unlike Twitter, which used to be free, BlueSky is built on a federated algorithm. There's a firehose of API you can apply to it. And then you can just like drink from this firehose for free. This is like the whole point of the platform.</p><p>[00:43:27] <strong>Alex Volkov:</strong> so then you'll connect to this firehose, drink from it and [00:43:30] collect, compile the data set of a 1 million posts, put it up on Hug Face, open source.</p><p>[00:43:36] Community Reactions and Moderation Issues</p><p>[00:43:36] <strong>Alex Volkov:</strong> And then got death threats. Death threats. He got death threats for this thing. People told him that he should kill himself for this act where he compiled data from an open fire hose of data that is open on purpose.</p><p>[00:43:58] <strong>Alex Volkov:</strong> What the actual f**k? [00:44:00] And when I saw this, I'm like, what is going on? And in less than 24 hours, I'm going to just show you guys what this looks like. Okay. this is the, this is on the left of my screen and the folks who are not seeing this, you probably, I'm going to, maybe pin.</p><p>[00:44:13] <strong>Alex Volkov:</strong> Yeah. let me just do this super quick. So you guys who are just listening to this, please see my pinned tweet, as well. because this is some insanity. Okay. And we have to talk about this because it's not over here. he compiled a 1 million public posts, BlueSky Firehose API, data set.</p><p>[00:44:27] <strong>Alex Volkov:</strong> And then, it got extremely [00:44:30] viral to the point where I don't know, it's like almost 500 whatever it's called. And then the amount of hate and vitriol in replies that he got from people in here. Including, yes, including you should kill yourself comments and like death threats and doxing threats, et cetera.</p><p>[00:44:47] <strong>Alex Volkov:</strong> many people reached out directly to,HugNFace folks. he became maybe number two most blocked person on the platform as well. and all of this, they, people reached out to the Hug Face community. Basically in less than [00:45:00] 24 hours, he basically said, I removed the BlueSky data from the repo.</p><p>[00:45:03] <strong>Alex Volkov:</strong> I wanted to support the tool development for the platform, recognize this approach, violate the principle of transparency and consent. I apologize for this mistake, which, okay, fine. I acknowledge his position. I acknowledge the fact that he works in a,he works in a company and this company has lawyers and those lawyers need to adhere to GDPR laws, et cetera.</p><p>[00:45:23] <strong>Alex Volkov:</strong> And many people started saying, Hey, you compiled my personal data without, the right for removal, et cetera, without the due [00:45:30] process, blah, blah, blah. Those lawyers came, there's a whole thing there. And then our friend here, Alpen, who's a researcher, of his own, connected to the same open firehose of data, and collected a dataset of 2 million posts.</p><p>[00:45:47] <strong>Alex Volkov:</strong> That's twice as many as Daniel did, and posted that one, and then became the person of the day. Alpen, you want to take it from here? You want to tell us what happened to you since then? What your 24 hours looked [00:46:00] like?</p><p>[00:46:00] <strong>Alpin Dale:</strong> yeah, sure. it's been quite the experience being the main character of the day in Blue Sky.</p><p>[00:46:05] <strong>Alpin Dale:</strong> And,obviously, I'm not showing my face for very obvious reasons. I have received quite a few threats because, Yeah, unlike Hugging Face employees, I am not beholden to a corporation, so I didn't really back down. And, yeah, I probably received hundreds of death threats and doxxing attempts.</p><p>[00:46:24] <strong>Alpin Dale:</strong> so just to reiterate what you said, the Firehose API is completely [00:46:30] open.</p><p>[00:46:31] <strong>Alpin Dale:</strong> It is, it's a good analogy with the name because it's like a firehose, anyone can use it.</p><p>[00:46:35] Legal and Ethical Implications</p><p>[00:46:35] <strong>Alpin Dale:</strong> you have they've also,threatened me with litigation, but, I'm not sure if you guys are aware, but there was a court case back in 2022, HiQ Labs versus LinkedIn, where, HiQ Labs was, scraping public, public accounts from LinkedIn and, using it for some commercial purposes, I don't remember.</p><p>[00:46:54] <strong>Alpin Dale:</strong> But, They did actually win in court against LinkedIn, and what they were doing was [00:47:00] slightly even more illegal because LinkedIn doesn't have a publicly accessible API, and they have Terms of Services specifically against that sort of scraping, and because of that, the ruling overturned later and they, they lost it, they lost the claim, but it did set a precedent to be had that if the,if the, data published on publicly accessible platforms could be lawfully connected, collected and used, even if terms of service like purported to limit such usage.</p><p>[00:47:28] <strong>Alpin Dale:</strong> But I [00:47:30] Never agreed to such a term of service when I started scraping or copying the data from the Firehose API because first, I didn't do any authentication. Second, I didn't provide a username when I did that. So anyone could have done that technically with the AT protocol Python SDK. It's you don't even need to sign in or anything.</p><p>[00:47:52] <strong>Alpin Dale:</strong> You just sign in. Connect to the thing and start downloading.</p><p>[00:47:55] <strong>Alex Volkov:</strong> Yeah, this is the platform is built on the ethos of the open [00:48:00] web. The open web is you connect and you read the data. This is the ethos of the open web. When this is the ethos of the open web, when you post on this platform, Whether or not the TOS is saying anything, when you don't need to authenticate, the understanding of the people should be, regardless, and I understand some of the anger when the people discover, oh, s**t, my, my thoughts That I posted on this platform so far are being used to like, whatever, train, whatever.</p><p>[00:48:28] <strong>Alex Volkov:</strong> I understand some of this, I [00:48:30] don't agree with them, but like I understand, what, how some people may feel when they discover Hey, my thoughts could be collected, blah, blah, blah. and somebody posted like a nice thread. But, the platform is open completely. Going from there to death threats, this is, like, where I draw completely, where I draw my line.</p><p>[00:48:45] <strong>Alex Volkov:</strong> Alpen, the next thing that happened is what I want to talk to you about. you're getting death threats, you're getting doxxed attempts. Um,I couldn't find your post today. what happened?</p><p>[00:48:56] <strong>Alpin Dale:</strong> for some reason, BlueSky decided to terminate my [00:49:00] account instead of the ones issuing the death threats, very interesting chain of events, but,they claimed that I was engaging in troll behavior, whatever that means.</p><p>[00:49:10] <strong>Alpin Dale:</strong> And for that reason, they just, like it wasn't even,due to mass reporting that happens on X. com, right? Specifically emailed me with very, human generated language, where they told me that I was being a troll. I think I posted it on my Twitter account too. And, Yeah, they just assumed I'm trolling, [00:49:30] and what's funny is there's been screenshots floating around of similar mod messages, just giving people a slap on the wrist for much, much worse things, like things we can't even talk about here, right?</p><p>[00:49:44] <strong>Alpin Dale:</strong> So very strange, very silly situation overall. And another thing I wanted to mention, a lot of people. We're bringing up the GDPR and all that because of like personally identifiable information, but if you go to the [00:50:00] dataset, all we have is the post text. The timestamp, the author, and the author name is a, it's just a hash, it's not the full author name, and the URI, so there isn't really much to link people to the, to their specific posts, and there isn't even a location tag, so I'm not sure if it fully applies with GDPR, but I'm not a liar anyways, and, The thing is, the data or their posts were published on a platform that is explicitly designed for public [00:50:30] discourse, right?</p><p>[00:50:31] <strong>Alpin Dale:</strong> And the decision to share sensitive information on a platform like this lies with the user, not the observer. And we are the observer in this case. And by the very nature of public platforms, Individuals that post like content like this, they have to bear the responsibility that their information is accessible to anyone.</p><p>[00:50:51] <strong>Alpin Dale:</strong> And I don't think my dataset like alters this reality because it just consolidates information that was already available for [00:51:00] everyone. And I guess,there were also people who were asking for an opt out option and, the Hugging Face CEO, Clem, also made an issue on the repo about this. And I did provide a very straightforward opt out process, if someone wants to remove that data, they can just submit a pull request.</p><p>[00:51:18] <strong>Alpin Dale:</strong> to remove the specific posts that belong to them but alsothey have to accompany it with a proof of authorship they have to prove to me that the post that they're removing is not a [00:51:30] it belongs to them and it's not a malicious request so i guess i've covered all grounds so i'm not sure what the what people are worried about</p><p>[00:51:38] <strong>Alex Volkov:</strong> so i uhI'm just showing to the folks who are listening, I'm showing a, an email from,from the moderation team at BlueSky.</p><p>[00:51:46] <strong>Alex Volkov:</strong> BlueSky County Control, Alpendale, BlueSky Social was reviewed by BlueSky Content Moderators and assessed as a new account trolling the community, which is a violation of our community guidelines. As a result, the account has been permanently suspended. They didn't even give you the chance to like, hey, delete this and come back to [00:52:00] the platform.</p><p>[00:52:00] <strong>Alex Volkov:</strong> Literally permanently suspended. the folks who are like saying, hey, You are going to be,delete this and come back or the folks who are like 13 death threats, are not there. Um,What can we say about this? it's ridiculous. Absolutely. And I, The fact that Hug Face's account, your account, Daniel's account, became the most blocked accounts on the platform in the past 24 hours, more so than some like crazy Manosphere accounts, is just is absolutely insanity.</p><p>[00:52:28] <strong>Alex Volkov:</strong> The fact that most of [00:52:30] these anger prone accounts People are like anti AI completely. And the whole issue about like consent, whatever, most of them don't even appear in the dataset, by the way. Like some people checked on the fly, Zeofon and I, like we did some basic checking, many people didn't even appear in the dataset.</p><p>[00:52:44] <strong>Alex Volkov:</strong> the fact that the absolute silly fact that the, none of them understand the Barbra Streisand effect on the internet and the fact that there's five datasets right now. Many of them collected the people who reacted to these specific posts and collected the data [00:53:00] set of the people who reacted to these specific posts.</p><p>[00:53:02] <strong>Alex Volkov:</strong> And people just don't understand how the internet works. That was just like ridiculous to me.</p><p>[00:53:07] Moving Forward with Open Source</p><p>[00:53:07] <strong>Alex Volkov:</strong> so Alpen, I personally think that you did Many of these people also a very good service as well, because at least some of them now realize how open internet works, despite the being very upset with the fact that this is how the open internet works, at least some of them are now like realizing this.</p><p>[00:53:23] <strong>Alex Volkov:</strong> I,I commend you on like the bravery and standing against this like absolute silliness and not backing down. And [00:53:30] Yeah, go ahead. Happy</p><p>[00:53:31] <strong>Alpin Dale:</strong> to serve. Yeah, another small thing I wanted to add was, I've received a lot of threats about me getting reported to the EU, but what I find really ironic is that,earlier this year, the EU funded a research for collecting over 200 million blue sky posts with a greater level of detail.</p><p>[00:53:50] <strong>Alpin Dale:</strong> So clearly the EU is fine with this, so I don't know what's the problem here, once again.</p><p>[00:53:58] <strong>Alex Volkov:</strong> yeah, I saw this. Yeah, there's a way [00:54:00] bigger thing. The last thing I saw about this, and then maybe we'll open up for folks, and then I would love to chat with my friend Thomas, for whom it's late, and I invited him here, and I want to be very mindful of his time as well, so thank you, Thomas, for being patient.</p><p>[00:54:12] <strong>Alex Volkov:</strong> The last thing I say about this is that this sucks for open source, from the very reason of, if you're open and public and good hearted about this, Hey folks, here's the data in the open, you can look at this data and you can ask for your s**t to be removed. You get an angry mob of people threatening [00:54:30] death against you and asking your lawyers to like, literally people asking like, was Daniel fired?</p><p>[00:54:34] <strong>Alex Volkov:</strong> what the f**k? Meanwhile, this is a open firehose and all of the companies in the world probably already have all this data. I'm pretty sure, OpenAI has been already training on BlueSky. Like, why wouldn't they? It's open. Literally, if you want to train, and Thomas, maybe here is like a little entry to what we're going to talk about.</p><p>[00:54:50] <strong>Alex Volkov:</strong> If you want to train a toxicity,thing, There is now a very good place to go to and look at toxicity score or I can show you where you can go [00:55:00] to to train toxicity score. Like, why wouldn't you go and collect this data? It's free, like literally it lies on the internet.</p><p>[00:55:05] <strong>Alex Volkov:</strong> Nothing in the TOS, like Alpen said, even I went to the TOS of BlueSky. Literally it says over there, we do not control how other people use your data. Like literally that's what it says on the TOS. So yeah, I'm just like, I'm very frustrated against this. I want to speak out against this, absolutely ridiculous behavior.</p><p>[00:55:22] <strong>Alex Volkov:</strong> I don't think that this,okay. So I don't think that the, how the people reacted on the platform speaks against the platform itself. I do think [00:55:30] That the way the moderators, acted out against Alvin's account and the removal of account permanently banned, speaks completely against the platform.</p><p>[00:55:38] <strong>Alex Volkov:</strong> This is stupid and we should speak against this, on the platform itself. if we think that this is a place for the community, that's where I stand. And I wanted to share the publicly, super brief comments, folks, and then we'll move on to this week's bus.</p><p>[00:55:49] <strong>Wolfram Ravenwolf:</strong> There was a link in his message from the moderators that he can reject it and get a review, appeal, yeah.</p><p>[00:55:58] <strong>Wolfram Ravenwolf:</strong> So I hope that, I hope [00:56:00] he gets the appeal through. That is important. Yeah,</p><p>[00:56:03] <strong>Alex Volkov:</strong> if you will,please email them with an appeal and, tell them about the multiple death threats that you received and the fact that, you didn't, did not mean to troll.</p><p>[00:56:12] <strong>Wolfram Ravenwolf:</strong> I reported every one of those messages, by the way, and anyone who does it is probably a good thing.</p><p>[00:56:18] <strong>Alex Volkov:</strong> Nisten, I know you have thoughts on this. I would love to hear.</p><p>[00:56:22] <strong>Nisten Tahiraj:</strong> we need to better educate people to not go after the ones on their side. a lot of the open source devs do this stuff [00:56:30] because they want everyone to have, Healthcare robots that no single corporation owns. They make this data public because people want to democratize the technology for everyone.</p><p>[00:56:41] <strong>Nisten Tahiraj:</strong> So it's not, it doesn't become like authoritarian and like a single source of control. And, to see that they prioritize, just, people's anger and feelings versus being objective. about it. Whereas, [00:57:00] so in this case, the public forum data set is public domain on purpose. And this is what drew people to the community in the first place, because they felt like Twitter was becoming too political, single sided.</p><p>[00:57:12] <strong>Nisten Tahiraj:</strong> And, we didn't like that. And a lot of people moved to, because they saw Blue Sky as a, Much better, democratized alternative to all of this. And,so that's really disappointing because, these are the people on your side and, now the two [00:57:30] nicest, most contributing open source devs that we know, are more hated than, like someone like Andrew Tate.</p><p>[00:57:37] <strong>Nisten Tahiraj:</strong> that just makes no sense at all. the, out of the five most blocked accounts Two of them are like the nicest people we know. So what is, something is pretty, pretty off. And, I'm also worried that in the AI community, we are in a bit of a bubble and not quite aware of,what people on our side are being communicated.</p><p>[00:57:58] <strong>Nisten Tahiraj:</strong> are being shown how this [00:58:00] stuff works, how open source, works because I'm pretty sure from their point of view, they're like, oh, here's another company just took all of our data and is just gonna train this porn bot with it and there's nothing we can do about it, but it's not like that.</p><p>[00:58:13] <strong>Nisten Tahiraj:</strong> Not a single company can own this data. It is public domain. We can't sue anyone else over the data. It's public domain in a public forum. You're supposed to have civil discourse because then the AI can also have civil [00:58:30] discourse and be reasonable and be like aligned to humanity. so now you have a bunch of people just giving, death threats and they're okay because they're just angry.</p><p>[00:58:40] <strong>Nisten Tahiraj:</strong> So you can tell someone to go kill themselves just because you're angry. And, yeah, so that's not good. Like they're just not good. you should probably, yeah, anyway, so there is something for us to do as well, like we need to communicate better, what does open source do versus what having a single company.</p><p>[00:58:58] <strong>Nisten Tahiraj:</strong> Own all that data and [00:59:00] have it as their property. because I feel like most of the general public doesn't really understand this.</p><p>[00:59:06] <strong>Nisten Tahiraj:</strong> yeah, that's it. I was just, okay. Just really quickly. Sorry. I went on too long, but after going through war in the Balkans as a kid, I didn't think people would be getting death threats for an open source dataset.</p><p>[00:59:17] <strong>Nisten Tahiraj:</strong> It's this is just completely beyond, It's absolutely unhinged. yeah, this is just completely off.</p><p>[00:59:23] <strong>Wolfram Ravenwolf:</strong> Unhinged. Just one thing, those people even think that now the thing is over, so the dataset has been [00:59:30] removed, okay, it's done, but you can get a new one anytime. The platform hasn't changed. They have to realize that.</p><p>[00:59:37] <strong>Alpin Dale:</strong> funny it mentioned that because they started blocking me for the explicit reason of, the user started blocking me for the explicit reason of stopping me from scraping their posts, as if I need my account to do that.</p><p>[00:59:49] <strong>Alex Volkov:</strong> Yeah, I think that there's, a lot of misunderstanding of, what's actually, happening.</p><p>[00:59:54] <strong>Alex Volkov:</strong> And how, which is fine, I completely empathize of people's misunderstanding of [01:00:00] technology, and thus fear, I get this I get the visceral reaction, I get,I don't like multiple other things about this, I don't like the, the absolute, horror mob. And the death threats, I don't like the platform reacting as it did, and like blocking completely, those things don't make sense.</p><p>[01:00:14] Hey, this is Alex from the editing studio. Super quick, about two hours after we recorded the show, Alpin posted that the moderation team at BlueSky emailed him and his account was in fact reinstated. He didn't ask them to. [01:00:30] They revisited their decision on their own.</p><p>[01:00:32] So either a public outcry from some individuals on the platform. Hopefully they listened to our show. I doubt they did. Um, but they reversed their decision. So I just wanted to set the record straight about that. He's back on the platform. Anyway, back to the show.</p><p>[01:00:48] <strong>Alex Volkov:</strong> Alright folks, unfortunately though, we do have to move on, to better things, and I'll give my other co hosts like a little five, five to seven minutes off, to go take a break. Meanwhile, we're going to discuss [01:01:00] this week's buzz.</p><p>[01:01:00] This Week's Buzz: Weights & Biases Updates</p><p>[01:01:00] <strong>Alex Volkov:</strong> Welcome to this week's buzz, a category at ThursdAI, where I talk about everything that I've learned or everything new that happened in Weights Biases this week. And this week, I have a colleague of mine, Thomas Capelli, [01:01:30] from the AI team at Weights Biases. We're now the AI team. This is new for us. We're Thomas, how, do you want to introduce yourself super brief for folks who've been here before, but maybe one more introduction for folks who don't know who you are.</p><p>[01:01:43] <strong>Thomas Capele:</strong> Yeah, I'm Thomas. I work with Alex. I'm in the AI Apply team at Weights Biases. I train models, I play with models on API, and I try to make my way into this LLM landscape that is becoming more and more complex. Try to avoid [01:02:00] getting roasted on the internet. And yeah, trying to learn from everyone. Thank you for the meeting.</p><p>[01:02:06] <strong>Alex Volkov:</strong> So you're going by Cape Torch, I'm going to add this as well on X as well. I don't know what you're going off as,on Blue Skies, same Cape Torch. I invited you here, and I think let's do the connection from the previous thing as well. A lot of toxicity we talked about just now, a lot of like toxic comments as well.</p><p>[01:02:23] <strong>Alex Volkov:</strong> and we're, we both work at Weights Biases on Weave. Weave is our LLM observability tool. [01:02:30] I showed off Weave multiple times on ThursdAI, but I will be remiss if I don't always remind people, because we have a bunch of new folks who are listening, what Weave is. Weave is an LLM observability tool. So if you're building as a developer, Anything with LLMs on production,you need to know what's going on, what your users are asking your LLM or what your LLM gives as responses, because sometimes imagine that your users are, let's say copy pasting, whatever comments, people just gave [01:03:00] Daniel and Alpin and they pasting it to them to do categorization, for example, and some of these like, Very bad things that we just talked about are getting pasted into the LLM and some of the LLM responses are maybe even worse, right?</p><p>[01:03:13] <strong>Alex Volkov:</strong> so maybe your application doesn't handle this. Maybe your application responds even worse and you want to know about this. and, the way to see those, some developers just looks at logs. we have a tool. That is way nicer. And, this is just some of the things it does. but this [01:03:30] tool is called Weave.</p><p>[01:03:30] <strong>Alex Volkov:</strong> it, it traces everything that your application gets as an input from users and also outputs. but that's not all it does. So it also allows you to do evaluations. And, recently Thomas and, has been working on, multiple things, specifically around scoring and different things. Thomas,you want to maybe give us a little bit of.</p><p>[01:03:47] <strong>Alex Volkov:</strong> Yeah, I think you,</p><p>[01:03:48] <strong>Thomas Capele:</strong> you described pretty well. Yeah, as I know, you have showed Weave and the product we have been working for a while, multiple times here, but it's, I would say it's mostly core feature is [01:04:00] actually building apps on top of LLMs and having observability and yeah, standard code, we have unit tests and for LLM based applications, we need like evaluations, actual evaluations on data we have curated.</p><p>[01:04:13] <strong>Thomas Capele:</strong> And it's, we have been doing this in the ML world for a while, but as we are merging with the software engineers that. Maybe don't know how to integrate this randomness from the LLMs in the, in their applications. Yeah. you need to actually compute evaluations. And that means gathering [01:04:30] data, still labeling a lot of stuff manually to have high quality signal.</p><p>[01:04:35] <strong>Thomas Capele:</strong> And then, yeah, iterating on your prompts and your application that, that's making API calls with scores, with metrics that gives you confidence that we are not like screwing up. And as you said, like I've been working recently on adding, we added a bunch of scores, default scores. We've a couple, yeah, like a month ago with Morgan, we spent like a week building those.</p><p>[01:04:58] <strong>Thomas Capele:</strong> and recently we have been like, [01:05:00] yeah, looking at stuff like toxicity and hallucination and yeah, context and bias detection, and there's multiple of them that are LLM powered, like the ones you are showing on the screen right now, like You have an LLM that it's actually prompt in a certain way, and you maybe build a system that requires like a couple of LLM prompt with structured output to actually get the scores you were expecting,and then this thing should be able to give you, yeah, a good value of the [01:05:30] scoring if it's hallucinating, if it's a toxic, actually the mall providers like OpenAI and Mistral and Anthropic, I think have an API exactly for moderation.</p><p>[01:05:41] <strong>Thomas Capele:</strong> So yeah, you can use also that and they are actually pretty good and fast and pretty cheap compared to the completion ABA. And we have no, what I've been doing this week and the last couple of weeks where I've been trying to build really high quality, small, non LLM powered scores. So example that you want to create a toxic, [01:06:00] detection system.</p><p>[01:06:00] <strong>Thomas Capele:</strong> Yeah. what can you do? Yeah, you could find a small model that it's not an LLM or it was an LLM a couple years ago. Now, like BERT, we don't consider BERT an LLM.</p><p>[01:06:09] <strong>Alex Volkov:</strong> Yeah.</p><p>[01:06:10] <strong>Thomas Capele:</strong> yeah. I've been fine tuning the BERT task and checking like this new high end phase, small LLM2 models, trying to adapt them to the task.</p><p>[01:06:18] <strong>Thomas Capele:</strong> Yeah. yeah, like good challenge, good engineering questions, like creating, there's plenty of high quality data set on HangingFace that people have been creating from multiple places, from Reddit, and [01:06:30] like these have been serving us to actually build this high quality classifiers that are capable of outputting and flagging the content that we're interested in.</p><p>[01:06:40] <strong>Alex Volkov:</strong> So here's what I, here's what I'll say for folks, just to like highlight what we're talking about. Weave itself. is a toolkit that you can use for both these things. You can use it for logging and tracing your application, which is what it looks like right now. You basically add these lines to your either Python or JavaScript application, JavaScript type of application, and we will help you track [01:07:00] everything your users do in production.</p><p>[01:07:01] <strong>Alex Volkov:</strong> Separately from this, You want to continuously evaluate your application on different set of metrics, for example, or scoring them on different set of metrics to know how your LLM or your prompts are doing, right? So you guys know that, like for example, before on the show we talked about, hey, here's this new model, the, qu quill, for example.</p><p>[01:07:20] <strong>Alex Volkov:</strong> And you know that wolf from, for example, tested it on MMU Pro. Those are generic evaluations. MMU Pros, those are evaluations that somebody built specifically for. [01:07:30] Something big. Look, there's a set of questions that somebody builds something big. specific scorers for your type of application, something that you build for your type of applications.</p><p>[01:07:38] <strong>Alex Volkov:</strong> and then people asked us as Weights Biases, Hey, okay, you give us a generic toolkit, an unopinionated toolkit, but can you give us some opinion? Can you give us some opinion? And basically this is what like Weave Scorers is. This is like an additional package that you can install if you want to,like additionally, right?</p><p>[01:07:55] <strong>Alex Volkov:</strong> Thomas, help me out here, but you can add this. The ones we're</p><p>[01:07:58] <strong>Thomas Capele:</strong> building right now, they're not yet [01:08:00] there. They will be probably in a certain future. Yeah. We need to test them correctly. And it's we're an experiment tracking company at the beginning. We're going to like, want to share the full reproducibility.</p><p>[01:08:10] <strong>Thomas Capele:</strong> Like this is the data, this is how we train them. there's different versions. It's scoring metrics we get, so you like have confident that they work as expected.</p><p>[01:08:18] <strong>Alex Volkov:</strong> So this is to me very interesting, right? So I came in as a previously software developer and now as like an AI evangelist, like I came in from like this side and I meet all these like machine learning engineers, experiment tracking folks who are like, okay, [01:08:30] now that we've built this like LLM based tool, observability tool, many people are asking us to do what Weights Biases does on the model side, on the Weights Biases side.</p><p>[01:08:37] <strong>Alex Volkov:</strong> Hey, Use everything from your, immense knowledge of tracking and doing experimentation to bring this over to the LLM side. Okay, now that you have all this data, now that companies are tracking all the data, how to actually, do experimentation on the front side. Thomas, last thing I'll ask you here before I let you go, briefly is about guardrails specifically.</p><p>[01:08:56] <strong>Alex Volkov:</strong> So there's this concept that we're going to talk about. We're going to keep talking about this [01:09:00] called guardrails. So we're talking about scorers. Scorers are basically the way to check your application. Just a model.</p><p>[01:09:05] Understanding Scoring Models</p><p>[01:09:05] <strong>Alex Volkov:</strong> Like</p><p>[01:09:06] <strong>Thomas Capele:</strong> I would define like score is just a model. It takes an input, produce an output.</p><p>[01:09:11] <strong>Thomas Capele:</strong> It could be simple. It could be complicated. Like a scoring, the simplest scores could be accuracy. if the prediction is equal to the label, like a complex score, it could be like an LLM power score that. Check that the context you retrieve from your RAG application, it's not like the response is not [01:09:30] hallucinated or is factually consistent with the original context.</p><p>[01:09:33] <strong>Alex Volkov:</strong> So like HallucinationFreeScorer, for example, is one score for folks who are listening. whether or not the response that your RAG application returned, Has hallucinations in it. Or,yeah, it's</p><p>[01:09:44] <strong>Thomas Capele:</strong> very it's very detailed. And you will probably need to refine all of this for your specific application because everyone has slightly definition and slightly needs, slightly different needs for their application.</p><p>[01:09:55] <strong>Thomas Capele:</strong> So yeah, you may need to tune everything, but this is like a good starting point.</p><p>[01:09:59] Guardrails in LLM Development</p><p>[01:09:59] <strong>Thomas Capele:</strong> [01:10:00] So yeah, I find it very interesting that you mentioned guardrails. I would say like a guardrail is. Also a model that predicts, but it's need to be really fast and it needs to be, it needs to take actions, maybe change the output, like any of these scores don't change your output.</p><p>[01:10:19] <strong>Thomas Capele:</strong> Like they will. Computer score, but they will not change the output. if you have IPAI's guardrail, it should, I don't know, redact stuff that [01:10:30] shouldn't pass. So it should change the output, like the payload you are getting from the API. So like guardrails are more like online, and these are more like, offline.</p><p>[01:10:41] <strong>Alex Volkov:</strong> So that's a good boundary to do. And I think we'll end here, but this is basically an exception for going forward, folks. I will tell you about guardrails specifically.</p><p>[01:10:48] Guardrails in Production</p><p>[01:10:48] <strong>Alex Volkov:</strong> It's something we're getting into, and I'm going to keep talking about guardrails specifically, because I think that this is a very important piece of developing LLMs in production.</p><p>[01:10:57] <strong>Alex Volkov:</strong> How are you making sure that the [01:11:00] model that you have online is also behaving within a set of boundaries that you set for your LLM? obviously we know that the big companies, they have their guardrails in place. We know because, for example, when you, talk with, advanced voice mode, for example, you ask it to sing, it doesn't sing.</p><p>[01:11:14] <strong>Alex Volkov:</strong> there's a boundary that they set in place. when you develop with your LLMs in production, your guardrails, the only way to build them in is in by prompting for example there's other ways to do them and we are building some of those ways or we're building tools for you to build some of those ways [01:11:30] and like thomas said one of those guardrails are changing the output or like building ways to prevent from some of the output from happening like PII for example or there's like toxicity detection and other stuff like this so we will Talking more about guardrails, Thomas with this, I want to thank you for coming out to the show today and helping us with scores and discussing about Weave as well.</p><p>[01:11:50] <strong>Alex Volkov:</strong> And, I appreciate the time here, folks. You can find Thomas on, X and on, and on BlueSky, under CapeTorch. Thomas is a machine learning engineer and, [01:12:00] developer AI engineer as well. Does a lot of great content, Thomas. Thank you for coming up. I appreciate you. He also does amazing cooking as well.</p><p>[01:12:06] <strong>Alex Volkov:</strong> Follow him for some amazing gnocchi as well. Thanks, Thomas. Thomas, thank you. Folks, this has been this week's Bugs, and now we're back. Good job being here. See you guys. See you, man. And now we're back to big companies and APIs.[01:12:30]</p><p>[01:12:33] <strong>Alex Volkov:</strong> All right. All right. All right. We are back from this week's buzz, folks. Hopefully, you learned a little bit about scores and guardrails. We're going to keep talking about guardrails, but now we have to move on because we have a bunch of stuff to talk about specifically around big companies and APIs, which had A bunch of stuff this week as well.</p><p>[01:12:51] OpenAI Leak Incident</p><p>[01:12:51] <strong>Alex Volkov:</strong> I wanna talk about, the leak. You guys wanna talk about the leak, this week? open the eye had a big, oh my God. Oops. Something big [01:13:00] happened. but nothing actually big happened, but look to some extent, this was a little bit big. at some point, this week, a frustrated participant in the open ai, how should I say, test.</p><p>[01:13:12] <strong>Alex Volkov:</strong> Program for Sora decided to quote unquote leak Sora and posted a hug and face space where you could go and say, Hey, I am,I want this and this. And you would see a Sora video generated and, yeah, we can actually show some videos. I think, this is not against any [01:13:30] TOS, I believe. and, Yeah, this wasn't actually a leak. What do you guys think? did you happen to participate in the bonanza of, Sora videos, Wolfram or Yam? Did you see this?</p><p>[01:13:40] <strong>Wolfram Ravenwolf:</strong> I saw it, but I didn't, try to go to the link.</p><p>[01:13:43] <strong>Alex Volkov:</strong> No.</p><p>[01:13:44] Sora Video Leak Reactions</p><p>[01:13:44] <strong>Alex Volkov:</strong> so basically, some very frustrated person from,the creative minds behind Sora behind the scenes, decided to like, Leak SORA, the leak wasn't actually the model leak like we would consider a model [01:14:00] leak.</p><p>[01:14:00] <strong>Alex Volkov:</strong> the leak was basically a hug and face application with a request to a SORA API with just the keys hidden behind the hug and face. we're showing some of the videos. I'm going to also add this to,to the top of the space for you guys as well. The videos look pretty good, but many of the folks who commented, they basically said that, compared to when Sora just was announced, where all of [01:14:30] us were mind blown completely, now the videos, when you compare them to something like Kling, or some of, Runway videos, they're pretty much on the same level.</p><p>[01:14:41] <strong>Alex Volkov:</strong> And, I, they look good. They still look very good. look at this animation for example. It looks very good still And apparently there's like a version of Sora called Sora Turbo. So these videos are like fairly quick, but Like folks are not as mind blown [01:15:00] as before yeah Some of the physics looks a little bit better than Kling etc, but it feels like we've moved onand this is something that I want to talk to you guys like super quick.</p><p>[01:15:09] <strong>Alex Volkov:</strong> we're following every week, right? So we get adapted every week, like every,the Reasoning Model Formula 1 blew us away. And then R1 came out and now we run this on our models due to Quill. So we're used to getting adapted to this. the video world caught up to Sora like super quick.</p><p>[01:15:24] <strong>Alex Volkov:</strong> Now we can run these models. There's one open source one like every week. These videos [01:15:30] don't blow us away as they used to anymore and,why isn't OpenAI releasing this at this point is unclear because if you could say before, elections, you could,you can put down Trump and Kamala Harris in there, Now, what's the reason for not releasing this and not giving us this thing?</p><p>[01:15:47] <strong>Alex Volkov:</strong> anyway, yeah, this video is pretty cool. There's one video with, a zoom in and somebody eating a burger. yeah, leak, not leak, I don't know, but, thoughts about the sourcling? What do you guys think about the videos and, the non releasing, things? Folks, I want to ask, Nisten, [01:16:00] what do you think about those videos?</p><p>[01:16:01] <strong>Alex Volkov:</strong> Do you have a chance to look at them?</p><p>[01:16:03] <strong>Nisten Tahiraj:</strong> I was going to say, by the way, I was going to say the exact same thing you did, that it's just been so long now, what, a few, a couple of months since they announced it? I think it's more than</p><p>[01:16:14] <strong>Alex Volkov:</strong> a couple of months, I think half a year, maybe, yeah.</p><p>[01:16:16] <strong>Nisten Tahiraj:</strong> Yeah, it's over half a year that so much happened that we're no longer impressed.</p><p>[01:16:22] <strong>Nisten Tahiraj:</strong> And I'm just trying to be mindful of that, that things are still moving fast. And, they haven't stopped [01:16:30] moving. Like we've seen a whole bunch of models start to get close to this now. it's still better, I would say it's still better than most of, what's come out in the last six months. but,yeah, we're getting pretty close.</p><p>[01:16:41] <strong>Nisten Tahiraj:</strong> I think they haven't released it mainly because of, weaponized litigation,that's the main thing.</p><p>[01:16:45] <strong>Alpin Dale:</strong> Yeah.</p><p>[01:16:45] <strong>Nisten Tahiraj:</strong> Holding them back and, uh.yeah, companies in other countries don't have that problem as, as much, so they were able to, to advance more, like while still being respectful tothe brands and [01:17:00] stuff, but, yeah, I think the main reason is, people are just going to try and nitpick any kind of,of, attack vector to, to, to sue them.</p><p>[01:17:08] <strong>Nisten Tahiraj:</strong> For it. So that's probably why</p><p>[01:17:10] <strong>Alex Volkov:</strong> Yeah. Everything open AI will Yeah. Will get attacked. That I fully agree with you on this. Yeah. speaking of, let's see, do we have anything else from OpenAI? I don't believe so. Yeah. the other one thing that I wanted to show super quick is that the new Chad GPT now is also y I'm gonna show this super quick on the thing, is also now [01:17:30] supporting cursor.</p><p>[01:17:31] <strong>Alex Volkov:</strong> So now, the NutriGPT app is supporting the Cursor app, so now you can ask what I'm working on in Cursor, and if you hover this, you can actually see all of my, including env, You can actually see my secrets, but, you can ask it, you can ask it about the open, open queries. And why would I, if I have Cursor?</p><p>[01:17:49] <strong>Alex Volkov:</strong> That's the question, right? Cursor supports O1, because, I have unlimited O1 queries on ChaiGPTN, whereas I have like fairly limited, queries for O1 in Cursor. and generally [01:18:00] That's been pretty good. That's been pretty cool. You can ask it about the stuff that you have open. There's a shortcut I think it's option shift 1 on Windows and you can enable this and basically you then start chatting With the open interface in the one.</p><p>[01:18:13] <strong>Alex Volkov:</strong> We tested this a couple of weeks ago if you guys remember and I found it super fun. I don't know if you guys used it since then or for those who use the Mac version of, of ChatGPT. I find it really fun. So folks in the audience, if you're using the macOS app and you are connecting this to Cursor or to the terminal, for [01:18:30] example.</p><p>[01:18:30] <strong>Alex Volkov:</strong> Unfortunately, I use the warp terminal and they still don't have warp. they have iTerm here and other things. if you use PyCharm or other, JetBrains, they also started supporting those.but I specifically use Courser and now there's a support for Courser, supports for Windsurf, which is another thing that we didn't cover yet.</p><p>[01:18:46] <strong>Alex Volkov:</strong> And I heard amazing things. And I hope, hopefully over the Thanksgiving break, I will have to, have a chance to use Windsurf. but yeah, this is from, OpenAI and we were waiting for some more news from OpenAI, but we didn't get one. So hopefully the folks at [01:19:00] OpenAI will get a Thanksgiving break.</p><p>[01:19:02] <strong>Alex Volkov:</strong> Just a small reminder. I looked a year ago, if you guys remember the Thanksgiving episode we had a year ago. We were discussing the control alt deletemen weekend where Sam Altman was fired and then rehired. That was the Thanksgiving episode of last year. You guys remember this? last year we discussed how Sam Altman, and Greg Brockman were shanked and, the coup from Ilya.</p><p>[01:19:26] <strong>Alex Volkov:</strong> You guys remember? It's been a year. It's been a year since then. This was the [01:19:30] Thanksgiving last year. and, yeah, it's been a year since then. which by the way. Next week is the one, the two year anniversary of JGPT as well. So we probably should prepare something for that. so that's on the OpenAI News.</p><p>[01:19:43] <strong>Alex Volkov:</strong> let's super quick talk about <a target="_blank" href="this.at">this.at</a> some point There's this, the sayings from Space Uncle is, they need to be studied in an encyclopedia. somebody tweeted, I don't understand how game developers and game journalists got so ideologically captured. [01:20:00] Elon Musk tweeted and said, Too many game studios are owned by massive corporations.</p><p>[01:20:03] <strong>Alex Volkov:</strong> XAI is going to start an AI game studio to make games great again.and I'm like, and please unmute if you're muted and laughing, because I want to hear, and I want the audience to hear that both PicoCreator and Nisten are just like laughing out loud at this. It's XAI with all of their like 200, H200, 200, 000 H200s, like the best, the fastest ever growing massive [01:20:30] Memphis, super cluster, they're going to build games like, what are they really going to actually.</p><p>[01:20:34] <strong>Alex Volkov:</strong> Have a gaming studio in there. Like we know he is, Elon is a, I don't know the best Diablo game player in the world right now. I don't know how the f**k</p><p>[01:20:43] <strong>Nisten Tahiraj:</strong> he's, he is fourth or 20th or,</p><p>[01:20:45] <strong>Alex Volkov:</strong> yeah, he was 20. I think he's at some point he got number one recently, or something. I, we know, we all know he's a gamer.</p><p>[01:20:51] <strong>Alex Volkov:</strong> Kudos. I really, I'm not making this up. Like I'm really have no idea how the f**k you can be like the best Diablo player in the world doing all these other stuff [01:21:00] and. I get the sentiment of okay, let's make games. Great. Turning in the eye company, the games company, how the,what?</p><p>[01:21:08] <strong>Alex Volkov:</strong> Ah, I just want to turn to this.</p><p>[01:21:12] <strong>Eugen Cheugh:</strong> I love most. It's just a massive corporation, XAI with billions of dollars of funding. It's going to be not a messy corporation.</p><p>[01:21:23] <strong>Alex Volkov:</strong> Yeah, this is not necessarily AI related necessarily,we are expecting big things from XAI, specifically around GROK [01:21:30] 3.</p><p>[01:21:30] <strong>Alex Volkov:</strong> Hopefully December, that's the date that they've given us. They have a hundred thousand H100s turning away and building something. We know that this was like announced. we know that Elon promises and doesn't deliver on time, but delivers at some point anyway. We know that they have. very good folks behind the scenes.</p><p>[01:21:47] <strong>Alex Volkov:</strong> We know this, we've seen this before. We know that, infrastructure is something they're building out. They're building out enterprise infrastructure for APIs. we've seen the X, AI API layer building out. We've seen the kind of the [01:22:00] X,infrastructure. Sorry, enterprise infrastructure for, the building layer.</p><p>[01:22:03] <strong>Alex Volkov:</strong> We've seen all this, getting prepared. Like we've talked about this, we're getting to the point where X is going to be another player, competing another player versus Google, OpenAI, Anthropic, etc. GRUG3 is going to be something significant to contend with. and like the amount of GPUs are there.</p><p>[01:22:22] <strong>Alex Volkov:</strong> It's just is this a sidetrack? this is basically my question.</p><p>[01:22:25] <strong>Nisten Tahiraj:</strong> it, so Uncle Elon tends to be like very [01:22:30] impulsive as we've seen, so if he spends a lot of time on something he's gonna start getting obsessed with it. So there's that. In order to have a gaming service, you will need a lot of GPUs, and I'm pretty sure at this point, if they want to do cloud gaming or streaming, they probably have more GPUs than PlayStation.</p><p>[01:22:49] <strong>Nisten Tahiraj:</strong> they might actually just have more right now. they're like, we can probably Support that, and so much for the Department of Government Efficiency, now we're all [01:23:00] just going to be streaming games.</p><p>[01:23:05] <strong>Nisten Tahiraj:</strong> But there is, there's also Another lining to this is for, for a while, for the last 10 years, there was an article about 10 years ago that the E3, I don't think that's a thing anymore, but the E3 gaming conference had a SpaceX booth over a decade ago and SpaceX was actively recruiting for the E3. I think to quote, it was, programmers of physics engine, and the [01:23:30] rumors were that they were going after the ones who made the Steam Havoc 2, like the one in Portal, and the ones that worked on the, Unreal Tournament physics engine.</p><p>[01:23:40] <strong>Nisten Tahiraj:</strong> And this was over 10 years ago, and those people, those programmers, were recruited by SpaceX. like, when you see, the Falcon Heavy, 2, 3, 4 rockets, just like Go dance in midair and land like they're in a video game is because, the people that made the simulation very likely worked on game engines.</p><p>[01:23:58] <strong>Nisten Tahiraj:</strong> So it might be [01:24:00] a hiring angle from him, or it might just be Angelino playing a lot of games and he just wants to know. there is an angle</p><p>[01:24:07] <strong>Alex Volkov:</strong> for gaming as a playground for training. Like a GI, whatever, like open AI obviously had, like trained robots in this area. we saw many papers for like agents running wild in a game constrained environments.</p><p>[01:24:19] <strong>Alex Volkov:</strong> There, there could be an angle there for sure. I just, this doesn't feel like, this feels like an impulsive, hey. make f*****g games great again.</p><p>[01:24:26] Anthropic's Model Context Protocol</p><p>[01:24:26] <strong>Alex Volkov:</strong> Alright, moving on, unless we have another comment here, moving on to [01:24:30] I really wanted to discuss the, super briefly the, Model Context Protocol from Anthropic.</p><p>[01:24:36] <strong>Alex Volkov:</strong> because this kind of blew up, but it's not ready yet. I saw a comment from Simon Wilson, you guys know Simon Wilson, the friend of the pod, he'd been here multiple times. basically he covered this. super quick, Anthropic released this new protocol, which they hope to standardize and by standardize, they mean Hey, let's get around this.</p><p>[01:24:53] <strong>Alex Volkov:</strong> Okay. So let's talk about a standard in the industry right now, the OpenAI SDK for Python. That's a [01:25:00] standard way to interact with LLMs. Pretty much everybody supports this, including Gemini. I think the only one who doesn't support this is Anthropic actually. So in Python, if you want to interact with any LLM, Literally any provider in LLM, including OpenRouter, like Google, OpenAI themselves, like pretty much everyone else, like including together, like all of the, all of those, you can replace one line of code in the OpenAI API, OpenAI Python SDK, where you just put a different URL in there, and then this is the standard way to talk to [01:25:30] LLMs.</p><p>[01:25:30] <strong>Alex Volkov:</strong> I think for TypeScript, JavaScript, it's pretty much the <a target="_blank" href="same.so">same.so</a> it looks like Anthropic is trying to do something like this to standardize around how LLMs are connecting with other applications. So right now, just a minute before I showed you how ChatGPT is connecting to like a VS Code for something.</p><p>[01:25:49] <strong>Alex Volkov:</strong> They built those integrations themselves. So you would install a specific extension in VS Code in etc. And that extension That they've built [01:26:00] talks to the ChatGPT app on the Mac OS that they've built and they build this connection for you. This is not what Anthropic wants to do. Anthropic wants to create a protocol that like developers, other developers can build on their own to allow the LLM to talk to any application and you as a developer, I as a developer, other developers can build those Communication layers, and then whatever LLM, in this case, this is the Anthropic, Claude desktop app, this could be the JGPT app, could be the Gemini GPT app, [01:26:30] Gemini app, et cetera, could talk to other applications.</p><p>[01:26:32] <strong>Alex Volkov:</strong> What those other applications are? Anything. Anything on your desktop, anything. At all. So they build this kind of a first standard, communication via JSON RPC. And I think they're buildingother ways, and other servers. I think this is a way to summarize this, basically.</p><p>[01:26:50] <strong>Alex Volkov:</strong> this is a open preview. Nisten, you want to take another crack at trying to recap this? Or Yam or Wolfram, you guys want to? You want to give me your thoughts on this super quick? As far as I understand from [01:27:00] Simon, this is like still robust and still in,in, in flux.</p><p>[01:27:03] <strong>Nisten Tahiraj:</strong> I think this might end up being a much bigger deal than we, we first expect, because it is an interoperability layer, and as a developer, you will have to learn this.</p><p>[01:27:15] <strong>Nisten Tahiraj:</strong> it is annoying at the moment that, While proposing a standard, Anthropic is not showing willingness to abide by one, which most people chose, and even Google was forced to support the OpenAI standard. if you [01:27:30] want people to come with your standard, to abide by your standard, you also have to show willingness to abide by others.</p><p>[01:27:36] <strong>Nisten Tahiraj:</strong> that's not going to work here until Anthropic Just supports a plug and play OpenAI API, so I just put their models in, but that aside. The criticism aside,this is pretty, pretty important. So I've been doing some of this stuff and just trying to do it with basic JSON. So I think that's,it's very good.</p><p>[01:27:55] <strong>Nisten Tahiraj:</strong> And yeah, it's pretty hard to know, am I on Mac? Am I on Linux? Am I on a phone? [01:28:00] What's the LLM going to talk to? what does this app even want me to do? Do I have to emulate this on the screen and then click on it? Can't it just give me a JSON so that I can click on it so it's a lot easier for me?</p><p>[01:28:11] <strong>Nisten Tahiraj:</strong> And this will also apply to websites and, and web apps after to you. Offer some kind of a JSON RPC. An RPC is just like an API for people. It's just an application programming interface. It's something you query, like you write a curl to this [01:28:30] IP and here's my API key and give me, or here I'm going to give you this stuff and give me this stuff.</p><p>[01:28:37] <strong>Nisten Tahiraj:</strong> From the database or whatever. So this is this actually extremely important because you can apply to, to web apps as well. And it's a way to manage multiple sessions. So I think it's a pretty big deal, even though I am. No. And anthropic, it this,yeah. I think that this is gonna become much, much more important because it saves a lot of bandwidth.[01:29:00]</p><p>[01:29:00] <strong>Nisten Tahiraj:</strong> Instead of you having to run a visual language model to show the whole screen, to run it on an emulator, to have to click on it and move around. And it's so compute intensive. It's can you just gimme like a adjacent API, so I can just like,</p><p>[01:29:13] <strong>Alex Volkov:</strong> yeah, do</p><p>[01:29:13] <strong>Nisten Tahiraj:</strong> a constraint output to adjacent and just output three tokens.</p><p>[01:29:16] <strong>Nisten Tahiraj:</strong> Be done with the whole thing. so yeah. Yeah, it's, I think it'll become a big deal.</p><p>[01:29:21] <strong>Alex Volkov:</strong> So in the spirit of, of the holiday, thank you and tropic for trying to standardize things, standardize, often, sometimes it's annoying, but often leads to good things as [01:29:30] well. folks, should try out the MMCP and definitely give them feedback.</p><p>[01:29:34] <strong>Alex Volkov:</strong> but yeah, they should also abide by some standards as well. It looks like the industry is standardizing around the. OpenAI SDK, and they maybe should also, it would help.</p><p>[01:29:43] <strong>Wolfram Ravenwolf:</strong> It's a new thing that they are doing because, so far we usually had the LLM as a part in an agent pipeline where you have, another process called the LLM with some input.</p><p>[01:29:52] <strong>Wolfram Ravenwolf:</strong> And here we have the LLM going out to get. The input itself. So I think that is also in the agent context, very important and [01:30:00] more integration is always better, but of course it's a new thing. We have to develop all those servers as I call it. So a lot of reinventing the wheel, I guess we'll see if it can really persevere.</p><p>[01:30:12] <strong>Alex Volkov:</strong> Yeah, one example that they highlight, and Simon talked about this as well, is that if you have a database, a SQLite database that sits on your computer,the way to have So you guys know we, we talked about tool use, for example,via API, those models can Get respond with some, some idea of how to use your [01:30:30] tools.</p><p>[01:30:30] <strong>Alex Volkov:</strong> And you, as a developer, you are in charge of using those tools. You basically get in response a structure of a function call. And you're like, okay, now I have to take this and then go to an external tool and use this. This is connecting this piece forward. This is basically. Allowing this LLM to then actually go and actually use this tool.</p><p>[01:30:48] <strong>Alex Volkov:</strong> Basically like getting a step forward. And one, one example that they're showing is a connecting to a database, allowing this LLM to connect to a database via a sq lite like MCP server. the model compute [01:31:00] protocol server. cps, sorry. yeah. So connecting via this MCP server,you basically allowing LM to read from this database.</p><p>[01:31:08] <strong>Alex Volkov:</strong> Itself without like returning a call. And then you are in charge as a developer to go and do the call return it responses. so basically trying to, allow LLMs to connect to different services. Yeah. And this, I think I agree with you with more work in here. this could be big.</p><p>[01:31:24] <strong>Nisten Tahiraj:</strong> It could literally make like over a thousand times more compute efficient to automate [01:31:30] something on a screen. Because instead of using a visual language model frame by frame, you can just have a JSON.</p><p>[01:31:37] <strong>Alex Volkov:</strong> Let's talk about Like literally</p><p>[01:31:38] <strong>Nisten Tahiraj:</strong> over a thousand times. Let's compute to do it. So I'm going to, I'm going to take a longer look at it as well.</p><p>[01:31:46] <strong>Alex Volkov:</strong> speaking of automating things on the screen,</p><p>[01:31:48] H runner from H the french AI company</p><p>[01:31:48] <strong>Alex Volkov:</strong> let's talk about the next thing that we want to talk about, H company AI. This is the next thing in big companies and APIs, H company from. France, this is another big company. So [01:32:00] we know Mistral is from France. some, DeepMind, some folks is from France as well.</p><p>[01:32:04] <strong>Alex Volkov:</strong> there's also FAIR in France from Meta. now France is positioning themselves to be one big kind of hub from AI as well. H Company. raised, fundraised, I think, 250 I have in my notes. Yeah, 220, one of the biggest, seed rounds. 220 million dollars, one of the biggest ones in, in the history of, French seed rounds, a while ago.</p><p>[01:32:24] <strong>Alex Volkov:</strong> And they just showcased their Runner H. Their Runner H [01:32:30] is, they're competing with Claude on speed of computer use. I apologize for this. Let's take a look at how fast they're claiming they're opening a browser, going to recipes and providing recipes for something. On the right, we have Claude, Computer Use.</p><p>[01:32:46] <strong>Alex Volkov:</strong> Claude is basically, hey, open the browser. On the left, they already pulled up a browser and already extracting data. So basically they're claiming A speed up of maybe two to three times over cloud computer use. [01:33:00] And they're basically showing while Claude still pulls up the Firefox browser, they have already completed the task, extracted the data and already responded to the user.</p><p>[01:33:09] <strong>Alex Volkov:</strong> they're showing steps by steps comparison, which I don't think is necessarily in, Apple's to Apple's comparison. I don't think it's necessarily fair, but. There's a big but here, big French, but, I don't know how to say, sorry, Nisten, I don't know how to say but in French, but there's a big one.</p><p>[01:33:25] <strong>Alex Volkov:</strong> Their models, as far as I could see, and I did some research, they have [01:33:30] a, they say this runner age thing that they have is powered by a specialized LLM, specialized optimist for function calling for 2 billion params. So whatever we see on the left is not like Claude, which whatever, we don't know the size of Claude, this is like a 2 billion parameter model.</p><p>[01:33:45] <strong>Alex Volkov:</strong> and, integrates in the VLM of a 3 billion parameter model to see, understand, interact with the graphical and text interface. Let's look at another example here. they're basically browsing the web and like doing extraction and yeah, I don't think you guys can see it. maybe like this.[01:34:00]</p><p>[01:34:02] <strong>Alex Volkov:</strong> It's literally, they're going to Wolfram Alpha and extracting and doing this task. They're basically asking Wolfram Alpha to do a task. So it's not like they're just reading from things. They're finding input and they're like plugging things in there and like responding, reading from the output from Wolfram Alpha as well.</p><p>[01:34:18] <strong>Alex Volkov:</strong> this runnerage thing actually performs tasks on the web. Extracts information back way faster than Claude Computerius, which Claude Computerius, let's give it its place. We were very excited when it came [01:34:30] out, and it does very well for, for just an adjustment of Claude. and they are showing immense differences in five steps, and we're still waiting for Claude Computerius to like, Try to figure this out.</p><p>[01:34:42] <strong>Alex Volkov:</strong> So did you</p><p>[01:34:43] <strong>Nisten Tahiraj:</strong> say it's a separate to be model? And then there's another? That's what I found</p><p>[01:34:48] <strong>Alex Volkov:</strong> from them. Yeah. Yeah. They said that they have, let me see if I can find the previous announcement. Yeah. Yeah.</p><p>[01:34:54] <strong>Wolfram Ravenwolf:</strong> The previous announcement</p><p>[01:34:56] <strong>Alex Volkov:</strong> that they have, that we missed from last week, Introducing Studio, a [01:35:00] automations at scale, run or age the most advanced agent to date.</p><p>[01:35:04] <strong>Alex Volkov:</strong> That's what they said last year. Powered by specialized LLM, highly optimized for function calling, 2 billion parameters. It also integrates a specialized VLM, 3 billion parameters, to perceive, understand, and interact with graphical and text elements. Delivers the state of the art on the public WebVoyager framework.</p><p>[01:35:20] <strong>Alex Volkov:</strong> And this is the graph that they have. WebVoyager, they have Runner H01. At 66 percent maybe? And, and [01:35:30] then, Cloud Computer Use at 52 percent and Agent E, I don't know where it is, it's like here. Yeah, so like the size of it is what's the most impressive part.</p><p>[01:35:41] <strong>Nisten Tahiraj:</strong> Yeah, I'd say this is impressive. as to what they're doing.</p><p>[01:35:44] <strong>Nisten Tahiraj:</strong> we can guess what model they're using, but it doesn't matter all that much. I just wanna say that it's not an apples to apples comparison with cloud because cloud is an entire OS in there and you can use whatever you want. It can use blender, it can, [01:36:00] you can run a virtual box of Windows 95 and it will use that as well.</p><p>[01:36:04] <strong>Eugen Cheugh:</strong> so the, yeah, it's not. That's not a pure example, whereas in this one, I'm assuming they do need access to the document object model, the DOM of the website, to be able to navigate it, but The results do indeed seem impressive, and it's at a size that you can run it, you can run on your own, Yeah, because if you're measuring steps and speed, actually, I think, Anthropic Cloud should, probably, partner with [01:36:30] a company like Browserbase, and just, do a demo, and then see how close they get instead. It will skip literally the first eight steps or something like that, which is all just the OS booted up.</p><p>[01:36:40] <strong>Alex Volkov:</strong> Yeah, this is why I didn't love the comparison specifically, you guys are right, it's running a janky Docker with Firefox, and by the time, it loads Firefox, these guys already loaded the website, so it's not like necessarily apples to apples, but it looks like those models are tiny compared to Claude, and also, they talk about, It's beyond [01:37:00] optimizing agent performance, they're like, they have, optimizing web interactions.</p><p>[01:37:05] <strong>Alex Volkov:</strong> they engineered Runaways to handle any web interactions. Advancing towards one singular mission, automating the web, so they're focused on web. So Eugene, like what you're talking about, like browser based with computer use, it looks like this is their focus, whereas computer use is, for computer use, generic.</p><p>[01:37:22] <strong>Alex Volkov:</strong> This is like their focus for web interactions. I guess what I'm saying is it's exciting. they raised a boatload of money, the folks behind [01:37:30] there, they seem like very,adept, I, I know they're based in France, Wolfram. I don't know, Wolfram, you're asking if, if I'm sure they're France.</p><p>[01:37:36] <strong>Alex Volkov:</strong> yeah, they're based in France, and, Yeah, we'll see. They're waitlisted. I haven't tested them out. I know that some folks collaborated on them already and posted some threads. so we'll hopefully, we'll see if I get access to this. I'll tell you guys and we'll play with it. Absolutely. definitely exciting in the world of agents.</p><p>[01:37:54] <strong>Alex Volkov:</strong> I think this is it from big companies. Folks, what do you think? Anything else From big companies, nothing from Google after the [01:38:00] releases of last week where they reclaimed the throne. Hopefully they're getting their deserved breaks and and relaxing. I don't think this week was fairly chill.</p><p>[01:38:07] <strong>Alex Volkov:</strong> Probably the next week they're going to come back with a vengeance. Next week there's like the AWS re invent. Maybe Amazon will come with something. And then the week after RPS. Maybe some folks are waiting for that. I think that this is it in big companies. Let's move on to vision and video.</p><p>[01:38:22] <strong>Alex Volkov:</strong> And then, Oh, I think we're at two minutes. Folks, I think we're at time. I think we're at time. I got too excited that we have like a bunch of other things to talk about. [01:38:30] So let me maybe recap on our Thanksgiving super quick. the stuff that we didn't get to just to like to recap super quick. we didn't get to, but just to tell you guys what else we didn't get to, runway specifically.</p><p>[01:38:41] <strong>Alex Volkov:</strong> Oh yeah, I just, I have to show this. not to talk about this. Just just visually show this beautiful thing. If I can click this. If I can click this thing, yeah, Runway introduced an expand feature, if you guys haven't seen this, it's really fun to just watch. Let me just mute this. basically, [01:39:00] what you see above and below, Runway introduced an expand feature where you take a video and you give it, give this model and the model tries to predict it.</p><p>[01:39:08] <strong>Alex Volkov:</strong> in different ratio, what's above and below this video. So basically, if you give a video in the widescreen format, 16 by nine, and you could try to turn it into a 19 by six format. And so the model will try to fill in the frames. The general video model tries to fill in the frames of what's above and below.</p><p>[01:39:25] <strong>Alex Volkov:</strong> So what we're looking at in the video on the screen is like a Lord of the [01:39:30] Rings scene where Legolas rides one of those like elephant looking thingies. Basically, the model tries to fill in the, just the frames from above and below. It just looks a little bit creepy. it's funny looking, but it's like looks, interesting.</p><p>[01:39:45] <strong>Alex Volkov:</strong> so this is like one expand feature and the other one is they released an actual image model from Runway, which kind of looks interesting. it's called a frames and it's specific for image generation for [01:40:00] world building. and Confi UI desktop launched. I think that's pretty much it.</p><p>[01:40:05] Thanksgiving Reflections and Thanks</p><p>[01:40:05] <strong>Alex Volkov:</strong> Folks, it's time to say thanks, because it's Thanksgiving. I just wanted to start, but I wanted to hear from you as well. My biggest thanks this year goes to, first of all, everybody who tunes in to ThursdAI. Everybody who comes into the community, everybody who provides comments and shares with their friends and, and listens and,The second huge thanks goes to all of you.</p><p>[01:40:26] <strong>Alex Volkov:</strong> My co hosts here, Wolfram, Yam, Nisten, LDJ, Junyang [01:40:30] who joined us, Eugene who joined us as well. Zafari who joined us from time to time, like a bunch of other folks. huge thanks to you for being here from like week to week for more than like almost, we're coming up on two years. And I think the thirst, the third thanks goes to Jensen for the GPUs that he provided for all of us to enjoy those like amazing corn coffee of AI features around the world.</p><p>[01:40:51] <strong>Alex Volkov:</strong> just, yeah, just open up the mics and feel free to, to join the festivities even though I don't know any of you celebrate [01:41:00] Thanksgiving unnecessarily. But yeah, what are you guys thankful for? before we wrap up, let's do the Thanksgiving roundup.</p><p>[01:41:07] <strong>Eugen Cheugh:</strong> I'm giving thanks to open models.</p><p>[01:41:08] <strong>Eugen Cheugh:</strong> let's go. Yeah, no, proving that you do not need billions of dollars to catch up with GPT 4 despite what the big labs will say. The open teams, keep going, keep bringing open models to the masses.</p><p>[01:41:25] <strong>Nisten Tahiraj:</strong> Yeah, We had Thanksgiving last month in Canada. I would like to [01:41:30] give thanks to two particular creators, mahi and, tki. each have over a thousand models and, quants that they release. And, and also Mr. Der Backer, probably mispronounced that was, over 5,000, quantization of models.</p><p>[01:41:48] <strong>Nisten Tahiraj:</strong> this is the stuff I use every day in tell. Other people. So whenever something new comes up, I almost always expect them to have a good, well done quantization ready for [01:42:00] others to use. and they just do this as volunteers. I don't even think they're part of the, none of them are part of like even a big corporation, or have high salaries.</p><p>[01:42:08] <strong>Nisten Tahiraj:</strong> They literally just do it as volunteers. Yeah, I want to give thanks to those people in particular, and everybody else here, and all the people on Discord as well, who sit around and help you correct stuff, but yeah, that's it for me.</p><p>[01:42:27] <strong>Wolfram Ravenwolf:</strong> Okay, I have three. The first [01:42:30] is to Alex for the podcast, because it's amazing to be here.</p><p>[01:42:34] <strong>Wolfram Ravenwolf:</strong> It's my way to keep up with the stuff I can't keep up with. So thank you for having me. Thank you for doing this. Thank you very much. And the second is to the whole community of AI people, especially those who release all these stuff in the open. But everybody who contributes, everybody who does a good thing about it, I think it is furthering humanity.</p><p>[01:42:53] <strong>Wolfram Ravenwolf:</strong> So thanks for that. And the third is a thanks to every reasonable person who is not, Going to insights or stuff, [01:43:00] but it's open minded and, seeing that we are all in the same boat and we are all trying to make the world a better place in our different ways. And for being, accepting and understanding of this.</p><p>[01:43:11] <strong>Wolfram Ravenwolf:</strong> In this times, I think it's very important to keep an open mind.</p><p>[01:43:16] <strong>Nisten Tahiraj:</strong> Oh yeah, just really quickly to add on, the biggest thanks I think for this year goes to the DeepSeek and Qwent teams for just caring. up everybody [01:43:30] else when we stalled on progress they kept it up to like actually democratize the models for you to actually have this piece of artificial intelligence and own it and control it and be loyal and make it loyal to you yeah.</p><p>[01:43:47] <strong>Nisten Tahiraj:</strong> they actually enable people to, to run fully local models. Like 90% of what I use every day is just completely open source. Now, honestly, it w it, I wouldn't, it would not be there if it wasn't for them. It would probably maybe be like [01:44:00] 20, 30%. So,yeah, they, they really carried, like that's a gaming term, like someone who.</p><p>[01:44:06] <strong>Nisten Tahiraj:</strong> Carries the team. They have really carried, so yeah.</p><p>[01:44:11] <strong>Alex Volkov:</strong> Jan, go</p><p>[01:44:14] <strong>Yam Peleg:</strong> ahead. To Jensen for the GPUs, and</p><p>[01:44:17] <strong>Alex Volkov:</strong> to everybody</p><p>[01:44:18] <strong>Yam Peleg:</strong> else I'm hugging face. Especially people collecting and releasing datasets. I think they're not getting enough credits because you can't just use the dataset [01:44:30] without training a model. There is an effort.</p><p>[01:44:31] <strong>Yam Peleg:</strong> to, until you appreciate the dataset, but, they make it possible, everything else.</p><p>[01:44:39] <strong>Alex Volkov:</strong> Last thing that I have to, and this is not because I have to, but honestly, folks, huge thanks to Weights Biases for all of this, honestly, I wouldn't have been able to do this as my job without a few folks in Weights Biases, so thank you Morgan, thank you Lavanya, thank you a bunch of folks in Weights Biases.</p><p>[01:44:55] <strong>Alex Volkov:</strong> who realized this could be a part of my actual day to day and bringing you news from Weights [01:45:00] Biases, but also promoting some of the stuff. many of the labs, if not most of the labs that we talk about, are using Weights Biases to bring us the open source, but also the closed source LLMs in the world.</p><p>[01:45:10] <strong>Alex Volkov:</strong> I couldn't be More happy and be in a better place to bring you the news, but also participate behind the scenes in building some of these things. With that, thank you to all of you. Hopefully you go and enjoy some of the rest of your holiday. Those of you who celebrate, those of you who don't celebrate, this is, I think the first Thursday in a while that we didn't have any breaking news.</p><p>[01:45:27] <strong>Alex Volkov:</strong> I'm itching to press it anyway, but we didn't [01:45:30] have any breaking news, but hopefully we'll have some next week. There could be some news next week. We'll see. With that, thank everybody who joins, go and enjoy the rest of your day. And we'll see you here next week as always. Bye everyone. Bye bye.</p><p>[01:45:43] <strong>Alex Volkov:</strong> Bye bye. Bye bye. Bye bye. Bye bye. And we have [01:46:00] a</p><p> </p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-thanksgiving-special-24</link><guid isPermaLink="false">substack:post:152302404</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Thu, 28 Nov 2024 23:55:50 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/152302404/dbe169e45de71ceaff53819385cfe5c0.mp3" length="102025740" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6376</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/152302404/11b53b2e469c0ce09d7f51e32ff163cf.jpg"/></item><item><title><![CDATA[📆 ThursdAI - Nov 21 - The fight for the LLM throne, OSS SOTA from AllenAI, Flux new tools, Deepseek R1 reasoning & more AI news]]></title><description><![CDATA[<p>Hey folks, Alex here, and oof what a 🔥🔥🔥 show we had today! I got to use my new breaking news button <strong>3 times</strong> this show! And not only that, some of you may know that one of the absolutely biggest pleasures as a host, is to feature the folks who actually make the news on the show!</p><p>And now that we're in video format, you actually get to see who they are! So this week I was honored to welcome back our friend and co-host Junyang Lin, a Dev Lead from the Alibaba Qwen team, who came back after launching the incredible Qwen Coder 2.5, and Qwen 2.5 Turbo with 1M context.</p><p>We also had breaking news on the show that AI2 (Allen Institute for AI) has fully released SOTA LLama post-trained models, and I was very lucky to get the core contributor on the paper, <a target="_blank" href="https://substack.com/profile/10472909-nathan-lambert">Nathan Lambert</a> to join us live and tell us all about this amazing open source effort! You don't want to miss this conversation!</p><p>Lastly, we chatted with the CEO of StackBlitz, Eric Simons, about the absolutely incredible lightning in the bottle success of their latest <a target="_blank" href="http://bolt.new">bolt.new</a> product, how it opens a new category of code generator related tools.</p><p>00:00 Introduction and Welcome</p><p>00:58 Meet the Hosts and Guests</p><p>02:28 TLDR Overview</p><p>03:21 Tl;DR</p><p>04:10 Big Companies and APIs</p><p>07:47 Agent News and Announcements</p><p>08:05 Voice and Audio Updates</p><p>08:48 AR, Art, and Diffusion</p><p>11:02 Deep Dive into Mistral and Pixtral</p><p>29:28 Interview with Nathan Lambert from AI2</p><p>30:23 Live Reaction to Tulu 3 Release</p><p>30:50 Deep Dive into Tulu 3 Features</p><p>32:45 Open Source Commitment and Community Impact</p><p>33:13 Exploring the Released Artifacts</p><p>33:55 Detailed Breakdown of Datasets and Models</p><p>37:03 Motivation Behind Open Source</p><p>38:02 Q&A Session with the Community</p><p>38:52 Summarizing Key Insights and Future Directions</p><p>40:15 Discussion on Long Context Understanding</p><p>41:52 Closing Remarks and Acknowledgements</p><p>44:38 Transition to Big Companies and APIs</p><p>45:03 Weights & Biases: This Week's Buzz</p><p>01:02:50 Mistral's New Features and Upgrades</p><p>01:07:00 Introduction to DeepSeek and the Whale Giant</p><p>01:07:44 DeepSeek's Technological Achievements</p><p>01:08:02 Open Source Models and API Announcement</p><p>01:09:32 DeepSeek's Reasoning Capabilities</p><p>01:12:07 Scaling Laws and Future Predictions</p><p>01:14:13 Interview with Eric from Bolt</p><p>01:14:41 Breaking News: Gemini Experimental</p><p>01:17:26 Interview with Eric Simons - CEO @ Stackblitz</p><p>01:19:39 Live Demo of Bolt's Capabilities</p><p>01:36:17 Black Forest Labs AI Art Tools</p><p>01:40:45 Conclusion and Final Thoughts</p><p>As always, the <strong>show notes</strong> and <strong>TL;DR with all the links</strong> I mentioned on the show and the full news roundup below the main new recap 👇</p><p></p><p>Google & OpenAI fighting for the LMArena crown 👑</p><p>I wanted to open with this, as last week I <a target="_blank" href="https://sub.thursdai.news/i/151672097/google-deepmind-gemini-new-king-of-llms-on-lmarena">reported</a> that Gemini Exp 1114 has taken over #1 in the LMArena, in <strong>less than a week</strong>, we saw a new ChatGPT release, called GPT-4o-2024-11-20 reclaim the arena #1 spot!</p><p>Focusing specifically on creating writing, this new model, that's now deployed on <a target="_blank" href="http://chat.com">chat.com</a> and in the API, is definitely more creative according to many folks who've tried it, with OpenAI employees saying "expect qualitative improvements with more natural and engaging writing, thoroughness and readability" and indeed that's what my feed was reporting as well.</p><p>I also wanted to mention here, that we've seen this happen once before, last time Gemini peaked at the LMArena, it took less than a week for OpenAI to release and test a model that beat it.</p><p>But not this time, this time Google came prepared with an answer!</p><p>Just as we were wrapping up the show (again, Logan apparently loves dropping things at the end of ThursdAI), we got breaking news that there is YET another experimental model from Google, called Gemini Exp 1121, and apparently, it reclaims the stolen #1 position, that chatGPT reclaimed from Gemini... yesterday! Or at least joins it at #1</p><p>LMArena Fatigue?</p><p>Many folks in my DMs are getting a bit frustrated with these marketing tactics, not only the fact that we're getting experimental models faster than we can test them, but also with the fact that if you think about it, this was probably a calculated move by Google. Release a very powerful checkpoint, knowing that this will trigger a response from OpenAI, but don't release your most powerful one. OpenAI predictably releases their own "ready to go" checkpoint to show they are ahead, then folks at Google wait and release what they wanted to release in the first place.</p><p>The other frustration point is, the over-indexing of the major labs on the LMArena human metrics, as the closest approximation for "best". For example, here's some analysis from Artificial Analysis showing that the while the latest ChatGPT is indeed better at creative writing (and #1 in the Arena, where humans vote answers against each other), it's gotten <strong>actively worse at MATH</strong> and coding from the August version (which could be a result of being a distilled much smaller version) .</p><p>In summary, maybe the LMArena is no longer 1 arena is all you need, but the competition at the TOP scores of the Arena has never been hotter.</p><p>DeepSeek R-1 preview - reasoning from the Chinese Whale</p><p>While the American labs fight for the LM titles, the real interesting news may be coming from the Chinese whale, DeepSeek, a company known for their incredibly cracked team, resurfaced once again and showed us that they are indeed, well super cracked.</p><p>They have trained and released R-1 preview, with Reinforcement Learning, a reasoning model that beasts O1 at AIME and other benchmarks! We don't know many details yet, besides them confirming that this model comes to the open source! but we do know that this model , unlike O1, is showing the actual reasoning it uses to achieve it's answers (reminder: O1 hides its actual reasoning and what we see is actually another model summarizing the reasoning)</p><p>The other notable thing is, DeepSeek all but confirmed the claim that we have a new scaling law with Test Time / Inference time compute law, where, like with O1, the more time (and tokens) you give a model to think, the better it gets at answering hard questions. Which is a very important confirmation, and is a VERY exciting one if this is coming to the open source!</p><p>Right now you can play around with R1 in their <a target="_blank" href="https://chat.deepseek.com/a/chat/s/34aab9af-4a4a-4a99-b50d-8f0cd879b231">demo</a> chat interface.</p><p>In other Big Co and API news</p><p>In other news, Mistral becomes a Research/Product company, with a host of new additions to <a target="_blank" href="https://chat.mistral.ai/chat">Le Chat</a>, including Browse, PDF upload, Canvas and Flux 1.1 Pro integration (for Free! I think this is the only place where you can get Flux Pro for free!).</p><p>Qwen released a new 1M context window model in their API called Qwen 2.5 Turbo, making it not only the 2nd ever 1M+ model (after Gemini) to be available, but also reducing TTFT (time to first token) significantly and slashing costs. This is available via their <a target="_blank" href="https://help.aliyun.com/zh/model-studio/getting-started/what-is-model-studio">APIs</a> and Demo <a target="_blank" href="https://huggingface.co/spaces/Qwen/Qwen2.5-Turbo-1M-Demo">here</a>.</p><p>Open Source is catching up</p><p>AI2 open sources Tulu 3 - SOTA 8B, 70B LLama post trained FULLY open sourced (<a target="_blank" href="https://www.interconnects.ai/p/tulu-3">Blog</a> ,<a target="_blank" href="https://playground.allenai.org/">Demo</a>, <a target="_blank" href="https://huggingface.co/collections/allenai/tulu-3-models-673b8e0dc3512e30e7dc54f5">HF</a>, <a target="_blank" href="https://huggingface.co/collections/allenai/tulu-3-datasets-673b8df14442393f7213f372">Data</a>, <a target="_blank" href="https://github.com/allenai/open-instruct">Github</a>, <a target="_blank" href="https://arc.net/l/quote/enrkmunf">Paper</a>)</p><p>Allen AI folks have joined the show before, and this time we got Nathan Lambert, the core contributor on the Tulu paper, join and talk to us about Post Training and how they made the best performing SOTA LLama 3.1 Funetunes with careful data curation (which they also open sourced), preference optimization, and a new methodology they call RLVR (Reinforcement Learning with Verifiable Rewards).</p><p>Simply put, RLVR modifies the RLHF approach by using a verification function instead of a reward model. This method is effective for tasks with verifiable answers, like math problems or specific instructions. It improves performance on certain benchmarks (e.g., GSM8K) while maintaining capabilities in other areas.</p><p>The most notable thing is, just how MUCH is open source, as again, like the last time we had AI2 folks on the show, the amount they release is staggering</p><p>In the show, Nathan had me pull up the paper and we went through the deluge of models, code and datasets they released, not to mention the 73 page paper full of methodology and techniques.</p><p>Just absolute ❤️ to the AI2 team for this release!</p><p>🐝 This weeks buzz - Weights & Biases corner</p><p>This week, I want to invite you to a live stream announcement that I am working on behind the scenes to produce, on December 2nd. You can register <a target="_blank" href="https://www.linkedin.com/events/7263242000454291457/comments/">HERE</a> (it's on LinkedIn, I know, I'll have the YT link next week, promise!)</p><p>We have some very exciting news to announce, and I would really appreciate the ThursdAI crew showing up for that! It's like 5 minutes and I helped produce 🙂</p><p>Pixtral Large is making VLMs cool again</p><p>Mistral had quite the week this week, not only adding features to Le Chat, but also releasing Pixtral Large, their updated multimodal model, which they claim state of the art on multiple benchmarks.</p><p>It's really quite good, not to mention that it's also included, for free, as part of the le chat platform, so now when you upload documents or images to <a target="_blank" href="https://chat.mistral.ai/chat">le chat</a> you get Pixtral Large.</p><p>The backbone for this model is Mistral Large (not the new one they also released) and this makes this 124B model a really really good image model, albeit a VERY chonky one that's hard to run locally.</p><p>The thing I loved about the Pixtral release the most is, they used the new understanding to ask about Weights & Biases charts 😅 and Pixtral did a pretty good job!</p><p>Some members of the community though, reacted to the SOTA claims by Mistral in a very specific meme-y way:</p><p>This meme has become a very standard one, when labs tend to not include Qwen VL 72B or other Qwen models in the evaluation results, all while claiming SOTA. I decided to put these models to a head to head test myself, only to find out, that ironically, both models say the <a target="_blank" href="https://x.com/altryne/status/1859446910622273559">other one is better</a><strong>, </strong>while both hallucinate some numbers.</p><p>BFL is putting the ART in Artificial Intelligence with <em>FLUX.1 Tools</em> (<a target="_blank" href="https://blackforestlabs.ai/flux-1-tools/">blog</a>)</p><p>With the absolute breaking news bombastic release, the folks at BFL (Black Forest Labs) have released Flux.1 Tools, which will allow AI artist to use these models in all kind of creative inspiring ways.</p><p>These tools are: FLUX.1 Fill (for In/Out painting), FLUX.1 Depth/Canny (Structural Guidance using depth map or canny edges) and FLUX.1 Redux for image variation and restyling.</p><p>These tools are not new to the AI Art community conceptually, but they have been patched over onto Flux from other models like SDXL, and now the actual lab releasing them gave us the crème de la crème, and the evals speak for themselves, achieving SOTA on image variation benchmark!</p><p>The last thing I haven't covered here, is my interview with <a target="_blank" href="https://x.com/ericsimons40">Eric Simons</a>, the CEO of StackBlitz, who came in to talk about the insane rise of bolt.new, and I would refer you to the actual recording for that, because it's really worth listening to it (and seeing me trying out bolt in real time!)</p><p>That's most of the recap, we talked about a BUNCH of other stuff of course, and we finished on <a target="_blank" href="https://distrokid.com/hyperfollow/kyleshannon/the-quantum-cipher">THIS</a> rap song that ChatGPT wrote, and Suno v4 produced with credits to Kyle Shannon.</p><p>TL;DR and Show Notes:</p><p>* <strong>Open Source LLMs</strong></p><p>* Mistral releases Pixtral Large (<a target="_blank" href="https://mistral.ai/news/pixtral-large/">Blog</a>, <a target="_blank" href="https://huggingface.co/mistralai/Pixtral-Large-Instruct-2411">HF</a>, <a target="_blank" href="https://chat.mistral.ai/">LeChat</a>)</p><p>* Mistral - Mistral Large 2411 (a <a target="_blank" href="https://huggingface.co/mistralai/Mistral-Large-Instruct-2411">HF</a>)</p><p>* Sage Attention the next Flash Attention? (<a target="_blank" href="https://x.com/_philschmid/status/1859132361536880720">X</a>)</p><p>* AI2 open sources Tulu 3 - SOTA 8B, 70B LLama Finetunes FULLY open sourced (<a target="_blank" href="https://www.interconnects.ai/p/tulu-3">Blog</a> ,<a target="_blank" href="https://playground.allenai.org/">Demo</a>, <a target="_blank" href="https://huggingface.co/collections/allenai/tulu-3-models-673b8e0dc3512e30e7dc54f5">HF</a>, <a target="_blank" href="https://huggingface.co/collections/allenai/tulu-3-datasets-673b8df14442393f7213f372">Data</a>, <a target="_blank" href="https://github.com/allenai/open-instruct">Github</a>, <a target="_blank" href="https://arc.net/l/quote/enrkmunf">Paper</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Alibaba - Qwen 2.5 Turbo with 1M tokens (<a target="_blank" href="https://x.com/Alibaba_Qwen/status/1858469845958074541">X</a>, <a target="_blank" href="https://huggingface.co/spaces/Qwen/Qwen2.5-Turbo-1M-Demo">HF Demo</a>)</p><p>* Mistral upgrades to a product company with le chat 2.0 (<a target="_blank" href="https://mistral.ai/news/mistral-chat/">Blog</a>, <a target="_blank" href="https://chat.mistral.ai/chat">Le Chat</a>)</p><p>* DeepSeek R1-preview - the first reasoning model from the Chinese whale (<a target="_blank" href="https://x.com/deepseek_ai/status/1859200141355536422">X</a>, <a target="_blank" href="https://x.com/deepseek_ai/status/1859200141355536422">chat</a>)</p><p>* OpenAI updates ChatGPT in app and API - reclaims #1 on LMArena (<a target="_blank" href="https://x.com/kondrich2">X</a>)</p><p>* Gemini Exp 1121 - rejoins #1 spot on LMArena after 1 day of being beaten (<a target="_blank" href="https://x.com/OfficialLoganK/status/1859667244688736419">X</a>)</p><p>* <strong>Agents News</strong></p><p>* Perplexity is going to do the shopping for you (<a target="_blank" href="https://x.com/AravSrinivas/status/1858560970223911122">X</a>, <a target="_blank" href="https://www.perplexity.ai/shopping">Shop</a>)</p><p>* Stripe Agent SDK - allowing agents to transact (<a target="_blank" href="https://stripe.dev/blog/adding-payments-to-your-agentic-workflows">Blog</a>)</p><p>* <strong>This weeks Buzz</strong></p><p>* We have an important announcement coming on December 2nd! (<a target="_blank" href="https://www.linkedin.com/events/7263242000454291457/comments/">link</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* Suno V4 released - but for real this time (<a target="_blank" href="https://x.com/sunomusic/status/1858918710008049866">X</a>)</p><p>* ChatGPT new creative writing does Eminem type rap with new Suno v4 (<a target="_blank" href="https://x.com/kyleshannon/status/1859355131738734824">link</a>)</p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* BFL announcing Flux Tools today (<a target="_blank" href="https://blackforestlabs.ai/flux-1-tools/">blog</a>, <a target="_blank" href="https://fal.ai/models/fal-ai/flux-pro/v1/fill/playground">fal</a>)</p><p>* Free BFL Flux Pro on Mistral Le Chat!</p><p>*</p><p>Thank you, see you next week 🫡</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-nov-21-the-fight-for-the</link><guid isPermaLink="false">substack:post:152001324</guid><dc:creator><![CDATA[Alex Volkov, Nathan Lambert, and Eric Simons]]></dc:creator><pubDate>Fri, 22 Nov 2024 01:49:53 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/152001324/140da6c3deafee584517395d4f15e12c.mp3" length="101209721" type="audio/mpeg"/><itunes:author>Alex Volkov, Nathan Lambert, and Eric Simons</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6325</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/152001324/c70ab5bdd2ee14a60c73e08d66b85746.jpg"/></item><item><title><![CDATA[📆 ThursdAI - Nov 14 - Qwen 2.5 Coder, No Walls, Gemini 1114 👑 LLM, ChatGPT OS integrations & more AI news]]></title><description><![CDATA[<p>This week is a <strong>very exciting</strong> one in the world of AI news, as we get <strong>3 SOTA models</strong>, one in overall LLM rankings, on in OSS coding and one in OSS voice + a bunch of new breaking news during the show (which we reacted to live on the pod, and as we're now doing video, you can see us freak out in real time at 59:32)</p><p></p><p>00:00 Welcome to ThursdAI</p><p>00:25 Meet the Hosts</p><p>02:38 Show Format and Community</p><p>03:18 TLDR Overview</p><p>04:01 Open Source Highlights</p><p>13:31 Qwen Coder 2.5 Release</p><p>14:00 Speculative Decoding and Model Performance</p><p>22:18 Interactive Demos and Artifacts</p><p>28:20 Training Insights and Future Prospects</p><p>33:54 Breaking News: Nexus Flow</p><p>36:23 Exploring Athene v2 Agent Capabilities</p><p>36:48 Understanding ArenaHard and Benchmarking</p><p>40:55 Scaling and Limitations in AI Models</p><p>43:04 Nexus Flow and Scaling Debate</p><p>49:00 Open Source LLMs and New Releases</p><p>52:29 FrontierMath Benchmark and Quantization Challenges</p><p>58:50 Gemini Experimental 1114 Release and Performance</p><p>01:11:28 LLM Observability with Weave</p><p>01:14:55 Introduction to Tracing and Evaluations</p><p>01:15:50 Weave API Toolkit Overview</p><p>01:16:08 Buzz Corner: Weights & Biases</p><p>01:16:18 Nous Forge Reasoning API</p><p>01:26:39 Breaking News: OpenAI's New MacOS Features</p><p>01:27:41 Live Demo: ChatGPT Integration with VS Code</p><p>01:34:28 Ultravox: Real-Time AI Conversations</p><p>01:42:03 Tilde Research and Stargazer Tool</p><p>01:46:12 Conclusion and Final Thoughts</p><p>This week also, there was a debate online, whether deep learning (and scale is all you need) has hit a wall, with folks like Ilya Sutskever being cited by publications claiming it has, folks like Yann LeCoon calling "I told you so". TL;DR? multiple huge breakthroughs later, and both <a target="_blank" href="https://x.com/OriolVinyalsML/status/1857117231035150567">Oriol</a> from DeepMind and <a target="_blank" href="https://x.com/sama/status/1856941766915641580">Sam Altman</a> are saying "what wall?" and Heiner from <a target="_blank" href="http://X.ai">X.ai</a> saying "<a target="_blank" href="https://x.com/HeinrichKuttler/status/1856614187268747594">skill issue</a>", there is no walls in sight, despite some tech journalism love to pretend there is. Also, what happened to Yann? 😵‍💫</p><p>Ok, back to our scheduled programming, here's the TL;DR, afterwhich, a breakdown of the most important things about today's update, and as always, I encourage you to watch / listen to the show, as we cover way more than I summarize here 🙂</p><p>TL;DR and Show Notes:</p><p>* <strong>Open Source LLMs</strong></p><p>* Qwen Coder 2.5 32B (+5 others) - Sonnet @ home (<a target="_blank" href="https://huggingface.co/collections/Qwen/qwen25-coder-66eaa22e6f99801bf65b0c2f">HF</a>, <a target="_blank" href="https://qwenlm.github.io/blog/qwen2.5-coder-family/">Blog</a>, <a target="_blank" href="https://arxiv.org/abs/2409.12186">Tech Report</a>)</p><p>* The End of Quantization? (<a target="_blank" href="https://x.com/Tim_Dettmers/status/1856338240099221674">X</a>, <a target="_blank" href="https://x.com/Tanishq97836660/status/1856045600355352753">Original Thread</a>)</p><p>* Epoch : FrontierMath new benchmark for advanced MATH reasoning in AI (<a target="_blank" href="https://buttondown.com/ainews/archive/ainews-frontiermath-a-benchmark-for-evaluating/">Blog</a>)</p><p>* Common Corpus: Largest multilingual 2T token dataset (<a target="_blank" href="https://huggingface.co/blog/Pclanglais/two-trillion-tokens-open">blog</a>)</p><p>* NexusFlow - Athena v2 - open model suite (<a target="_blank" href="https://x.com/NexusflowX/status/1857089879437352987">X</a>, <a target="_blank" href="https://t.co/MxO86Gcq0Y">Blog</a>, <a target="_blank" href="https://huggingface.co/Nexusflow/Athene-V2-Chat">HF</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Gemini 1114 is new king LLM #1 LMArena (<a target="_blank" href="https://x.com/lmarena_ai/status/1857110672565494098">X</a>)</p><p>* Nous Forge Reasoning API - beta (<a target="_blank" href="https://nousresearch.com/introducing-the-forge-reasoning-api-beta-and-nous-chat-an-evolution-in-llm-inference/">Blog</a>, <a target="_blank" href="https://x.com/NousResearch/status/1856417883934601246">X</a>)</p><p>* Reuters reports "AI is hitting a wall" and it's becoming a meme (<a target="_blank" href="https://www.reuters.com/technology/artificial-intelligence/openai-rivals-seek-new-path-smarter-ai-current-methods-hit-limitations-2024-11-11/">Article</a>)</p><p>* Cursor acq. SuperMaven (<a target="_blank" href="https://x.com/cursor_ai/status/1856427424927625679">X</a>)</p><p>* This Weeks Buzz</p><p>* <a target="_blank" href="https://wandb.me/weave?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=thursdai&#38;utm_id=nov14">Weave</a> JS/TS support is here 🙌</p><p>* Voice & Audio</p><p>* Fixie releases UltraVox SOTA (<a target="_blank" href="https://demo.ultravox.ai/">Demo</a>, <a target="_blank" href="https://huggingface.co/fixie-ai">HF</a>, <a target="_blank" href="https://fixie-ai.github.io/ultradox/">API</a>)</p><p>* Suno v4 is coming and it's bonkers amazing (<a target="_blank" href="https://x.com/altryne/status/1856737012348301414">Alex Song</a>, <a target="_blank" href="https://suno.com/song/0552bee7-a104-4be0-bc3a-beb5398a2d61">SOTA Jingle</a>)</p><p>* Tools demoed</p><p>* Qwen artifacts - <a target="_blank" href="https://huggingface.co/spaces/Qwen/Qwen2.5-Coder-Artifacts">HF Demo</a></p><p>* Tilde Galaxy - <a target="_blank" href="https://stars.tilderesearch.com/">Interp Tool</a></p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-nov-14-qwen-25-coder-no</link><guid isPermaLink="false">substack:post:151672097</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Fri, 15 Nov 2024 00:38:11 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/151672097/974b487fcfad680ff84778b00f50e312.mp3" length="104347083" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6522</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/151672097/7148871bfbf8a490f27368919a961e4d.jpg"/></item><item><title><![CDATA[📆 ThursdAI - Nov 7 - Video version, full o1 was given and taken away, Anthropic price hike-u, halloween 💀 recap & more AI news]]></title><description><![CDATA[<p>👋 Hey all, this is Alex, coming to you from the very Sunny California, as I'm in SF again, while there is a complete snow storm back home in Denver (brrr).</p><p>I flew here for the Hackathon I kept telling you about, and it was glorious, we had over 400 registered, over 200 approved hackers, 21 teams submitted incredible projects 👏 You can follow some of these <a target="_blank" href="https://x.com/AITinkerers/status/1853198238855368724">here</a></p><p>I then decided to stick around and record the show from SF, and finally pulled the plug and asked for some budget, and I present, the first ThursdAI, recorded from the newly minted W&B Podcast studio at our office in SF 🎉</p><p>This isn't the only first, today also, for the first time, all of the regular co-hosts of ThursdAI, met on video for the first time, after over a year of hanging out weekly, we've finally made the switch to video, and you know what? Given how good AI podcasts are getting, we may have to stick around with this video thing! We played one such clip from a new model called hertz-dev, which is a <10B model for full duplex audio.</p><p>Given that today's episode is a video podcast, I would love for you to see it, so here's the timestamps for the chapters, which will be followed by the TL;DR and show notes in raw format. I would love to hear from folks who read the longer form style newsletters, do you miss them? Should I bring them back? Please leave me a comment 🙏 (I may send you a survey)</p><p>This was a generally slow week (for AI!! not for... ehrm other stuff) and it was a fun podcast! Leave me a comment about what you think about this new format.</p><p>Chapter Timestamps</p><p>00:00 Introduction and Agenda Overview</p><p>00:15 Open Source LLMs: Small Models</p><p>01:25 Open Source LLMs: Large Models</p><p>02:22 Big Companies and LLM Announcements</p><p>04:47 Hackathon Recap and Community Highlights</p><p>18:46 Technical Deep Dive: HertzDev and FishSpeech</p><p>33:11 Human in the Loop: AI Agents</p><p>36:24 Augmented Reality Lab Assistant</p><p>36:53 Hackathon Highlights and Community Vibes</p><p>37:17 Chef Puppet and Meta Ray Bans Raffle</p><p>37:46 Introducing Fester the Skeleton</p><p>38:37 Fester's Performance and Community Reactions</p><p>39:35 Technical Insights and Project Details</p><p>42:42 Big Companies API Updates</p><p>43:17 Haiku 3.5: Performance and Pricing</p><p>43:44 Comparing Haiku and Sonnet Models</p><p>51:32 XAI Grok: New Features and Pricing</p><p>57:23 OpenAI's O1 Model: Leaks and Expectations</p><p>01:08:42 Transformer ASIC: The Future of AI Hardware</p><p>01:13:18 The Future of Training and Inference Chips</p><p>01:13:52 Oasis Demo and Etched AI Controversy</p><p>01:14:37 Nisten's Skepticism on Etched AI</p><p>01:19:15 Human Layer Introduction with Dex</p><p>01:19:24 Building and Managing AI Agents</p><p>01:20:54 Challenges and Innovations in AI Agent Development</p><p>01:21:28 Human Layer's Vision and Future</p><p>01:36:34 Recap and Closing Remarks</p><p></p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p><p></p><p>Show Notes and Links:</p><p>* <strong>Interview</strong></p><p>* Dexter Horthy (<a target="_blank" href="https://x.com/dexhorthy">X</a>) from <a target="_blank" href="https://x.com/humanlayer_dev">HumanLayer</a></p><p>* <strong>Open Source LLMs</strong></p><p>* SmolLM2: the new, best, and open 1B-parameter language mode (<a target="_blank" href="https://x.com/andi_marafioti/status/1854077589469462840">X</a>)</p><p>* Meta released MobileLLM (125M, 350M, 600M, 1B) (<a target="_blank" href="https://huggingface.co/collections/facebook/mobilellm-6722be18cb86c20ebe113e95">HF</a>)</p><p>* Tencent Hunyuan Large - 389B X 52B (Active) MoE (<a target="_blank" href="https://x.com/WizardLM_AI/status/1853814545514700802">X</a>, <a target="_blank" href="https://huggingface.co/tencent/Tencent-Hunyuan-Large">HF</a>, <a target="_blank" href="https://arxiv.org/abs/2411.02265">Paper</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* OpenAI buys and opens <a target="_blank" href="http://chat.com">chat.com</a></p><p>* Anthropic releases Claude Haiku 3.5 via API (<a target="_blank" href="https://x.com/AnthropicAI/status/1853498267612438873">X</a>, <a target="_blank" href="https://www.anthropic.com/claude/haiku">Blog</a>)</p><p>* OpenAI drops o1 full - and pulls it back (but not before it got <a target="_blank" href="https://x.com/elder_plinius/status/1852690065698250878">Jailbroken</a>)</p><p>* <a target="_blank" href="X.ai">X.ai</a> now offers $25/mo free of Grok API credits (<a target="_blank" href="https://x.com/btibor91/status/1853406496039846291">X</a>, <a target="_blank" href="https://x.ai/blog/api">Platform</a>)</p><p>* Etched announces Sonu - first Transformer ASIC - 500K tok/s (<a target="_blank" href="https://www.etched.com/">etched</a>)</p><p>* PPXL is not valued at 9B lol</p><p>* <strong>This weeks Buzz</strong></p><p>* Recap of SF Hackathon w/ AI Tinkerers (<a target="_blank" href="https://x.com/AITinkerers/status/1853198238855368724">X</a>)</p><p>* Fester the Halloween Toy aka Project Halloweave videos from trick or treating (<a target="_blank" href="https://x.com/altryne/status/1853957088747413949">X</a>, <a target="_blank" href="https://wandb.me/halloweave">Writeup</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* Hertz-dev - 8.5B conversation audio gen (<a target="_blank" href="https://x.com/si_pbc/status/1853184307063660723">X</a>, <a target="_blank" href="https://si.inc/hertz-dev/">Blog</a> )</p><p>* Fish Agent v0.1 3B - Speech to Speech model (<a target="_blank" href="https://huggingface.co/fishaudio/fish-agent-v0.1-3b">HF</a>, <a target="_blank" href="https://huggingface.co/spaces/fishaudio/fish-agent">Demo</a>)</p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* FLUX 1.1 [pro] is how HD - 4x resolution (<a target="_blank" href="https://x.com/bfl_ml/status/1854187828923531558">X</a>, <a target="_blank" href="https://blackforestlabs.ai/flux-1-1-ultra/">blog</a>)</p><p></p><p>Full Transcription for convenience below:</p><p></p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-nov-7-video-version-full</link><guid isPermaLink="false">substack:post:151353661</guid><dc:creator><![CDATA[Alex Volkov and Dex Horthy]]></dc:creator><pubDate>Fri, 08 Nov 2024 01:36:02 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/151353661/56b8d76133bd52ba674599fe950a659f.mp3" length="94442410" type="audio/mpeg"/><itunes:author>Alex Volkov and Dex Horthy</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5902</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/151353661/d2fdbb77d9d1940824ecfd3a5a907873.jpg"/></item><item><title><![CDATA[📆 ThursdAI - Spooky Halloween edition with Video!]]></title><description><![CDATA[<p>Hey everyone, Happy Halloween! Alex here, coming to you live from my mad scientist lair! For the first ever, live video stream of ThursdAI, I dressed up as a mad scientist and had my co-host, Fester the AI powered Skeleton join me (as well as my usual cohosts haha) in a very energetic and hopefully entertaining video stream! </p><p>Since it's Halloween today, Fester (and I) have a very busy schedule, so no super length ThursdAI news-letter today, as we're still not in the realm of Gemini being able to write a decent draft that takes everything we talked about and cover all the breaking news, I'm afraid I will have to wish you a Happy Halloween and ask that you watch/listen to the episode. </p><p>The TL;DR and show links from today, don't cover all the breaking news but the major things we saw today (and caught live on the show as Breaking News) were, ChatGPT now has search, Gemini has grounded search as well (seems like the release something before Google announces it streak from OpenAI continues). </p><p>Here's a quick trailer of the major things that happened: </p><p></p><p>This weeks buzz - Halloween AI toy with Weave</p><p>In this weeks buzz, my long awaited Halloween project is finally live and operational! </p><p>I've posted a public Weave dashboard <a target="_blank" href="https://wandb.me/halloweave">here</a> and the code (that you can run on your mac!) <a target="_blank" href="https://github.com/altryne/halloweave">here</a></p><p>Really looking forward to see all the amazing costumers the kiddos come up with and how Gemini will be able to respond to them, follow <a target="_blank" href="https://wandb.me/halloweave">along</a>! </p><p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></p><p>Ok and finally my raw TL;DR notes and links for this week. Happy halloween everyone, I'm running off to spook the kiddos (and of course record and post about it!)</p><p>ThursdAI - Oct 31 - TL;DR</p><p><strong>TL;DR of all topics covered:</strong></p><p>* <strong>Open Source LLMs:</strong></p><p>* <strong>Microsoft's OmniParser: SOTA UI parsing (MIT Licensed)</strong> <a target="_blank" href="https://x.com/mervenoyann/status/1849772138166727128">𝕏</a></p><p>* Groundbreaking model for web automation (MIT license).</p><p>* State-of-the-art UI parsing and understanding.</p><p>* Outperforms GPT-4V in parsing web UI.</p><p>* Designed for web automation tasks.</p><p>* Can be integrated into various development workflows.</p><p>* <strong>ZhipuAI's GLM-4-Voice: End-to-end Chinese/English speech</strong> <a target="_blank" href="https://x.com/ChatGLM/status/1849819999663423816">𝕏</a></p><p>* End-to-end voice model for Chinese and English speech.</p><p>* Open-sourced and readily available.</p><p>* Focuses on direct speech understanding and generation.</p><p>* Potential applications in various speech-related tasks.</p><p>* <strong>Meta releases LongVU: Video LM for long videos</strong> <a target="_blank" href="https://x.com/mervenoyann/status/1851650881374040357">𝕏</a></p><p>* Handles long videos with impressive performance.</p><p>* Uses DINOv2 for downsampling, eliminating redundant scenes.</p><p>* Fuses features using DINOv2 and SigLIP.</p><p>* Select tokens are passed to Qwen2/Llama-3.2-3B.</p><p>* Demo and model are available on HuggingFace.</p><p>* Potential for significant advancements in video understanding.</p><p>* OpenAI new factuality benchmark (<a target="_blank" href="https://openai.com/index/introducing-simpleqa/">Blog</a>, Github)</p><p>* Introducing SimpleQA: new factuality benchmark</p><p>* Goal: high correctness, diversity, challenging for frontier models</p><p>* Question Curation: AI trainers, verified by second trainer</p><p>* Quality Assurance: 3% inherent error rate</p><p>* Topic Diversity: wide range of topics</p><p>* Grading Methodology: "correct", "incorrect", "not attempted"</p><p>* Model Comparison: smaller models answer fewer correctly</p><p>* Calibration Measurement: larger models more calibrated</p><p>* Limitations: only for short, fact-seeking queries</p><p>* Conclusion: drive research on trustworthy AI</p><p>* <strong>Big CO LLMs + APIs:</strong></p><p>* ChatGPT now has Search! (<a target="_blank" href="https://x.com/altryne/status/1852045015050260703">X</a>)</p><p>* Grounded search results in browsing the web</p><p>* Still hallucinates</p><p>* Reincarnation of Search GPT inside ChatGPT</p><p>* <strong>Apple Intelligence Launch: Image features for iOS 18.2</strong> [𝕏]( <em>Link not provided in source material</em>)</p><p>* Officially launched for developers in iOS 18.2.</p><p>* Includes Image Playground and Gen Moji.</p><p>* Aims to enhance image creation and manipulation on iPhones.</p><p>* <strong>GitHub Universe AI News: Co-pilot expands, new Spark tool</strong> <a target="_blank" href="https://x.com/devnationworld/status/1851524396680102239">𝕏</a></p><p>* GitHub Co-pilot now supports Claude, Gemini, and OpenAI models.</p><p>* GitHub Spark: Create micro-apps using natural language.</p><p>* Expanding the capabilities of AI-powered coding tools.</p><p>* Copilot now supports multi-file edits in VS Code, similar to Cursor, and faster code reviews.</p><p>* GitHub Copilot extensions are planned for release in 2025.</p><p>* <strong>Grok Vision: Image understanding now in Grok</strong> <a target="_blank" href="https://x.com/elonmusk/status/1850724646414606406">𝕏</a></p><p>* Finally has vision capabilities (currently via 𝕏, API coming soon).</p><p>* Can now understand and explain images, even jokes.</p><p>* Early version, with rapid improvements expected.</p><p>* OpenAI advanced voice mode updates (<a target="_blank" href="https://x.com/OpenAI/status/1851714389835157660">X</a>)</p><p>* 70% cheaper in input tokens because of automatic caching (X)</p><p>* Advanced voice mode is now on desktop app</p><p>* Claude this morning - new mac / pc App</p><p>* <strong>This week's Buzz:</strong></p><p>* My AI Halloween toy skeleton is greeting kids right now (and is reporting to <a target="_blank" href="https://wandb.me/halloweave">Weave dashboard</a>)</p><p>* <strong>Vision & Video:</strong></p><p>* <strong>Meta's LongVU: Video LM for long videos</strong> <a target="_blank" href="https://x.com/mervenoyann/status/1851650881374040357">𝕏</a> (see Open Source LLMs for details)</p><p>* <strong>Grok Vision on 𝕏:</strong> <a target="_blank" href="https://x.com/elonmusk/status/1850724646414606406">𝕏</a> (see Big CO LLMs + APIs for details)</p><p>* <strong>Voice & Audio:</strong></p><p>* <strong>MaskGCT: New SoTA Text-to-Speech</strong> <a target="_blank" href="https://x.com/reach_vb/status/1851629504348754202">𝕏</a></p><p>* New open-source state-of-the-art text-to-speech model.</p><p>* Zero-shot voice cloning, emotional TTS, long-form synthesis, variable speed synthesis, bilingual (Chinese & English).</p><p>* Available on Hugging Face.</p><p>* <strong>ZhipuAI's GLM-4-Voice: End-to-end Chinese/English speech</strong> <a target="_blank" href="https://x.com/ChatGLM/status/1849819999663423816">𝕏</a> (see Open Source LLMs for details)</p><p>* <strong>Advanced Voice Mode on Desktops:</strong> <a target="_blank" href="https://x.com/testingcatalog/status/1851718996049170515">𝕏</a> (See Big CO LLMs + APIs for details).</p><p>* <strong>AI Art & Diffusion:</strong> (See Red Panda in "This week's Buzz" above)</p><p>* Redcraft Red Panda: new SOTA image diffusion <a target="_blank" href="https://x.com/MParakhin/status/1851287090748953038">𝕏</a></p><p>* High-performing image diffusion model, beating Black Forest Labs Flux.</p><p>* 72% win rate, higher ELO than competitors.</p><p>* Creates SVG files, editable as vector files.</p><p>* From Redcraft V3.</p><p>* <strong>Tools:</strong></p><p>* <a target="_blank" href="Bolt.new"><strong>Bolt.new</strong></a><strong> by StackBlitz: In-browser full-stack dev environment</strong> <a target="_blank" href="https://x.com/stackblitz/status/1841873251313844631">𝕏</a></p><p>* Platform for prompting, editing, running, and deploying full-stack apps directly in your browser.</p><p>* Uses WebContainers.</p><p>* Supports npm, Vite, Next.js, and integrations with Netlify, Cloudflare, and SuperBase.</p><p>* Free to use.</p><p>* <strong>Jina AI's Meta-Prompt: Improved LLM Codegen</strong> <a target="_blank" href="https://x.com/JinaAI_/status/1851651702635847729">𝕏</a></p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-spooky-halloween-edition</link><guid isPermaLink="false">substack:post:151009424</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Fri, 01 Nov 2024 00:40:58 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/151009424/dcfc5f258d67088881808e0a824a2d12.mp3" length="104724614" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6545</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/151009424/1a2486f099bdb11b52e7650bedf1ffa0.jpg"/></item><item><title><![CDATA[📅 ThursdAI - Oct 24 - Claude 3.5 controls your PC?! Talking AIs with 🦾, Multimodal Weave, Video Models mania + more AI news from this 🔥 week. ]]></title><description><![CDATA[<p>Hey all, Alex here, coming to you from the (surprisingly) sunny Seattle, with just a mind-boggling week of releases. Really, just on Tuesday there was so much news already! I had to post a recap <a target="_blank" href="https://x.com/altryne/status/1848868854711398456">thread</a>, something I do usually after I finish ThursdAI! </p><p>From Anthropic reclaiming close-second sometimes-first AI lab position + giving Claude the wheel in the form of computer use powers, to more than 3 AI video generation updates with open source ones, to Apple updating Apple Intelligence beta, it's honestly been very hard to keep up, and again, this is literally part of my job! </p><p>But once again I'm glad that we were able to cover this in ~2hrs, including multiple interviews with returning co-hosts ( <a target="_blank" href="https://substack.com/profile/5753967-simon-willison">Simon Willison</a> came back, <a target="_blank" href="https://sub.thursdai.news/p/thursdai-special-interview-with-killian?utm_source=publication-search">Killian</a> came back) so definitely if you're only a reader at this point, <strong>listen to the show</strong>! </p><p>Ok as always (recently) the <strong>TL;DR and show notes at the bottom</strong> (I'm trying to get you to scroll through ha, is it working?) so grab a bucket of popcorn, let's dive in 👇 </p><p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></p><p>Claude's Big Week: Computer Control, Code Wizardry, and the Mysterious Case of the Missing Opus</p><p>Anthropic <strong>dominated</strong> the headlines this week with a flurry of updates and announcements. Let's start with the new Claude Sonnet 3.5 (really, they didn't update the version number, it's still 3.5 tho a different API model) </p><p>Claude Sonnet 3.5: Coding Prodigy or Benchmark Buster?</p><p>The new Sonnet model shows impressive results on coding benchmarks, surpassing even OpenAI's O1 preview on some. "It absolutely crushes coding benchmarks like Aider and Swe-bench verified," I exclaimed on the show. But a closer look reveals a more nuanced picture. Mixed results on other benchmarks indicate that Sonnet 3.5 might not be the universal champion some anticipated. My friend who has held back internal benchmarks was disappointed highlighting weaknesses in scientific reasoning and certain writing tasks. Some folks are seeing it being lazy-er for some full code completion, while the context window is now doubled from 4K to 8K! This goes to show again, that benchmarks don't tell the full story, so we wait for LMArena (formerly LMSys Arena) and the vibe checks from across the community. </p><p>However it absolutely dominates in code tasks, that much is clear already. This is a screenshot of the new model on Aider code editing benchmark, a fairly reliable way to judge models code output, they also have a code refactoring benchmark</p><p>Haiku 3.5 and the Vanishing Opus: Anthropic's Cryptic Clues</p><p>Further adding to the intrigue, Anthropic announced Claude 3.5 Haiku!  They usually provide immediate access, but Haiku remains elusive, saying that it's available by end of the month, which is very very soon. Making things even more curious, their highly anticipated Opus model has seemingly vanished from their website. "They've gone completely silent on 3.5 Opus," Simon Willison (<a target="_blank" href="https://x.com/simonw">𝕏</a>) noted, mentioning conspiracy theories that this new Sonnet might simply be a rebranded Opus? 🕯️ 🕯️ We'll make a summoning circle for new Opus and update you once it lands (maybe next year)  </p><p>Claude Takes Control (Sort Of): Computer Use API and the Dawn of AI Agents (<a target="_blank" href="https://x.com/altryne/status/1848790306776379767">𝕏</a>)</p><p>The biggest bombshell this week? Anthropic's Computer Use. This isn't just about executing code; it’s about Claude <em>interacting</em> with computers, clicking buttons, browsing the web, and yes, even ordering pizza! Killian Lukas (<a target="_blank" href="https://x.com/hellokillian">𝕏</a>), creator of Open Interpreter, returned to ThursdAI to discuss this groundbreaking development. "This stuff of computer use…it’s the same argument for having humanoid robots, the web is human shaped, and we need AIs to interact with computers and the web the way humans do" Killian explained, illuminating the potential for bridging the digital and physical worlds. </p><p>Simon, though enthusiastic, provided a dose of realism: "It's incredibly impressive…but also very much a V1, beta.” Having tackled the setup myself, I agree; the current reliance on a local Docker container and virtual machine introduces some complexity and security considerations. However, seeing Claude fix its <em>own</em> Docker installation error was an unforgettably <a target="_blank" href="https://x.com/altryne/status/1848790306776379767">mindblowing</a> experience. The future of AI agents is upon us, even if it’s still a bit rough around the edges.</p><p><a target="_blank" href="https://x.com/alexalbert__/status/1849124173966393630">Here's</a> an easy guide to set it up yourself, takes 5 minutes, requires no coding skills and it's safely tucked away in a container.</p><p>Big Tech's AI Moves: Apple Embraces ChatGPT, <a target="_blank" href="X.ai">X.ai</a> API (+Vision!?), and Cohere  Multimodal Embeddings</p><p>The rest of the AI world wasn’t standing still. Apple made a surprising integration, while <a target="_blank" href="X.ai">X.ai</a> and Cohere pushed their platforms forward.</p><p>Apple iOS 18.2 Beta: Siri Phones a Friend (ChatGPT)</p><p>Apple, always cautious, surprisingly integrated ChatGPT directly into iOS. While Siri remains…well, Siri, users can now effortlessly offload more demanding tasks to ChatGPT. "Siri is still stupid," I joked, "but can now ask it to write some stuff and it'll tell you, hey, do you want me to ask my much smarter friend ChatGPT about this task?" This approach acknowledges Siri's limitations while harnessing ChatGPT’s power. The iOS 18.2 beta also includes GenMoji (custom emojis!) and Visual Intelligence (multimodal camera search) which are both welcome, tho I didn't really get the need of the Visual Intelligence (maybe I'm jaded with my Meta Raybans that already have this and are on my face most of the time) and I didn't get into the GenMoji waitlist still waiting to show you some custom emojis! </p><p><a target="_blank" href="X.ai">X.ai</a> API: Grok's Enterprise Ambitions and a Secret Vision Model</p><p>Elon Musk's <a target="_blank" href="X.ai">X.ai</a> unveiled their API platform, focusing on enterprise applications with Grok 2 beta. They also teased an undisclosed vision model, and they had vision APIs for some folks who joined their hackathon. While these models are still not worth using necessarily, the next Grok-3 is promising to be a frontier model, and for some folks, it's relaxed approach to content moderation (what Elon is calling maximally seeking the truth) is going to be a convincing point for some! </p><p>I just wish they added fun mode and access to real time data from X! Right now it's just the Grok-2 model, priced at a very non competative $15/mTok 😒</p><p>Cohere Embed 3: Elevating Multimodal Embeddings (<a target="_blank" href="https://cohere.com/blog/multimodal-embed-3">Blog</a>)</p><p>Cohere launched Embed 3, enabling embeddings for both text and visuals such as graphs and designs. "While not the first multimodal embeddings, when it comes from Cohere, you know it's done right," I commented. </p><p>Open Source Power: JavaScript Transformers and SOTA Multilingual Models</p><p>The open-source AI community continues to impress, making powerful models accessible to all.</p><p>Massive kudos to Xenova (<a target="_blank" href="https://x.com/xenovacom/status/1848741677122654483">𝕏</a>) for the release of Transformers.js v3! The addition of WebGPU support results in a staggering "up to 100 times faster" performance boost for browser-based AI, dramatically simplifying local, private, and efficient model running. We also saw DeepSeek’s Janus 1.3B, a multimodal image and text model, and Cohere For AI's Aya Expanse, supporting 23 languages.</p><p>This Week’s Buzz: Hackathon Triumphs and Multimodal Weave</p><p>On ThursdAI, we also like to share some of the exciting things happening behind the scenes.</p><p>AI Chef Showdown: Second Place and Lessons Learned</p><p>Happy to report that team Yes Chef clinched <strong>second place</strong> in a hackathon with an unconventional creation: a Gordon Ramsay-inspired robotic chef hand puppet, complete with a cloned voice and visual LLM integration. We bought and 3D printed and assembled an Open Source robotic arm, made it become a ventriloquist operator by letting it animate a hand puppet, and cloned Ramsey's voice. It was so so much fun to build, and the code is here</p><p>Weave Goes Multimodal: Seeing and Hearing Your AI</p><p>Even more exciting was the opportunity to leverage Weave's newly launched multimodal functionality. "Weave supports you to see and play back everything that's audio generated," I shared, emphasizing its usefulness in debugging our vocal AI chef. </p><p>For a practical example, <a target="_blank" href="https://wandb.ai/wandb-public/yes-chef/weave/calls?filter=%7B%22opVersionRefs%22%3A%5B%22weave%3A%2F%2F%2Fwandb-public%2Fyes-chef%2Fop%2Fhandle_wake_word%3A*%22%5D%7D">here's</a> ALL the (NSFW) roasts that AI Chef has cooked me with, it's honestly horrifying haha. For full effect, turn on the background music first and then play the chef audio 😂</p><p>📽️ Video Generation Takes Center Stage: Mochi's Motion Magic and Runway's Acting Breakthrough</p><p>Video models made a quantum leap this week, pushing the boundaries of generative AI.</p><p>Genmo Mochi-1: Diffusion Transformers and Generative Motion</p><p>Genmo's Ajay Jain (Genmo) joined ThursdAI to discuss Mochi-1, their powerful new diffusion transformer. "We really focused on…prompt adherence and motion," he explained. Mochi-1's capacity to generate complex and realistic motion is truly remarkable, and with an HD version on its way, the future looks bright (and animated!). They also get bonus points for dropping a torrent link in the announcement <a target="_blank" href="https://x.com/genmoai/status/1848762405779574990">tweet</a>.</p><p></p><p>So far this apache 2, 10B Diffusion Transformer is open source, but not for the GPU-poors, as it requires 4 GPUs to run, but apparently there was already an attempt to run in on one single 4090 which, Ajay highlighted was one of the reasons they open sourced it! </p><p>Runway Act-One: AI-Powered Puppetry and the Future of Acting (<a target="_blank" href="https://runwayml.com/research/introducing-act-one">blog</a>)</p><p>Ok this one absolutely seems bonkers! Runway unveiled Act-One! Forget just generating video from text; Act-One takes a driving video and character image to produce expressive and nuanced character performances. "It faithfully represents elements like eye-lines, micro expressions, pacing, and delivery," I noted, excited by the transformative potential for animation and filmmaking.</p><p>So no need for rigging, for motion capture suites on faces of actors, Runway now, does this, so you can generate characters with Flux, and animate them with Act-One 📽️ Just take a look at this insanity 👇 </p><p>11labs Creative Voices: Prompting Your Way to the Perfect Voice</p><p>11labs debuted an incredible feature: creating custom voices using only text prompts. Want a high-pitched squeak or a sophisticated British accent? Just ask. This feature makes bespoke voice creation significantly easier.</p><p>I was really really impressed by this, as this is perfect for my Skeleton Halloween project! So far I struggled to get the voice "just right" between the awesome Cartesia voice that is not emotional enough, and the very awesome custom OpenAI voice that needs a prompt to act, and sometimes stops acting in the middle of a sentence. </p><p>With this new Elevenlabs feature, I can describe the exact voice I want <strong>with a prompt</strong>, and then keep iterating until I find the perfect one, and then boom, it's available for me! Great for character creation, and even greater for the above Act-One model, as you can now generate a character with Flux, Drive the video with Act-one and revoice yourself with a custom prompted voice from 11labs! Which is exactly what I'm going to build for the next hackathon! </p><p>If you'd like to support me in this journey, here's an 11labs affiliate <a target="_blank" href="elevenlabs.io/?from=partnergarcia1131">link</a> haha but I already got a yearly account so don't sweat it. </p><p>AI Art & Diffusion Updates: Stable Diffusion 3.5, Ideogram Canvas, and OpenAI's Sampler Surprise</p><p>The realm of AI art and diffusion models saw its share of action as well.</p><p>Stable Diffusion 3.5 (<a target="_blank" href="https://stability.ai/news/introducing-stable-diffusion-3-5">Blog</a>) and Ideogram Canvas: Iterative Improvements and Creative Control</p><p>Stability AI launched Stable Diffusion 3.5, bringing incremental enhancements to image quality and prompt accuracy. Ideogram, meanwhile, introduced Canvas, a groundbreaking interface enabling mixing, matching, extending, and fine-tuning AI-generated artwork. This opens doors to unprecedented levels of control and creative expression.</p><p>Midjourney also announced a web editor, and folks are freaking out, and I'm only left thinking, is MJ a bit a cult? There are so much offerings out there, but it seems like everything MJ releases gets tons more excitement from that part of X than other way more incredible stuff 🤔 </p><p>Seattle Pic</p><p>Ok wow that was a LOT of stuff to cover, honestly, the TL;DR for this week became so massive that I had to zoom out to take 1 screenshot of it all ,and I wasn't sure we'd be able to cover all of it! </p><p>Massive massive week, super exciting releases, and the worst thing about this is, I barely have time to play with many of these!</p><p>But I'm hoping to have some time during the <a target="_blank" href="https://usewb.link/weave-sf-hackathon">Tinkerer AI hackathon</a> we're hosting on Nov 2-3 in our SF office, limited spots left, so come and hang with me and some of the Tinkerers team, and maybe even win a Meta Rayban special Weave prize! </p><p>RAW TL;DR + Show notes and links </p><p>* <strong>Open Source LLMs</strong> </p><p>* Xenova releases Transformers JS version 3 (<a target="_blank" href="https://x.com/xenovacom/status/1848741677122654483">X</a>)</p><p>* ⚡ WebGPU support (up to 100x faster than WASM)🔢 New quantization formats (dtypes)🏛 120 supported architectures in total📂 25 new example projects and templates🤖 Over 1200 pre-converted models🌐 Node.js (ESM + CJS), Deno, and Bun compatibility🏡 A new home on GitHub and NPM</p><p>* DeepSeek drops Janus 1.3B (<a target="_blank" href="https://x.com/osanseviero/status/1847185192823202079">X</a>, <a target="_blank" href="https://t.co/jq6hR6P79t">HF</a>, <a target="_blank" href="https://t.co/B1AgNu4ahi">Paper</a>)</p><p>* DeepSeek releases Janus 1.3B 🔥</p><p>* 🎨 Understands and generates both images and text</p><p>* 👀Combines DeepSeek LLM 1.3B with SigLIP-L for vision </p><p>* ✂️ Decouples the vision encoding</p><p>* Cohere for AI releases Aya expanse 8B, 32B (X, <a target="_blank" href="https://huggingface.co/CohereForAI/aya-expanse-8b">HF</a>, <a target="_blank" href="https://huggingface.co/spaces/CohereForAI/aya_expanse">Try it</a>)</p><p>* Aya Expanse is an open-weight research release of a model with highly advanced multilingual capabilities. It focuses on pairing a highly performant pre-trained <a target="_blank" href="https://huggingface.co/CohereForAI/c4ai-command-r-plus"><strong>Command family</strong></a> of models with the result of a year’s dedicated research from <a target="_blank" href="https://cohere.for.ai/"><strong>Cohere For AI</strong></a>, including <a target="_blank" href="https://arxiv.org/pdf/2408.14960"><strong>data arbitrage</strong></a>, <a target="_blank" href="https://arxiv.org/abs/2407.02552"><strong>multilingual preference training</strong></a>, <a target="_blank" href="https://arxiv.org/abs/2406.18682"><strong>safety tuning</strong></a>, and <a target="_blank" href="https://arxiv.org/abs/2410.10801"><strong>model merging</strong></a>. The result is a powerful multilingual large language model serving 23 languages.</p><p>* 23 languages </p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* New Claude Sonnet 3.5, Claude Haiku 3.5</p><p>* New Claude absolutely crushes coding benchmarks like Aider and Swe-bench verified. </p><p>* But I'm getting mixed signals from folks with internal benchmarks, as well as some other benches like Aidan Bench and Arc challenge in which it performs worse. </p><p>* 8K output token limit vs 4K</p><p>* Other folks swear by it, Skirano, Corbitt say it's an absolute killer coder</p><p>* Haiku is 2x the price of 4o-mini and Flash </p><p>* Anthropic Computer use API + docker (<a target="_blank" href="https://x.com/altryne/status/1848790306776379767">X</a>)</p><p>* Computer use is not new, see open interpreter etc </p><p>* Adept has been promising this for a while, so was LAM from rabbit.</p><p>* Now Anthropic has dropped a bomb on all these with a specific trained model to browse click and surf the web with a container</p><p>* Examples of computer use are super cool, Corbitt built agent.exe which uses it to control your computer</p><p>* Killian will join to talk about what this computer use means</p><p>* Folks are trying to <a target="_blank" href="https://x.com/nearcyan/status/1848875226043703762">order food</a>  (like Anthropic shows in their demo of ordering pizzas for the team) </p><p>* Claude launches code interpreter mode for <a target="_blank" href="http://claude.ai">claude.ai</a> (<a target="_blank" href="https://x.com/alexalbert__/status/1849471363456577806">X</a>)</p><p>* Cohere released Embed 3 for multimodal embeddings (<a target="_blank" href="https://cohere.com/blog/multimodal-embed-3">Blog</a>)</p><p>* 🔍 Multimodal Embed 3: Powerful AI search model</p><p>* 🌍 Unlocks value from image data for enterprises</p><p>* 🔍 Enables fast retrieval of relevant info & assets</p><p>* 🛒 Transforms e-commerce search with image search</p><p>* 🎨 Streamlines design process with visual search</p><p>* 📊 Improves data-driven decision making with visual insights</p><p>* 🔝 Industry-leading accuracy and performance</p><p>* 🌐 Multilingual support across 100+ languages</p><p>* 🤝 Partnerships with Azure AI and Amazon SageMaker</p><p>* 🚀 Available now for businesses and developers</p><p>* X ai has a new API platform + secret vision feature (<a target="_blank" href="https://docs.x.ai/docs">docs</a>)</p><p>* grok-2-beta $5.0 / $15.00 mtok</p><p>* Apple releases IOS 18.2 beta with GenMoji, Visual Intelligence, ChatGPT integration & more</p><p>* Siri is still stupid, but can now ask chatGPT to write s**t</p><p>* <strong>This weeks Buzz</strong></p><p>* Got second place for the hackathon with our AI Chef that roasts you in the kitchen (<a target="_blank" href="https://x.com/kwindla/status/1848868497314746727">X</a>, <a target="_blank" href="https://wandb.ai/wandb-public/yes-chef/weave/calls?filter=%7B%22opVersionRefs%22%3A%5B%22weave%3A%2F%2F%2Fwandb-public%2Fyes-chef%2Fop%2Fhandle_wake_word%3A*%22%5D%7D">Weave dash</a>)</p><p>* Weave is now multimodal and supports audio! (Weave)</p><p>* Tinkerers <a target="_blank" href="https://sf.aitinkerers.org/p/ai-tinkerers-humans-in-the-loop-agents-hackathon-with-google-cloud-weights-biases">Hackathon</a> in less than a week! </p><p>* <strong>Vision & Video</strong></p><p>* Genmo releases Mochi-1 txt2video model w/ Apache 2.0 license</p><p>* Gen mo - generative motion</p><p>* 10B DiT - diffusion transformer</p><p>* 5.5 seconds video</p><p>* Apache 2.0</p><p>* Comparison thread between Genmo Mochi-1 and Hailuo</p><p>* Genmo, the company behind Mochi 1, has raised $28.4M in Series A funding from various investors. Mochi 1 is an open-source video generation model that the company claims has "superior motion quality, prompt adherence and exceptional rendering of humans that begins to cross the uncanny valley." The company is open-sourcing their base 480p model, with an HD version coming soon.</p><p>Summary Bullet Points:</p><p>* Genmo announces $28.4M Series A funding</p><p>* Mochi 1 is an open-source video generation model</p><p>* Mochi 1 has "superior motion quality, prompt adherence and exceptional rendering of humans"</p><p>* X is open-sourcing their base 480p Mochi 1 model</p><p>* HD version of Mochi 1 is coming soon</p><p>* Mochi 1 is available via Genmo's playground or as downloadable weights, or on Fal</p><p>* Mochi 1 is licensed under <strong>Apache 2.0</strong></p><p>* Rhymes AI - Allegro video model (<a target="_blank" href="https://x.com/rhymes_ai_/status/1848554123471544711">X</a>)</p><p>* Meta a bunch of releases - Sam 2.1, Spirit LM </p><p>* Runway introduces puppetry video 2 video with emotion transfer (<a target="_blank" href="https://x.com/runwayml/status/1848785907723473001">X</a>)</p><p>* The webpage introduces Act-One, a new technology from Runway that allows for the generation of expressive character performances using a single driving video and character image, without the need for motion capture or rigging. Act-One faithfully represents elements like eye-lines, micro expressions, pacing, and delivery in the final generated output. It can translate an actor's performance across different character designs and styles, opening up new avenues for creative expression.</p><p>Summary in 10 Bullet Points:</p><p>* Act-One is a new technology from Runway</p><p>* It generates expressive character performances</p><p>* Uses a single driving video and character image</p><p>* No motion capture or rigging required</p><p>* Faithfully represents eye-lines, micro expressions, pacing, and delivery</p><p>* Translates performance across different character designs and styles</p><p>* Allows for new creative expression possibilities</p><p>* Works with simple cell phone video input</p><p>* Replaces complex, multi-step animation workflows</p><p>* Enables capturing the essence of an actor's performance</p><p>* Haiper releases a new video model</p><p>* Meta releases Sam 2.1</p><p>* Key updates to SAM 2:</p><p>* New data augmentation for similar and small objects</p><p>* Improved occlusion handling</p><p>* Longer frame sequences in training</p><p>* Tweaks to positional encoding</p><p>SAM 2 Developer Suite released:</p><p>* Open source code package</p><p>* Training code for fine-tuning</p><p>* Web demo front-end and back-end code</p><p>* <strong>Voice & Audio</strong></p><p>* OpenAI released custom voice support for chat completion API (<a target="_blank" href="https://x.com/altryne/status/1847076550866620537/video/1">X</a>, <a target="_blank" href="https://platform.openai.com/docs/guides/audio">Docs</a>)</p><p>* Pricing is still insane ($200/1mtok) </p><p>* This is not just TTS, this is advanced voice mode! </p><p>* The things you can ddo with them are very interesting, like asking for acting, or singing. </p><p>* 11labs create voices with a prompt is super cool (<a target="_blank" href="https://x.com/altryne/status/1849145950347915613">X</a>)</p><p>* Meta Spirit LM: An open source language model for seamless speech and text integration (<a target="_blank" href="https://ai.meta.com/blog/fair-news-segment-anything-2-1-meta-spirit-lm-layer-skip-salsa-lingua/">Blog</a>, <a target="_blank" href="https://ai.meta.com/resources/models-and-libraries/spirit-lm-downloads/">weights</a>)</p><p>* Meta Spirit LM is a multimodal language model that:</p><p>* Combines text and speech processing</p><p>* Uses word-level interleaving for cross-modality generation</p><p>* Has two versions:</p><p>* Base: uses phonetic tokens</p><p>* Expressive: uses pitch and style tokens for tone</p><p>* Enables more natural speech generation</p><p>* Can learn tasks like ASR, TTS, and speech classification</p><p>* MoonShine for audio </p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* Stable Diffusion 3.5 was released (<a target="_blank" href="https://x.com/StabilityAI/status/1848729212250951911">X</a>, <a target="_blank" href="https://stability.ai/news/introducing-stable-diffusion-3-5">Blog</a>, <a target="_blank" href="https://huggingface.co/stabilityai">HF</a>)</p><p>* including Stable Diffusion 3.5 Large and Stable Diffusion 3.5 Large Turbo.</p><p>* table Diffusion 3.5 Medium will be released on October 29th.  </p><p>* the permissive <a target="_blank" href="https://stability.ai/community-license-agreement">Stability AI Community License</a>. </p><p>* 🚀 Introducing Stable Diffusion 3.5 - powerful, customizable, and free models</p><p>* 🔍 Improved prompt adherence and image quality compared to previous versions</p><p>* ⚡️ Stable Diffusion 3.5 Large Turbo offers fast inference times</p><p>* 🔧 Multiple variants for different hardware and use cases</p><p>* 🎨 Empowering creators to distribute and monetize their work</p><p>* 🌐 Available for commercial and non-commercial use under permissive license</p><p>* 🔍 Listening to community feedback to advance their mission</p><p>* 🔄 Stable Diffusion 3.5 Medium to be released on October 29th</p><p>* 🤖 Commitment to transforming visual media with accessible AI tools</p><p>* 🔜 Excited to see what the community creates with Stable Diffusion 3.5</p><p>* Ideogram released Canvas (<a target="_blank" href="https://x.com/ideogram_ai/status/1848757699606983143">X</a>)</p><p>* Canvas is a mix of Krea and Everart</p><p>* Ideogram is a free AI tool for generating realistic images, posters, logos</p><p>* Extend tool allows expanding images beyond original borders</p><p>* Magic Fill tool enables editing specific image regions and details</p><p>* Ideogram Canvas is a new interface for organizing, generating, editing images</p><p>* Ideogram uses AI to enhance the creative process with speed and precision</p><p>* Developers can integrate Ideogram's Magic Fill and Extend via the API</p><p>* Privacy policy and other legal information available on the website</p><p>* Ideogram is free-to-use, with paid plans offering additional features</p><p>* Ideogram is available globally, with support for various browsers</p><p>* OpenAI released a new sampler paper trying to beat diffusers (<a target="_blank" href="https://openai.com/index/simplifying-stabilizing-and-scaling-continuous-time-consistency-models/">Blog</a>)</p><p>* Researchers at OpenAI have developed a new approach called sCM that simplifies the theoretical formulation of continuous-time consistency models, allowing them to stabilize and scale the training of these models for large datasets. The sCM approach achieves sample quality comparable to leading diffusion models, while using only two sampling steps - a 50x speedup over traditional diffusion models. Benchmarking shows sCM produces high-quality samples using less than 10% of the effective sampling compute required by other state-of-the-art generative models.The key innovation is that sCM models scale commensurately with the teacher diffusion models they are distilled from. As the diffusion models grow larger, the relative difference in sample quality between sCM and the teacher model diminishes. This allows sCM to leverage the advances in diffusion models to achieve impressive sample quality and generation speed, unlocking new possibilities for real-time, high-quality generative AI across domains like images, audio, and video.</p><p>* 🔍 Simplifying continuous-time consistency models</p><p>* 🔨 Stabilizing training for large datasets</p><p>* 🔍 Scaling to 1.5 billion parameters on ImageNet</p><p>* ⚡ 2-step sampling for 50x speedup vs. diffusion</p><p>* 🎨 Comparable sample quality to diffusion models</p><p>* 📊 Benchmarking against state-of-the-art models</p><p>* 🗺️ Visualization of diffusion vs. consistency models</p><p>* 🖼️ Selected 2-step samples from 1.5B model</p><p>* 📈 Scaling sCM with teacher diffusion models</p><p>* 🔭 Limitations and future work</p><p>* Midjourney announces an editor (<a target="_blank" href="https://x.com/midjourney/status/1849213116489564239">X</a>)</p><p>* announces the release of two new features for Midjourney users - an image editor for uploaded images and </p><p>* image re-texturing for exploring materials, surfacing, and lighting. </p><p>* These features will initially be available only to yearly members, members who have been subscribers for the past 12 months, and members with at least 10,000 images. </p><p>* The post emphasizes the need to give the community, human moderators, and AI moderation systems time to adjust to the new features</p><p>* Tools</p><p>PS : Subscribe to the newsletter and podcast, and I'll be back next week with more AI escapades! 🫶</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-oct-24-claude-35-controls</link><guid isPermaLink="false">substack:post:150690327</guid><dc:creator><![CDATA[Alex Volkov, Simon Willison, and Killian Lucas]]></dc:creator><pubDate>Fri, 25 Oct 2024 01:26:27 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/150690327/e84158f1a246efd7507b330ba9dba890.mp3" length="83767376" type="audio/mpeg"/><itunes:author>Alex Volkov, Simon Willison, and Killian Lucas</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6980</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/150690327/d17e306dff52674068f6034c49e43208.jpg"/></item><item><title><![CDATA[📆 ThursdAI - Oct 17 - Robots, Rockets, and Multi Modal Mania with open source voice cloning, OpenAI new voice API and more AI news]]></title><description><![CDATA[<p>Hey folks, Alex here from Weights & Biases, and this week has been absolutely bonkers. From robots walking among us to rockets landing on chopsticks (well, almost), the future is feeling palpably closer. And if real-world robots and reusable spaceship boosters weren't enough, the open-source AI community has been <em>cooking</em>, dropping new models and techniques faster than a Starship launch. So buckle up, grab your space helmet and noise-canceling headphones (we’ll get to why those are important!), and let's blast off into this week’s AI adventures!</p><p>TL;DR and show-notes + links at the <strong>end</strong> of the post 👇</p><p><strong>Robots and Rockets: A Glimpse into the Future</strong></p><p>I gotta start with the real-world stuff because, let's be honest, it's mind-blowing. We had Robert Scoble (yes, <em>the</em> Robert Scoble) join us after attending the Tesla We, Robot AI event, reporting on Optimus robots strolling through crowds, serving drinks, and generally being ridiculously futuristic. Autonomous robo-taxis were also cruising around, giving us a taste of a driverless future.</p><p>Robert’s enthusiasm was infectious: <em>"It was a vision of the future, and from that standpoint, it succeeded wonderfully."</em> I couldn't agree more. While the market might have had a mini-meltdown (apparently investors aren't ready for robot butlers yet), the sheer audacity of Tesla’s vision is exhilarating. These robots aren't just cool gadgets; they represent a fundamental shift in how we interact with technology and the world around us. And they’re learning fast. Just days after the event, Tesla released a video of Optimus operating autonomously, showcasing the rapid progress they’re making.</p><p>And speaking of audacious visions, SpaceX decided to one-up everyone (including themselves) by launching Starship and <em>catching</em> the booster with Mechazilla – their giant robotic chopsticks (okay, technically a launch tower, but you get the picture). Waking up early with my daughter to watch this live was pure magic. As Ryan Carson put it, "<em>It was magical watching this… my kid who's 16… all of his friends are getting their imaginations lit by this experience."</em> That’s exactly what we need - more imagination and less doomerism! The future is coming whether we like it or not, and I, for one, am excited.</p><p><strong>Open Source LLMs and Tools: The Community Delivers (Again!)</strong></p><p>Okay, back to the virtual world (for now). This week's open-source scene was electric, with new model releases and tools that have everyone buzzing (and benchmarking like crazy!).</p><p>* <strong>Nemotron 70B: Hype vs. Reality:</strong> NVIDIA dropped their Nemotron 70B instruct model, claiming impressive scores on certain benchmarks (Arena Hard, AlpacaEval), even suggesting it outperforms GPT-4 and Claude 3.5. As always, we take these claims with a grain of salt (remember Reflection?), and our resident expert, Nisten, was quick to run his own tests. The verdict? Nemotron is good, <em>"a pretty good model to use,"</em> but maybe not the giant-killer some hyped it up to be. Still, kudos to NVIDIA for pushing the open-source boundaries. (<a target="_blank" href="https://hf.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF">Hugging Face</a>, <a target="_blank" href="https://x.com/Sentdex/status/1846699728450298339">Harrison Kingsley evals</a>)</p><p>* <strong>Zamba 2 : Hybrid Vigor:</strong> <a target="_blank" href="https://x.com/ZyphraAI/status/1846257760192942342">Zyphra</a>, in collaboration with NVIDIA, released <a target="_blank" href="https://huggingface.co/spaces/Zyphra/Zamba2-7B">Zamba 2</a>, a hybrid Sparse Mixture of Experts (SME) model. We had <a target="_blank" href="https://x.com/PaoloGlorioso1">Paolo Glorioso</a>, a researcher from Ziphra, join us to break down this unique architecture, which combines the strengths of transformers and state space models (SSMs). He highlighted the memory and latency advantages of SSMs, especially for on-device applications. Definitely worth checking out if you’re interested in transformer alternatives and efficient inference.</p><p>* <strong>Zyda 2: Data is King (and Queen):</strong> Alongside Zamba 2, Zyphra also dropped Zyda 2, a massive 5 trillion token dataset, filtered, deduplicated, and ready for LLM training. This kind of open-source data release is a huge boon to the community, fueling the next generation of models. (<a target="_blank" href="https://x.com/ZyphraAI/status/1846257760192942342">X</a>)</p><p>* <strong>Ministral: Pocket-Sized Power:</strong> On the one-year anniversary of the iconic Mistral 7B release, Mistral announced two new smaller models – Ministral 3B and 8B. Designed for on-device inference, these models are impressive, but as always, Qwen looms large. While Mistral didn’t include Qwen in their comparisons, early tests suggest Qwen’s smaller models still hold their own. One point of contention: these Ministrals aren't as open-source as the original 7B, which is a bit of a bummer, with the 3B not being even released anywhere besides their platform. (<a target="_blank" href="https://mistral.ai/news/ministraux/">Mistral Blog</a>)</p><p>* <strong>Entropix (aka Shrek Sampler): Thinking Outside the (Sample) Box:</strong> This one is intriguing! Entropix introduces a novel sampling technique aimed at boosting the reasoning capabilities of smaller LLMs. Nisten’s yogurt analogy explains it best: it’s about “marinating” the information and picking the best “flavor” (token) at the end. <a target="_blank" href="https://x.com/_xjdr/status/1846640821107675618">Early examples</a> look promising, suggesting Entropix could help smaller models tackle problems that even trip up their larger counterparts. But, as with all shiny new AI toys, we're eagerly awaiting robust evals. Tim Kellog has an detailed breakdown of this method <a target="_blank" href="https://timkellogg.me/blog/2024/10/10/entropix">here</a></p><p>* <strong>Gemma-APS: Fact-Finding Mission:</strong> Google released Gemma-APS, a set of models specifically designed for extracting claims and facts from text. While LLMs can already do this to some extent, a dedicated model for this task is definitely interesting, especially for applications requiring precise information retrieval. (<a target="_blank" href="https://huggingface.co/google/gemma-7b-aps-it">HF</a>)</p><p> 🔥 OpenAI adds voice to their completion API (<a target="_blank" href="https://x.com/OpenAIDevs/status/1846972985170972923">X</a>, <a target="_blank" href="https://platform.openai.com/docs/guides/audio">Docs</a>)</p><p>In the last second of the pod, OpenAI decided to grace us with Breaking News! </p><p>Not only did they launch their Windows native app, but also added <strong>voice input and output to their completion APIs</strong>. This seems to be the same model as the advanced voice mode (and priced super expensively as well) and the one they used in RealTime API released a few weeks ago at DevDay. </p><p>This is of course a bit slower than RealTime but is much simpler to use, and gives way more developers access to this incredible resource (I'm definitely planning to use this for ... things 😈) </p><p>This isn't their "TTS" or "STT (whisper) models, no, this is an actual omni model that understands audio natively and also outputs audio natively, allowing for things like "count to 10 super slow"</p><p>I've played with it just now  (and now it's after 6pm and I'm still writing this newsletter) and it's so so awesome, I expect it to be huge because the RealTime API is very curbersome and many people don't really need this complexity. </p><p>This weeks Buzz - Weights & Biases updates </p><p> Ok I wanted to send a completely different update, but what I will show you is, <a target="_blank" href="https://dub.link/weave-thursdai-oct17">Weave</a>, our observability framework is now also Multi Modal! </p><p>This couples very well with the new update from OpenAI! </p><p>So here's an example usage with today's announcement, I'm going to go through the OpenAI example and show you how to use it with streaming so you can get the audio faster, and show you the Weave multimodality as well 👇</p><p>You can find the code for this in this <a target="_blank" href="https://dub.link/weave-gist">Gist</a> and please give us feedback as this is brand new</p><p>Non standard use-cases of AI corner</p><p>This week I started noticing and collecting some incredible use-cases of Gemini and it's long context and multimodality and wanted to share with you guys, so we had some incredible conversations about non-standard use cases that are pushing the boundaries of what's possible with LLMs.</p><p><a target="_blank" href="https://x.com/hrishioa/status/1846222504018563210">Hrishi</a> blew me away with his experiments using Gemini for transcription and diarization. Turns out, Gemini is not only <strong>great at transcription </strong>(it beats whisper!), it’s also ridiculously cheaper than dedicated ASR models like Whisper, around 60x cheaper! He emphasized the unexplored potential of prompting multimodal models, adding, “<em>the prompting on these things… is still poorly understood."</em> So much room for innovation here!</p><p><a target="_blank" href="https://substack.com/profile/5753967-simon-willison">Simon Willison</a> then stole the show with his <a target="_blank" href="https://simonwillison.net/2024/Oct/17/video-scraping/">mind-bending</a> screen-scraping technique. He recorded a video of himself clicking through emails, fed it to Gemini Flash, and got perfect structured data in return. This trick isn’t just clever; it’s practically free, thanks to the ridiculously low cost of Gemini Flash. I even tried it myself, recording my X bookmarks and getting a near-perfect TLDR of the week’s AI news. The future of data extraction is here, and it involves screen recordings and very cheap (or free) LLMs.</p><p>Here's Simon's example of how much this would cost him had he actually be charged for it. 🤯</p><p>Speaking of <a target="_blank" href="https://substack.com/profile/5753967-simon-willison">Simon Willison</a> , he broke the news that NotebookLM has got an <a target="_blank" href="https://t.co/OkQeJ8xxxH">upgrade</a>, with the ability to steer the speakers with custom commands, which Simon promptly used to ask the overview hosts to talk like Pelicans </p><p><strong>Voice Cloning, Adobe Magic, and the Quest for Real-Time Avatars</strong></p><p>Voice cloning also took center stage this week, with the release of F5-TTS. This open-source model performs zero-shot voice cloning with just a few seconds of audio, raising all sorts of ethical questions (and exciting possibilities!). I played a sample on the show, and it was surprisingly convincing (though not without it's problems) for a local model! </p><p>This, combined with <a target="_blank" href="https://github.com/fudan-generative-vision/hallo2?tab=readme-ov-file">Hallo 2</a>'s (also released this week!) ability to animate talking avatars, has Wolfram Ravenwolf dreaming of real-time AI assistants with personalized faces and voices. The pieces are falling into place, folks.</p><p>And for all you Adobe fans, Firefly Video has landed! This “commercially safe” text-to-video and image-to-video model is seamlessly integrated into Premiere, offering incredible features like extending video clips with AI-generated frames. Photoshop also got some Firefly love, with mind-bending relighting capabilities that could make AI-generated images indistinguishable from real photographs.</p><p><strong>Wrapping Up:</strong></p><p>Phew, that was a marathon, not a sprint! From robots to rockets, open source to proprietary, and voice cloning to video editing, this week has been a wild ride through the ever-evolving landscape of AI. Thanks for joining me on this adventure, and as always, keep exploring, keep building, and keep pushing those AI boundaries. The future is coming, and it’s going to be amazing.</p><p><strong>P.S.</strong> Don’t forget to subscribe to the podcast and newsletter for more AI goodness, and if you’re in Seattle next week, come say hi at the AI Tinkerers meetup. I’ll be demoing my Halloween AI toy – it’s gonna be spooky!</p><p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></p><p>TL;DR -  Show Notes and Links</p><p>* <strong>Open Source LLMs</strong></p><p>* <strong>Nvidia releases Llama 3.1-Nemotron-70B instruct:</strong> Outperforms GPT-40 and Anthropic Claude 3.5 on several benchmarks. Available on Hugging Face and Nvidia. (<a target="_blank" href="https://x.com/_philschmid/status/1846527494351998980">X</a>, <a target="_blank" href="https://x.com/Sentdex/status/1846699728450298339">Harrison Eval</a>)</p><p>* <strong>Zamba2-7B:</strong> A hybrid Sparse Mixture of Experts model from Zyphra and Nvidia. Claims to outperform Mistral, Llama2, and Gemmas in the 58B weight class. (<a target="_blank" href="https://x.com/ZyphraAI/status/1845939850958327822">X</a>, <a target="_blank" href="https://huggingface.co/spaces/Zyphra/Zamba2-7B">HF</a>)</p><p>* <strong>Zyda-2:</strong> 57B token dataset distilled from high-quality sources for training LLMs. Released by Zyphra and Nvidia. (<a target="_blank" href="https://x.com/ZyphraAI/status/1846257760192942342">X</a>)</p><p>* Ministral 3B & 8B - Mistral releases 2 new models for on device, claims SOTA (<a target="_blank" href="https://mistral.ai/news/ministraux/">Blog</a>)</p><p>• Entropix aims to mimic advanced reasoning in small LLMs (<a target="_blank" href="https://github.com/xjdr-alt/entropix">Github</a>, <a target="_blank" href="https://timkellogg.me/blog/2024/10/10/entropix">Breakdown</a>)</p><p>* <strong>Google releases Gemma-APS:</strong> A collection of Gemma models for text-to-propositions segmentation, distilled from Gemini Pro and fine-tuned on synthetic data. (<a target="_blank" href="https://huggingface.co/collections/google/gemma-aps-release-66e1a42c7b9c3bd67a0ade88">HF</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* OpenAI ships advanced voice model in chat completions API endpoints with multimodality (<a target="_blank" href="https://x.com/OpenAIDevs/status/1846972985170972923">X, Docs</a>, <a target="_blank" href="https://x.com/altryne/status/1847076550866620537">My Example</a>)</p><p>* Amazon, Microsoft, Google all announce nuclear power for AI future</p><p>* <a target="_blank" href="Yi-01.AI"><strong>Yi-01.AI</strong></a><strong> launches Yi-Lightning:</strong> A proprietary model accessible via API.</p><p>* <strong>New Gemini API parameters:</strong> Google has shipped new Gemini API parameters, including logprobs, candidateCount, presencePenalty, seed, frequencyPenalty, and model_personality_in_response.</p><p>* Google NotebookLM is no longer "experimental" and now allows for "steering" the hosts (<a target="_blank" href="https://x.com/joshtwoodward/status/1846946225251406039">Announcement</a>)</p><p>* XAI - GROK 2 and Grok2-mini are now available via API in OpenRouter - (<a target="_blank" href="https://x.com/OpenRouterAI/status/1845549651811824078">X</a>, <a target="_blank" href="https://openrouter.ai/x-ai/grok-2-mini">OR</a>)</p><p>* This weeks Buzz (What I learned with WandB this week)</p><p>* Weave is now MultiModal (supports audio and text!) (<a target="_blank" href="https://x.com/altryne/status/1847076550866620537">X</a>, <a target="_blank" href="https://dub.link/weave-gist">Github Example</a>)</p><p>* Vision & Video</p><p>* <strong>Adobe Firefly Video:</strong> Adobe's first commercially safe text-to-video and image-to-video generation model. Supports prompt coherence. (<a target="_blank" href="https://x.com/koltregaskes/status/1845846420877685097">X</a>)</p><p>* Voice & Audio</p><p>* <strong>Ichigo-Llama3.1 Local Real-Time Voice AI:</strong> Improvements allow it to talk back, recognize when it can't comprehend input, and run on a single Nvidia 3090 GPU. (<a target="_blank" href="https://x.com/homebrewltd/status/1845685589376647654">X</a>)</p><p>* <strong>F5-TTS:</strong> Performs zero-shot voice cloning with less than 15 seconds of audio, using audio clips to generate additional audio. (<a target="_blank" href="https://huggingface.co/SWivid/F5-TTS">HF</a>, <a target="_blank" href="https://huggingface.co/papers/2410.06885">Paper</a>)</p><p>* AI Art & Diffusion & 3D</p><p>* <strong>RF-Inversion:</strong> Zero-shot inversion and editing framework for Flux, introduced by Litu Rout. Allows for image editing and personalization without training, optimization, or prompt-tuning. (<a target="_blank" href="https://x.com/litu_rout_/status/1846046009668878799">X</a>)</p><p>* Tools</p><p>* <strong>Fastdata:</strong> A library for synthesizing 1B tokens. (<a target="_blank" href="https://x.com/ncooper57/thread/1846612127911760261">X</a>)</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-oct-17-robots-rockets-and</link><guid isPermaLink="false">substack:post:150381664</guid><dc:creator><![CDATA[Alex Volkov and Simon Willison]]></dc:creator><pubDate>Fri, 18 Oct 2024 02:36:18 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/150381664/9ba6ad810119059278f24f4bf3fd2a2f.mp3" length="68516084" type="audio/mpeg"/><itunes:author>Alex Volkov and Simon Willison</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5710</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/150381664/1b938477296f240546e37f6dba3b90f9.jpg"/></item><item><title><![CDATA[📆 ThursdAI - Oct 10 - Two Nobel Prizes in AI!? Meta Movie Gen (and sounds ) amazing, Pyramid Flow a 2B video model, 2 new VLMs & more AI news!]]></title><description><![CDATA[<p>Hey Folks, we are finally due for a "relaxing" week in AI, no more HUGE company announcements (if you don't consider Meta Movie Gen huge), no conferences or dev days, and some time for Open Source projects to shine. (while we all wait for Opus 3.5 to shake things up) </p><p>This week was very multimodal on the show, we covered 2 new video models, one that's tiny and is open source, and one massive from Meta that is aiming for SORA's crown, and 2 new VLMs, one from our friends at REKA that understands videos and audio, while the other from Rhymes is apache 2 licensed and we had a chat with Kwindla Kramer about OpenAI RealTime API and it's shortcomings and voice AI's in general. </p><p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></p><p>All right, let's TL;DR and show notes, and we'll start with the 2 Nobel prizes in AI 👇 </p><p>* <strong>2 AI nobel prizes</strong></p><p>* John Hopfield and Geoffrey Hinton have been awarded a Physics Nobel prize</p><p>* Demis Hassabis, John Jumper & David Baker, have been awarded this year's <a target="_blank" href="https://x.com/hashtag/NobelPrize?src=hashtag_click">#NobelPrize</a> in Chemistry.</p><p>* <strong>Open Source LLMs & VLMs</strong></p><p>* TxT360: a globally deduplicated dataset for LLM pre-training<strong> ( </strong><a target="_blank" href="https://huggingface.co/spaces/LLM360/TxT360"><strong>Blog</strong></a><strong>, </strong><a target="_blank" href="https://huggingface.co/datasets/LLM360/TxT360"><strong>Dataset</strong></a><strong>)</strong></p><p>* Rhymes <strong>Aria</strong> - 25.3B multimodal MoE model that can take image/video inputs Apache 2 (<a target="_blank" href="https://rhymes.ai/blog-details/aria-first-open-multimodal-native-moe-model">Blog</a>, <a target="_blank" href="https://huggingface.co/rhymes-ai/Aria">HF</a>, <a target="_blank" href="https://rhymes.ai/">Try It</a>)</p><p>* Maitrix and LLM360 launch a new decentralized arena (<a target="_blank" href="https://huggingface.co/spaces/LLM360/de-arena">Leaderboard</a>, <a target="_blank" href="https://de-arena.maitrix.org/">Blog</a>)</p><p>* New Gradio 5 with server side rendering (<a target="_blank" href="https://x.com/Gradio/status/1844142439017386216">X</a>)</p><p>* LLamaFile now comes with a chat interface and syntax highlighting (<a target="_blank" href="https://x.com/JustineTunney/status/1843729427706065000">X</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* OpenAI releases MLEBench - new kaggle focused benchmarks for AI Agents (<a target="_blank" href="https://arxiv.org/abs/2410.07095">Paper</a>, <a target="_blank" href="https://github.com/openai/mle-bench">Github</a>)</p><p>* <strong>Inflection</strong> is still alive - going for enterprise lol (<a target="_blank" href="https://inflection.ai/blog/enterprise">Blog</a>)</p><p>* new <strong>Reka</strong> Flash 21B - (<a target="_blank" href="https://x.com/RekaAILabs/status/1843298155682820566">X</a>, <a target="_blank" href="https://www.reka.ai/news/reka-flash-updates">Blog</a>, <a target="_blank" href="https://chat.reka.ai/chat/09olSjjQnLGVDY6WFfAa">Try It</a>)</p><p>* <strong>This weeks Buzz</strong></p><p>* We chatted about Cursor, it went <a target="_blank" href="https://x.com/altryne/status/1843738554352185542">viral</a>, there are many tips</p><p>* WandB releases HEMM - benchmarks of text-to-image generation models (<a target="_blank" href="https://x.com/soumikRakshit96/status/1841893461060157501">X</a>, <a target="_blank" href="https://github.com/wandb/Hemm">Github</a>, <a target="_blank" href="https://wandb.ai/hemm-eval/mllm-eval-action/weave/evaluations">Leaderboard</a>)</p><p>* <strong>Vision & Video</strong></p><p>* Meta presents Movie Gen <strong>30B</strong> - img and text to video models (<a target="_blank" href="https://ai.meta.com/research/movie-gen/">blog</a>, <a target="_blank" href="https://ai.meta.com/static-resource/movie-gen-research-paper">paper</a>)</p><p>* Pyramid Flow - open source img2video model MIT license (<a target="_blank" href="https://x.com/reach_vb/status/1844241948233826385">X</a>, <a target="_blank" href="https://pyramid-flow.github.io/">Blog</a>, <a target="_blank" href="https://huggingface.co/rain1011/pyramid-flow-sd3">HF</a>, <a target="_blank" href="https://arxiv.org/abs/2410.05954">Paper</a>, <a target="_blank" href="https://github.com/jy0205/Pyramid-Flow">Github</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* Working with OpenAI RealTime Audio - Alex conversation with Kwindla from trydaily.com</p><p>* Cartesia Sonic goes multilingual (<a target="_blank" href="https://x.com/cartesia_ai/status/1843353627836236070">X</a>)</p><p>* Voice hackathon in SF with 20K prizes (and a remote track) - <a target="_blank" href="https://lu.ma/6www8b0t">sign up</a></p><p>* <strong>Tools</strong></p><p>* LM Studio ships with MLX natively (<a target="_blank" href="https://x.com/LMStudioAI/status/1843715603892449315">X</a>, <a target="_blank" href="https://lmstudio.ai/">Download</a>)</p><p>* <a target="_blank" href="http://UITHUB.com">UITHUB.com</a> - turn any github repo into 1 long file for LLMs</p><p>A Historic Week: TWO AI Nobel Prizes!</p><p>This week wasn't just big; it was HISTORIC. As Yam put it, "two Nobel prizes for AI in a single week. It's historic." And he's absolutely spot on! Geoffrey Hinton, often called the "grandfather of modern AI," alongside John Hopfield, were awarded the Nobel Prize in Physics for their foundational work on neural networks - work that paved the way for everything we're seeing today. Think back propagation, Boltzmann machines – these are concepts that underpin much of modern deep learning. It’s about time they got the recognition they deserve!</p><p>Yoshua Bengio posted about this in a very nice quote: </p><p><a target="_blank" href="https://x.com/HopfieldJohn">@HopfieldJohn</a> and <a target="_blank" href="https://x.com/geoffreyhinton">@geoffreyhinton</a>, along with collaborators, have created a beautiful and insightful bridge between physics and AI. They invented neural networks that were not only inspired by the brain, but also by central notions in physics such as energy, temperature, system dynamics, energy barriers, the role of randomness and noise, connecting the local properties, e.g., of atoms or neurons, to global ones like entropy and attractors. And they went beyond the physics to show how these ideas could give rise to memory, learning and generative models; concepts which are still at the forefront of modern AI research</p><p>And Hinton's post-Nobel quote? Pure gold: “<strong>I’m particularly proud of the fact that one of my students fired Sam Altman</strong>." He went on to explain his concerns about OpenAI's apparent shift in focus from safety to profits. Spicy take! It sparked quite a conversation about the ethical implications of AI development and who’s responsible for ensuring its safe deployment. It’s a discussion we need to be having more and more as the technology evolves. Can you guess which one of his students it was? </p><p>Then, not to be outdone, the AlphaFold team (Demis Hassabis, John Jumper, and David Baker) snagged the Nobel Prize in Chemistry for AlphaFold 2. This AI revolutionized protein folding, accelerating drug discovery and biomedical research in a way no one thought possible. These awards highlight the tangible, real-world applications of AI. It's not just theoretical anymore; it's transforming industries.</p><p>Congratulations to all winners, and we gotta wonder, is this a start of a trend of AI that takes over every Nobel prize going forward? 🤔 </p><p>Open Source LLMs & VLMs: The Community is COOKING!</p><p>The open-source AI community consistently punches above its weight, and this week was no exception. We saw some truly impressive releases that deserve a standing ovation. First off, the TxT360 dataset (<a target="_blank" href="https://huggingface.co/spaces/LLM360/TxT360">blog</a>, <a target="_blank" href="https://huggingface.co/datasets/LLM360/TxT360">dataset</a>). Nisten, resident technical expert, broke down the immense effort: "The amount of DevOps and…operations to do this work is pretty rough." </p><p>This globally deduplicated <strong>15+ trillion-token corpus</strong> combines the best of Common Crawl with a curated selection of high-quality sources, setting a new standard for open-source LLM training. We talked about the importance of deduplication for model training - avoiding the "memorization" of repeated information that can skew a model's understanding of language. TxT360 takes a 360-degree approach to data quality <em>and</em> documentation – a huge win for accessibility.</p><p>Apache 2 Multimodal MoE from Rhymes AI called Aria (<a target="_blank" href="https://rhymes.ai/blog-details/aria-first-open-multimodal-native-moe-model">blog</a>, <a target="_blank" href="https://huggingface.co/rhymes-ai/Aria">HF</a>, <a target="_blank" href="https://rhymes.ai/">Try It</a> )</p><p>Next, the Rhymes <strong>Aria</strong> model (25.3B total and only 3.9B active parameters!) This multimodal marvel operates as a Mixture of Experts (MoE), meaning it activates only the necessary parts of its vast network for a given task, making it surprisingly efficient. Aria excels in understanding image and video inputs, features a generous 64K token context window, <em>and</em> is available under the <strong>Apache</strong> 2 license – music to open-source developers’ ears! We even discussed its coding capabilities: imagine pasting images of code and getting intelligent responses.</p><p>I particularly love the focus on long multimodal input understanding (think longer videos) and super high resolution image support. </p><p>I uploaded this simple pin-out diagram of RaspberriPy and it got all the right answers correct! Including ones I missed myself (and <a target="_blank" href="https://x.com/altryne/status/1844465069041615238">won against</a> Gemini 002 and the new Reka Flash!) </p><p>Big Companies and APIs</p><p>OpenAI new Agentic benchmark, can it compete with MLEs on Kaggle?</p><p>OpenAI snuck in a new benchmark, MLEBench (<a target="_blank" href="https://arxiv.org/abs/2410.07095">Paper</a>, <a target="_blank" href="http://github.com/openai/mle-bench/">Github</a>), specifically designed to evaluate AI agents performance on Machine Learning Engineering tasks. Designed around a curated collection of Kaggle competitions, creating a diverse set of challenging tasks that test real-world ML engineering skills such as training models, preparing datasets, and running experiments. </p><p>They found that the best-performing setup--OpenAI's o1-preview with AIDE scaffolding--achieves at least the level of a Kaggle bronze medal in <strong>16.9% of competitions</strong> (though there are some that <a target="_blank" href="https://x.com/tunguz/status/1844459668619383122">throw shade</a> on this score)</p><p>Meta comes for our reality with Movie Gen</p><p>But let's be honest, Meta stole the show this week with Movie Gen (<a target="_blank" href="https://ai.meta.com/research/movie-gen/">blog</a>). This isn’t your average video generation model; it’s like something straight out of science fiction. Imagine creating long, high-definition videos, with different aspect ratios, personalized elements, <em>and</em> accompanying audio – all from text and image prompts. It's like the Holodeck is finally within reach! </p><p>Unfortunately, despite hinting at its size (30B) Meta is not releasing this model (just yet) nor is it available widely so far! But we'll keep our fingers crossed that it drops before SORA. </p><p>One super notable thing is, this model generates audio as well to accompany the video and it's quite remarkable. We listened to a few examples from Meta’s demo, and the sound effects were truly remarkable – everything from fireworks to rustling leaves. This model isn't just creating video, it's crafting experiences. (Sound on for the next example!)</p><p>They also have personalization built in, which is showcased here by one of the leads of LLama ,Roshan, as a scientist doing experiments and the realism is quite awesome to see (but I get why they are afraid of releasing this in open weights)</p><p>This Week’s Buzz: What I learned at Weights & Biases this week</p><p>My "buzz" this week was less about groundbreaking models and more about mastering the AI tools we have. We had a team meeting to share our best tips and tricks for using Cursor, and when I shared those insights on X (<a target="_blank" href="https://x.com/altryne/status/1843738554352185542">thread</a>), they went surprisingly viral! </p><p>The big takeaway from the thread? Composer, Cursor’s latest feature, is a true game-changer. It allows for more complex refactoring and code generation across multiple files – the kind of stuff that would take hours manually. If you haven't tried Composer, you're seriously missing out. We also covered strategies for leveraging different models for specific tasks, like using O1 mini for outlining and then switching to the more robust Cloud 3.5 for generating code. Another gem we uncovered: selecting any text in the console and hitting opt+D will immediately send it to the chat to debug, super useful! </p><p>Over at Weights & Biases, my talented teammate, Soumik, released HEMM (<a target="_blank" href="https://x.com/soumikRakshit96/status/1841893461060157501">X</a>, <a target="_blank" href="https://github.com/wandb/Hemm">Github</a>), a comprehensive benchmark specifically designed for text-to-image generation models. Want to know how different models fare on image quality and prompt comprehension? Head over to the leaderboard on Weave (<a target="_blank" href="https://wandb.ai/hemm-eval/mllm-eval-action/weave/evaluations">Leaderboard</a>) and find out! And yes, it's true, Weave, our LLM observability tool, is multimodal (well within the theme of today's update)</p><p>Voice and Audio: Real-Time Conversations and the Quest for Affordable AI</p><p>OpenAI's DevDay was just a few weeks back, but the ripple effects of their announcements are still being felt. The big one for voice AI enthusiasts like myself? The RealTime API, offering developers a direct line to Advanced Voice Mode. My initial reaction was pure elation – finally, a chance to build some seriously interactive voice experiences that sound incredible and in near real time! </p><p>That feeling was quickly followed by a sharp intake of breath when I saw the price tag. As I discovered building my Halloween project, real-time streaming of this caliber isn’t exactly budget-friendly (yet!). Kwindla from trydaily.com, a voice AI expert, joined the show to shed some light on this issue. </p><p>We talked about the challenges of scaling these models and the complexities of context management in real-time audio processing. The conversation shifted to how OpenAI's RealTime API isn’t just about the model itself but also the innovative way they're managing the user experience and state within a conversation. He pointed out, however, that what we see and hear from the API isn’t exactly what’s going on under the hood, “What the model hears and what the transcription events give you back are not the same”. Turns out, OpenAI relies on Whisper for generating text transcriptions – it’s not directly from the voice model.</p><p>The pricing really threw me though, only testing a little bit, not even doing anything on production, and OpenAI charged almost 10$, the same conversations are happening across <a target="_blank" href="https://www.reddit.com/r/OpenAI/comments/1fvp88h/new_realtime_api_is_extremely_expensive/">Reddit</a> and <a target="_blank" href="https://community.openai.com/t/realtime-api-extremely-expensive/966825/14">OpenAI forums</a> as well. </p><p>Hallo-Weave project update: </p><p>So as I let folks know on the show, I'm building a halloween AI decoration as a project, and integrating it into Weights & Biases Weave (that's why it's called HalloWeave)</p><p>After performing <a target="_blank" href="https://x.com/altryne/status/1842459983625048177">brain surgery</a>, <a target="_blank" href="https://www.instagram.com/reel/DA0Cd38OK18/">futzing with wires</a> and LEDs, I finally have it set up so it wakes up on a trigger word (it's "Trick or Treat!"), takes a picture with the webcam (actual webcam, raspberryPi camera was god awful) and sends it to Gemini Flash to detect which costume this is and write a nice customized greeting. </p><p>Then I send that text to Cartesia to generate the speech using a British voice, and then I play it via a bluetooth speaker. <a target="_blank" href="https://x.com/altryne/status/1844482944867676461">Here's</a> a video of the last stage (which still had some bluetooth issues, it's a bit better now) </p><p>Next up: I should decide if I care to integrate OpenAI Real time (and pay a LOT of $$$ for it) or fallback to existing LLM - TTS services and let kids actually have a conversation with the toy! </p><p>Stay tuned for more updates as we get closer to halloween, the project is open source HERE and the Weave dashboard will be open once it's live. </p><p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></p><p>One More Thing… UIThub! </p><p>Before signing off, one super useful tool for you! It's so useful I recorded (and created an edit) video on it. I've also posted it on my brand new <a target="_blank" href="https://www.tiktok.com/@alt_tok/video/7423861834907651370">TikTok</a>, <a target="_blank" href="https://www.instagram.com/altryne_ai/">Instagram</a>, <a target="_blank" href="https://youtube.com/shorts/onG85S2zosM">Youtube</a> and <a target="_blank" href="https://www.linkedin.com/in/alex-volkov-/">Linkedin</a> accounts, where it promptly did not receive any views, but hey, gotta start somewhere right? 😂 </p><p>Phew! That’s a wrap for this week’s ThursdAI. From Nobel Prizes to new open-source tools, and even meta's incredibly promising (but still locked down) video gen models, the world of AI continues to surprise and delight (and maybe cause a mild existential crisis or two!). I'd love to hear your thoughts – what caught your eye? Are you building anything cool? Let me know in the comments, and I'll see you back here next week for more AI adventures! Oh, and don't forget to subscribe to the podcast (five-star ratings always appreciated 😉). </p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-oct-10-two-nobel-prizes</link><guid isPermaLink="false">substack:post:150072088</guid><dc:creator><![CDATA[Alex Volkov and Kwindla Hultman Kramer]]></dc:creator><pubDate>Thu, 10 Oct 2024 22:53:27 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/150072088/a71c7bd741525ba2b1dbf136c7961c3d.mp3" length="64817643" type="audio/mpeg"/><itunes:author>Alex Volkov and Kwindla Hultman Kramer</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5401</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/150072088/4cc3d97fb7fd7ad7685f340e68333fa5.jpg"/></item><item><title><![CDATA[📆 ThursdAI - Oct 3 - OpenAI RealTime API, ChatGPT Canvas & other DevDay news (how I met Sam Altman), Gemini 1.5 8B is basically free, BFL makes FLUX 1.1 6x faster, Rev breaks whisper records...]]></title><description><![CDATA[<p>Hey, it's Alex. Ok, so mind is officially blown. I was sure this week was going to be wild, but I didn't expect everyone else besides OpenAI to pile on, exactly on ThursdAI. </p><p>Coming back from Dev Day (number 2) and am still processing, and wanted to actually do a recap by humans, not just the NotebookLM one I posted during the keynote itself (which was awesome and scary in a "will AI replace me as a podcaster" kind of way), and was incredible to have <a target="_blank" href="https://substack.com/profile/5753967-simon-willison">Simon Willison</a> who was sitting just behind me most of Dev Day, join me for the recap! </p><p>But then the news kept coming, OpenAI released Canvas, which is a whole new way of interacting with chatGPT, BFL released a new Flux version that's 8x faster, Rev released a Whisper killer ASR that does diarizaiton and Google released Gemini 1.5 Flash 8B, and said that with prompt caching (which OpenAI now also has, yay) this will cost a whopping 0.01 / Mtok. That's 1 cent per million tokens, for a multimodal model with 1 million context window. 🤯 </p><p>This whole week was crazy, as last ThursdAI after finishing the newsletter I went to meet tons of folks at the AI Tinkerers in Seattle, and did a little EvalForge demo (which you can <a target="_blank" href="https://x.com/altryne/status/1839707273465303468">see here</a>) and wanted to share <a target="_blank" href="https://github.com/wandb/evalForge/tree/main">EvalForge</a> with you as well, it's early but very promising so feedback and PRs are welcome! </p><p>WHAT A WEEK, TL;DR for those who want the links and let's dive in 👇 </p><p>* OpenAI - Dev Day Recap (<a target="_blank" href="https://sub.thursdai.news/p/openai-dev-day-2024-keynote">Alex</a>, <a target="_blank" href="https://simonw.substack.com/p/openai-devday-lets-build-developer">Simon Willison</a>)</p><p>* Recap of Dev Day</p><p>* RealTime API launched</p><p>* Prompt Caching launched</p><p>* Model Distillation is the new finetune</p><p>* Finetuning 4o with images (Skalski <a target="_blank" href="https://blog.roboflow.com/gpt-4o-object-detection/">guide</a>)</p><p>* Fireside chat Q&A with Sam</p><p>* <strong>Open Source LLMs</strong> </p><p>* NVIDIA finally releases NVML (<a target="_blank" href="https://huggingface.co/nvidia/NVLM-D-72B">HF</a>)</p><p>* <strong>This weeks Buzz</strong></p><p>* Alex discussed his demo of <strong>EvalForge</strong> at the AI Tinkers event in Seattle in "This Week's Buzz". (<a target="_blank" href="https://x.com/altryne/status/1839707273465303468"><strong>Demo</strong></a><strong>, </strong><a target="_blank" href="https://github.com/wandb/evalForge/tree/main"><strong>EvalForge</strong></a><strong>, </strong><a target="_blank" href="https://seattle.aitinkerers.org/p/ai-tinkerers-seattle-september-2024-meetup"><strong>AI TInkerers</strong></a><strong>)</strong></p><p>* <strong>Big Companies & APIs</strong></p><p>* Google has released Gemini Flash 8B - 0.01 per million tokens cached (<a target="_blank" href="https://x.com/OfficialLoganK/status/1841903061360640029">X</a>, <a target="_blank" href="https://developers.googleblog.com/en/gemini-15-flash-8b-is-now-generally-available-for-use/">Blog</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* Rev breaks SOTA on ASR with Rev ASR and Rev Diarize (<a target="_blank" href="https://www.rev.com/blog/speech-to-text-technology/introducing-reverb-open-source-asr-diarization">Blog</a>, <a target="_blank" href="https://github.com/revdotcom/reverb/tree/main?tab=readme-ov-file">Github</a>, <a target="_blank" href="https://huggingface.co/Revai">HF</a>)</p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* BFL relases Flux1.1[pro] - 3x-6x faster than 1.0 and higher quality (was 🫐) - (<a target="_blank" href="https://blackforestlabs.ai/announcing-flux-1-1-pro-and-the-bfl-api/">Blog</a>, <a target="_blank" href="https://fal.ai/models/fal-ai/flux-pro/v1.1">Try it</a>)</p><p></p><p>The day I met Sam Altman  / Dev Day recap</p><p>Last Dev Day (my coverage <a target="_blank" href="https://sub.thursdai.news/p/nov-09?utm_source=publication-search">here</a>) was a "singular" day in AI for me, given it also had the "keep AI open source" with Nous Research and Grimes, and this Dev Day I was delighted to find out that the vibe was completely different, and focused less on bombastic announcements or models, but on practical dev focused things. </p><p>This meant that OpenAI cherry picked folks who actively develop with their tools, and they didn't invite traditional media, only folks like yours truly, @swyx from Latent space, Rowan from Rundown, Simon Willison and Dan Shipper, you know, newsletter and podcast folks who actually build! </p><p>This also allowed for many many OpenAI employees who work on the products and APIs we get to use, were there to receive feedback, help folks with prompting, and just generally interact with the devs, and build that community. I want to shoutout my friends Ilan (who was in the keynote as the strawberry salesman interacting with RealTime API agent), Will DePue from the SORA team, with whom we had an incredible conversation about ethics and legality of projects, Christine McLeavey who runs the Audio team, with whom I shared a video of my daughter crying when chatGPT didn't understand her, Katia, Kevin and Romain on the incredible DevEx/DevRel team and finally, my new buddy Jason who does infra, and was fighting bugs all day and only joined the pub after shipping RealTime to all of us. </p><p>I've collected all these folks in a convenient and super high signal X list <a target="_blank" href="https://x.com/onebitToo">here</a> so definitely give that list a follow if you'd like to tap into their streams</p><p>For the actual announcements, I've already covered this in my Dev Day post <a target="_blank" href="https://sub.thursdai.news/p/openai-dev-day-2024-keynote">here</a> (which was payed subscribers only, but is now open to all) and Simon did an incredible summary on <a target="_blank" href="https://simonw.substack.com/p/openai-devday-lets-build-developer">his Substack as well</a> </p><p>The highlights were definitely the new RealTime API that let's developers build with Advanced Voice Mode, Prompt Caching that will happen automatically and reduce all your long context API calls by a whopping 50% and finetuning of models that they are rebranding into Distillation and adding new tools to make it easier (including Vision Finetuning for the first time!)</p><p>Meeting Sam Altman</p><p>While I didn't get a "media" pass or anything like this, and didn't really get to sit down with OpenAI execs (see <a target="_blank" href="https://www.latent.space/p/devday-2024">Swyx</a> on <a target="_blank" href="https://open.substack.com/pub/swyx">Latent Space</a>  for those conversations), I did have a chance to ask Sam multiple things. </p><p>First at the closing fireside chat between Sam and Kevin Weil (CPO at OpenAI), Kevin first asked Sam a bunch of questions, and then they gave out the microphones to folks, and I asked the only question that got Sam to smile</p><p>Sam and Kevin went on for a while, and that Q&A was actually very interesting, so much so, that I had to recruit my favorite Notebook LM podcast hosts, to go through it and give you an overview, so here's that <a target="_blank" href="https://notebooklm.google.com/notebook/99bf124c-f665-4b64-af71-78d2eaa3fd6a">Notebook LM</a>, with the transcript of the whole Q&A (maybe i'll publish it as a standalone episode? LMK in the comments)</p><p>After the official day was over, there was a reception, at the same gorgeous Fort Mason location, with drinks and light food, and as you might imagine, this was great for networking.</p><p>But the <strong>real post dev day event</strong> was hosted by OpenAI devs at a bar, Palm House, which both Sam and Greg Brokman just came to and hung out with folks. I missed Sam last time and was very eager to go and ask him follow up questions this time, when I saw he was just chilling at that bar, talking to devs, as though he didn't "just" complete the largest funding round in VC history ($6.6B at $175B valuation) and went through a lot of drama/turmoil with the departure of a lot of senior leadership! </p><p>Sam was awesome to briefly chat with, tho as you might imagine, it was loud and tons of folks wanted selfies, but we did discuss how AI affects the real world, job replacement stuff were brought up, and how developers are using the OpenAI products. </p><p>What we learned, thanks to Sigil, is that o1 was named partly as a "reset" like the main blogpost claimed and partly as "alien of extraordinary ability" , which is the the official designation of the o1 visa, and that Sam came up with this joke himself. </p><p>Is anyone here smarter than o1? Do you think you still will by o2? </p><p>One of the highest impact questions was by Sam himself to the audience.</p><p>Who feels like they've spent a lot of time with O1, and they would say, like, I feel definitively smarter than that thing?</p><p>— Sam Altman</p><p>When Sam asked this at first, a few hands hesitatingly went up. He then followed up with </p><p>Do you think you still will by O2? No one. No one taking the bet.One of the challenges that we face is like, we know how to go do this thing that we think will be like, at least probably smarter than all of us in like a broad array of tasks</p><p>This was a very palpable moment that folks looked around and realized, what OpenAI folks have probably internalized a long time ago, we're living in INSANE times, and even those of us at the frontier or research, AI use and development, don't necessarily understand or internalize how WILD the upcoming few months, years will be. </p><p>And then we all promptly forgot to have an existential crisis about it, and took our self driving Waymo's to meet Sam Altman at a bar 😂 </p><p>This weeks Buzz from Weights & Biases</p><p>Hey so... after finishing ThursdAI last week I went to Seattle Tinkerers event and gave a demo (and sponsored the event with a raffle of Meta Raybans). I demoed our project called EvalForge, which I built the frontend of and my collegue Anish on backend, as we tried to replicate the <a target="_blank" href="https://arxiv.org/abs/2404.12272">Who validates the validators</a> paper by Shreya Shankar, here’s that demo, and EvalForge <a target="_blank" href="https://github.com/wandb/evalForge/tree/main">Github</a> for many of you who asked to see it. </p><p>Please let me know what you think, I love doing demos and would love feedback and ideas for the next one (coming up in October!)</p><p>OpenAI chatGPT Canvas - a complete new way to interact with chatGPT</p><p>Just 2 days after Dev Day, and as breaking news during the show, OpenAI also shipped a new way to interact with chatGPT, called Canvas! </p><p>Get ready to say goodbye to simple chats and hello to a whole new era of AI collaboration! <strong>Canvas</strong>, a groundbreaking interface that transforms ChatGPT into a true creative partner for writing and coding projects. Imagine having a tireless copy editor, a brilliant code reviewer, and an endless source of inspiration all rolled into one – that's Canvas!</p><p>Canvas moves beyond the limitations of a simple chat window, offering a dedicated space where you and ChatGPT can work side-by-side. <strong>Canvas opens in a separate window, allowing for a more visual and interactive workflow.</strong> You can directly edit text or code within Canvas, <strong>highlight sections for specific feedback</strong>, and even use a handy menu of shortcuts to request tasks like adjusting the length of your writing, debugging code, or adding final polish. And just like with your favorite design tools, you can easily restore previous versions using the back button.</p><p>Per Karina, OpenAI has trained a special GPT-4o model specifically for Canvas, enabling it to understand the context of your project and provide more insightful assistance. They used synthetic data, generated by O1 which led them to outperform the basic version of GPT-4o by 30% in accuracy. </p><p>A general pattern emerges, where new frontiers in intelligence are advancing also older models (and humans as well). </p><p>Gemini Flash 8B makes intelligence essentially free</p><p>Google folks were not about to take this week litely and decided to hit back with one of the most insane upgrades to pricing I've seen. The newly announced Gemini Flash 1.5 8B is goint to cost just... <strong>$0.01 per million tokens</strong> 🤯 (when using caching, 3 cents when not cached) </p><p>This basically turns intelligence free. And while it is free, it's still their multimodal model (supports images) and has a HUGE context window of 1M tokens. </p><p>The evals look ridiculous as well, this 8B param model, now almost matches Flash from May of this year, less than 6 month ago, while giving developers 2x the rate limits and lower latency as well. </p><p>What will you do with free intelligence? What will you do with free intelligence of o1 quality in a year? what about o2 quality in 3 years? </p><p>Bye Bye whisper? Rev open sources <strong>Reverb</strong> and Reverb Diarize + turbo models (<a target="_blank" href="https://www.rev.com/blog/speech-to-text-technology/introducing-reverb-open-source-asr-diarization">Blog</a>, HF, Github)</p><p>With a "WTF just happened" breaking news, a company called <a target="_blank" href="http://Rev.com">Rev.com</a> releases what they consider a SOTA ASR model, that obliterates Whisper (English only for now) on metrics like WER, and includes a specific diarization focused model. </p><p>Trained on 200,000 hours of English speech, expertly transcribed by humans, which according to their claims, is the largest dataset that any ASR model has been trained on, they achieve some incredible results that blow whisper out of the water (lower WER is better)</p><p>They also released a seemingly incredible diarization model, which helps understand who speaks when (and is usually added on top of Whisper) </p><p>For diarization, Rev used the high-performance <a target="_blank" href="https://github.com/pyannote/pyannote-audio">pyannote.audio</a> library to fine-tune existing models on 26,000 hours of expertly labeled data, significantly improving their performance</p><p>While this is for English only, getting a SOTA transcription model in the open, is remarkable. Rev opened up this model on HuggingFace with a non commercial license, so folks can play around (and distill?) it, while also making it available in their API for very cheap and also a <a target="_blank" href="https://github.com/revdotcom/reverb-self-hosted">self hosted solution in a docker container</a></p><p>Black Forest Labs feeding up blueberries - new Flux 1.1[pro] is here (<a target="_blank" href="https://blackforestlabs.ai/announcing-flux-1-1-pro-and-the-bfl-api/">Blog</a>, <a target="_blank" href="https://fal.ai/models/fal-ai/flux-pro/v1.1">Try It</a>)</p><p>What is a ThursdAI without multiple SOTA advancements in all fields of AI? In an effort to prove this to be very true, the folks behind FLUX, revealed that the mysterious 🫐 model that was trending on some image comparison leaderboards is in fact a new version of Flux pro, specifically 1.1[pro]</p><p>FLUX1.1 [pro] provides six times faster generation than its predecessor FLUX.1 [pro] while also improving image quality, prompt adherence, and diversity</p><p>Just a bit over 2 month since the inital release, and proving that they are THE frontier lab for image diffusion models, folks at BLF are dropping a model that outperforms their previous one on users voting and quality, while being a much faster! </p><p>They have partnered with Fal, Together, Replicate to disseminate this model (it's not on X quite yet) but are now also offering developers direct access to their <a target="_blank" href="https://docs.bfl.ml/">own API </a>and at a competitive pricing of just 4 cents per image generation (while being faster AND cheaper AND higher quality than the previous Flux 😮) and you can try it out on Fal <a target="_blank" href="https://fal.ai/models/fal-ai/flux-pro/v1.1">here</a></p><p>Phew! What a whirlwind! Even <em>I</em> need a moment to catch my breath after that AI news tsunami. But don’t worry, the conversation doesn't end here. I barely scratched the surface of these groundbreaking announcements, so dive into the <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=link-to-podcast"><strong>podcast episode</strong></a> for the full scoop – Simon Willison’s insights on OpenAI’s moves are pure gold, and Maxim LaBonne spills the tea on Liquid AI's audacious plan to dethrone transformers (yes, you read that right). And for those of you who prefer skimming, check out my <a target="_blank" href="https://www.google.com/url?sa=E&#38;q=https%3A%2F%2Fsub.thursdai.news%2Fp%2Fopenai-dev-day-2024-keynote"><strong>Dev Day summary</strong></a> (open to all now). As always, hit me up in the comments with your thoughts. What are <em>you</em> most excited about? Are you building anything cool with these new tools? Let's keep the conversation going!Alex</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/oct-3-how-i-met-sam-altman</link><guid isPermaLink="false">substack:post:149785640</guid><dc:creator><![CDATA[Alex Volkov, Simon Willison, and Piotr Skalski]]></dc:creator><pubDate>Fri, 04 Oct 2024 00:00:34 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/149785640/bd056910c2f4c7071c75f15cae4e8fdc.mp3" length="75772522" type="audio/mpeg"/><itunes:author>Alex Volkov, Simon Willison, and Piotr Skalski</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6314</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/149785640/4dc84c91310415591cb085950a2e5d4b.jpg"/></item><item><title><![CDATA[OpenAI Dev Day 2024 keynote]]></title><description><![CDATA[<p>Hey, Alex here. Super quick, as I’m <strong>still</strong> attending Dev Day, but I didn’t want to leave you hanging (if you're a paid subscriber!), I have decided to outsource my job and give the amazing podcasters of NoteBookLM the whole transcript of the opening keynote of OpenAI Dev Day.</p><p>You can see a blog of everything they just posted <a target="_blank" href="https://openai.com/devday/">here</a></p><p>Here’s a summary of all what was announced:</p><p>* <strong>Developer-Centric Approach:</strong> OpenAI consistently emphasized the importance of developers in their mission to build beneficial AGI. The speaker stated, "OpenAI's mission is to build AGI that benefits all of humanity, and developers are critical to that mission... we cannot do this without you."</p><p>* <strong>Reasoning as a New Frontier:</strong> The introduction of the GPT-4 series, specifically the "O1" models, marks a significant step towards AI with advanced reasoning capabilities, going beyond the limitations of previous models like GPT-3.</p><p>* <strong>Multimodal Capabilities:</strong> OpenAI is expanding the potential of AI applications by introducing multimodal capabilities, particularly focusing on real-time speech-to-speech interaction through the new Realtime API.</p><p>* <strong>Customization and Fine-Tuning:</strong> Empowering developers to customize models is a key theme. OpenAI introduced Vision for fine-tuning with images and announced easier access to fine-tuning with model distillation tools.</p><p>* <strong>Accessibility and Scalability:</strong> OpenAI demonstrated a commitment to making AI more accessible and cost-effective for developers through initiatives like price reductions, prompt caching, and model distillation tools.</p><p><strong>Important Ideas and Facts:</strong></p><p><strong>1. The O1 Models:</strong></p><p>* Represent a shift towards AI models with enhanced reasoning capabilities, surpassing previous generations in problem-solving and logical thought processes.</p><p>* O1 Preview is positioned as the most powerful reasoning model, designed for complex problems requiring extended thought processes.</p><p>* O1 Mini offers a faster, cheaper, and smaller alternative, particularly suited for tasks like code debugging and agent-based applications.</p><p>* Both models demonstrate advanced capabilities in coding, math, and scientific reasoning.</p><p>* OpenAI highlighted the ability of O1 models to work with developers as "thought partners," understanding complex instructions and contributing to the development process.</p><p><strong>Quote:</strong> "The shift to reasoning introduces a new shape of AI capability. The ability for our model to scale and correct the process is pretty mind-blowing. So we are resetting the clock, and we are introducing a new series of models under the name O1."</p><p><strong>2. Realtime API:</strong></p><p>* Enables developers to build real-time AI experiences directly into their applications using WebSockets.</p><p>* Launches with support for speech-to-speech interaction, leveraging the technology behind ChatGPT's advanced voice models.</p><p>* Offers natural and seamless integration of voice capabilities, allowing for dynamic and interactive user experiences.</p><p>* Showcased the potential to revolutionize human-computer interaction across various domains like driving, education, and accessibility.</p><p><strong>Quote:</strong> "You know, a lot of you have been asking about building amazing speech-to-speech experiences right into your apps. Well now, you can."</p><p><strong>3. Vision, Fine-Tuning, and Model Distillation:</strong></p><p>* Vision introduces the ability to use images for fine-tuning, enabling developers to enhance model performance in image understanding tasks.</p><p>* Fine-tuning with Vision opens up opportunities in diverse fields such as product recommendations, medical imaging, and autonomous driving.</p><p>* OpenAI emphasized the accessibility of these features, stating that "fine-tuning with Vision is available to every single developer."</p><p>* Model distillation tools facilitate the creation of smaller, more efficient models by transferring knowledge from larger models like O1 and GPT-4.</p><p>* This approach addresses cost concerns and makes advanced AI capabilities more accessible for a wider range of applications and developers.</p><p><strong>Quote:</strong> "With distillation, you take the outputs of a large model to supervise, to teach a smaller model. And so today, we are announcing our own model distillation tools."</p><p><strong>4. Cost Reduction and Accessibility:</strong></p><p>* OpenAI highlighted its commitment to lowering the cost of AI models, making them more accessible for diverse use cases.</p><p>* Announced a 90% decrease in cost per token since the release of GPT-3, emphasizing continuous efforts to improve affordability.</p><p>* Introduced prompt caching, automatically providing a 50% discount for input tokens the model has recently processed.</p><p>* These initiatives aim to remove financial barriers and encourage wider adoption of AI technologies across various industries.</p><p><strong>Quote:</strong> "Every time we reduce the price, we see new types of applications, new types of use cases emerge. We're super far from the price equilibrium. In a way, models are still too expensive to be bought at massive scale."</p><p><strong>Conclusion:</strong></p><p>OpenAI DevDay conveyed a strong message of developer empowerment and a commitment to pushing the boundaries of AI capabilities. With new models like O1, the introduction of the Realtime API, and a dedicated focus on accessibility and customization, OpenAI is paving the way for a new wave of innovative and impactful AI applications developed by a global community.</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/openai-dev-day-2024-keynote</link><guid isPermaLink="false">substack:post:149677472</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Tue, 01 Oct 2024 18:56:18 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/149677472/b5d6c5a0a15984364fd375493bb39088.mp3" length="5682439" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>355</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/149677472/0621ac6ac3ca84057cf3b98420871097.jpg"/></item><item><title><![CDATA[📅 ThursdAI - Sep 26 - 🔥 Llama 3.2 multimodal & meta connect recap, new Gemini 002, Advanced Voice mode & more AI news]]></title><description><![CDATA[<p>Hey everyone, it's Alex (still traveling!), and oh boy, what a week again! Advanced Voice Mode is finally here from OpenAI, Google updated their Gemini models in a huge way and then Meta announced MultiModal LlaMas and on device mini Llamas (and we also got a "better"? multimodal from Allen AI called MOLMO!)</p><p>From Weights & Biases perspective, our hackathon was a success this weekend, and then I went down to Menlo Park for my first Meta Connect conference, full of news and updates and will do a full recap here as well. </p><p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></p><p>Overall another crazy week in AI, and it seems that everyone is trying to rush something out the door before OpenAI Dev Day next week (which I'll cover as well!) <strong>Get ready, folks, because Dev Day is going to be epic!</strong></p><p><strong>TL;DR of all topics covered:</strong> </p><p>* <strong>Open Source LLMs</strong> </p><p>* Meta llama 3.2 Multimodal models (11B & 90B) (X, HF, <a target="_blank" href="https://x.com/togethercompute/status/1839013617817309563">try free</a>)</p><p>* Meta Llama 3.2 tiny models 1B & 3B parameters (X, Blog, download)</p><p>* Allen AI releases MOLMO - open SOTA multimodal AI models (<a target="_blank" href="https://x.com/allen_ai/status/1838956313902219595">X</a>, Blog, HF, <a target="_blank" href="https://molmo.allenai.org/">Try It</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* OpenAI releases Advanced Voice Mode to all & Mira Murati leaves OpenAI </p><p>* Google updates Gemini 1.5-Pro-002 and 1.5-Flash-002 (<a target="_blank" href="https://developers.googleblog.com/en/updated-gemini-models-reduced-15-pro-pricing-increased-rate-limits-and-more/">Blog</a>)</p><p>* <strong>This weeks Buzz</strong> </p><p>* Our free course is LIVE - more than 3000 already started learning how to <a target="_blank" href="https://www.wandb.courses/courses/rag-in-production?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=thursdai&#38;utm_id=sep26">build advanced RAG++</a></p><p>* Sponsoring tonights AI Tinkerers in Seattle, if you're in Seattle, <a target="_blank" href="https://seattle.aitinkerers.org/p/ai-tinkerers-seattle-september-2024-meetup">come through for my demo</a></p><p>* <strong>Voice & Audio</strong></p><p>* Meta also launches voice mode (<a target="_blank" href="https://twitter.com/Ahmad_Al_Dahle/status/1839025866011193367">demo</a>)</p><p>* <strong>Tools & Others</strong></p><p>* Project ORION - holographic glasses are here! (<a target="_blank" href="https://about.meta.com/realitylabs/orion/?tab=Optics+%26+display">link</a>)</p><p>Meta gives us new LLaMas and AI hardware</p><p>LLama 3.2 Multimodal 11B and 90B</p><p>This was by far the biggest OpenSource release of this week (tho see below, may not be the "best"), as a rumored released finally came out, and Meta has given our Llama eyes! Coming with 2 versions (well 4 if you count the base models which they also released), these new MultiModal LLaMas were trained with an adapter architecture, keeping the underlying text models the same, and placing a vision encoder that was trained and finetuned separately on top. </p><p>LLama 90B is among the best open-source mutlimodal models available</p><p>— Meta team at launch</p><p>These new vision adapters were trained on a massive 6 Billion images, including synthetic data generation by 405B for questions/captions, and finetuned with a subset of 600M high quality image pairs. </p><p>Unlike the rest of their models, the Meta team did NOT claim SOTA on these models, and the benchmarks are very good but not the best we've seen (Qwen 2 VL from a couple of weeks ago, and MOLMO from today beat it on several benchmarks) </p><p>With <strong>text-only </strong>inputs, the Llama 3.2 Vision models are functionally the same as the <a target="_blank" href="https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_1">Llama 3.1 Text models</a>; this allows the Llama 3.2 Vision models to be a drop-in replacement for Llama 3.1 8B/70B with added image understanding capabilities.</p><p>Seems like these models don't support multi image or video as well (unlike Pixtral for example) nor tool use with images. </p><p>Meta will also release these models on <a target="_blank" href="http://meta.ai">meta.ai</a> and every other platform, and they cited a crazy 500 million monthly active users of their AI services across all their apps 🤯 which marks them as the leading AI services provider in the world now. </p><p><strong>Llama 3.2 Lightweight Models (1B/3B)</strong></p><p>The additional and maybe more exciting thing that we got form Meta was the introduction of the small/lightweight models of 1B and 3B parameters. </p><p>Trained on up to 9T tokens, and distilled / pruned from larger models, these are aimed for on-device inference (and by device here we mean from laptops to mobiles to soon... glasses? more on this later) </p><p>In fact, meta released an IOS demo, that runs these models, takes a group chat, summarizes and calls the calendar tool to schedule based on the conversation, and all this happens on device without the info leaving to a larger model. </p><p>They have also been able to prune down the LLama-guard safety model they released to under 500Mb and have had demos of it running on client side and hiding user input on the fly as the user types something bad!</p><p>Interestingly, here too, the models were not SOTA, even in small category, with tiny models like Qwen 2.5 3B beating these models on many benchmarks, but they are outlining a new distillation / pruning era for Meta as they aim for these models to run on device, eventually even glasses (and some said Smart Thermostats)</p><p>In fact they are so tiny, that the communtiy quantized them, released and I was able to download these models, all while the keynote was still going! Here I am running the Llama 3B during the developer keynote! </p><p>Speaking AI - not only from OpenAI</p><p>Zuck also showcased a voice based Llama that's coming to Meta AI (unlike OpenAI it's likely a pipeline of TTS/STT) but it worked really fast and Zuck was able to interrupt it. </p><p>And they also showed a crazy animated AI avatar of a creator, that was fully backed by Llama, while the human creator was on stage, Zuck chatted with his avatar and reaction times were really really impressive. </p><p>AI Hardware was glasses all along? </p><p>Look we've all seen the blunders of this year, the Humane AI Ping, the Rabbit R1 (which sits on my desk and I haven't recharged in two months) but maybe Meta is the answer here? </p><p>Zuck took a bold claim that glasses are actually the perfect form factor for AI, it sits on your face, sees what you see and hears what you hear, and can whisper in your ear without disrupting the connection between you and your conversation partner. </p><p>They haven't announced new Meta Raybans, but did update the lineup with a new set of transition lenses (to be able to wear those glasses inside and out) and a special edition clear case pair that looks very sleek + new AI features like memories to be able to ask the glasses "hey Meta where did I park" or be able to continue the conversation. I had to get me a pair of this limited edition ones!</p><p>Project ORION - first holographic glasses</p><p>And of course, the biggest announcement of the Meta Connect was the super secret decade old project of fully holographic AR glasses, which they called ORION. </p><p>Zuck introduced these as the most innovative and technologically dense set of glasses in the world. They always said the form factor will become just "glasses" and they actually did it ( a week after Snap spectacles ) tho those are not going to get released to any one any time soon, hell they only made a few thousand of these and they are extremely expensive.</p><p>With 70 deg FOV, cameras, speakers and a compute puck, these glasses pack a full day battery with under 100grams of weight, and have a custom silicon, custom displays with MicroLED projector and just... tons of more innovation in there. </p><p>They also come in 3 pieces, the glasses themselves, the compute wireless pack that will hold the LLaMas in your pocket and the EMG wristband that allows you to control these devices using muscle signals. </p><p>These won't ship as a product tho so don't expect to get them soon, but they are real, and will allow Meta to build the product that we will get on top of these by 2030</p><p>AI usecases</p><p>So what will these glasses be able to do? well, they showed off a live translation feature on stage that mostly worked, where you just talk and listen to another language in near real time, which was great. There are a bunch of mixed reality games, you'd be able to call people and see them in your glasses on  a virtual screen and soon you'll show up as an avatar there as well. </p><p>The AI use-case they showed beyond just translation was MultiModality stuff, where they had a bunch of ingredients for a shake, and you could ask your AI assistant, which shake you can make with what it sees. Do you really need </p><p>I'm so excited about these to finally come to people I screamed in the audience 👀👓</p><p>OpenAI gives everyone* advanced voice mode </p><p>It's finally here, and if you're paying for chatGPT you know this, the long announced Advanced Voice Mode for chatGPT is now rolled out to all plus members. </p><p>The new updated since the beta are, 5 new voices (Maple, Spruce, Vale, Arbor and Sol), finally access to custom instructions and memory, so you can ask it to remember things and also to know who you are and your preferences (try saving your jailbreaks there) </p><p>Unfortunately, as predicted, by the time it rolled out to everyone, this feels way less exciting than it did 6 month ago, the model is way less emotional, refuses to sing (tho folks are making it anyway) and generally feels way less "wow" than what we saw. Less "HER" than we wanted for sure <strong>Seriously, they nerfed the singing! Why OpenAI, </strong><strong><em>why</em></strong><strong>?</strong></p><p>Pro tip of mine that went viral : you can set your action button on the newer iphones to immediately start the voice conversation <a target="_blank" href="https://x.com/altryne/status/1838650551246164403/video/1">with 1 click</a>. </p><p>*This new mode is not available in EU </p><p>This weeks Buzz - our new advanced RAG++ course is live</p><p>I had an awesome time with my colleagues Ayush and Bharat today, after they finally released a FREE advanced RAG course they've been working so hard on for the past few months! Definitely check out our conversation, but better yet, why don't you roll into the course? it's FREE and you'll get to learn about data ingestion, evaluation, query enhancement and more! </p><p>New Gemini 002 is 50% cheaper, 2x faster and better at MMLU-pro</p><p>It seems that every major lab (besides Anthropic) released a big thing this week to try and get under Meta's skin? </p><p>Google announced an update to their Gemini Pro/Flash models, called 002, which is a very significant update!</p><p>Not only are these models 50% cheaper now (Pro price went down by 50% on <128K context lengths), they are 2x faster on outputs with 3x lower latency on first tokens. It's really quite something to see</p><p>The new models have also improved scores, with the Flash models (the super cheap ones, remember) from September, now coming close to or beating the Pro scores from May 2024! </p><p>Definitely a worthy update from the team at Google! </p><p>Hot off the press, the folks at Google Labs also added a feature to the awesome NotebookLM that allows it to summarize over 50h of youtube videos in the crazy high quality <a target="_blank" href="https://x.com/joshtwoodward/status/1839341423038193716">Audio Overview feature</a>! </p><p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></p><p>That's it for the week, we of course chatted about way way more during the show, so make sure to listen to the podcast this week, but otherwise, signing off for this week, as I travel back home for a weekend, before returning to SF for the OpenAI dev day next week! </p><p>Expect full Dev Day coverage live next tuesday and a recap on the newsletter. </p><p>Meanwhile, if you've already subscribed, please share this newsletter with 1 or two people who are interested in AI 🙇‍♂️ and see you next week. </p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-sep-26-llama-32-multimodal</link><guid isPermaLink="false">substack:post:149468402</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Thu, 26 Sep 2024 22:38:01 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/149468402/adfb1666f84d1d2d2443c2daa8763617.mp3" length="77223071" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6435</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/149468402/8ac52106dd1207e3223f24c97a1977a5.jpg"/></item><item><title><![CDATA[ThursdAI - Sep 19 - 👑 Qwen 2.5 new OSS king LLM, MSFT new MoE, Nous Research's Forge announcement, and Talking AIs in the open source!]]></title><description><![CDATA[<p>Hey folks, Alex here, back with another ThursdAI recap – and let me tell you, this week's episode was a whirlwind of open-source goodness, mind-bending inference techniques, and a whole lotta talk about talking AIs! We dove deep into the world of LLMs, from Alibaba's massive Qwen 2.5 drop to the quirky, real-time reactions of Moshi. </p><p>We even got a sneak peek at Nous Research's ambitious new project, Forge, which promises to unlock some serious LLM potential. So grab your pumpkin spice latte (it's that time again isn't it? 🍁) settle in, and let's recap the AI awesomeness that went down on ThursdAI, September 19th! </p><p>ThursdAI is brought to you (as always) by Weights & Biases, we still have a few spots left in our <a target="_blank" href="https://lu.ma/judge">Hackathon</a> this weekend and our new advanced RAG course is now released and is <a target="_blank" href="https://www.wandb.courses/courses/rag-in-production?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=thursdai&#38;utm_id=sep19">FREE</a> to sign up!</p><p>TL;DR of all topics + show notes and links</p><p>* <strong>Open Source LLMs</strong> </p><p>* Alibaba Qwen 2.5 models drop + Qwen 2.5 Math and Qwen 2.5 Code (<a target="_blank" href="https://x.com/Alibaba_Qwen/status/1836449414220779584">X</a>, HF, <a target="_blank" href="https://qwenlm.github.io/blog/qwen2.5/">Blog</a>, <a target="_blank" href="https://huggingface.co/spaces/Qwen/Qwen2.5">Try It</a>)</p><p>* Qwen 2.5 Coder 1.5B is running on a 4 year old phone (<a target="_blank" href="https://x.com/nisten/status/1836509890136727573">Nisten</a>)</p><p>* KyutAI open sources Moshi  & Mimi (Moshiko & Moshika) - end to end voice chat model (X, HF, <a target="_blank" href="https://kyutai.org/Moshi.pdf">Paper</a>)</p><p>* Microsoft releases GRIN-MoE - tiny (6.6B active) MoE with 79.4 MMLU (<a target="_blank" href="https://x.com/LiyuanLucas/status/1836550267522945508">X</a>, <a target="_blank" href="https://huggingface.co/microsoft/GRIN-MoE">HF</a>, <a target="_blank" href="https://github.com/microsoft/GRIN-MoE">GIthub</a>)</p><p>* Nvidia - announces NVLM 1.0 - frontier class multimodal LLMS (no weights yet, <a target="_blank" href="https://twitter.com/_weiping/status/1836226447863877837">X</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* OpenAI O1 results from LMsys do NOT disappoint - vibe checks also confirm, new KING llm in town (<a target="_blank" href="https://x.com/OpenAI/status/1836846202182131776">Thread</a>)</p><p>* NousResearch announces Forge in waitlist - their MCTS enabled inference product (<a target="_blank" href="https://twitter.com/altryne/status/1836605857490178429/video/1">X</a>)</p><p>* <strong>This weeks Buzz - </strong>everything <strong>Weights & Biases </strong>related this week</p><p>* Judgement Day (hackathon) is in 2 days! Still places to come hack with us <a target="_blank" href="https://lu.ma/judge">Sign up</a></p><p>* Our new RAG Course is live - learn all about advanced RAG from WandB, Cohere and Weaviate (<a target="_blank" href="https://www.wandb.courses/courses/rag-in-production?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=thursdai&#38;utm_id=sep19">sign up for free</a>)</p><p>* <strong>Vision & Video</strong></p><p>* Youtube announces DreamScreen - generative AI image and video in youtube shorts ( <a target="_blank" href="https://wandb.ai/site/resources/events/judgment-day-hackathon?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=thursdai&#38;utm_id=sep5">Blog</a>)</p><p>* <strong>CogVideoX-5B-I2V</strong> - leading open source img2video model (<a target="_blank" href="https://x.com/ChatGLM/status/1836461703179178321">X</a>, <a target="_blank" href="https://huggingface.co/THUDM/CogVideoX-5b-I2V">HF</a>)</p><p>* Runway, DreamMachine & Kling all announce text-2-video over API (<a target="_blank" href="https://x.com/runwayml/status/1835670564825944265">Runway</a>, <a target="_blank" href="https://x.com/LumaLabsAI/status/1835742651662139529">DreamMachine</a>)</p><p>* Runway announces video 2 video model (<a target="_blank" href="https://x.com/runwayml/status/1836754191408091312">X</a>)</p><p>* Tools</p><p>* Snap announces their XR glasses - have hand tracking and AI features (<a target="_blank" href="https://x.com/bilawalsidhu/status/1836106140708687885">X</a>)</p><p>Open Source Explosion!</p><p>👑 Qwen 2.5: new king of OSS llm models with 12 model releases, including instruct, math and coder versions</p><p>This week's open-source highlight was undoubtedly the release of Alibaba's Qwen 2.5 models. We had Justin Lin from the Qwen team join us live to break down this monster drop, which includes a whopping seven different sizes, ranging from a nimble 0.5B parameter model all the way up to a colossal 72B beast! And as if that wasn't enough, they also dropped Qwen 2.5 Coder and Qwen 2.5 Math models, further specializing their LLM arsenal. As Justin mentioned, they heard the community's calls for 14B and 32B models loud and clear – and they delivered! "We do not have enough GPUs to train the models," Justin admitted, "but there are a lot of voices in the community...so we endeavor for it and bring them to you." Talk about listening to your users!</p><p>Trained on an astronomical 18 trillion tokens (that’s even more than Llama 3.1 at 15T!), Qwen 2.5 shows significant improvements across the board, especially in coding and math. <strong>They even open-sourced the previously closed-weight Qwen 2 VL 72B</strong>, giving us access to the best open-source vision language models out there. With a 128K context window, these models are ready to tackle some serious tasks. As Nisten exclaimed after putting the 32B model through its paces, "It's really practical…I was dumping in my docs and my code base and then like actually asking questions."</p><p>It's safe to say that Qwen 2.5 coder is now the best coding LLM that you can use, and just in time for our chat, a new update from ZeroEval confirms, Qwen 2.5 models are the absolute kings of OSS LLMS, beating Mistral large, 4o-mini, Gemini Flash and other huge models with just 72B parameters 👏 </p><p>Moshi: The Chatty Cathy of AI</p><p>We've covered Moshi Voice <a target="_blank" href="https://sub.thursdai.news/i/146291193/kyutai-moshi-a-b-end-to-end-voice-model-try-it-see-announcement">back in July</a>, and they have promised to open source the whole stack, and now finally they did! Including the LLM and the Mimi Audio Encoder! </p><p>This quirky little 7.6B parameter model is a speech-to-speech marvel, capable of understanding your voice and responding in kind. It's an end-to-end model, meaning it handles the entire speech-to-speech process internally, without relying on separate speech-to-text and text-to-speech models.</p><p>While it might not be a logic genius, Moshi's real-time reactions are undeniably uncanny. Wolfram Ravenwolf described the experience: "It's uncanny when you don't even realize you finished speaking and it already starts to answer." The speed comes from the integrated architecture and efficient codecs, boasting a theoretical response time of just 160 milliseconds!</p><p>Moshi uses (also open sourced) Mimi neural audio codec, and achieves 12.5 Hz representation with just 1.1 kbps bandwidth.</p><p>You can download it and run on your own machine or give it a try <a target="_blank" href="https://moshi.chat">here</a> just don't expect a masterful conversationalist hehe</p><p>Gradient-Informed MoE (GRIN-MoE): A Tiny Titan</p><p>Just before our live show, Microsoft dropped a paper on GrinMoE, a gradient-informed Mixture of Experts model. We were lucky enough to have the lead author, Liyuan Liu (aka Lucas), join us impromptu to discuss this exciting development. Despite having only 6.6B active parameters (16 x 3.8B experts), GrinMoE manages to achieve remarkable performance, even outperforming larger models like Phi-3 on certain benchmarks. It's a testament to the power of clever architecture and training techniques. Plus, it's open-sourced under the MIT license, making it a valuable resource for the community.</p><p>NVIDIA NVLM: A Teaser for Now</p><p>NVIDIA announced NVLM 1.0, their own set of multimodal LLMs, but alas, no weights were released. We’ll have to wait and see how they stack up against the competition once they finally let us get our hands on them. Interestingly, while claiming SOTA on some vision tasks, they haven't actually compared themselves to Qwen 2 VL, which we know is really really good at vision tasks 🤔 </p><p>Nous Research Unveils Forge: Inference Time Compute Powerhouse (beating o1 at AIME Eval!)</p><p>Fresh off their NousCon event, Karan and Shannon from Nous Research joined us to discuss their latest project, Forge. Described by Shannon as "Jarvis on the front end," Forge is an inference engine designed to push the limits of what’s possible with existing LLMs. Their secret weapon? Inference-time compute. By implementing sophisticated techniques like Monte Carlo Tree Search (MCTS), Forge can outperform larger models on complex reasoning tasks beating OpenAI's o1-preview at the AIME Eval, competition math benchmark, even with smaller, locally runnable models like Hermes 70B. As Karan emphasized, “We’re actually just scoring with Hermes 3.1, which is available to everyone already...we can scale it up to outperform everything on math, just using a system like this.”</p><p>Forge isn't just about raw performance, though. It's built with usability and transparency in mind. Unlike OpenAI's 01, which obfuscates its chain of thought reasoning, Forge provides users with a clear visual representation of the model's thought process. "You will still have access in the sidebar to the full chain of thought," Shannon explained, adding, “There’s a little visualizer and it will show you the trajectory through the tree… you’ll be able to see exactly what the model was doing and why the node was selected.” Forge also boasts built-in memory, a graph database, and even code interpreter capabilities, initially supporting Python, making it a powerful platform for building complex LLM applications.</p><p>Forge is currently in a closed beta, but a waitlist is open for eager users. Karan and Shannon are taking a cautious approach to the rollout, as this is Nous Research’s first foray into hosting a product. For those lucky enough to gain access, Forge offers a tantalizing glimpse into the future of LLM interaction, promising greater transparency, improved reasoning, and more control over the model's behavior.</p><p>For ThursdAI readers early, here's a <a target="_blank" href="https://qkj75zosy2s.typeform.com/Nousresearch?typeform-source=t.co">waitlist form</a> to test it out!</p><p>Big Companies and APIs: The Reasoning Revolution</p><p>OpenAI’s 01: A New Era of LLM Reasoning</p><p>The big story in the Big Tech world is OpenAI's 01. Since we covered it live last week as it dropped, many of us have been playing with these new reasoning models, and collecting "vibes" from the community.  These models represent a major leap in reasoning capabilities, and the results speak for themselves. </p><p>01 Preview claimed the top spot across the board on the LMSys Arena leaderboard, demonstrating significant improvements in complex tasks like competition math and coding. Even the smaller 01 Mini showed impressive performance, outshining larger models in certain technical areas. (and the jump in ELO score above the rest in MATH is just incredible to see!) and some folks made this video viral, of a PHD candidate reacting to 01 writing in 1 shot, code that took him a year to write, check it out, it’s priceless. </p><p>One key aspect of 01 is the concept of “inference-time compute”. As Noam Brown from OpenAI calls it, this represents a "new scaling paradigm", allowing the model to spend more time “thinking” during inference, leading to significantly improved performance on reasoning tasks. The implications of this are vast, opening up the possibility of LLMs tackling long-horizon problems in areas like drug discovery and physics.</p><p>However, the opacity surrounding 01’s chain of thought reasoning being hidden/obfuscated and the ban on users asking about it was a major point of contention at least within the ThursdAI chat. As Wolfram Ravenwolf put it, "The AI gives you an answer and you can't even ask how it got there. That is the wrong direction." as he was referring to the fact that not only is asking about the reasoning impossible, some folks were actually getting threatening emails and getting banned from using the product all together 😮</p><p>This Week's Buzz: Hackathons and RAG Courses!</p><p>We're almost ready to host our Weights & Biases <a target="_blank" href="https://lu.ma/judge">Judgment Day Hackathon</a> (LLMs as a judge, anyone?) with a few spots left, so if you're reading this and in SF, come hang out with us!</p><p>And the main thing I gave an update about is our <a target="_blank" href="https://www.wandb.courses/courses/rag-in-production?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=thursdai&#38;utm_id=sep19">Advanced RAG course</a>, packed with insights from experts at Weights & Biases, Cohere, and Weaviate. Definitely check those out if you want to level up your LLM skills (and it's <a target="_blank" href="https://www.wandb.courses/courses/rag-in-production?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=thursdai&#38;utm_id=sep19">FREE</a> in our courses academy!)</p><p>Vision & Video: The Rise of Generative Video</p><p>Generative video is having its moment, with a flurry of exciting announcements this week. First up, the open-source CogVideoX-5B-I2V, which brings accessible image-to-video capabilities to the masses. It's not perfect, but being able to generate video on your own hardware is a game-changer.</p><p>On the closed-source front, YouTube announced the integration of generative AI into YouTube Shorts with their DreamScreen feature, bringing AI-powered video generation to a massive audience. We also saw API releases from three leading video model providers: Runway, DreamMachine, and Kling, making it easier than ever to integrate generative video into applications. Runway even unveiled a video-to-video model, offering even more control over the creative process, and it's wild, check out what folks are doing with video-2-video! </p><p>One last thing here, Kling is adding a motion brush feature to help users guide their video generations, and it just looks so awesome I wanted to show you</p><p>Whew! That was one hell of a week, tho from the big companies perspective, it was a very slow week, getting a new OSS king, an end to end voice model and a new hint of inference platform from Nous, and having all those folks come to the show was awesome! </p><p>If you're reading all the way down to here, it seems that you like this content, why not share it with 1 or two friends? 👇 And as always, thank you for reading and subscribing! 🫶</p><p></p><p>P.S - I’m traveling for the next two weeks, and this week the live show was live recorded from San Francisco, thanks to my dear friends <a target="_blank" href="https://substack.com/profile/89230629-swyx-and-alessio">swyx & Alessio</a>  for hosting my again in their awesome <a target="_blank" href="null">Latent Space</a> pod studio at <a target="_blank" href="https://www.solarissociety.org/">Solaris SF</a>! </p><p></p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-sep-19-qwen-25-new-oss-king</link><guid isPermaLink="false">substack:post:149124060</guid><dc:creator><![CDATA[Alex Volkov and desiderata]]></dc:creator><pubDate>Thu, 19 Sep 2024 23:40:28 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/149124060/964eecd9bdf5c15ff193c4fe79502531.mp3" length="83589506" type="audio/mpeg"/><itunes:author>Alex Volkov and desiderata</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6966</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/149124060/1da7382a66a9b6d14b98223dfe13377f.jpg"/></item><item><title><![CDATA[🔥 📅 ThursdAI - Sep 12 - OpenAI's 🍓 is called 01 and is HERE, reflecting on Reflection 70B, Google's new auto podcasts & more AI news from last week]]></title><description><![CDATA[<p>March 14th, 2023 was the day ThursdAI was born, it was also the day OpenAI released GPT-4, and I jumped into a Twitter space and started chaotically reacting together with other folks about what a new release of a paradigm shifting model from OpenAI means, what are the details, the new capabilities. Today, it happened again! </p><p>Hey, it's Alex, I'm back from my mini vacation (pic after the signature) and boy am I glad I decided to not miss September 12th! The long rumored 🍓 thinking model from OpenAI, dropped as breaking news in the middle of ThursdAI live show, giving us plenty of time to react live! </p><p>But before this, we already had an amazing show with some great guests! Devendra Chaplot from Mistral came on and talked about their newly torrented (yeah they did that again) Pixtral VLM, their first multi modal! , and then I had the honor to host Steven Johnson and Raiza Martin from  NotebookLM team at Google Labs which shipped something so uncannily good, that I legit said "holy fu*k" on X in a reaction!   </p><p>So let's get into it (TL;DR and links will be at the end of this newsletter)</p><p>OpenAI o1, o1 preview and o1-mini, a series of new "reasoning" models</p><p>This is it folks, the strawberries have bloomed, and we finally get to taste them. OpenAI has released (without a waitlist, 100% rollout!) o1-preview and o1-mini models to chatGPT and API (tho only for tier-5 customers) 👏 and are working on releasing 01 as well.</p><p>These are models that think before they speak, and have been trained to imitate "system 2" thinking, and integrate chain-of-thought reasoning internally, using Reinforcement Learning and special thinking tokens, which allows them to actually review what they are about to say before they are saying it, achieving <strong>remarkable results</strong> on logic based questions.</p><p>Specifically you can see the jumps in the very very hard things like competition math and competition code, because those usually require a lot of reasoning, which is what these models were trained to do well. </p><p>New scaling paradigm </p><p>Noam Brown from OpenAI calls this a "new scaling paradigm" and Dr Jim Fan explains <a target="_blank" href="https://x.com/DrJimFan/status/1834284702494327197">why</a>, with this new way of "reasoning", <strong>the longer the model thinks - the better it does on reasoning tasks</strong>, they call this "test-time compute" or "inference-time compute" as opposed to compute that was used to train the model. This shifting of computation down to inference time is the essence of the paradigm shift, as in, pre-training can be very limiting computationally as the models scale in size of parameters, they can only go so big until you have to start building out a huge new supercluster of GPUs to host the next training run (Remember Elon's Colossus from last week?). </p><p>The interesting thing to consider here is, while current "thinking" times are ranging between a few seconds to a minute, imagine giving this model hours, days, weeks to think about new drug problems, physics problems 🤯.</p><p>Prompting o1 </p><p>Interestingly, a new prompting paradigm has also been introduced. These models now have CoT (think "step by step") built-in, so you no longer have to include it in your prompts. By simply switching to o1-mini, most users will see better results right off the bat. OpenAI has worked with the Devin team to test drive these models, and these folks found that asking the new models to just give the final answer often works better and avoids redundancy in instructions.</p><p>The community of course will learn what works and doesn't in the next few hours, days, weeks, which is why we got 01-preview and not the actual (much better) o1. </p><p>Safety implications and future plans</p><p>According to <a target="_blank" href="https://x.com/gdb/status/1834295775674990676">Greg Brokman</a>, this inference time compute also greatly helps with aligning the model to policies, giving it time to think about policies at length, and improving security and jailbreak preventions, not only logic. </p><p>The folks at OpenAI are so proud of all of the above that they have decided to restart the count and call this series o1, but they did mention that they are going to release GPT series models as well, adding to the confusing marketing around their models. </p><p>Open Source LLMs </p><p>Reflecting on Reflection 70B</p><p><a target="_blank" href="https://sub.thursdai.news/p/thursdai-sep-5-reflection-70b-beats">Last week</a>, Reflection 70B was supposed to launch live on the ThursdAI show, and while it didn't happen live, I did add it in post editing, and sent the newsletter, and packed my bag, and flew for my vacation. I got many DMs since then, and at some point couldn't resist checking and what I saw was complete chaos, and despite this, I tried to disconnect still until last night. </p><p>So here's what I could gather since last night. The claims of a llama 3.1 70B finetune that Matt Shumer and Sahil Chaudhary from Glaive beating Sonnet 3.5 are proven false, nobody was able to reproduce those evals they posted and boasted about, which is a damn shame. </p><p>Not only that, multiple trusted folks from our community, like <a target="_blank" href="https://sub.thursdai.news/p/thursdai-july-11-mixture-of-agents?utm_source=publication-search">Kyle Corbitt</a>, <a target="_blank" href="https://sub.thursdai.news/p/thursdai-july-11-mixture-of-agents?utm_source=publication-search">Alex Atallah</a> have reached out to Matt in to try to and get to the bottom of how such a thing would happen, and how claims like these could have been made in good faith. (or was there foul play) </p><p>The core idea of something like Reflection is actually very interesting, but alas, the inability to replicate, but also to stop engaging with he community openly (I've reached out to Matt and given him the opportunity to come to the show and address the topic, he did not reply), keep the model on hugging face where it's still trending, claiming to be the world's number 1 open source model, all these smell really bad, despite multiple efforts on out part to give the benefit of the doubt here. </p><p>As for my part in building the hype on this (last week's issues till claims that this model is top open source model), I addressed it in the beginning of the show, but then twitter spaces crashed, but unfortunately as much as I'd like to be able to personally check every thing I cover, I often have to rely on the reputation of my sources, which is easier with established big companies, and this time this approached failed me. </p><p>This weeks Buzzzzzz - One last week till our hackathon! </p><p>Look at this point, if you read this newsletter and don't know about our hackathon, then I really didn't do my job prompting it, but it's coming up, September 21-22 ! Join us, it's going to be a LOT of fun! </p><p>🖼️ Pixtral 12B from Mistral </p><p>Mistral AI burst onto the scene with <strong>Pixtral, their first multimodal model!</strong> Devendra Chaplot, research scientist at Mistral, joined ThursdAI to explain their unique approach, ditching fixed image resolutions and training a vision encoder from scratch.</p><p>"We designed this from the ground up to...get the most value per flop," Devendra explained. Pixtral handles multiple images interleaved with text within a 128k context window - a far cry from the single-image capabilities of most open-source multimodal models. And to make the community erupt in thunderous applause (cue the clap emojis!) they released the 12 billion parameter model under the ultra-permissive Apache 2.0 license. You can give Pixtral a whirl on Hyperbolic, HuggingFace, or directly through Mistral.</p><p>DeepSeek 2.5: When Intelligence Per Watt is King</p><p>Deepseek 2.5 launched amid the reflection news and did NOT get the deserved attention it.... deserves. It folded (no deprecated) Deepseek Coder into 2.5 and shows incredible metrics and  a truly next-gen architecture. "It's like a higher order MOE", Nisten revealed, "which has this whole like pile of brain and it just like picks every time, from that." 🤯. DeepSeek 2.5 achieves maximum "intelligence per active parameter"</p><p>Google's turning text into AI podcast for auditory learners with Audio Overviews</p><p>Today I had the awesome pleasure of chatting with <strong>Steven Johnson</strong> and <strong>Raiza Martin</strong> from the NotebookLM team at Google Labs. NotebookLM is a research tool, that if you haven't used, you should definitely give it a spin, and this week they launched something I saw in preview and was looking forward to checking out and honestly was jaw-droppingly impressed today. </p><p>NotebookLM allows you to upload up to 50 "sources" which can be PDFs, web links that they will scrape for you, documents etc' (no multimodality so far) and will allow you to chat with them, create study guides, dive deeper and add notes as you study. </p><p>This week's update allows someone who doesn't like reading, to turn all those sources into a legit 5-10 minute podcast, and that sounds so realistic, that I was honestly blown away. I uploaded a documentation of fastHTML in there.. and well hear for yourself </p><p>The conversation with Steven and Raiza was really fun, podcast definitely give it a listen! </p><p>Not to mention that Google released (under waitlist) another podcast creating tool called illuminate, that will convert ArXiv papers into similar sounding very realistic 6-10 minute <a target="_blank" href="https://illuminate.google.com/home?pli=1">podcasts</a>! </p><p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></p><p></p><p>There are many more updates from this week, there was a whole Apple keynote I missed, which had a new point and describe feature with AI on the new iPhones and Apple Intelligence, Google also released new DataGemma 27B, and more things in TL'DR which are posted here in raw format </p><p>See you next week 🫡 Thank you for being a subscriber, weeks like this are the reason we keep doing this! 🔥 Hope you enjoy these models, leave in comments what you think about them</p><p>TL;DR in raw format</p><p>* Open Source LLMs </p><p>* Reflect on Reflection 70B & Matt Shumer (<a target="_blank" href="https://twitter.com/Yuchenj_UW/status/1833627813552992722">X</a>, <a target="_blank" href="https://twitter.com/csahil28/status/1833619624589725762">Sahil</a>)</p><p>* Mixtral releases Pixtral 12B - multimodal model (<a target="_blank" href="https://x.com/MistralAI/status/1833758285167722836">X</a>, <a target="_blank" href="https://huggingface.co/spaces/WildVision/vision-arena">try it</a>)</p><p>* Pixtral is really good at OCR says <a target="_blank" href="https://x.com/swyx/status/1833934254834942047">swyx</a></p><p>* Interview with Devendra Chaplot on ThursdAI</p><p>* Initial reports of Pixtral beating GPT-4 on <a target="_blank" href="https://huggingface.co/spaces/WildVision/vision-arena">WildVision arena</a> from AllenAI</p><p>* JinaIA <strong>reader-lm-0.5b</strong> and <strong>reader-lm-1.5b (</strong><a target="_blank" href="https://x.com/JinaAI_/status/1833861180445860168"><strong>X</strong></a><strong>)</strong></p><p>* ZeroEval updates</p><p>* Deepseek 2.5 - </p><p>* Deepseek coder is now folded into DeepSeek v2.5</p><p>* 89 HumanEval (up from 84 from deepseek v2)</p><p>* 9 on MT-bench</p><p>* Google - DataGemma 27B (RIG/RAG) for improving results </p><p>* <strong>Retrieval-Interleaved Generation </strong> </p><p>* 🤖 DataGemma: AI models that connect LLMs to Google's Data Commons</p><p>* 📊 Data Commons: A vast repository of trustworthy public data</p><p>* 🔍 Tackling AI hallucination by grounding LLMs in real-world data</p><p>* 🔍 Two approaches: RIG (Retrieval-Interleaved Generation) and RAG (Retrieval-Augmented Generation)</p><p>* 🔍 Preliminary results show enhanced accuracy and reduced hallucinations</p><p>* 🔓 Making DataGemma open models to enable broader adoption</p><p>* 🌍 Empowering informed decisions and deeper understanding of the world</p><p>* 🔍 Ongoing research to refine the methodologies and scale the work</p><p>* 🔍 Integrating DataGemma into Gemma and Gemini AI models</p><p>* 🤝 Collaborating with researchers and developers through quickstart notebooks</p><p>* Big CO LLMs + APIs</p><p>* Apple event</p><p>* Apple Intelligence - launching soon</p><p>* Visual Intelligence with a dedicated button</p><p>* Google Illuminate - generate arXiv paper into multiple speaker podcasts (<a target="_blank" href="https://illuminate.google.com/home?pli=1">Website</a>)</p><p>* 5-10 min podcasts</p><p>* multiple speakers</p><p>* any paper </p><p>* waitlist</p><p>* has samples</p><p>* sounds super cool</p><p>* Google NotebookLM is finally available - multi modal research tool + podcast (<a target="_blank" href="https://notebooklm.google.com/notebook/a93dd3ea-d6c0-4e44-a88a-2630e05581f4">NotebookLM</a>)</p><p>* Has RAG like abilities, can add sources from drive or direct web links</p><p>* Currently not multimodal</p><p>* Generation of multi speaker conversation about this topic to present it, sounds really really realistic</p><p>* Chat with Steven and Raiza</p><p>* OpenAI reveals new o1 models, and launches o1 preview and o1-mini in chat and API (<a target="_blank" href="https://x.com/polynoamial/status/1834280425457426689">X</a>, <a target="_blank" href="https://openai.com/o1/">Blog</a>)</p><p>* Trained with RL to think before it speaks with special thinking tokens (that you pay for)</p><p>* new scaling paradigm</p><p>* This weeks Buzz</p><p>* Vision & Video</p><p>* Adobe announces Firefly video model (<a target="_blank" href="https://x.com/CharaspowerAI/status/1833915411110265103">X</a>)</p><p>* Voice & Audio</p><p>* Hume launches EVI 2 (<a target="_blank" href="https://x.com/hume_ai/status/1833921932275986673">X</a>)</p><p>* Fish Speech 1.4 (<a target="_blank" href="https://x.com/reach_vb/status/1833801060659372071">X</a>)</p><p>* Instant Voice Cloning</p><p>* Ultra low latenc</p><p>* ~1GB model weights</p><p>* LLaMA-Omni, a new model for speech interaction (<a target="_blank" href="https://x.com/osanseviero/status/1833860776823562511">X</a>)</p><p>*  Tools</p><p>* New Jina reader (<a target="_blank" href="https://x.com/JinaAI_/status/1833861180445860168">X</a>)</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-sep-12-openais-is-called</link><guid isPermaLink="false">substack:post:148831936</guid><dc:creator><![CDATA[Alex Volkov and Devendra Chaplot]]></dc:creator><pubDate>Fri, 13 Sep 2024 00:50:10 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/148831936/86c274dddd1091e43426534986cfba63.mp3" length="85134007" type="audio/mpeg"/><itunes:author>Alex Volkov and Devendra Chaplot</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>7094</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/148831936/bbad0ded4377f25f12c4594a98af8423.jpg"/></item><item><title><![CDATA[📅 ThursdAI - Sep 5 - 👑 Reflection 70B beats Claude 3.5, Anthropic Enterprise 500K context, 100% OSS MoE from AllenAI, 1000 agents world sim, Replit agent is the new Cursor? and more AI news]]></title><description><![CDATA[<p>Welcome back everyone, can you believe it's another ThursdAI already? And can you believe me when I tell you that friends of the pod Matt Shumer & Sahil form Glaive.ai just dropped a LLama 3.1 70B finetune that you can download that will <strong>outperform Claude Sonnet 3.5</strong> while running locally on your machine? </p><p>Today was a VERY heavy Open Source focused show, we had a great chat w/ Niklas, the leading author of OLMoE, a new and 100% open source MoE from Allen AI, a chat with Eugene (pico_creator) about RWKV being deployed to over 1.5 billion devices with Windows updates and a lot more. </p><p>In the realm of the big companies, Elon shook the world of AI by turning on the biggest training cluster called Colossus (100K H100 GPUs) which was scaled in 122 days 😮 and Anthropic announced that they have 500K context window Claude that's only reserved if you're an enterprise customer, while OpenAI is floating an idea of a $2000/mo subscription for Orion, their next version of a 100x better chatGPT?! </p><p>TL;DR</p><p>* <strong>Open Source LLMs</strong> </p><p>* Matt Shumer / Glaive - Reflection-LLama 70B beats Claude 3.5 (<a target="_blank" href="https://x.com/mattshumer_/status/1831767014341538166">X</a>, <a target="_blank" href="https://huggingface.co/mattshumer/Reflection-70B">HF</a>)</p><p>* Allen AI - OLMoE - first "good" MoE 100% OpenSource (<a target="_blank" href="https://x.com/Muennighoff/status/1831159130230587486">X</a>, <a target="_blank" href="https://blog.allenai.org/olmoe-an-open-small-and-state-of-the-art-mixture-of-experts-model-c258432d0514">Blog</a>, <a target="_blank" href="https://arxiv.org/abs/2409.02060">Paper</a>, <a target="_blank" href="https://wandb.ai/ai2-llm/olmoe/reports/OLMoE-1B-7B-0924--Vmlldzo4OTcyMjU3">WandB</a>)</p><p>* RWKV.cpp is deployed with Windows to 1.5 Billion devices</p><p>* MMMU pro - more robust multi disipline multimodal understanding bench (<a target="_blank" href="https://mmmu-benchmark.github.io/">proj</a>)</p><p>* 01AI - Yi-Coder 1.5B and 9B (X, <a target="_blank" href="https://01-ai.github.io/blog.html?post=en/2024-09-05-A-Small-but-Mighty-LLM-for-Code.md">Blog</a>, <a target="_blank" href="https://huggingface.co/collections/01-ai/yi-coder-66bdb00f5bdd611f9a008f30">HF</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Replit launches Agent in beta - from coding to production (X, Try It)</p><p>* Ilya SSI announces 1B round from everyone (<a target="_blank" href="https://www.reuters.com/technology/artificial-intelligence/openai-co-founder-sutskevers-new-safety-focused-ai-startup-ssi-raises-1-billion-2024-09-04/">Post</a>)</p><p>* Cohere updates Command-R and Command R+ on API (<a target="_blank" href="https://cohere.com/blog/command-series-0824">Blog</a>)</p><p>* Claude Enterprise with 500K context window (<a target="_blank" href="https://www.anthropic.com/news/claude-for-enterprise">Blog</a>)</p><p>* Claude invisibly adds instructions (even via the API?) (<a target="_blank" href="https://x.com/goodside/status/1830747657653940532">X</a>)</p><p>* Google got structured output finally (<a target="_blank" href="https://ai.google.dev/gemini-api/docs/json-mode?lang=python#supply-schema-in-config">Docs</a>)</p><p>* Amazon to include Claude in Alexa starting this October (<a target="_blank" href="https://www.reuters.com/technology/artificial-intelligence/amazon-turns-anthropics-claude-alexa-ai-revamp-2024-08-30/">Blog</a>)</p><p>* X ai scaled Colossus to 100K H100 GPU goes online (<a target="_blank" href="https://x.com/elonmusk/status/1815325410667749760">X</a>)</p><p>* DeepMind - AlphaProteo new paper (<a target="_blank" href="https://deepmind.google/discover/blog/alphaproteo-generates-novel-proteins-for-biology-and-health-research/?utm_source=x&#38;utm_medium=&#38;utm_campaign=gdm&#38;utm_content=">Blog</a>, <a target="_blank" href="https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/alphaproteo-generates-novel-proteins-for-biology-and-health-research/Protein_Design_White_Paper_2024.pdf">Paper</a>, <a target="_blank" href="https://www.youtube.com/watch?v=lI3EoCjWC2E">Video</a>)</p><p>* <strong>This weeks Buzz</strong></p><p>* <a target="_blank" href="https://wandb.ai/site/resources/events/judgment-day-hackathon">Hackathon</a> did we mention? We're going to have Eugene and Greg as Judges!</p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* ByteDance - LoopyAvatar - Audio Driven portait avatars (<a target="_blank" href="https://loopyavatar.github.io/">Page</a>)</p><p>Open Source LLMs</p><p><strong>Reflection Llama-3.1 70B</strong> - new 👑 open source LLM from Matt Shumer / GlaiveAI </p><p>This model is BANANAs folks, this is a LLama 70b finetune, that was trained with a new way that Matt came up with, that bakes CoT and Reflection into the model via Finetune, which results in model outputting its thinking as though you'd prompt it in a certain way. </p><p>This causes the model to say something, and then check itself, and then reflect on the check and then finally give you a much better answer. Now you may be thinking, we could do this before, <a target="_blank" href="https://arxiv.org/pdf/2303.11366.pdf">RefleXion</a> (<a target="_blank" href="https://arxiv.org/pdf/2303.11366.pdf">arxiv.org/2303.11366</a>) came out a year ago, so what's new? </p><p>What's new is, this is now happening inside the models head, you don't have to reprompt, you don't even have to know about these techniques! So what you see above, is just colored differently, but all of it, is output by the model without extra prompting by the user or extra tricks in system prompt. the model thinks, plans, does chain of thought, then reviews and reflects, and then gives an answer! </p><p>And the results are quite incredible for a 70B model 👇</p><p>Looking at these evals, this is a 70B model that beats GPT-4o, Claude 3.5 on Instruction Following (IFEval), MATH, GSM8K with 99.2% 😮 and gets very close to Claude on GPQA and HumanEval! </p><p>(Note that these comparisons are a bit of a apples to ... different types of apples. If you apply CoT and reflection to the Claude 3.5 model, they may in fact perform better on the above, as this won't be counted 0-shot anymore. But given that this new model is effectively spitting out those reflection tokens, I'm ok with this comparison)</p><p>This is just the 70B, next week the folks are planning to drop the 405B finetune with the technical report, so stay tuned for that! </p><p>Kudos on this work, go give <a target="_blank" href="https://x.com/mattshumer_/status/1831767014341538166">Matt Shumer</a> and <a target="_blank" href="https://x.com/GlaiveAI">Glaive AI</a> a follow! </p><p>Allen AI OLMoE - tiny "good" MoE that's 100% open source, weights, code, logs</p><p>We've <a target="_blank" href="https://sub.thursdai.news/i/141297426/olmo-from-ai-new-fully-open-source-b-model-announcement">previously covered</a> OLMO from Allen Institute, and back then it was obvious how much commitment they have to open source, and this week they continued on this path with the release of OLMoE, an Mixture of Experts 7B parameter model (1B active parameters), trained from scratch on 5T tokens, which was completely open sourced. </p><p>This model punches above its weights on the best performance/cost ratio chart for MoEs and definitely highest on the charts of releasing everything. </p><p>By everything here, we mean... everything, not only the final weights file; they released 255 checkpoints (every 5000 steps), the training code (<a target="_blank" href="https://github.com/allenai/OLMoE">Github</a>) and even (and maybe the best part) the <a target="_blank" href="https://wandb.ai/ai2-llm/olmoe/reports/OLMoE-1B-7B-0924--Vmlldzo4OTcyMjU3">Weights & Biases logs</a>! </p><p>It was a pleasure to host the leading author of the OLMoE paper, <a target="_blank" href="https://x.com/Muennighoff"><strong>Niklas Muennighoff</strong></a><strong> </strong>on the show today, so definitely give this segment a listen, he's a great guest and I learned a lot! </p><p>Big Companies LLMs + API</p><p>Anthropic has 500K context window Claude but only for Enterprise? </p><p>Well, this sucks (unless you work for Midjourney, Airtable or Deloitte). Apparently Anthropic has been sitting on Claude that can extend to half a million tokens in the context window, and decided to keep it to themselves and a few trial enterprises, and package it as an Enterprise offering. </p><p>This offering now includes, beyond just the context window, also a native Github integration, and a few key enterprise features like access logs, provisioning and SCIM and all kinds of "procurement and CISO required" stuff enterprises look for. </p><p>To be clear, this is a great move for Anthropic, and this isn't an API tier, this is for their front end offering, including the indredible artifacts tool, so that companies can buy their employees access to <a target="_blank" href="http://Claude.ai">Claude.ai</a> and have them be way more productive coding (hence the Github integration) or summarizing (very very) long documents, building mockups and one off apps etc' </p><p>Anthropic is also in the news this week, because Amazon announced that it'll use Claude as the backbone for the smart (or "remarkable" as they call it) Alexa brains coming up in October, which, again, incredible for Anthropic distribution, as there are maybe 100M Alexa users in the world or so. </p><p>Prompt injecting must stop! </p><p>And lastly, there have been <a target="_blank" href="https://x.com/goodside/status/1830747657653940532">mounting evidence</a>, <a target="_blank" href="https://x.com/WolframRvnwlf/status/1831722744955711793">including our own</a> Wolfram Ravenwolf that confirmed it, that Anthropic is prompt injecting additional context into your own prompts, in the UI but also via the API! This is awful practice and if anyone from there reads this newsletter, please stop or at least acknowledge. Claude apparently just... thinks that it's something my users said, when in fact, it's some middle layer of anthropic security decided to just inject some additional words in there!</p><p>XAI turns on the largest training GPU SuperCluster Colossus - 100K H100 GPUS</p><p>This is a huge deal for AI, specifically due to the time this took and the massive massive scale of this SuperCluster. SuperCluster means all these GPUs sit in one datacenter, drawing from the same power-grid and can effectively run single training jobs. </p><p>This took <a target="_blank" href="https://x.com/ebbyamir/status/1830693650327880135">just 122 days</a> for Elon and the XAI team to go from an empty warehouse in Memphis to booting up an incredible 100K H100, and they claim that they will double this capacity by adding 50K H200 in the next few months. As Elon mentioned when they released Grok2, it was trained on 15K, and it matched GPT4! </p><p>Per SemiAnalisys, this new Supercluster can train a GPT-4 level model in <a target="_blank" href="https://www.semianalysis.com/p/100000-h100-clusters-power-network?triedRedirect=true&#38;utm_source=ainews&#38;utm_medium=email&#38;utm_campaign=ainews-everybody-shipped-small-things-this">just 4 days</a> 🤯 </p><p>XAI was founded a year ago, and by end of this year, they plan for Grok to be the beast LLM in the world, and not just get to GPT-4ish levels, and with this + 6B investment they have taken in early this year, it seems like they are well on track, which makes some folks at OpenAI <a target="_blank" href="https://www.theinformation.com/articles/why-musks-ai-rivals-are-alarmed-by-his-new-gpu-cluster?utm_campaign=Editorial&#38;utm_content=Newsletter%2CAI+Agenda&#38;utm_medium=organic_social&#38;utm_source=twitter">reportedly worried</a></p><p>This weeks buzz - we're in SF in less than two weeks, join our hackathon! </p><p>This time I'm very pleased to announce incredible judges for our hackathon, the spaces are limited, but there's still some spaces so please feel free to <a target="_blank" href="https://wandb.ai/site/resources/events/judgment-day-hackathon?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=thursdai&#38;utm_id=sep5">sign up and join us</a></p><p>I'm so honored to announce that we'll have Eugene Yan (<a target="_blank" href="https://x.com/eugeneyan">@eugeneyan</a>), Greg Kamradt (<a target="_blank" href="https://x.com/GregKamradt">@GregKamradt</a>) and Charles Frye (<a target="_blank" href="https://x.com/charles_irl">@charles_irl</a>) on the Judges panel. 🤩 It'll be incredible to have these folks see what hackers come up with, and I'm excited as this comes closer! </p><p>Replit launches Agents beta - a fully integrated code → deployment agent </p><p>Replit is a great integrated editing environment, with database and production in 1 click and they've had their LLMs trained on a LOT of code helping folks code for a while. </p><p>Now they are launching agents, which seems very smart from them, given that development is much more than just coding. All the recent excitement we see about Cursor, is omitting the fact that those demos are only working for folks who already know how to set up the environment, and then there's the need to deploy to production, maintain.</p><p>Replit has that basically built in, and now their Agent can build a plan and help you build those apps, and "ship" them, while showing you what they are doing. This is massive, and I can't wait to play around with this! </p><p>The additional benefit of Replit is that they nailed the mobile app experience as well, so this now works from mobile, on the go! </p><p>In fact, as I was writing this, I got so excited that I paused for 30 minutes, payed the yearly subscription and decided to give building an app a try! </p><p>The fact that this can deploy and run the server and the frontend, detect errors, fix them, and then also provision a DB for me, provision Stripe, login buttons and everything else is quite insane. </p><p>Can't wait to see what I can spin up with this 🔥 (and show all of you!) </p><p>Loopy - Animated Avatars from ByteDance </p><p>A new animated avatar project from folks at ByteDance just dropped, and it’s WAY clearer than anything we’ve seen before, like EMO or anything else. I will just add this video here for you to enjoy and look at the earring movements, vocal cords, eyes, everything! </p><p>I of course wanted to know if I’ll ever be able to use this, and .. likely no, here’s the response I got from Jianwen one of the Authors today. </p><p>That's it for this week, we've talked about so much more in the pod, please please check it out. </p><p>As for me, while so many exciting things are happening, I'm going on a small 🏝️ vacation until next ThursdAI, which will happen on schedule, so planning to decompress and disconnect, but will still be checking in, so if you see things that are interesting, please tag me on X 🙏 </p><p>P.S - I want to shout out a dear community member that's been doing just that, <a target="_blank" href="https://x.com/Presidentlin/status/1831712399893663829">@PresidentLin</a> has been tagging me in many AI related releases, often way before I would even notice them, so please give them a follow! 🫡 </p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p><p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></p><p></p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-sep-5-reflection-70b-beats</link><guid isPermaLink="false">substack:post:148556842</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Fri, 06 Sep 2024 00:10:15 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/148556842/ab3399c5b41cb09ffd18d0411c3e6cd8.mp3" length="75560014" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6296</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/148556842/175e45c1e2780af2410c6aae141ed359.jpg"/></item><item><title><![CDATA[📅 ThursdAI - Aug 29 - AI Plays DOOM, Cerebras breaks inference records, Google gives new Geminis, OSS vision SOTA & 100M context windows!?]]></title><description><![CDATA[<p>Hey, for the least time during summer of 2024, welcome to yet another edition of ThursdAI, also happy skynet self-awareness day for those who keep track :) </p><p>This week, Cerebras broke the world record for fastest LLama 3.1 70B/8B inference (and came on the show to talk about it) Google updated 3 new Geminis, Anthropic artifacts for all, 100M context windows are possible, and Qwen beats SOTA on vision models + much more! </p><p>As always, this weeks newsletter is brought to you by Weights & Biases, did I mention we're doing a <a target="_blank" href="https://wandb.ai/site/resources/events/judgment-day-hackathon">hackathon</a> in SF in September 21/22 and that we have an <a target="_blank" href="https://wandb.me/rag-course">upcoming free</a> RAG course w/ Cohere & Weaviate? </p><p>TL;DR</p><p>* <strong>Open Source LLMs</strong> </p><p>* Nous DisTrO - Distributed Training (<a target="_blank" href="https://x.com/NousResearch/status/1828121648383566270">X</a> , <a target="_blank" href="https://github.com/NousResearch/DisTrO/blob/main/A_Preliminary_Report_on_DisTrO.pdf">Report</a>)</p><p>* NousResearch/ hermes-function-calling-v1 open sourced - (<a target="_blank" href="https://x.com/NousResearch/status/1829143753036366325">X</a>, <a target="_blank" href="https://huggingface.co/datasets/NousResearch/hermes-function-calling-v1">HF</a>)</p><p>* LinkedIN Liger-Kernel - OneLine to make Training 20% faster & 60% more memory Efficient (<a target="_blank" href="https://github.com/linkedin/Liger-Kernel">Github</a>)</p><p>* Cartesia - Rene 1.3B LLM SSM + Edge Apache 2 acceleration (<a target="_blank" href="https://x.com/cartesia_ai/status/1828500784033735156">X</a>, <a target="_blank" href="https://cartesia.ai/blog/2024-08-27-on-device">Blog</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Cerebras launches the fastest AI inference - 447t/s LLama 3.1 70B (<a target="_blank" href="https://x.com/CerebrasSystems/status/1828464491677524311">X</a>, <a target="_blank" href="https://cerebras.ai/inference">Blog</a>, <a target="_blank" href="https://inference.cerebras.ai/">Try It</a>)</p><p>* Google - Gemini 1.5 Flash 8B & new Gemini 1.5 Pro/Flash (<a target="_blank" href="https://x.com/OfficialLoganK/status/1828480081574142227">X</a>, <a target="_blank" href="https://aistudio.google.com/app/prompts/new_chat">Try it</a>)</p><p>* Google adds Gems & Imagen to Gemini paid tier</p><p>* Anthropic artifacts available to all users + on mobile (<a target="_blank" href="https://www.anthropic.com/news/artifacts">Blog</a>, <a target="_blank" href="https://claude.ai/chat/d6e5acfd-7be9-4cb0-8ae5-41fbe6d509f4">Try it</a>)</p><p>* Anthropic publishes their system prompts with model releases (<a target="_blank" href="https://docs.anthropic.com/en/release-notes/system-prompts">release notes</a>)</p><p>* OpenAI has project Strawberry coming this fall (via The information)</p><p>* <strong>This weeks Buzz</strong></p><p>* WandB Hackathon hackathon hackathon (<a target="_blank" href="https://wandb.ai/site/resources/events/judgment-day-hackathon">Register</a>, <a target="_blank" href="https://wandb.ai/site/resources/events/judgment-day-hackathon">Join</a>)</p><p>* Also, we have a new RAG course w/ Cohere and Weaviate (<a target="_blank" href="https://www.wandb.courses/courses/rag-in-production">RAG Course</a>)</p><p>* <strong>Vision & Video</strong></p><p>* Zhipu AI CogVideoX - 5B Video Model w/ Less 10GB of VRAM (<a target="_blank" href="https://x.com/ChatGLM/status/1828402245949628632">X</a>, <a target="_blank" href="https://huggingface.co/collections/THUDM/cogvideo-66c08e62f1685a3ade464cce">HF</a>, <a target="_blank" href="https://fal.ai/models/fal-ai/cogvideox-5b?share=ebaa1d13-10a2-4053-a204-83f1d49489fc">Try it</a>)</p><p>* Qwen-2 VL 72B,7B,2B  - new SOTA vision models from QWEN (<a target="_blank" href="https://x.com/Alibaba_Qwen/status/1829187276179681634">X</a>, <a target="_blank" href="https://qwenlm.github.io/blog/qwen2-vl/">Blog</a>, <a target="_blank" href="https://huggingface.co/spaces/Qwen/Qwen2-VL">HF</a>)</p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* GameNgen - completely generated (not rendered) DOOM with SD1.4 (<a target="_blank" href="https://gamengen.github.io/">project</a>)</p><p>* FAL new LORA trainer for FLUX - trains under 5 minutes (<a target="_blank" href="https://fal.ai/models/fal-ai/flux-lora-fast-training/playground">Trainer</a>, <a target="_blank" href="https://fal.ai/?coupon=ThursdA">Coupon for ThursdAI</a>)</p><p>* Tools & Others</p><p>* SimpleBench from AI Explained - closely matches human experience (<a target="_blank" href="https://simple-bench.com/">simple-bench.com</a>)</p><p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></p><p>Open Source</p><p>Let's be honest - ThursdAI is a love letter to the open-source AI community, and this week was packed with reasons to celebrate.</p><p>Nous Research DiStRO + Function Calling V1</p><p>Nous Research was on fire this week (aren't they always?) and they kicked off the week with the release of DiStRO, which is a breakthrough in distributed training. You see, while LLM training requires a lot of hardware, it also requires a lot of network bandwidth between the different GPUs, even within the same data center. </p><p>Proprietary networking solutions like Nvidia NVLink, and more open standards like Ethernet work well within the same datacenter, but training across different GPU clouds has been unimaginable until now. </p><p>Enter DiStRo, a new decentralized training by the mad geniuses at Nous Research, in which they reduced the required bandwidth to train a 1.2B param model from <strong>74.4GB</strong> to just <strong>86MB</strong> (857x)! </p><p>This can have massive implications for training across compute clusters, doing shared training runs, optimizing costs and efficiency and democratizing LLM training access! So don't sell your old GPUs just yet, someone may just come up with a folding@home but for training the largest open source LLM, and it may just be Nous! </p><p>Nous Research also released their function-calling-v1 dataset (<a target="_blank" href="https://huggingface.co/datasets/NousResearch/hermes-function-calling-v1/tree/main">HF</a>) that was used to train Hermes-2, and we had InterstellarNinja who authored that dataset, join the show and chat about it. This is an incredible unlock for the open source community, as function calling become a de-facto standard now. Shout out to the Glaive team as well for their pioneering work that paved the way!</p><p>LinkedIn's Liger Kernel: Unleashing the Need for Speed (with One Line of Code)</p><p>What if I told you, that whatever software you develop, you can add 1 line of code, and it'll run 20% faster, and require 60% less memory? </p><p>This is basically what Linkedin researches released this week with Liger Kernel, yes you read that right, Linkedin, as in the website you career related posts on! </p><p>"If you're doing any form of finetuning, using this is an instant win"Wing Lian - Axolotl</p><p>This absolutely bonkers improvement in training LLMs, now works smoothly with Flash Attention, PyTorch FSDP and DeepSpeed. If you want to read more about the implementation of the triton kernels, you can see a <a target="_blank" href="https://x.com/hsu_byron/status/1827072737673982056">deep dive here</a>, I just wanted to bring this to your attention, even if you're not technical, because efficiency jumps like these are happening all the time. We are used to seeing them in capabilities / intelligence, but they are also happening on the algorithmic/training/hardware side, and it's incredible to see!</p><p>Huge shoutout to Byron and team at <a target="_blank" href="https://www.linkedin.com/posts/byronhsu1230_httpslnkdinga5hpbt-maximizing-gpu-activity-7232841234115870721-yqQc/?utm_source=combined_share_message&#38;utm_medium=member_desktop_web">Linkedin</a> for this unlock, check out their <a target="_blank" href="https://github.com/linkedin/Liger-Kernel">Github</a> if you want to get involved!</p><p>Qwen-2 VL - SOTA image and video understanding + open weights mini VLM</p><p>You may already know that we love the folks at Qwen here on ThursdAI, not only because Junyang Lin is a frequeny co-host and we get to hear about their releases as soon as they come out (they seem to be releasing them on thursdays around the time of the live show, I wonder why!) </p><p>But also because, they are committed to open source, and have released 2 models 7B and 2B with complete Apache 2 license! </p><p>First of all, their Qwen-2 VL 72B model, is now SOTA at many benchmarks, beating GPT-4, Claude 3.5 and other much bigger models. This is insane. I literally had to pause Junyang and repeat what he said, this is a 72B param model, that beats GPT-4o on document understanding, on math, on general visual Q&A. </p><p>Additional Capabilities & Smaller models</p><p>They have added new capabilities in these models, like being able to handle arbitrary resolutions, but the one I'm most excited about is the video understanding. These models can now understand up to 20 minutes of video sequences, and it's not just "split the video to 10 frames and do image caption", no, these models understand video progression and if I understand correctly how they do it, it's quite genius. </p><p>They the video embed time progression into the model using a new technique called M-RoPE, which turns the time progression into rotary positional embeddings. </p><p>Now, the 72B model is currently available via API, but we do get 2 new small models with Apache 2 license and they are NOT too shabby either! </p><p>7B parameters (<a target="_blank" href="https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct">HF</a>) and 2B Qwen-2 VL (<a target="_blank" href="https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct">HF</a>) are small enough to run completely on your machine, and the 2B parameter, scores better than GPT-4o mini on OCR-bench for example! </p><p>I can't wait to finish writing and go play with these models! </p><p>Big Companies & LLM APIs</p><p>The biggest news this week came from Cerebras System, a relatively unknown company, that shattered the world record for LLM inferencing out of the blue (and came on the show to talk about how they are doing it)</p><p>Cerebras - fastest LLM inference on wafer scale chips</p><p>Cerebras has introduced the concept of wafer scale chips to the world, which is, if you imagine a microchip, they are the size of a post stamp maybe? GPUs are bigger, well, Cerebras are making chips the sizes of an iPad (72 square inches), largest commercial chips in the world. </p><p>And now, they created an inference stack on top of those chips, and showed that they have the fastest inference in the world, how fast? Well, they can server LLama 3.1 8B at a <strong>whopping 1822t/s</strong>. No really, this is INSANE speeds, as I was writing this, I copied all the words I had so far, went to <a target="_blank" href="https://inference.cerebras.ai">inference.cerebras.ai</a> , asked to summarize, pasted and hit send, and I immediately got a summary! </p><p></p><p>"The really simple explanation is we basically store the entire model, whether it's 8B or 70B or 405B, entirely on the chip. There's no external memory, no HBM. We have 44 gigabytes of memory on chip."James Wang</p><p>They not only store the whole model (405B coming soon), but they store it in full fp16 precision as well, so they don't quantize the models. Right now, they are serving it with 8K tokens in context window, and we had a conversation about their next steps being giving more context to developers. </p><p>The whole conversation is well worth listening to, James and Ian were awesome to chat with, and while they do have a waitlist, as they gradually roll out their release, James said to DM him on X and mention ThursdAI, and he'll put you through, so you'll be able to get an OpenAI compatible API key and be able to test this insane speed. </p><p>P.S - we also did an independent verification of these speeds, using Weave, and found Cerebras to be quite incredible for agentic purposes, you can read our report <a target="_blank" href="https://wandb.ai/capecape/benchmark_llama_70b/reports/Is-the-new-Cerebras-API-the-fastest-LLM-service-provider---Vmlldzo5MTQ4OTM2">here</a> and the weave dashboard <a target="_blank" href="https://wandb.ai/capecape/benchmark_llama_70b/weave/compare-evaluations?evaluationCallIds=%5B%2201919475-b2cb-78f3-934d-639d9810b9c5%22%2C%2201918d86-097e-7901-93c2-78ed0eea4715%22%2C%2201918d84-1806-7152-9543-d7dc5d9df6dc%22%2C%2201918d82-5c7a-7392-9cec-6bec5391a84e%22%2C%2201918d80-2af4-7072-b637-2e42ac2d565a%22%5D">here</a></p><p>Anthropic - unlocking just-in-time applications with artifacts for all</p><p>Well, if you aren't paying claude, maybe this will convince you. This week, anthropic announced that artifacts are available to all users, not only their paid customers. </p><p>Artifacts are a feature in Claude that is basically a side pane (and from this week, a drawer in their mobile apps) that allows you to see what Claude is building, by rendering the web application almost on the fly. They have also trained Claude in working with that interface, so it knows about the different files etc</p><p>Effectively, this turns Claude into a web developer that will build mini web applications (without backend) for you, on the fly, for any task you can think of. </p><p>Drop a design, and it'll build a mock of it, drop some data in a CSV and it'll build an interactive onetime dashboard visualizing that data, or just ask it to build an app helping you split the bill between friends by uploading a picture of a bill. </p><p>Artifacts are share-able and remixable, so you can build something and share with friends, so <a target="_blank" href="https://claude.site/artifacts/0c5a9962-08be-4369-9d1b-36f84be79a33">here you go</a>, an artifact I made, by dropping my notes into claude, and asking for a <a target="_blank" href="https://claude.site/artifacts/0c5a9962-08be-4369-9d1b-36f84be79a33">magic 8 Ball</a>, that will spit out a random fact from today's editing of ThursdAI. I also provided Claude with an 8Ball image, but it didn't work due to restrictions, so instead I just uploaded that image to claude and asked it to recreate it with SVG! And viola, a completely un-nessesary app that works! </p><p>Google’s Gemini Keeps Climbing the Charts (But Will It Be Enough?)</p><p>Sensing a disturbance in the AI force (probably from that Cerebras bombshell), Google rolled out a series of Gemini updates, including a new experimental <strong>Gemini 1.5 Pro (0827)</strong> with sharper coding skills and logical reasoning. According to LMSys, it’s already nipping at the heels of ChatGPT 4o and is number 2!</p><p>Their <strong>Gemini 1.5 Flash</strong> model got a serious upgrade, vaulting to the #6 position on the arena. And to add to the model madness, they even released an Gemini Flash 8B parameter version for folks who need that sweet spot between speed and size.</p><p>Oh, and those long-awaited <strong>Gems</strong> are finally starting to roll out. But get ready to open your wallet – this feature (preloading Gemini with custom context and capabilities) is a <strong>paid-tier exclusive</strong>. But hey, at least Imagen-3 is cautiously returning to the image generation game! </p><p>AI Art & Diffusion</p><p>Doom Meets Stable Diffusion: AI Dreams in 20FPS Glory (<a target="_blank" href="https://gamengen.github.io/">GameNGen</a>)</p><p>The future of video games is, uh, definitely going to be interesting. Just as everyone thought AI would be conquering Go or Chess, it seems we've stumbled into a different battlefield: first-person shooters. 🤯</p><p>This week, researchers in DeepMind blew everyone's minds with their <strong>GameNgen</strong> research. What did they do? They trained <strong>Stable Diffusion</strong> 1.4 on Doom, and I'm not just talking about static images - I'm talking about <em>generating actual Doom gameplay in near real time</em>. Think 20FPS  Doom running on nothing but the magic of AI. </p><p>The craziest part to me is this quote "Human raters are only slightly better than random chance at distinguishing short clips of the game from clips of the simulation" </p><p>FAL Drops the LORA Training Time Bomb (and I Get a New Meme!)</p><p>As you see, I haven't yet relaxed from making custom AI generations with Flux and customizing them with training LORAs. Two weeks ago, this used to take 45 minutes, a week ago, 20 minutes, and now, the wizards at FAL, created a new trainer that shrinks the training times down to less than 5 minutes! </p><p>So given that the first upcoming SpaceX commercial spacewalk Polaris Dawn, I trained a <a target="_blank" href="https://huggingface.co/altryne/spacex-astro-lora">SpaceX astronaut LORA</a> and then combined my face with it, and viola, here I am, as a space X astronaut! </p><p>BTW because they are awesome, Jonathan and Simo (who is the magician behind this new trainer) came to the show, announced the new trainer, but also gave all listeners of ThursdAI a coupon to train a LORA effectively for free, just use <a target="_blank" href="https://fal.ai?coupon=ThursdAI">this link</a> and start training! (btw I get nothing out of this, just trying to look out for my listeners!) </p><p>That's it for this week, well almost that's it, <a target="_blank" href="http://magic.dev">magic.dev</a> announced a new funding round of <strong>320 million</strong>, and that they have a <strong>100M context window </strong>capable models and coding product to go with it, but didn't yet release it, just as we were wrapping up. Sam Altman tweeted that OpenAI now has over 200 Million active users on ChatGPT and that OpenAI will collaborate with AI Safety institute. </p><p>Ok now officially that's it! See you next week, when it's going to be 🍁 already brrr</p><p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></p><p></p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-aug-29-ai-plays-doom-cerebras</link><guid isPermaLink="false">substack:post:148282478</guid><dc:creator><![CDATA[Alex Volkov, James Wang, Ian Milton, and Jonathan Fischoff]]></dc:creator><pubDate>Fri, 30 Aug 2024 00:18:42 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/148282478/1171005cfb343ee7f7da36245e21fdaf.mp3" length="68448509" type="audio/mpeg"/><itunes:author>Alex Volkov, James Wang, Ian Milton, and Jonathan Fischoff</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5704</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/148282478/4559f47390f3a3f8921dd6eb93a7b929.jpg"/></item><item><title><![CDATA[📅 AI21 Jamba 1.5, DIY Meme Faces, 8yo codes with AI and a Doomsday LLM Device?!]]></title><description><![CDATA[<p>Hey there, Alex here with an end of summer edition of our show, which did not disappoint. Today is the official anniversary of stable diffusion 1.4 can you believe it? </p><p>It's the second week in the row that we have an exclusive LLM launch on the show (after Emozilla announced Hermes 3 on <a target="_blank" href="https://sub.thursdai.news/p/thursdai-chatgpt-4o-back-on-top-nous">last week's show</a>), and spoiler alert, we may have something cooking for next week as well!</p><p><p>This edition of ThursdAI is brought to you by W&B <a target="_blank" href="https://wandb.ai/site/weave/?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=thursdai&#38;utm_id=aug22">Weave</a>, our LLM observability toolkit, letting you evaluate LLMs for your own use-case easily</p></p><p>Also this week, we've covered both ends of AI progress, doomerist CEO saying "Fck Gen AI" vs an 8yo coder and I continued to geek out on putting myself into memes (I promised I'll stop... at some point) so buckle up, let's take a look at another crazy week: </p><p>TL;DR</p><p>* <strong>Open Source LLMs</strong> </p><p>* AI21 releases Jamba1.5 Large / Mini hybrid Mamba MoE (<a target="_blank" href="https://x.com/AI21Labs/status/1826614352948199754">X</a>, <a target="_blank" href="https://www.ai21.com/blog/announcing-jamba-model-family">Blog</a>, <a target="_blank" href="https://huggingface.co/collections/ai21labs/jamba-15-66c44befa474a917fcf55251">HF</a>)</p><p>* Microsoft Phi 3.5 - 3 new models including MoE (<a target="_blank" href="https://x.com/WeizhuChen/status/1825978852205801970">X</a>, <a target="_blank" href="https://huggingface.co/microsoft/Phi-3.5-vision-instruct">HF</a>)</p><p>* BFCL 2 - Berkley Function Calling Leaderboard V2 (<a target="_blank" href="https://x.com/shishirpatil_/status/1825577931697233999">X</a>, <a target="_blank" href="https://gorilla.cs.berkeley.edu/blogs/12_bfcl_v2_live.html">Blog</a>, <a target="_blank" href="https://gorilla.cs.berkeley.edu/leaderboard_live.html">Leaderboard</a>)</p><p>* NVIDIA - Mistral Nemo Minitron 8B - Distilled / Pruned from 12B (<a target="_blank" href="https://huggingface.co/nvidia/Mistral-NeMo-Minitron-8B-Base">HF</a>)</p><p>* Cohere paper proves - code improves intelligence (<a target="_blank" href="https://x.com/iScienceLuvr/status/1826084883535700004">X</a>, <a target="_blank" href="https://arxiv.org/abs/2408.10914">Paper</a>)</p><p>* MOHAWK - transformer → Mamba distillation method (<a target="_blank" href="https://x.com/kevinyli_/status/1825956447185940674">X</a>, <a target="_blank" href="https://t.co/kBoE4BlarP">Paper</a>, <a target="_blank" href="https://goombalab.github.io/blog/2024/distillation-part1-mohawk/">Blog</a>)</p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* Ideogram launches v2 - new img diffusion king 👑 + API (<a target="_blank" href="https://x.com/ideogram_ai/status/1826277550798278804">X</a>, <a target="_blank" href="https://about.ideogram.ai/2.0">Blog</a>, <a target="_blank" href="https://about.ideogram.ai/2.0">Try it</a>) </p><p>* Midjourney is now on web + free tier (<a target="_blank" href="http://midjourney.com">try it finally</a>)</p><p>* Flux keeps getting better, cheaper, faster + adoption from OSS (<a target="_blank" href="https://x.com/HBCoop_/status/1826640023216615608">X</a>, <a target="_blank" href="https://x.com/fofrAI/status/1826625973111882100">X</a>, <a target="_blank" href="https://x.com/isidentical/status/1826693064489890250">X</a>)</p><p>* Procreate hates generative AI (<a target="_blank" href="https://x.com/Procreate/status/1825311104584802470">X</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Grok 2 full is finally available on X - performs well on real time queries (<a target="_blank" href="https://x.com/altryne/status/1826033974902403544">X</a>)</p><p>* OpenAI adds GPT-4o Finetuning (<a target="_blank" href="https://openai.com/index/gpt-4o-fine-tuning/">blog</a>)</p><p>* Google API updates - 1000 pages PDFs + LOTS of free tokens (<a target="_blank" href="https://x.com/OfficialLoganK/thread/1825656369627935069">X</a>)</p><p>* <strong>This weeks Buzz</strong></p><p>* Weights & Biases Judgement Day SF Hackathon in September 21-22 (<a target="_blank" href="https://wandb.ai/site/resources/events/judgment-day-hackathon">Sign up to hack</a>)</p><p>* <strong>Video</strong> </p><p>* Hotshot - new video model - trained by 4 guys (<a target="_blank" href="https://hotshot.co/">try it</a>, <a target="_blank" href="https://hotshot.co/release">technical deep dive</a>)</p><p>* Luma Dream Machine 1.5 (<a target="_blank" href="https://x.com/LumaLabsAI/status/1825639918539817101">X</a>, <a target="_blank" href="https://lumalabs.ai/dream-machine/creations">Try it</a>) </p><p>* <strong>Tools & Others</strong></p><p>* LMStudio 0.0.3 update - local RAG, structured outputs with any model & more (<a target="_blank" href="https://x.com/LMStudioAI/status/1826680869773357513">X</a>)</p><p>* Vercel - Vo now has chat (<a target="_blank" href="https://twitter.com/ido_pesok/status/1826326694615286231">X</a>)</p><p>* Ark - a completely offline device - offline LLM +  worlds maps (<a target="_blank" href="https://x.com/adamcohenhillel/status/1825314267056443409">X</a>)</p><p>* Ricky's Daughter coding with cursor video is a must watch (<a target="_blank" href="https://x.com/rickyrobinett/status/1825581674870055189">video</a>)</p><p>The Best of the Best: Open Source Wins with Jamba, Phi 3.5, and Surprise Function Calling Heroes</p><p>We kick things off this week by focusing on what we love the most on ThursdAI, open-source models! We had a ton of incredible releases this week, starting off with something we were super lucky to have live, the official announcement of AI21's latest LLM: Jamba.</p><p>AI21 Officially Announces Jamba 1.5 Large/Mini – The Powerhouse Architecture Combines Transformer and Mamba </p><p>While we've covered Jamba release on the show back <a target="_blank" href="https://sub.thursdai.news/i/143286226/jamba-deep-dive-with-roi-ai-and-maxime-labonne">in April</a>, Jamba 1.5 is an updated powerhouse. It's 2 models, Large and Mini, both MoE and both are still hybrid architecture of Transformers + Mamba that try to get both worlds. </p><p>Itay Dalmedigos, technical lead at AI21, joined us on the ThursdAI stage for an <strong>exclusive first look</strong>, giving us the full rundown on this <strong>developer-ready</strong> model with an awesome <strong>256K context window</strong>, but it's not just the size – it’s about <strong><em>using that size effectively</em></strong>. </p><p>AI21 measured the effective context use of their model on the new RULER benchmark released by NVIDIA, an iteration of the needle in the haystack and showed that their models have full utilization of context, as opposed to many other models.</p><p>“As you mentioned, we’re able to pack many, many tokens on a single GPU. Uh, this is mostly due to the fact that we are able to quantize most of our parameters", Itay explained, diving into their secret sauce, <a target="_blank" href="https://github.com/vllm-project/vllm/pull/7415"><strong>ExpertsInt8</strong></a><strong>, a novel quantization technique</strong> specifically designed for MoE models. </p><p>Oh, and did we mention Jamba is <strong>multilingual (eight languages and counting), natively supports structured JSON, function calling, document digestion</strong>… basically everything developers dream of. They even chucked in <strong>citation generation</strong>, as it's long context can contain full documents, your RAG app may not even need to chunk anything, and the citation can cite full documents!</p><p>Berkeley Function Calling Leaderboard V2: Updated + Live (<a target="_blank" href="https://gorilla.cs.berkeley.edu/leaderboard.html">link</a>)</p><p>Ever wondered how to <strong>measure the real-world magic</strong> of those models boasting <em>"I can call functions! I can do tool use! Look how cool I am!</em>" 😎? Enter the <strong>Berkeley Function Calling Leaderboard (BFCL) 2</strong>, a battleground where models clash to prove their function calling prowess.</p><p>Version 2 just dropped, and this ain't your average benchmark, folks. It's armed with a <em>"Live Dataset"</em> - a dynamic, <strong>user-contributed</strong> treasure trove of <em>real-world queries</em>, rare function documentations, and specialized use-cases spanning multiple languages. Translation: <strong>NO more biased, contaminated datasets</strong>. BFCL 2 is as close to the real world as it gets.</p><p>So, who’s sitting on the Function Calling throne this week? Our old friend <strong>Claude 3.5 Sonnet, with an impressive score of 73.61</strong>. But breathing down its neck is <strong>GPT 4-0613</strong> (the OG Function Calling master) with 73.5. That's right, the one released a year ago, the first one with function calling, in fact the first LLM with function calling as a concept IIRC!</p><p>Now, prepare for the REAL plot twist. The <strong>top-performing open-source model</strong> <em>isn’t</em> some big name, resource-heavy behemoth. It’s a <em>tiny little underdog</em> called <a target="_blank" href="https://huggingface.co/meetkai/functionary-medium-v3.1"><strong>Functionary Medium 3.1</strong></a>, a finetuned version of Llama 3.1 that blew everyone away. It even <em>outscored</em> both versions of Claude 3 Opus AND GPT 4 - leaving folks scrambling to figure out WHO created this masterpiece.</p><p>“I’ve never heard of this model. It's MIT licensed from an organization called MeetKai. Have you guys heard about Functionary Medium?” I asked, echoing the collective bafflement in the space. Yep, turns out there’s gold hidden in the vast landscape of open source models, just waiting to be unearthed ⛏️.</p><p>Microsoft updates Phi 3.5 - 3 new models including an MoE + MIT license</p><p>3 new Phi's dropped this week, including an MoE one, and a new revamped vision one. They look very decent on benchmark yet again, with the mini version (3.8B) seemingly beating LLama 3.1 8B on a few benchmarks.</p><p>However, as previously the excitement is met with caution because Phi models seem great on benchmarks but then actually talking with them, folks are not as impressed usually. </p><p>Terry from BigCodeBench also saw a <a target="_blank" href="https://twitter.com/terryyuezhuo/status/1826016258552447132">significant decrease</a> in coding ability for Phi 3.5 vs 3.1 </p><p>Of course, we're not complaining, the models released with 128K context and MIT license. </p><p>The thing I'm most excited about is the vision model updates, it has been updated with "multi-frame image understanding and reasoning" which is a big deal! This means understanding videos more natively across scenes. </p><p>This weeks Buzz</p><p>Hey, if you're reading this, while sitting in the bay area, and you don't have plans for exactly a month from now, why don't you come and hack with me? (<a target="_blank" href="https://wandb.ai/site/resources/events/judgment-day-hackathon?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=thursdai&#38;utm_id=aug22">Register Free</a>)</p><p>Announcing, the first W&B hackathon, <strong>Judgement Day</strong> that's going to be focused on LLM as a judge! Come hack on innovative LLM as a judge ideas, UIs, evals and more, meet other like minded hackers and AI engineers and win great prizes! </p><p>🎨 AI Art: Ideogram Crowns Itself King, Midjourney Joins the Internet & FLUX everywhere</p><p>While there was little news from big LLM labs this week, there is a LOT of AI art news, which is fitting to celebrate 2 year Stable Diffusion 1.4 anniversary! </p><p>👑 Ideogram v2: Text Wizardry and API Access (But No Loras… Yet?)</p><p>With significantly improved realism, and likely the best text generation across all models out there, Ideogram v2 just took over the AI image generation game! Just look at that text sharpness! </p><p>They now offer a selection of styles (Realistic, Design, 3D, Anime) and any aspect ratios you'd like and also, brands can now provide color palettes to control the outputs! </p><p>Adding to this is a new API offering (.8c per image for the main model, .5c for the new turbo model of v2!) and a new IOS app, they also added the option (for premium users only) to search through a billion generations and their prompts, which is a great offering as well, as sometimes you don't even know what to prompt. </p><p>They claim a significant improvement over Flux[pro] and Dalle-3 in text, alignment and overall, interesting that MJ was not compared! </p><p>Meanwhile, Midjourney finally launched a <a target="_blank" href="https://midjourney.com">website</a> and a free tier, so no longer do you have to learn to use Discord to even try Midjourney. </p><p>Meanwhile Flux enjoys the fruits of Open Source</p><p>While the Ideogram and MJ fight it out for the closed source, Black Forest Labs enjoys the fruits of released their weights in the open. </p><p>Fal just released an <a target="_blank" href="https://x.com/isidentical/status/1826693064489890250">update</a> that LORAs run 2.5x faster and 2.5x cheaper, <a target="_blank" href="https://civitai.com/models">CivitAI</a> has LORAs for pretty much every character and celebrity ported to FLUX already, different techniques like <a target="_blank" href="https://huggingface.co/InstantX/FLUX.1-dev-Controlnet-Union">ControlNets</a> Unions, IPAdapters and more are being trained as we speak and tutorials upon tutorials are released of how to customize these models, <a target="_blank" href="https://www.youtube.com/watch?v=_rjto4ix3rA&#38;t=923s">for free</a> (shoutout to my friend Matt Wolfe for this one)</p><p>you can now train your own face on <a target="_blank" href="http://fal.ai">fal.ai</a> , <a target="_blank" href="http://replicate.com">replicate.com</a> and <a target="_blank" href="http://astria.ai">astria.ai</a> , and thanks to astria, I was able to find some old generations of my LORAs from the 1.5 days (not quite 1.4, but still, enough to show the difference between then and now) and whoa. </p><p>🤔 Is This AI Tool Necessary, Bro?</p><p>Let’s end with a topic that stirred up a hornets nest of opinions this week: <strong>Procreate, a beloved iPad design app, publicly declared their "</strong>f<em>ing hate” for Generative AI</em>. </p><p>Yeah, you read that right. <em>Hate</em>. The CEO, in a <a target="_blank" href="https://x.com/Procreate/status/1825311104584802470">public statement</a> went FULL scorched earth - proclaiming that AI-powered features would <em>never</em> sully the pristine code of their precious app.</p><p>“Instead of trying to bridge the gap, he’s creating more walls", Wolfram commented, echoing the general “dude… what?” vibe in the space. “It feels marketeerial”, I added, pointing out the obvious PR play (while simultaneously acknowledging the very REAL, very LOUD segment of the Procreate community that cheered this decision).</p><p>Here’s the thing: you can hate the tech. You can lament the potential demise of the human creative spark. You can rail against the looming AI overlords. But one thing’s undeniable: this tech isn't going <em>anywhere</em>.</p><p>Meanwhile, 8yo coders lean in fully into AI</p><p>As a contrast to this doomerism take, just watch <a target="_blank" href="https://x.com/rickyrobinett/status/1825581674870055189">this</a> video of <strong>Ricky Robinette's eight-year-old daughter building a Harry Potter website in 45 minutes</strong>, <em>using nothing but a chat interface in Cursor</em>. No coding knowledge. No prior experience. Just prompts and the power of AI ✨.</p><p>THAT’s where we’re headed, folks. It might be terrifying. It might be inspiring. But it’s DEFINITELY happening. Better to understand it, engage with it, and <em>maybe</em> try to nudge it in a positive direction, than burying your head in the sand and muttering “I bleeping hate this progress” like a cranky, Luddite hermit. Just sayin' 🤷‍♀️.</p><p>AI Device to reboot civilization (if needed)</p><p>I was scrolling through my feed (as I do VERY often, to bring you this every week) and I saw this and super quickly decided to invite the author to the show to talk about it.</p><p><a target="_blank" href="https://substack.com/profile/109320861-adam-cohen-hillel">Adam Cohen Hillel</a> has prototyped an AI hardware device, but this one isn't trying to record you or be your friend, no, this one comes with offline LLMs finetuned with health and bio information, survival tactics, and all of the worlds maps and works completely offline! </p><p>This to me was a very exciting use for an LLM, a distilled version of all human knowledge, buried in a faraday cage, with replaceable batteries that runs on solar and can help you survive in the case of something bad happening, like really bad happening (think a solar flare that takes out the electrical grid or an EMP device). While improbable, I thought this was a great idea and had a nice chat with the creator, you should definitely give this one a listen, and if you want to buy one, he is going to sell them soon <a target="_blank" href="https://www.privsov.com/products/ark">here</a></p><p>This is it for this week, there have been a few updates from the big labs, OpenAI has opened Finetuneing for GPT-4o, and you can use your WandB API key in there to track those, which is cool, Gemini API now accepts incredibly large PDF files (up to 1000 pages) and Grok 2 is finally on X (not mini from last week) </p><p>See you next week (we will have another deep dive!) </p><p></p><p></p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/ai21-jamba-15-diy-meme-faces-8yo</link><guid isPermaLink="false">substack:post:148017288</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Thu, 22 Aug 2024 22:00:24 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/148017288/be8c5b701e3f48b32771379795ae80d4.mp3" length="73187324" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6099</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/148017288/a2e35c96188ecf837833be8376b5def3.jpg"/></item><item><title><![CDATA[📅 ThursdAI - ChatGPT-4o back on top, Nous Hermes 3 LLama finetune, XAI uncensored Grok2, Anthropic LLM caching & more AI news from another banger week]]></title><description><![CDATA[<p>Look these crazy weeks don't seem to stop, and though this week started out a bit slower (while folks were waiting to see how the speculation about certain red berry flavored conspiracies are shaking out) the big labs are shipping! </p><p>We've got space uncle Elon dropping an "almost-gpt4" level Grok-2, that's uncensored, has access to real time data on X and can draw all kinds of images with Flux, OpenAI announced a new ChatGPT 4o version (not the one from last week that supported structured outputs, a different one!) and Anthropic dropping something that makes AI Engineers salivate! </p><p>Oh, and for the second week in a row, ThursdAI live spaces were listened to <strong>by over 4K people</strong>, which is very humbling, and awesome because for example today, Nous Research announced Hermes 3 live on ThursdAI before the public heard about it (and I had a long chat w/ Emozilla about it, very well worth listening to)</p><p><strong>TL;DR of all topics covered:</strong> </p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Xai releases GROK-2 - frontier level Grok, uncensored + image gen with Flux (<a target="_blank" href="https://x.com/ebbyamir/status/1823602128872726727">𝕏</a>, <a target="_blank" href="https://x.ai/blog/grok-2">Blog</a>, <a target="_blank" href="https://x.com/i/grok">Try It</a>)</p><p>* OpenAI releases another ChatGPT-4o (and tops LMsys again) (<a target="_blank" href="https://x.com/ChatGPTapp/status/1823109016223957387">X</a>, <a target="_blank" href="https://help.openai.com/en/articles/9624314-model-release-notes">Blog</a>)</p><p>* Google showcases Gemini Live, Pixel Bugs w/ Gemini, Google Assistant upgrades ( <a target="_blank" href="https://blog.google/products/platforms-devices/made-by-google-2024-collection/?utm_source=tw&#38;utm_medium=social&#38;utm_campaign=mbg24&#38;utm_content=&#38;utm_term=">Blog</a>)</p><p>* Anthropic adds Prompt Caching in Beta - cutting costs by u to 90% (<a target="_blank" href="https://x.com/alexalbert__/status/1823751966893465630">X</a>, <a target="_blank" href="https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching">Blog</a>)</p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* Flux now has support for LORAs, ControlNet, img2img  (<a target="_blank" href="https://fal.ai/models/fal-ai/flux-lora-general-training?a=1">Fal</a>, <a target="_blank" href="https://replicate.com/lucataco/flux-dev-lora?prediction=jeek8wfz99rj20ch9z1s6py17w">Replicate</a>)</p><p>* Google Imagen-3 is out of secret preview and it looks very good (<a target="_blank" href="https://x.com/altryne/status/1821960738933764347/photo/1">𝕏</a>, <a target="_blank" href="https://x.com/_akhaliq/status/1823539204086751477">Paper</a>, <a target="_blank" href="https://aitestkitchen.withgoogle.com/tools/image-fx">Try It</a>)</p><p>* <strong>This weeks Buzz</strong></p><p>* Using Weights & Biases <a target="_blank" href="http://wandb.me/weave?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=thursdai&#38;utm_id=aug14">Weave</a> to evaluate Claude Prompt Caching (<a target="_blank" href="https://x.com/altryne/status/1823878392141267441">X</a>, <a target="_blank" href="https://github.com/altryne/compare-claude-caching/">Github</a>, <a target="_blank" href="https://wandb.ai/wandb/compare-claude-caching/weave/compare-evaluations?evaluationCallIds=%5B%22019152dd-c8a4-7ea0-b56f-821225d18917%22%2C%22019152da-17ac-7741-a964-6eb6a3adf3e7%22%2C%22019152d7-0c57-77f0-a38d-9f026a799ac4%22%2C%22019152d4-86a3-7530-bb27-fa8413277e3c%22%5D">Weave Dash</a>)</p><p>* <strong>Open Source LLMs</strong> </p><p>* NousResearch drops Hermes 3 - 405B, 70B, 8B LLama 3.1 finetunes (<a target="_blank" href="https://x.com/NousResearch/status/1824131520375951454">X</a>, <a target="_blank" href="https://nousresearch.com/wp-content/uploads/2024/08/Hermes-3-Technical-Report.pdf">Blog</a>, <a target="_blank" href="https://nousresearch.com/wp-content/uploads/2024/08/Hermes-3-Technical-Report.pdf">Paper</a>)</p><p>* NVIDIA Llama-3.1-Minitron 4B (<a target="_blank" href="https://developer.nvidia.com/blog/how-to-prune-and-distill-llama-3-1-8b-to-an-nvidia-llama-3-1-minitron-4b-model">Blog</a>, <a target="_blank" href="https://huggingface.co/collections/nvidia/minitron-669ac727dc9c86e6ab7f0f3e">HF</a>)</p><p>* AnswerAI - colbert-small-v1<strong> </strong> (<a target="_blank" href="https://www.answer.ai/posts/2024-08-13-small-but-mighty-colbert.html">Blog</a>, <a target="_blank" href="https://huggingface.co/answerdotai/answerai-colbert-small-v1">HF</a>)</p><p>* <strong>Vision & Video</strong></p><p>* Runway Gen-3 Turbo is now available (<a target="_blank" href="https://app.runwayml.com/video-tools/">Try It</a>)</p><p>Big Companies & LLM APIs</p><p>Grok 2: Real Time Information, Uncensored as Hell, and… Flux?!</p><p>The team at xAI definitely knows how to make a statement, dropping a knowledge bomb on us with the release of <strong>Grok 2</strong>. This isn't your uncle's dad joke model anymore - Grok 2 is a legitimate frontier model, folks.</p><p>As Matt Shumer excitedly put it</p><p> “If this model is this good with less than a year of work, the trajectory they’re on, it seems like they will be far above this...very very soon” 🚀</p><p>Not only does Grok 2 have impressive scores on MMLU (beating the previous GPT-4o on their benchmarks… from MAY 2024), it even outperforms Llama 3 405B, proving that xAI isn't messing around.</p><p>But here's where things get <strong><em>really</em></strong><strong> interesting</strong>. Not only does this model access real time data through Twitter, which is a MOAT so wide you could probably park a rocket in it, <strong>it's also VERY uncensored.</strong> Think generating political content that'd make your grandma clutch her pearls or imagining Disney characters breaking bad in a way that’s both hilarious and <em>kinda disturbing</em> all thanks to Grok 2’s integration with <strong>Black Forest Labs Flux image generation model</strong>. </p><p>With an affordable price point ($8/month for x Premium including access to Grok 2 and their killer MidJourney competitor?!), it’ll be interesting to see how Grok’s "truth seeking" (as xAI calls it) model plays out.  Buckle up, folks, this is going to be <em>wild</em>, especially since all the normies now have the power to create political memes, that look VERY realistic, within seconds. </p><p>Oh yeah… and there’s the upcoming <strong>Enterprise API</strong> as well… and  Grok 2’s made its debut in the wild on the LMSys Arena, lurking incognito as <strong>"sus-column-r" </strong>and is now placed on TOP of Sonnet 3.5 and comes in as number 5 overall!</p><p>OpenAI last ChatGPT is back at #1, but it's all very confusing 😵‍💫</p><p>As the news about Grok-2 was settling in, OpenAI decided to, well… drop yet another GPT-4.o update on us. <strong><em>While Google was hosting their event no less</em></strong><strong>.</strong> Seriously OpenAI? I guess they like to one-up Google's new releases (they also kicked Gemini from the #1 position after only 1 week there)</p><p>So what was anonymous-chatbot in Lmsys for the past week, was also released in ChatGPT interface, is now the best LLM in the world according to LMSYS and other folks, it's #1 at Math, #1 at complex prompts, coding and #1 overall. </p><p>It is also available for us developers via API, but... they don't recommend using it? 🤔 </p><p>The most interesting thing about this release is, they don't really know to tell us why it's better, they just know that it is, qualitatively and that it's not a new frontier-class model (ie, not 🍓 or GPT5) </p><p>Their release notes on this are something else 👇 </p><p>Meanwhile it's been 3 months, and the promised Advanced Voice Mode is only in the hands of a few lucky testers so far. </p><p>Anthropic Releases Prompt Caching to Slash API Prices By up to 90%</p><p>Anthropic joined DeepSeek's game of "Let's Give Devs Affordable Intelligence," this week rolling out prompt caching with <em>up to 90%</em> cost reduction on cached tokens (yes NINETY…🤯 ) for those of you new to all this technical sorcery</p><p>Prompt Caching allows the inference provider to save users money by reusing repeated chunks of a long prompt form cache, reducing pricing and increasing time to first token, and is especially beneficial for longer contexts (>100K) use-cases like conversations with books, agents with a lot of memory, 1000 examples in prompt etc'</p><p>We covered caching before with Gemini (in Google IO) and <a target="_blank" href="https://sub.thursdai.news/p/thursdai-aug8-qwen2-math-king-tiny">last week</a> with DeepSeek, but IMO this is a better implementation from a frontier lab that's easy to get started, manages the timeout for you (unlike Google) and is a no brainer implementation. </p><p>And, you'll <em>definitely</em> want to see the code to implement it all <em> yourself</em>, (plus Weave is free!🤩):</p><p>"In this week's buzz category… I used Weave, our LLM observability tooling to <em>super quickly</em> evaluate how much cheaper Cloud Caching from Anthropic <em>really is</em>, I did a video of it and I posted the code … If you're into this and want to see how to <em>actually</em> do this … how to evaluate, <em>the code is there for you</em>" - <em>Alex</em></p><p>With the ridiculous 90% price drop for those cached calls (Haiku basically becomes FREE and cached Claude is costs like Haiku, .30 cents per 1Mtok). For context, I took 5 transcripts of 2 hour podcast conversations, and it amounted to ~110,000 tokens overall, and was able to ask questions across all this text, and it cost me less than $1 (see in the above video) </p><p>Code <a target="_blank" href="https://github.com/altryne/compare-claude-caching">Here</a>  + Weave evaluation Dashboard <a target="_blank" href="https://wandb.ai/wandb/compare-claude-caching/weave/compare-evaluations?evaluationCallIds=%5B%22019152dd-c8a4-7ea0-b56f-821225d18917%22%2C%22019152da-17ac-7741-a964-6eb6a3adf3e7%22%2C%22019152d7-0c57-77f0-a38d-9f026a799ac4%22%2C%22019152d4-86a3-7530-bb27-fa8413277e3c%22%5D">here</a></p><p>AI Art, Diffusion, and Personalized AI On the Fly</p><p>Speaking of mind blowing,  Flux took over this week, thanks in no small part to Elon <em>strategically</em> leveraging their tech in Grok (and everyone reminding everyone else, that it's not Grok creating images, it's Flux!)</p><p>Now, remember, the REAL magic happens when code meets open source, “Flux now has support for LORAs, ControlNet, img2img…" meaning developers have turned those foundational tools into <em>artistic wizardry.</em> With as little as $5 bucks and a few pictures, “You can train the best image model <em>on your own face</em>. ”🤯 (Seriously folks, head up to <a target="_blank" href="Fal.ai"><em>Fal.ai</em></a>, give it a whirl… it’s awesome)</p><p>Now if you combine the LORA tech with ControlNet tech, you can get VERY creepy very fast (I'm using my own face here but you get the idea), here's "me" as the distracted boyfriend meme, and the girlfriend, and the distraction 😂 (I'm sorry you had to see this, AI has gone too far! Shut it all down!)</p><p>If seeing those creepy faces on screen isn't  for you (I <em>totally</em> get that) there’s also <strong>Google IMAGEN 3</strong>, freshly escaped from secret preview and <em>just waiting for you to unleash those artistic prompts on it!</em>  Google, despite being… <em>Google</em>, somehow figured out that <em>a little competition does a lab good</em> and rolled out a model that’s <em>seriously impressive</em>.</p><p>Runway Video Gets a "Turbocharged" Upgrade🚀🚀🚀</p><p>Ever tried those jaw-dropping text-to-video generators but groaned as you watched those seconds of video render painfully slowly?😭 <strong>Well Runway, creators of Gen 3, answered our prayers with the distilled turbocharged version that churns out those visuals </strong><strong><em>in a blink</em></strong><strong> 🤯🤯🤯 .</strong></p><p>What's <em>truly</em> cool is they unlocked it for FREE tier users (sign up and unleash those cinematic prompts <em>right now</em>!), letting everyday folks dip their toes in those previously-unfathomable waters. Even the skeptics at OpenBMB (Junyang knows what I'm talking about…) had to acknowledge that their efforts with MiniCPM V are impressive, especially the smooth way it captures video sequences <em>better than models even twice its size</em> 🤯.</p><p>Open Source: Hermes 3 and The Next Generation of Open AI 🚀</p><p>NousResearch Dropped Hermes 3: Your New Favorite AI (Yes Really)</p><p>In the ultimate “We Dropped This On ThursdAI Before Even<em> HuggingFace</em>”, the legendary team at NousResearch dropped the hottest news since Qwen decided to play math God: <strong>Hermes 3</strong> is officially here! 🤯</p><p>“You’re about to get to use the FIRST big Finetune of LLama 3.1 405B… We don’t think there have been finetunes,” announced Emozilla who’s both co founder <em>and</em> resident master wizard of all things neural net, “And it's available to try for free thanks to Lambda, you can try it out right <a target="_blank" href="https://lambda.chat/chatui/">here</a> ”  (you’re all racing to their site as I type this, I KNOW it!). </p><p>Not ONLY does this beauty run ridiculously smooth on Lambda, but here’s the real TL;DR:</p><p>* Hermes 3 isn’t <em>just</em> 405B; there are 70B and 8B versions dropping simultaneously on Hugging Face, ready to crush benchmarks and melt your VRAM (in a GOOD way… okay maybe not so great for your power bill 😅).</p><p>* On Benchmark, they beat LLama 3.1 instruct on a few evals and lose on some, which is quite decent, given that Meta team did an amazing job with their instruct finetuning (and probably spent millions of $ on it too)</p><p>* <strong>Hermes 3 is all about user alignment, which our open source champion Wolfram Ravenwolf summarized beautifully: “When you have a model, and you run it on your system, </strong><strong><em>IT MUST BE LOYAL TO YOU</em></strong><strong>.”</strong> 😈</p><p>Hermes 3 does just that with incredibly precise control via its <em>godlike</em> system prompt: “In Hermes 3 <em>the system prompt is KING</em>,” confirmed Emoz. It’s <em>so</em> powerful that the 405B version was practically suffering existential angst in their first conversation… I read that part outloud during the space, but here you go, this is their first conversation, and he goes into why this they thing this happened, in our chat that's very worth listening to</p><p>This model was trained on a bunch of datasources that they will release in the future, and includes tool use, and a slew of tokens that you can add in the system prompt, that will trigger abilities in this model to do chain of thought, to do scratchpad (think, and then rethink), to cite from sources for RAG purposes and a BUNCH more. </p><p>The technical report is <a target="_blank" href="https://nousresearch.com/wp-content/uploads/2024/08/Hermes-3-Technical-Report.pdf">HERE</a> and is worth diving into as is our full conversation with Emozilla on the pod. </p><p>Wrapping Things Up… But We’re <em>Just Getting Started!</em> 😈</p><p>I know, I KNOW, <em>your brain is already overflowing</em> but we barely SCRATCHED the surface…</p><p>We also dove into NVIDIA's research into new pruning and distilling techniques, TII Falcon’s attempt at making those State Space models <em>finally</em> challenge the seemingly almighty Transformer architecture (it's getting <em>closer</em>... but has a way to go!), plus <strong>AnswerAI's deceptively tiny Colbert-Small-V1</strong>, achieving remarkable search accuracy despite its featherweight size and a bunch more...</p><p>See you all next week for what’s bound to be yet another wild AI news bonanza… Get those download speeds prepped, <em>we’re in for a wild ride</em>. 🔥</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-chatgpt-4o-back-on-top-nous</link><guid isPermaLink="false">substack:post:147758969</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Thu, 15 Aug 2024 21:35:03 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/147758969/dc12b93f060b056f0388594fef832312.mp3" length="88189151" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>7349</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/147758969/8ff67dc1a8ecfa7b170a396fc7b94740.jpg"/></item><item><title><![CDATA[📅 ThursdAI - Aug8 - Qwen2-MATH King, tiny OSS VLM beats GPT-4V, everyone slashes prices + 🍓 flavored OAI conspiracy]]></title><description><![CDATA[<p>Hold on tight, folks, because THIS week on ThursdAI felt like riding a roller coaster through the wild world of open-source AI - extreme highs, mind-bending twists, and a sprinkle of "wtf is happening?" conspiracy theories for good measure. 😂 </p><p>Theme of this week is, Open Source keeps beating GPT-4, while we're inching towards intelligence too cheap to meter on the API fronts. </p><p>We even had a live demo so epic, folks at the <em>Large Hadron Collider</em> are taking notice! Plus, strawberry shenanigans abound (did Sam REALLY tease GPT-5?), and your favorite AI evangelist nearly got canceled on X! Buckle up; this is gonna be another long one! 🚀</p><p>Qwen2-Math Drops a KNOWLEDGE BOMB: Open Source Wins AGAIN!</p><p>When I say "open source AI is unstoppable", I MEAN IT. This week, the brilliant minds from Alibaba's Qwen team decided to show everyone how it's DONE. Say hello to <strong>Qwen2-Math-72B-Instruct</strong> - a specialized language model SO GOOD at math, it's achieving a ridiculous 84 points on the MATH benchmark. 🤯</p><p>For context, folks... that's beating GPT-4, Claude Sonnet 3.5, <em>and</em> Gemini 1.5 Pro. We're not talking incremental improvements here - this is a full-blown DOMINANCE of the field, and you can download and use it <em>right now</em>. 🔥</p><p><a target="_blank" href="https://huggingface.co/collections/Qwen/qwen2-math-66b4c9e072eda65b5ec7534d">Get Qwen-2 Math from HuggingFace here</a></p><p>What made this announcement EXTRA special was that <a target="_blank" href="https://x.com/JustinLin610">Junyang Lin</a> , the Chief Evangelist Officer at Alibaba Qwen team,  joined ThursdAI <strong>moments</strong> after they released it, giving us a behind-the-scenes peek at the effort involved. Talk about being in the RIGHT place at the RIGHT time! 😂</p><p>They painstakingly crafted a massive, math-specific training dataset, incorporating techniques like Chain-of-Thought reasoning (where the model thinks step-by-step) to unlock this insane level of mathematical intelligence.</p><p><strong>"We have constructed a lot of data with the form of ... Chain of Thought ... And we find that it's actually very effective. And for the post-training, we have done a lot with rejection sampling to create a lot of data sets, so the model can learn how to generate the correct answers"</strong> - <em>Junyang Lin</em></p><p>Now I gotta give mad props to Qwen for going beyond just raw performance - they're open-sourcing this beast under an Apache 2.0 license, meaning you're FREE to use it, fine-tune it, adapt it to your wildest mathematical needs! 🎉</p><p>But hold on... the awesomeness doesn't stop there! Remember those smaller, resource-friendly LLMs everyone's obsessed with these days? Well, Qwen released 7B and <em>even</em> 1.5B versions of Qwen-2 Math, achieving jaw-dropping scores for their size (70 for the 1.5B?? That's unheard of!).🤯 Nisten nearly lost his mind when he heard that - and trust me, <em>he's seen things</em>. 😂</p><p><strong>"This is insane! This is... what, Sonnet 3.5 gets what, 71? 72? This gets 70? And it's a 1.5B? Like I could run that on someone's watch. Real."</strong> - <em>Nisten</em></p><p>With this level of efficiency, we're talking about AI-powered calculators, tutoring apps, research tools that run smoothly on everyday devices. The potential applications are endless!</p><p>MiniCPM-V 2.6: A Pocket-Sized GPT-4 Vision... Seriously! 🤯</p><p>If Qwen's Math marvel wasn't enough open-source goodness for ya, OpenBMB had to get in on the fun too! This time, they're bringing the 🔥 to vision with <strong>MiniCPM-V 2.6</strong> - a ridiculous 8 billion parameter VLM (visual language model) that packs a serious punch, even outperforming GPT-4 Vision on OCR benchmarks!</p><p><a target="_blank" href="https://x.com/OpenBMB/status/1820798828251103234">OpenBMB drops a bomb on X here</a></p><p>I'll say this straight up: talking about vision models in a TEXT-based post is hard. You gotta SEE it to believe it. But folks... TRUST ME on this one. This model is <em>mind-blowing</em>, capable of analyzing single images, multi-image sequences, and EVEN VIDEOS with an accuracy that rivaled my wildest hopes for open-source.🤯</p><p><a target="_blank" href="https://huggingface.co/spaces/openbmb/MiniCPM-V-2_6">Check out their playground and prepare to be stunned</a></p><p>It even captured every single nuance in this viral toddler speed-running video I threw at it, with an accuracy I haven't seen in models THIS small:</p><p><em>"The video captures a young child's journey through an outdoor park setting. Initially, the child ... is seen sitting on a curved stone pathway besides a fountain, dressed in ... a green t-shirt and dark pants. As the video progresses, the child stands up and begins to walk ..."</em></p><p>Junyang said that they actually collabbed with the OpenBMB team and knows firsthand how much effort went into training this model:</p><p><strong>"We actually have some collaborations with OpenBMB... it's very impressive that they are using, yeah, multi-images and video. And very impressive results. You can check the demo... the performance... We care a lot about MMMU [the benchmark], but... it is actually relying much on large language models."</strong> - <em>Junyang Lin</em></p><p>Nisten and I have been talking for months about the relationship between these visual "brains" and the larger language model base powering their "thinking." While it seems smaller models are catching up fast, combining a top-notch visual processor like MiniCPM-V with a monster LLM like Quen72B or Llama 405B could unlock truly unreal capabilities.</p><p>This is why I'm excited - open source lets us mix and match like this! We can Frankenstein the best parts together and see what emerges... and it's usually something <em>mind-blowing</em>. 🤯</p><p><p>Thank you for reading ThursdAI - Recaps of the most high signal AI weekly spaces. This post is public so feel free to share it.</p></p><p>From the Large Hadron Collider to YOUR Phone: This Model Runs ANYWHERE 🚀</p><p>While Qwen2-Math is breaking records on one hand, Nisten's latest creation, <strong>Biggie-SmoLlm</strong>, is showcasing the opposite side of the spectrum. Trying to get the smallest/fastest coherent LLM possible, Nisten blew up on HuggingFace.</p><p>Biggie-SmoLlm (<a target="_blank" href="https://huggingface.co/nisten/Biggie-SmoLlm-0.15B-Base">Hugging Face</a>) is TINY, efficient, and with some incredible optimization work from the folks <em>right here on the show</em>, it's reaching an insane 330 tokens/second on regular M3 chips. 🤯 That's <strong>WAY</strong> <strong>faster than real-time conversation, folks!</strong> And thanks to Eric Hartford's (from Cognitive Computation) awesome new optimizer,  (<a target="_blank" href="https://github.com/cognitivecomputations/grokadamw">Grok AdamW</a>) it's surprisingly coherent for such a lil' fella.</p><p>The cherry on top? Someone messaged Nisten saying they're using Biggie-SmoLlm at the <strong>Large. Hadron. Collider.</strong> 😳 I'll let <em>that</em> sink in for a second.</p><p>It was <em>incredible</em> having ALL the key players behind Biggie-SmoLlm right there on stage: LDJ (whose Capybara dataset made it teaching-friendly), Junyang (whose Qwen work served as the base), and Eric (the optimizer mastermind himself). THIS, my friends, is what the ThursdAI community is ALL about! 🚀</p><p>Speaking of which this week we got a new friend of the pod, Mark Saroufim, a long time PyTorch core maintainer, to join the community. </p><p>This Week's Buzz (and Yes, It Involves Making AI <em>Even Smarter</em>) 🤓</p><p>NeurIPS Hacker Cup 2024 - Can You Solve Problems <em>Humans</em> Struggle With? 🤔</p><p>I've gotta hand it to my PyTorch friend, Mark Saroufim. He knows how to make AI <em>interesting</em>! He and his incredible crew (Weiwei from MSFT, some WandB brainiacs, and more) are bringing you NeurIPS Hacker Cup 2024 - a competition to push those coding agents to their ABSOLUTE limits. 🚀</p><p>This isn't your typical "LeetCode easy" challenge, folks... These are problems SO hard, years of competitive programming experience are required to even <em>attempt</em> them! Mark himself said, </p><p><em>“At this point, like, if a model does make a significant dent in this competition, uh, I think people would need to acknowledge that, like, LLMs can do a form of planning. ”</em></p><p>And don't worry, total beginners: Mark and Weights & Biases are hosting a <strong>series of</strong><a target="_blank" href="http://wandb.me/hackercup?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=thursdai&#38;utm_id=aug8"><strong> FREE sessions</strong></a><strong> to level you up.</strong> Get those brain cells prepped and ready for the challenge and then <a target="_blank" href="http://wandb.me/hackercup?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=thursdai&#38;utm_id=aug8">Join the NeurIPS Hacker Cup Discord</a> </p><p>P.S. We're ALSO starting a killer AI Salon series in our SF office August 15th! You'll get a chance to chat with researches like Shreya Shankar - she's a leading voice on evaluation. More details and free tickets right here! <a target="_blank" href="https://wandb.ai/site/resources/events/genai-salon-with-shreya">AI Salons Link</a></p><p>Big Co & APIs - Towards intelligence too cheap to meter </p><p>Open-source was crushing it this week... but that didn't stop Big AI from throwing a few curveballs. OpenAI is doubling down on structured data (AND cheaper models!), Google slashed Gemini prices <em>again</em> (as we trend towards intelligence too cheap to meter), and a certain strawberry mystery took over Twitter.</p><p>DeepSeek context caching lowers price by 90% automatiically</p><p>DeepSeek, those masters of ridiculously-good coding AI, casually dropped a bombshell - context caching for their API! 🤯</p><p>If you're like "<em>wait, what does THAT mean?</em>", listen up because this is game-changing for production-grade AI:</p><p>* <strong>Problem:</strong> LLMs get fed the ENTIRE conversation history EVERY. SINGLE. TIME. This wastes compute (and $$$) when info is repeated.</p><p>* <strong>Solution:</strong> DeepSeek now <em>remembers</em> what you've said, automatically pulling from a cache when the conversation goes down familiar paths.</p><p>* <strong>The Win:</strong> Up to 90% cheaper API calls. Yes, NINETY.😳 It costs <em>1.4 CENTS</em> per million tokens for cached content. Let THAT sink in. 🤯</p><p>As Nisten (always bringing the technical breakdowns) explained:</p><p><strong>"Everyone should be using LLMs this way!...The simplest way is to have a long conversation ... then you save it on disk... you don't have to wait again ... [it's] kind of free. DeepSeek... did this in a more dynamic way"</strong>. - <em>Nisten</em></p><p>Even Matt Shumer, who usually advocates for clever prompting over massive context, got legitimately hyped about the possibilities:</p><p><strong>"For me, and how we use LLMs... instead of gathering a million examples... curate a hundred gold examples... you have something </strong><strong><em>better</em></strong><strong> than if you fine-tuned it, </strong><strong><em>and</em></strong><strong> cheaper, </strong><strong><em>and</em></strong><strong> faster..."</strong> - <em>Matt Shumer</em></p><p>Think about this... instead of painstakingly fine-tuning, we can "guide" models with expertly crafted examples, letting them learn "on the fly" with minimal cost. Context as the NEW fine-tuning! 🤯</p><p>P.S - Google actually also has caching on its Gemini API, but you have to opt-in, while this happens automatically with DeepSeek API! </p><p>Google Goes "Price War Nuclear": Gemini Flash is Officially TOO CHEAP</p><p>Speaking of sneaky advancements from Google... they <em>also</em> dropped an update SO casually impactful, it <em>almost</em> got lost in the shuffle. Gemini Flash (their smallest, but still crazy-good model) is now... 7.5 cents per million tokens for input and 30 cents per million tokens for output... (for up to 128k of context) </p><p>I REPEAT: 7.5 cents... with LONG context!? 🤯 Google, please chill, MY SANITY cannot handle this price free-fall any longer! 😂</p><p><a target="_blank" href="https://developers.googleblog.com/en/gemini-15-flash-updates-google-ai-studio-gemini-api/"><strong>Full Breakdown of Gemini’s Crazy New Prices on Google’s Blog</strong></a></p><p>While this USUALLY means a model's performance gets quietly nerfed in exchange for lower costs... in Gemini's case? Let's just say... even <em>I</em>, a staunch defender of open-source, am kinda SHOOK by how GOOD this thing is NOW!</p><p>After Google threw down this gauntlet, I <em>actually</em> used Gemini to draft my last ThursdAI newsletter (for the <em>first</em> time!). It nailed my tone and style <em>better than any other model I've tried</em> - and I've TRIED them ALL. 🤯 Even Nisten, who's super picky about his coding LLMs, gave it a rare nod of approval. Gemini's image understanding capabilities have improved significantly too! 🤯</p><p>Google also added improvements in how Gemini understands PDFs that are worth mentioning 👀</p><p>From JSON Headaches to REASONING Gains: What's <em>Really</em> New with GPT-4?</p><p>While Matt Shumer, my go-to expert on all things practical AI, might not be <em>immediately</em> impressed by OpenAI's new structured output features, they're still a huge win for many developers. Tired of LLM JSON going haywire? Well, GPT-4 can now <a target="_blank" href="https://openai.com/index/introducing-structured-outputs-in-the-api/">adhere to your </a><a target="_blank" href="https://openai.com/index/introducing-structured-outputs-in-the-api/"><em>exact</em></a><a target="_blank" href="https://openai.com/index/introducing-structured-outputs-in-the-api/"> schemas</a>, delivering 100% reliable structured data, no need for Instructor! 🙌</p><p><strong>This solves a </strong><strong><em>real</em></strong><strong> problem,</strong> even if the prompting gurus (like Matt) have figured out their own workarounds. The key is:</p><p>* <strong>Determinism</strong>: This ain't your typical LLM chaos - they're <em>guaranteeing</em> consistency, essential for building reliable applications.</p><p>* <strong>Ease of use</strong>: No need for external libraries - it's built right into the API!</p><p>Plus... a sneaky price drop, folks! GPT-4 is now 50% cheaper for input tokens and 33% cheaper for output. As I said on the show:</p><p><strong>"Again, quite insane... we're getting 50% cheaper just </strong><strong><em>without fanfare</em></strong><strong>. We're going towards 'intelligence too cheap to meter'... it's crazy".</strong></p><p>And HERE'S the plot twist... multiple folks on stage (including the eager newcomer N8) noticed significant reasoning improvements in this new GPT-4 model. They tested it on tasks like lateral thinking puzzles and even anecdotally challenging tasks - and guess what? It consistently outperformed older versions. 🤯</p><p><strong>"I have my own benchmark... of lateral thinking puzzles... the new GPT-4 [scored] roughly five to 10% higher... these are like </strong><strong><em>really</em></strong><strong> hard lateral thinking puzzles that require </strong><strong><em>innovative</em></strong><strong> reasoning ability".</strong> - <em>N8</em></p><p>OpenAI isn't bragging about this upgrade explicitly, which makes me even MORE curious... 🤔</p><p>Mistral Joins the AGENT Hype Train (But Their Version is Different)</p><p>Everybody wants a piece of that AI "Agent" pie, and now Mistral (the scrappy, efficient French company) is stepping up. They announced a double whammy this week: fine-tuning is here AND "les agents" have arrived... <em>but their agents are NOT quite what we're seeing elsewhere</em> (think AutoGPT, CrewAI, all those looped assistants). 🤔</p><p><a target="_blank" href="https://mistral.ai/news/build-tweak-repeat/"><strong>Mistral's Blog Post - Fine-tuning & Agents... Ooh La La!</strong></a></p><p>Their <strong>fine-tuning service</strong> is pretty straightforward: upload your data and they'll host a bespoke Mistral Large V2 running through their API at no extra cost (very cool!).</p><p>Their <strong>agents</strong> aren't based on agentic loop-running like what we see from those recursive assistants. As I pointed out on ThursdAI:</p><p><strong>"[Mistral] agents are </strong><strong><em>not</em></strong><strong> agentic... They're more similar to... GPTs for OpenAI or 'Projects' in Anthropic, where... you as a user add examples and preload context"</strong>.</p><p>It's more about <em>defining</em> agents with examples and system prompts, essentially letting Mistral "pre-tune" their models for specific tasks. This lets you deploy those agents via the API <em>or</em> to their LeChat platform - pretty darn neat!</p><p><a target="_blank" href="https://console.mistral.ai/build/agents/new"><strong>Build your OWN agent - Mistral's "Agent Builder" is slick!</strong></a></p><p>While not as flashy as those recursive agents that build websites and write symphonies on their own, Mistral's take on the agent paradigm is strategic. It plays to their strengths:</p><p>* <strong>Developer-focused:</strong> It's about creating <em>bespoke, task-specific tools</em> - think API integrations, code reviewers, or content generators.</p><p>* <strong>Ease of deployment:</strong> No need for complex loop management, Mistral handles the hard parts for you!</p><p>Mistral even teased that they'll eventually be incorporating tool use... so these "pre-tuned" agents <em>could</em> quickly evolve into something very interesting. 😏</p><p>NVIDIA leak about downloading videos went viral (And the Internet... Didn't Like That!)</p><p>This week, I found myself unexpectedly at the center of an X drama explosion (fun times! 😅 ) when some leaked NVIDIA Slack messages showed them discussing which YouTube channels to scrape. My crime? I dared to ask how this is <em>different</em> from how Google creating Street View, filming every street possible without asking for permission. <a target="_blank" href="https://x.com/altryne/status/1820571710288011312">My Honest Question that Sparked AI Outrage</a></p><p>The Internet, as it often does, had <em>thoughts</em> . The tweet blew up (like a <em>million</em> views blew up). I was labeled an apologist, a shill, all kinds of charming things... 😂 It got so intense, I had to MUTE the whole thing for my sanity's sake. BUT it brings up serious issues:</p><p>* AI & Copyright: Where the Heck are the Lines? When does inspiration become <em>infringement</em> when a model's trained on massive datasets? There's no legal precedent, folks, which is <em>scary</em> .</p><p>* <strong>Ethics vs. Innovation:</strong> AI progress moves FAST... sometimes FASTER than our ability to grasp the implications. That's <em>unsettling</em>.</p><p>* <strong>Twitter Pile-Ons & Nuance (aka What NOT to do):</strong> Look, I GET being passionate. BUT when criticism turns into name-calling and mob mentality, it shuts down any chance of meaningful conversation. That's not helping ANYONE.</p><p>Strawberry Shenanigans: Theories, Memes, and a Little AI LARPing?🍓</p><p>And now, for the MAIN EVENT: STRAWBERRY! You might have heard whispers... seen those cryptic tweets... maybe <em>witnessed</em> that wild Twitter account firsthand! It all started with Sam Altman casually posting a pic of a strawberry garden with the caption "nice summer day". Then came the deluge - more pics of strawberries from OpenAI folks, even those cryptic, semi-official domain names LDJ uncovered... I even spotted a strawberry IN OUR audience for crying out loud! This thing spread like wildfire. 🔥</p><p>We spent a solid chunk of the episode piecing together the lore: Q*, the mystery model shrouded in secrecy for years, then that Bloomberg leak claiming it was code-named "Strawberry", and now <em>this</em>. It was peak AI conspiracy-theory land!</p><p>We <em>still</em> don't have hard confirmation on Q*... <em>but that strawberry account, spitting out fruit puns and pinging ChatGPT like a maniac? Some on ThursdAI (Yam, mostly) believe that this may not have been a human at all, but an early, uncontrolled attempt to have an AI manage its own PR. 😳 I almost bought it - especially the way it reacted to some of my live comments - but now... the LARP explanation seems more likely</em></p><p>Many folks at OpenAI posted things with strawberries as well, was this a sign of something to come or were they just trying to bury the news that <a target="_blank" href="https://www.theinformation.com/articles/trio-of-leaders-leave-openai">3 executives departed</a> the company this week under a mountain of 🍓? </p><p>Cursor & Composer: When Coding Becomes <em>AI-Powered Magic</em> ✨</p><p>I love a good tool... and this week, my dev heart was a-flutter over Cursor . Tried it yet? Seriously, you need to! It's VS Code, but SUPERCHARGED with AI that'll make you question why Copilot ever existed. 😂</p><p>You can edit code by CHAT, summarize entire files with one click, zap bugs instantly ... but they just dropped their <em>ultimate</em> weapon: <strong>Composer</strong>. It's essentially a coding AGENT that does <em>multi-file edits</em>. 🤯</p><p>Matt Shumer (my SaaS wizard friend who adopted Cursor early) had some jaw-dropping examples:</p><p><strong>" [Composer] ... takes all the parts of Cursor you like and strings them together </strong><strong><em>as an agent</em></strong><strong>... it takes away a lot of the grunt work... you can say 'go add this feature'... it searches your files, figures out what to edit, then puts it together. ...I literally </strong><strong><em>built a SaaS in 20 minutes</em></strong><strong>!"</strong> - <em>Matt Shumer</em></p><p>Matt also said that using Cursor is required at their company! </p><p><em>Even</em> my stoic PyTorch friend, Mark, couldn't help but express some curiosity:</p><p><strong>"It's </strong><strong><em>cool</em></strong><strong> they're doing things like multi-file editing... pretty curious to see more projects along those lines"</strong> - <em>Mark Serafim</em></p><p>Yeah, it's still in the rough-around-the-edges stage (UX could use some polish). But THIS, folks, is the <em>future of coding</em> - less about hammering out syntax, more about <em>describing INTENT</em> and letting the AI handle the magic! 🤯 I can't <em>wait</em> to see what they do next.</p><p>Download at <a target="_blank" href="http://cursor.sh">cursor.sh</a> and let me know what you think</p><p>Conclusion: The Future Is FAST, Open, And Maybe a Bit TOO Spicy? 🌶️😂</p><p>Honestly, every single week leaves me awestruck by how fast this AI world is moving. 🤯 We went from "transformers? Huh?" to 70-point math models running on SMARTWATCHES and AI building ENTIRE web apps <em>in less than two years</em>. And I still haven't got GPT-4's new voice model yet!!</p><p>Open source keeps proving its power, even THOSE BIG companies are getting in on the action (look at those Google prices! 😍), and then you've got those captivating mysteries keeping us on our toes... like those damned strawberries! 🍓 What DOES OpenAI have up their sleeve??</p><p>As <em>always</em>, huge THANK YOU to the amazing guests who make this show what it is - this week, extra kudos to Junyang, Nisten, LDJ, Mark, Yam, and Eric, you guys ROCK. 🔥 And HUGE gratitude to each and every ONE of you readers/listeners (and NEW folks who stuck around after those Strawberry bait tweets! 😂) You make this ThursdAI community truly <em>unstoppable</em>. 💪</p><p>Keep on building, stay insanely curious, and I'll see you <em>next</em> Thursday - ready or not, that AI future is coming in hot! 🔥🚀</p><p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></p><p></p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-aug8-qwen2-math-king-tiny</link><guid isPermaLink="false">substack:post:147498681</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Thu, 08 Aug 2024 22:00:31 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/147498681/98ea5fc5e03eed11913b3b2ec7059ebd.mp3" length="75050478" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6254</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/147498681/5476bbe48e35b36e289784bafa03bc47.jpg"/></item><item><title><![CDATA[📆 ThursdAI - August 1st - Meta SAM 2 for video, Gemini 1.5 is king now?, GPT-4o Voice is here (for some), new Stability, Apple Intelligence also here & more AI news]]></title><description><![CDATA[<p>Starting Monday, Apple released iOS 18.1 with <strong>Apple Intelligence</strong>, then Meta dropped <strong>SAM-2</strong> (Segment Anything Model) and then Google first open sourced <strong>Gemma 2B</strong> and now (just literally 2 hours ago, during the live show) released <strong>Gemini 1.5 0801 experimental</strong> that takes #1 on LMsys arena across multiple categories, to top it all off we also got a new <strong>SOTA image diffusion model</strong> called FLUX.1 from ex-stability folks and their new Black Forest Lab.</p><p>This week on the show, we had <a target="_blank" href="https://x.com/josephofiowa">Joseph</a> & <a target="_blank" href="https://x.com/skalskip92">Piotr Skalski</a> from Roboflow, talk in depth about Segment Anything, and as the absolute experts on this topic (Skalski is our returning vision expert), it was an incredible deep dive into the importance dedicated vision models (not VLMs).</p><p>We also had <a target="_blank" href="https://x.com/LucasAtkins7">Lukas Atkins</a> & <a target="_blank" href="https://x.com/FernandoNetoAi">Fernando Neto</a> from Arcee AI talk to use about their new DistillKit and explain model Distillation in detail & finally we had <a target="_blank" href="https://x.com/CrisGiardina"><strong>Cristiano Giardina</strong></a><strong> </strong>who is one of the lucky few that got access to OpenAI advanced voice mode + <strong>his new friend GPT-4o came on the show as well!</strong></p><p>Honestly, how can one keep up with all this? by reading ThursdAI of course, that's how but ⚠️ buckle up, this is going to be a BIG one (I think over 4.5K words, will mark this as the longest newsletter I penned, I'm sorry, maybe read this one on 2x? 😂)</p><p>[ Chapters ] </p><p>00:00 Introduction to the Hosts and Their Work</p><p>01:22 Special Guests Introduction: Piotr Skalski and Joseph Nelson</p><p>04:12 Segment Anything 2: Overview and Capabilities</p><p>15:33 Deep Dive: Applications and Technical Details of SAM2</p><p>19:47 Combining SAM2 with Other Models</p><p>36:16 Open Source AI: Importance and Future Directions</p><p>39:59 Introduction to Distillation and DistillKit</p><p>41:19 Introduction to DistilKit and Synthetic Data</p><p>41:41 Distillation Techniques and Benefits</p><p>44:10 Introducing Fernando and Distillation Basics</p><p>44:49 Deep Dive into Distillation Process</p><p>50:37 Open Source Contributions and Community Involvement</p><p>52:04 ThursdAI Show Introduction and This Week's Buzz</p><p>53:12 Weights & Biases New Course and San Francisco Meetup</p><p>55:17 OpenAI's Advanced Voice Mode and Cristiano's Experience</p><p>01:08:04 SearchGPT Release and Comparison with Perplexity</p><p>01:11:37 Apple Intelligence Release and On-Device AI Capabilities</p><p>01:22:30 Apple Intelligence and Local AI</p><p>01:22:44 Breaking News: Black Forest Labs Emerges</p><p>01:24:00 Exploring the New Flux Models</p><p>01:25:54 Open Source Diffusion Models</p><p>01:30:50 LLM Course and Free Resources</p><p>01:32:26 FastHTML and Python Development</p><p>01:33:26 <a target="_blank" href="http://Friend.com">Friend.com</a>: Always-On Listening Device</p><p>01:41:16 Google Gemini 1.5 Pro Takes the Lead</p><p>01:48:45 GitHub Models: A New Era</p><p>01:50:01 Concluding Thoughts and Farewell</p><p></p><p>Show Notes & Links</p><p>* <strong>Open Source LLMs</strong></p><p>* Meta gives SAM-2 - segment anything with one shot + video capability! (<a target="_blank" href="https://x.com/AIatMeta/status/1818055906179105010">X</a>, <a target="_blank" href="https://ai.meta.com/blog/segment-anything-2/?utm_source=twitter&#38;utm_medium=organic_social&#38;utm_content=reel&#38;utm_campaign=sam2">Blog</a>, <a target="_blank" href="https://sam2.metademolab.com/demo">DEMO</a>)</p><p>* Google open sources Gemma 2 2.6B (<a target="_blank" href="https://developers.googleblog.com/en/smaller-safer-more-transparent-advancing-responsible-ai-with-gemma/?linkId=10532702">Blog</a>, <a target="_blank" href="https://huggingface.co/collections/google/gemma-2-2b-release-66a20f3796a2ff2a7c76f98f">HF</a>)</p><p>* MTEB Arena launching on HF - Embeddings head to head (<a target="_blank" href="https://huggingface.co/spaces/mteb/arena">HF</a>)</p><p>* Arcee AI announces DistillKit - (<a target="_blank" href="https://x.com/arcee_ai/status/1819015876575637890">X</a>, <a target="_blank" href="https://blog.arcee.ai/announcing-distillkit/">Blog</a>, <a target="_blank" href="https://github.com/arcee-ai/DistillKit?ref=blog.arcee.ai">Github</a>)</p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* Black Forest Labs - FLUX new SOTA diffusion models (<a target="_blank" href="https://twitter.com/robrombach/status/1819012132064669739">X</a>, <a target="_blank" href="https://blackforestlabs.ai/announcing-black-forest-labs/">Blog</a>, <a target="_blank" href="https://fal.ai/models/fal-ai/flux/dev/playground">Try It</a>)</p><p>* Midjourney 6.1 update - greater realism + potential Grok integration (<a target="_blank" href="https://x.com/midjourney/status/1818342703618482265?utm_source=ainews&#38;utm_medium=email&#38;utm_campaign=ainews-to-be-named-5098">X</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Google updates Gemini 1.5 Pro with 0801 release and is #1 on LMsys arena (<a target="_blank" href="https://x.com/OfficialLoganK/status/1819041453978357899">X</a>)</p><p>* OpenAI started alpha GPT-4o voice mode (<a target="_blank" href="https://x.com/CrisGiardina/status/1818687256741187909">examples</a>)</p><p>* OpenAI releases SearchGPT (<a target="_blank" href="https://openai.com/index/searchgpt-prototype/">Blog</a>, <a target="_blank" href="https://sub.thursdai.news/p/thursdai-searchgpt-vs-perplexity">Comparison w/ PPXL</a>)</p><p>* Apple releases beta of iOS 18.1 with Apple Intelligence (X, hands on, <a target="_blank" href="https://x.com/rudrankriyam/status/1818006716023296277">Intents</a> )</p><p>* Apple released a <a target="_blank" href="https://x.com/ruomingpang/status/1817983627340472642">technical paper</a> of apple intelligence</p><p>* <strong>This weeks Buzz</strong></p><p>* <a target="_blank" href="https://wandb.ai/site/resources/events/genai-salon-with-shreya??utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=thursdai&#38;utm_id=aug1">AI Salons</a> in SF + <a target="_blank" href="https://www.wandb.courses/courses/101-weave??utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=thursdai&#38;utm_id=aug1">New Weave course</a> for WandB featuring yours truly!</p><p>* <strong>Vision & Video</strong></p><p>* Runway ML adds Gen -3 image to video and makes it 7x faster (<a target="_blank" href="https://x.com/altryne/status/1818695305182757235">X</a>)</p><p>* Tools & Hardware</p><p>* Avi announces f<a target="_blank" href="https://t.co/hE7tFK4sTR">riend.com</a></p><p>* Jeremy Howard releases FastHTML (<a target="_blank" href="https://fastht.ml/">Site</a>, Video)</p><p>* Applied LLM course from Hamel dropped <a target="_blank" href="https://x.com/HamelHusain/status/1817935895246635362">all videos</a></p><p>Open Source</p><p>It feels like everyone and their grandma is open sourcing incredible AI this week! Seriously, get ready for segment-anything-you-want + real-time-video capability PLUS small AND powerful language models.</p><p>Meta Gives Us SAM-2: Segment ANYTHING Model in Images & Videos... With One Click!</p><p>Hold on to your hats, folks! Remember Segment Anything, Meta's already-awesome image segmentation model? They've just ONE-UPPED themselves. Say hello to SAM-2 - it's real-time, promptable (you can TELL it what to segment), and handles VIDEOS like a champ. As I said on the show: "I was completely blown away by segment anything 2".</p><p><strong>But wait, what IS segmentation?</strong> Basically, pixel-perfect detection - outlining objects with incredible accuracy. My guests, the awesome Piotr Skalski and Joseph Nelson (computer vision pros from Roboflow), broke it down historically, from SAM 1 to SAM 2, and highlighted just how mind-blowing this upgrade is.</p><p><strong>"So now, Segment Anything 2 comes out. Of course, it has all the previous capabilities of Segment Anything ... But the segment anything tool is awesome because it also can segment objects on the video".</strong> - <em>Piotr Skalski</em></p><p>Think about Terminator vision from the "give me your clothes" bar scene: you see a scene, instantly "understand" every object separately, AND track it as it moves. SAM-2 gives us that, allowing you to click on a single frame, and BAM - perfect outlines that flow through the entire video! I played with their playground, and you NEED to try it - you can blur backgrounds, highlight specific objects... the possibilities are insane. <a target="_blank" href="https://sam2.metademolab.com/demo">Playground Link</a></p><p>In this video, Piotr annotated only the first few frames of the top video, and SAM understood the bottom two shot from 2 different angles!</p><p><em>Okay, cool tech, BUT why is it actually USEFUL?</em> Well, Joseph gave us incredible examples - from easier sports analysis and visual effects (goodbye manual rotoscoping) to advances in microscopic research and even galactic exploration! Basically, any task requiring precise object identification gets boosted to a whole new level.</p><p><strong>"SAM does an incredible job at creating pixel perfect outlines of everything inside visual scenes. And with SAM2, it does it across videos super well, too ... That capability is still being developed for a lot of AI Models and capabilities. So having very rich ability to understand what a thing is, where that thing is, how big that thing is, allows models to understand spaces and reason about them"</strong> - <em>Joseph Nelson</em></p><p>AND if you combine this power with other models (like Piotr is already doing!), you get zero-shot segmentation - literally type what you want to find, and the model will pinpoint it in your image/video. It's early days, but get ready for robotics applications, real-time video analysis, and who knows what else these clever hackers are dreaming up! 🤯</p><p><a target="_blank" href="https://huggingface.co/spaces/SkalskiP/florence-sam">Check out Piotr's Zero Shot Florence + Sam2 Implementation</a></p><p>Best of all? Apache 2 license, baby! As Joseph said, <strong>"Open source is foundational to making the accessibility, the use cases, and the advancement of the field overall"</strong>, and this is a prime example. Huge kudos to Meta for empowering us with this tech.</p><p>The whole conversation w/ Piotr & Joseph is very much worth listening to on the pod 🎙️</p><p>Google Throws Down The Gauntlet: Open Sourcing GemMA 2 2.6B</p><p>It was Meta vs. Google on Monday because NOT to be outdone, Google also went on an open-sourcing spree. This time, they gifted us GemMA 2 (a 2.6 billion parameter powerhouse), alongside a safety-focused suite called ShieldGemMA AND a transparency tool called GemmaScope.</p><p><strong>So what makes Gemma 2 special?</strong> First off, it's optimized for on-device use, meaning super-efficient local running. BUT there's a catch, folks... They claim it beats Mixtral AND Llama 2 70B on the LMsys Arena leaderboard, with an ELO score of 1126. Hold on, a 2 billion parameter model outperforming the big boys? 🤨 As LDJ (one of my regular co-hosts) said on the show:</p><p><strong>"Yeah, I think my best theory here is... there's at least two or three variables at play ... In LMSys, people are much more likely to do single turn, and within LMSys, people will usually be biased more towards rating models with a more recent knowledge cutoff as higher".</strong></p><p>Translation? It <em>might</em> be gaming the system a bit, but either way, Gemma 2 is an exciting release - super fast, small enough for on-device applications, and coming with safety tools right out the gate! I think Zenova (our Hugging Face wizard) is already running this on WebGPU! You NEED to try it out.</p><p><a target="_blank" href="https://huggingface.co/collections/google/gemma-2-2b-release-66a20f3796a2ff2a7c76f98f">Gemma 2 HF Link</a></p><p>And GemmaScope? That's some cool, cool stuff too. Think about peeking inside the "brain" of the model - you can actually SEE how Gemma 2 processes information. Remember Anthropic Mechinterp? It's like that, giving us unprecedented transparency into how these systems actually "think". You gotta see it on Neuronpedia. <a target="_blank" href="https://www.neuronpedia.org/gemma-scope#microscope">Neuronpedia link</a></p><p>It's Meta versus Google - round one, FIGHT! 🥊</p><p>Distilling Knowlege: Arcee AI Drops DistilKit!</p><p>Just when I thought the week was done throwing surprises, Arcee AI casually dropped DistilKit - an open source tool to build <em>distilled</em> language models. Now, this is some NEXT level stuff, folks. We talked with Lukas Atkins and Fernando (the brilliant minds behind DistillKit), and I finally learned what the heck "distillation" really <em>means</em>.</p><p><strong>"TLDR - we teach a smaller model to think like a bigger model"</strong></p><p><em>In a nutshell: teach a smaller model how to think like a larger one</em>. Think GPT-4o and GPT-4 Mini, where the smaller model supposedly got the "essence" of the bigger version. Or imagine a tiny Llama that inherited the smarts of 405B - ridiculous! 🤯 As Fernando eloquently put it:</p><p>So in the finetuning that we have been doing, just in terms of generating text instructions and so on, we were observing only the token that was generated from the teacher model. And now with the distillation, we are <strong>observing the whole distribution of the tokens</strong> that could be sampled</p><p>Now I admit, even after Fernando's expert breakdown, my brain still kind of melted. 🫠 BUT, here's why this matters: distilled models are <em>super</em> efficient, saving on cost and resources. Imagine powerful AI that runs seamlessly on your phone! 🤯 Arcee is making this possible for everyone.</p><p><a target="_blank" href="https://github.com/arcee-ai/DistillKit?ref=blog.arcee.ai">Check Out DistilKit Here</a></p><p>Was it pure coincidence they released this on the same week as the Llama 3.1 LICENSE CHANGE (Zuckerberg is clearly watching ThursdAI...), which makes distillation perfectly legal?</p><p>It's wild, exciting, AND I predict a massive surge in smaller, specialized AI tools that inherit the intelligence of the big boys.</p><p>This weeks buzz</p><p>Did I already tell you that someone came up to me and said, hey, you're from Weights & Biases, you are the guys who make the courses right? 😂 I said, well yeah, we have a bunch of free courses on <a target="_blank" href="https://www.wandb.courses/courses/101-weave??utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=thursdai&#38;utm_id=aug1">wandb.courses</a> but we also have a world leading ML experiment tracking software and an LLM observability toolkit among other things. It was really funny he thought we're just courses company!</p><p>Well this last week, my incredible colleague Agata who's in charge of our courses, took an initiative and stitched together a course about <a target="_blank" href="https://wandb.me/weave/?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=thursdai&#38;utm_id=aug1">Weave</a> from a bunch of videos that I already had recorded! It's awesome, please <a target="_blank" href="https://www.wandb.courses/courses/101-weave??utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=thursdai&#38;utm_id=aug1">check it out</a> if you're interested to learn about Weave 👏</p><p>P.S - we are also starting a series of AI events in our SF office called AI Salons, the first one is going to feature Shreya Shankar, and focus on evaluations, it's on August 15th, so if you're in SF, you're invited for free as a ThursdAI subscriber! <a target="_blank" href="https://wandb.ai/site/resources/events/genai-salon-with-shreya">Get free tickets</a></p><p>Big Co AI - LLMs & APIs</p><p>Not only was open source popping off, but those walled-garden mega corps wanted in on the action too! SearchGPT, anyone?</p><p>From Whispers to Reality: OpenAI Alpha Tests GPT-4 Voice (and IT'S WILD)</p><p>This was THE moment I waited for, folks - GPT-4 with ADVANCED VOICE is finally trickling out to alpha users. Did I get access? NO. 😩 But my new friend, Cristiano Giardina, DID and you've probably seen his viral videos of this tech - they're blowing up MY feed, even Sam Altman retweeted the above one! <strong>I said on the show, this new voice "feels like a big next unlock for AI"</strong></p><p><strong>What sets this apart from the "regular" GPT-4 voice we have now?</strong> As Cristiano told us:</p><p><strong>"the biggest difference is that the emotion , and the speech is very real and it follows instructions regarding emotion very well, like you can ask it to speak in a more animated way, you can ask it to be angry, sad, and it really does a good job of doing that."</strong></p><p>We did a LIVE DEMO (it worked, thank God), and y'all... I got CHILLS. We heard counting with a breath, depressed Soviet narrators, even a "GET TO THE CHOPPA" Schwarzenegger moment that <em>still</em> makes me laugh 😂 It feels like a completely different level of interaction, something genuinely conversational and even emotional. Check out Cristiano's profile for more insane demos - you won't be disappointed.<a target="_blank" href="https://x.com/CrisGiardina">Follow Cristiano Here For Amazing Voice Mode Videos</a></p><p>Can't wait for access, if anyone from OpenAI is reading this, hook me up 🙏 I'll trade my SearchGPT access!</p><p>SearchGPT: OpenAI Throws Their Hat Into The Ring (again?)</p><p>Did OpenAI want to remind everyone they're STILL here amidst the LLama/Mistral frenzy? Maybe that's why they released SearchGPT - their newest "search engine that can hold a conversation" tool. Again, waitlisted, but unlike with voice mode... I got access. 😅</p><p><strong>The good:</strong> Fast. Really fast. And <em>impressively</em> competent, considering it's still a demo. Handles complex queries well, and its "follow-up" ability blows even Perplexity out of the water (which is impressive).</p><p><strong>The less-good:</strong> Still feels early, especially for multi-language and super local stuff. Honestly, feels more like a sneak peek of an upcoming ChatGPT integration than a standalone competitor to Google.</p><p>But either way, it's an interesting development - as you may have already learned from my full breakdown of SearchGPT vs. Perplexity</p><p>Apple Intelligence is here! (sort of)</p><p>And speaking of big companies, how could I not mention the Apple Intelligence release this week? Apple finally dropped iOS 18.1 with the long-awaited ON-DEVICE intelligence, powered by the Apple Foundational Model (AFM). Privacy nerds rejoice! 🎉</p><p>But how good is it? Mixed bag, I'd say. It's there, and definitely usable for summarization, rewriting tools, text suggestions... but Siri STILL isn't hooked up to it yet, tho speech to text is way faster and she does look more beautiful. 🤔 Apple <em>did</em> release a <a target="_blank" href="https://x.com/ruomingpang/status/1817983627340472642">ridiculously detailed paper</a> explaining how they trained this model on Apple silicon... and as Nisten (ever the voice of honesty) said on the show,</p><p><strong>"It looks like they've stacked a lot of the tricks that had been working ... overall, they're not actually really doing anything new ... the important thing here is how they apply it all as a system that has access to all your personal data."</strong></p><p>Yeah, ouch, BUT still exciting, especially as we get closer to truly personal, on-device AI experiences. Right now, it's less about revolutionary advancements, and more about how Apple can weave this into our lives seamlessly - they're focusing heavily on <em>app intents</em>, meaning AI that can <em>actually DO</em> things for you (think scheduling appointments, drafting emails, finding that photo buried in your library). I'll keep testing this, the more I play around the more I find out, like it suddenly started suggesting replies in messages for me for example, or I haven't yet seen the filtered notifications view where it smartly only lets important messages go through your focus mode.</p><p>So stay tuned but it's likely not worth the beta iOS upgrade if you're not a dev or a very strong enthusiast.</p><p>Wait, MORE Breaking News?? The AI World Doesn't Sleep!</p><p>If this episode wasn't already enough... the very day of the live show, as we're chatting, I get bombarded with breaking news alerts from my ever-vigilant listeners.</p><p><strong>1. Gemini 1.5 Pro 0801 - Now #1 on LMsys Arena!</strong> 🤯 Google apparently loves to ship big right AFTER I finish recording ThursdAI (this happened last week too!). Gemini's new version, released WHILE we were talking about older Gemini versions, claimed the top spot with an <em>insane</em> 1300 ELO score - crushing GPT-4 and taking home 1st place in Math, Instruction Following, <em>and</em> Hard Prompts! It's experimental, it's up on <a target="_blank" href="https://aistudio.google.com/app/prompts/new_chat">Google AI Studio</a>... Go play with it! (and then tag me with your crazy findings!)</p><p>And you know what? Some of this blog was drafted by this new model, in fact, I had the same prompt sent to Claude Sonnet 3.5, Mistral Large v2 and I tried LLama 3.1 405B but couldn't find any services that host a full context window, and this Gemini just absolutely demolished all of them on tone, on imitating my style, it even took some of the links from my TL;DR and incorporated them into the draft on its own! I've never seen any other model do that! I haven't used any LLMs so far for this blog besides proof-reading because, well they all kinda sucked, but damn, I dare you to try and find out where in this blog it was me and where it was Gemini.</p><p><strong>2. GitHub Does a Hugging Face: Introducing GitHub Models!</strong></p><p>This dropped just as we wrapped - basically a built-in marketplace where you can try, test, and deploy various models <em>right</em> within GitHub! They've already got LLaMa, Mistral, and some Azure-hosted GPT-4o stuff - very intriguing... Time will tell what Microsoft is cooking here, but you can bet I'll be investigating!🕵️</p><p>AI Art & Diffusion</p><p>New Stability: Black Forest Labs <em>and FLUX.1</em> Rise!</p><p>Talk about a comeback story: 14 EX Stability AI pros led by Robin Rombach, Andreas Blatmann & Patrick Esser the OG creaters of Stable Diffusion with $31 million in funding from a16z, and are back to make diffusion dreams come true. Enter <strong>Black Forest Labs</strong>. Their first gift? <strong>FLUX.1 - a suite of text-to-image models so good, they're breaking records.</strong> I saw those demos and wow. It's good, like REALLY good. 🤯</p><p><a target="_blank" href="https://fal.ai/models/fal-ai/flux/dev/playground">Try it out here</a></p><p>And the <em>real</em> bomb? They're working on <strong>open-source TEXT-TO-VIDEO!</strong> That's right, imagine generating those mind-blowing moving visuals... with code anyone can access. It's in their "Up Next" section, so watch that space - it's about to get REAL interesting.</p><p><strong>Also... Midjourney 6.1 also came out, and it looks GOOD</strong></p><p>And<strong> </strong>you can see a comparison between the two new leading models in this thread by <a target="_blank" href="https://x.com/TheNoahHein/status/1819098232636481711">Noah Hein</a></p><p>Tools & Hardware: When AI Gets Real (And <em>Maybe</em> Too Real...)</p><p>You knew I had to close this madness out with some Hardware, because hardware means that we actually are <em>interacting</em> with these incredible models in a meaningful way.</p><p>Friend.com: When Your AI Is... Always Listening? 🤨</p><p>And then this happened... Avi Schiffman (finally) announces <a target="_blank" href="friend.com">friend.com</a>. with an amazingly dystopian promo video from Sandwich. Videos. ~ 22 million views and counting, not by accident! <a target="_blank" href="https://x.com/AviSchiffmann/status/1818284595902922884">Link to Video</a>.</p><p>It's basically an always-on, listening pendant. "<strong>A little like wearing a wire" as Nisten so eloquently put it. 🎧</strong> Not for memory extension or productivity... for <em>friendship</em>. Target audience? Lonely people who want a device that captures and understands their entire lives, but in an almost comforting way (or maybe unsettling, depending on your viewpoint!). The debate about privacy is already RAGING... But as Nisten pointed out:</p><p><strong>"Overall, it is a positive. ...The entire privacy talk and data ownership, I think that's a very important conversation to have".</strong></p><p>I kinda get the vision. Combine THIS tech with GPT-4 Voice speed... you could actually have engaging conversations, 24/7! 🤯 I don't think it's as simple as "this is dystopian, end of story". Character AI is EXPLODING right now, remember those usage stats, over 20 million users and counting? The potential to help with loneliness <em>is</em> real...</p><p>The Developer Corner: Tools for Those Hacking This Future</p><p>Gotta love these shoutouts:</p><p>* <strong>FastHTML from Jeremy Howard:</strong> Not <em>strictly</em> AI, but if you hate JS and love Python, this one's for you - insanely FAST web dev using a mind-bending new syntax. <a target="_blank" href="https://fastht.ml/">FastHTML website link</a></p><p>* <strong>Hamel Hussain's Applied LLM Course - All Videos NOW Free!:</strong> Want to learn from some of the <em>best</em> minds in the field (including Jeremy Howard, Shreya Shankar evaluation QUEEN, Charles Frye and tons of other great speakers)? This course covers it all - from finetuning to Rag Building to optimizing your prompts.<a target="_blank" href="https://x.com/HamelHusain/status/1817935895246635362">Applied LLMs course - free videos link</a></p><p><em>AND ALSO ... Nisten blew everyone's minds again in the end! Remember last week, we thought it'd take time before anyone could run Llama 3.1 405B on just CPU?</em> Well, this crazy genius already cracked the code - seven tokens per second on a normal CPU! 🤯 If you're a researcher who hates using cloud GPUs (or wants to use ALL THOSE CORES in your Lambda machine, wink wink)... get ready.</p><p>Look, I'm not going to sit here and pretend that weeks are not getting crazier, it takes me longer and longer to prep for the show, and really is harder and harder to contain the show to 2 hours, and we had 3 breaking news stories just today!</p><p>So we're accelerating, and I'll likely be using a bit of support from AI, but only if it's good, and only if it's proof read by me, so please let me know if you smell slop! I really wanna know!</p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-august-1st-meta-sam-2-for</link><guid isPermaLink="false">substack:post:147254712</guid><dc:creator><![CDATA[Alex Volkov, Joseph, Lucas Atkins, Fernando Neto, and Piotr Skalski]]></dc:creator><pubDate>Thu, 01 Aug 2024 23:48:59 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/147254712/1983de352705fe5496d63904eff936a4.mp3" length="81078509" type="audio/mpeg"/><itunes:author>Alex Volkov, Joseph, Lucas Atkins, Fernando Neto, and Piotr Skalski</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6756</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/147254712/75d6b9916745c6ffac9ff2ebe160d8c5.jpg"/></item><item><title><![CDATA[🧨 ThursdAI - July 25 - OpenSource GPT4 intelligence has arrived - Meta LLaMa 3.1 405B beats GPT4o! Mistral Large 2 also, Deepseek Code v2 ALSO - THIS WEEK]]></title><description><![CDATA[<p>Holy s**t, folks! I was off for two weeks, last week OpenAI released GPT-4o-mini and everyone was in my mentions saying, Alex, how are you missing this?? and I'm so glad I missed that last week and not this one, because while GPT-4o-mini is incredible (GPT-4o level distill with incredible speed and almost 99% cost reduction from 2 years ago?) it's not open source. </p><p>So welcome back to ThursdAI, and buckle up because we're diving into what might just be the craziest week in open-source AI since... well, ever!</p><p>This week, we saw Meta drop LLAMA 3.1 405B like it's hot (including updated 70B and 8B), Mistral joining the party with their Large V2, and DeepSeek quietly updating their coder V2 to blow our minds. Oh, and did I mention Google DeepMind casually solving math Olympiad problems at silver level medal 🥈? Yeah, it's been that kind of week.</p><p>TL;DR of all topics covered: </p><p>* <strong>Open Source</strong></p><p>* Meta LLama 3.1 updated models (405B, 70B, 8B) - Happy LLama Day! (<a target="_blank" href="https://x.com/AIatMeta/status/1815766327463907421">X</a>, <a target="_blank" href="https://ai.meta.com/blog/meta-llama-3-1/?utm_source=twitter&#38;utm_medium=organic_social&#38;utm_content=video&#38;utm_campaign=llama31">Announcement</a>, <a target="_blank" href="https://www.youtube.com/watch?v=Vy3OkbtUa5k">Zuck</a>, <a target="_blank" href="https://meta.ai">Try It</a>, <a target="_blank" href="https://fast.snova.ai/">Try it Faster</a>, <a target="_blank" href="https://twitter.com/altryne/status/1815824095839244399">Evals</a>, <a target="_blank" href="https://twitter.com/altryne/status/1816239501204955616">Provider evals</a>)</p><p>* Mistral Large V2 123B (<a target="_blank" href="https://x.com/dchaplot/status/1816132981377097883">X</a>, <a target="_blank" href="https://huggingface.co/mistralai/Mistral-Large-Instruct-2407">HF</a>, <a target="_blank" href="https://mistral.ai/news/mistral-large-2407/">Blog</a>, <a target="_blank" href="https://chat.mistral.ai/">Try It</a>)</p><p>* DeepSeek-Coder-V2-0724 update (<a target="_blank" href="https://platform.deepseek.com/api-docs/updates/">API only</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* 🥈 Google Deepmind wins silver medal at Math Olympiad - AlphaGeometry 2 (<a target="_blank" href="https://x.com/GoogleDeepMind/status/1816498082860667086">X</a>)</p><p>* OpenAI teases SearchGPT - their reimagined search experience (<a target="_blank" href="https://openai.com/index/searchgpt-prototype/">Blog</a>)</p><p>* OpenAI opens GPT-4o-mini finetunes + 2 month free (<a target="_blank" href="https://x.com/OpenAIDevs/status/1815836887631946015">X</a>)</p><p>* <strong>This weeks Buzz</strong></p><p>* I compare 5 LLama API providers for speed and quantization using <a target="_blank" href="https://wandb.me/weave/?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=thursdai&#38;utm_id=july25">Weave</a> (<a target="_blank" href="https://x.com/altryne/status/1816239501204955616">X</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* Daily announces a new open standard for real time Voice and Video  RTVI-AI (<a target="_blank" href="https://twitter.com/trydaily/status/1815530613434417241">X</a>, <a target="_blank" href="https://demo.rtvi.ai/">Try it</a>,  <a target="_blank" href="https://twitter.com/trydaily/status/1815530613434417241">Github</a>)</p><p>Meta LLAMA 3.1: The 405B Open Weights Frontier Model Beating GPT-4 👑</p><p></p><p>Let's start with the star of the show: Meta's LLAMA 3.1. This isn't just a 0.1 update; it's a whole new beast. We're talking about a 405 billion parameter model that's not just knocking on GPT-4's door – it's kicking it down.</p><p>Here's the kicker: you can actually download this internet scale intelligence (if you have 820GB free). That's right, a state-of-the-art model beating GPT-4 on multiple benchmarks, and you can click a download button. As I said during the show, "This is not only refreshing, it's quite incredible."</p><p>Some highlights:</p><p>* 128K context window (finally!)</p><p>* MMLU score of 88.6</p><p>* <strong>Beats GPT-4 on several benchmarks</strong> like IFEval (88.6%), GSM8K (96.8%), and ARC Challenge (96.9%)</p><p>* Has Tool Use capabilities (also beating GPT-4) and is Multilingual (ALSO BEATING GPT-4)</p><p>But that's just scratching the surface. Let's dive deeper into what makes LLAMA 3.1 so special.</p><p>The Power of Open Weights</p><p>Mark Zuckerberg himself dropped an exclusive interview with our friend Rowan Cheng from Rundown AI. And let me tell you, Zuck's commitment to open-source AI is no joke. He talked about distillation, technical details, and even released a manifesto on why open AI (the concept, not the company) is "<a target="_blank" href="https://about.fb.com/news/2024/07/open-source-ai-is-the-path-forward/">the way forward</a>".</p><p>As I mentioned during the show, "The fact that this dude, like my age, I think he's younger than me... knows what they released to this level of technical detail, while running a multi billion dollar company is just incredible to me."</p><p>Evaluation Extravaganza</p><p>The evaluation results for LLAMA 3.1 are mind-blowing. We're not just talking about standard benchmarks here. The model is crushing it on multiple fronts:</p><p>* MMLU (Massive Multitask Language Understanding): 88.6%</p><p>* IFEval (Instruction Following): <strong>88.6%</strong></p><p>* GSM8K (Grade School Math): <strong>96.8%</strong></p><p>* ARC Challenge: 96.9%</p><p>But it doesn't stop there. The fine folks at meta also for the first time added new categories like Tool Use (BFCL 88.5) and Multilinguality (Multilingual MGSM 91.6) (not to be confused with MultiModality which is not yet here, <a target="_blank" href="https://twitter.com/reach_vb/status/1815776912402497908">but soon</a>) </p><p>Now, these are official evaluations from Meta themselves, that we know, often don't really represent the quality of the model, so let's take a look at other, more vibey results shall we? </p><p></p><p>On SEAL leaderboards from Scale (held back so can't be trained on) LLama 405B is beating ALL other models on Instruction Following, getting 4th at Coding and 2nd at Math tasks. </p><p>On MixEval (the eval that approximates LMsys with 96% accuracy), my colleagues Ayush and Morgan got a whopping 66%, placing 405B just after Clause Sonnet 3.5 and above GPT-4o</p><p></p><p>And there are more evals that all tell the same story, we have a winner here folks (see the rest of the evals in my <a target="_blank" href="https://twitter.com/altryne/status/1815824095839244399">thread roundup</a>)</p><p>The License Game-Changer</p><p>Meta didn't just release a powerful model; they also updated their license to allow for synthetic data creation and distillation. This is huge for the open-source community.</p><p>LDJ highlighted its importance: "I think this is actually pretty important because even though, like you said, a lot of people still train on OpenAI outputs anyways, there's a lot of legal departments and a lot of small, medium, and large companies that they restrict the people building and fine-tuning AI models within that company from actually being able to build the best models that they can because of these restrictions."</p><p>This update could lead to a boom in custom models and applications across various industries as companies can start distilling, finetuning and creating synthetic datasets using these incredibly smart models.</p><p>405B: A Double-Edged Sword</p><p>While the 405B model is incredibly powerful, it's not exactly practical for most production use cases as you need 2 nodes of 8 H100s to run it in full precision. Despite the fact that pricing wars already started, and we see inference providers at as low as 2.7$/1M tokens, this hardly makes sense when GPT-4o mini is 15 cents. </p><p>However, this model shines in other areas:</p><p>* Synthetic Data Generation & Distillation: Its power and the new license make it perfect for creating high-quality training data and use it to train smaller models</p><p>* LLM as a Judge: The model's reasoning capabilities make it an excellent candidate for evaluating other AI outputs.</p><p>* Research and Experimentation: For pushing the boundaries of what's possible in AI.</p><p>The Smaller Siblings: 70B and 8B</p><p>While the 405B model is grabbing headlines, don't sleep on its smaller siblings. The 70B and 8B models got significant upgrades too.</p><p>The 70B model saw impressive gains:</p><p>* MMLU: 80.9 to 86</p><p>* IFEval: 82 to 87</p><p>* GPQA: 39 to 46</p><p>The 8B model, in particular, could be a hidden gem. As Kyle Corbitt from OpenPipe discovered, a fine-tuned 8B model could potentially beat a prompted GPT-4 Mini in specific tasks.</p><p>No multi-modality</p><p>While Meta definitely addressed everything we had to ask for from the Llama 3 release, context window, incredible performance, multi-linguality, tool-use, we still haven't seen multi-modality with Llama. We still can't show it pictures or talk to it! </p><p>However, <a target="_blank" href="https://x.com/mannat_singh/status/1815782658091237658">apparently</a> they have trained it to be mutli-modal as well but haven't yet released those weights, but they went into this in great detail in the paper and even showed a roadmap, stating that they will release it <a target="_blank" href="https://x.com/mannat_singh/status/1815782658091237658">soon-ish</a> (not in EU though)</p><p>This Week's Buzz: Weave-ing Through LLama Providers</p><p>In the spirit of thorough evaluation, I couldn't resist putting LLAMA 3.1 through its paces across different providers. Using Weights & Biases Weave (<a target="_blank" href="https://wandb.me/weave/?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=thursdai&#38;utm_id=july25">https://wandb.me/weave</a>), our evaluation and tracing framework for LLMs, I ran a comparison between various LLAMA providers.</p><p>Here's what I found:</p><p>* Different providers are running the model with varying optimizations (VLLM, FlashAttention3, etc.)</p><p>* Some are serving quantized versions, which can affect output style and quality</p><p>* Latency and throughput vary significantly between providers</p><p>The full results are available in a <a target="_blank" href="https://wandb.ai/wandb/compare-llamas/weave/compare-evaluations?evaluationCallIds=%5B%22bed51f00-ff5a-4050-ad6c-94224ba404bc%22%2C%223fd20176-0017-470c-84f3-e25af87f2423%22%2C%22df4d0826-8c16-4e75-a8ce-6202c88bf009%22%2C%22dd20c25e-254c-499e-932e-a6635d6c2f3d%22%2C%2257f9f904-73ba-4224-86a1-309f5112f06b%22%5D">Weave comparison dashboard</a>, which you can check out for a deep dive into the nuances of model deployment and code is up on <a target="_blank" href="https://github.com/altryne/compare-llama-providers/tree/main">Github</a> if you want to verify this yourself or see how easy this is to do with <a target="_blank" href="https://wandb.github.io/weave/?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=thursdai&#38;utm_id=july25">Weave</a></p><p>Mistral Crashes the Party with Large V2 123B model (<a target="_blank" href="https://x.com/dchaplot/status/1816132981377097883">X</a>, <a target="_blank" href="https://huggingface.co/mistralai/Mistral-Large-Instruct-2407">HF</a>, <a target="_blank" href="https://mistral.ai/news/mistral-large-2407/">Blog</a>, <a target="_blank" href="https://chat.mistral.ai/">Try It</a>)</p><p></p><p>Just when we thought Meta had stolen the show, Mistral AI decided to drop their own bombshell: Mistral Large V2. This 123 billion parameter dense model is no joke, folks. With an MMLU score of 84.0, 128K context window and impressive performance across multiple benchmarks, it's giving LLAMA 3.1 a run for its money, especially in some coding tasks while being optimized to run on a single node!</p><p>Especially interesting is the function calling on which they claim SOTA, without telling us which metric they used (or comparing to Llama 3.1) but are saying that they now support parallel and sequential function calling! </p><p>DeepSeek updates DeepSeek Coder V2 to 0724</p><p>While everyone was busy gawking at Meta and Mistral, DeepSeek quietly updated their coder model, and holy smokes, did they deliver! DeepSeek Coder v2 is now performing at GPT-4 and Claude 3.5 Sonnet levels on coding tasks. As Junyang Lin noted during our discussion, "DeepSeek Coder and DeepSeek Coder v2 should be the state of the art of the code-specific model."</p><p>Here's the result from BigCodeBench </p><p>and from Aider Chat (code editing dashboard)</p><p></p><p>But it's not just about raw performance. DeepSeek is bringing some serious innovation to the table. They've added JSON mode, function calling, and even a fill-in-the-middle completion feature in beta. Plus, they've bumped up their max token generation to 8K. And let's talk about that API pricing – it's ridiculously cheap, at 14c / 1M tokens!. </p><p>We're talking about costs that are competitive with GPT-4 Mini, but with potentially better performance on coding tasks. It's a game-changer for developers and companies looking to integrate powerful coding AI without breaking the bank.</p><p>Google DeepMind's Math Wizardry: From Silver Medals to AI Prodigies</p><p>Just when we thought this week couldn't get any crazier, Google DeepMind decides to casually drop a bombshell that would make even the most decorated mathletes sweat. They've created an AI system that can solve International Mathematical Olympiad (IMO) problems at a silver medalist level. I mean, come on! As if the AI world wasn't moving fast enough, now we've got silicon-based Math Olympians?</p><p></p><p>This isn't just any run-of-the-mill calculator on steroids. We're talking about a combination of AlphaProof, a new breakthrough model for formal reasoning, and AlphaGeometry 2, an upgraded version of their previous system. These AI math whizzes tackled this year's six IMO problems, covering everything from algebra to number theory, and managed to solve four of them. That's 28 points, folks - enough to bag a silver medal if it were human!</p><p>But here's where it gets really interesting. For non-geometry problems, AlphaProof uses the Lean theorem prover, coupling a pre-trained language model with the same AlphaZero reinforcement learning algorithm that taught itself to crush humans at chess and Go. And for geometry? They've got AlphaGeometry 2, a neuro-symbolic hybrid system powered by a Gemini-based language model. It's like they've created a math genius that can not only solve problems but also explain its reasoning in a formal, verifiable way.</p><p>The implications here are huge, folks. We're not just talking about an AI that can do your homework; we're looking at a system that could potentially advance mathematical research and proof verification in ways we've never seen before.</p><p>OpenAI takes on Google, Perplexity (and Meta's ownership of this week) with SearchGPT waitlist (<a target="_blank" href="https://openai.com/index/searchgpt-prototype/">Blog</a>)</p><p></p><p>As I write these words, Sam posts a tweet, saying that they are launching SearchGPT, their new take on search, and as I click, I see a waitlist 😅 But still, this looks so sick, just look: </p><p></p><p>RTVI - new open standard for real time Voice and Video RTVI-AI (<a target="_blank" href="https://twitter.com/trydaily/status/1815530613434417241">X</a>, <a target="_blank" href="https://twitter.com/trydaily/status/1815530613434417241">Github</a>, <a target="_blank" href="https://demo.rtvi.ai/">Try it</a>)</p><p>Ok this is also great and can't be skipped, even tho this week was already insane. These models are great to text with but we want to talk to them, and while we all wait for GPT-4 Omni with voice to actually ship, we get a new contender that gives us an open standard and a killer demo! </p><p>Daily + Groq + Cartesia + a lot of other great companies have releases this incredible demo (which you can try yourself <a target="_blank" href="https://demo.rtvi.ai/">here</a>) and an open source standard to deliver something like a GPT-4o experience with incredible end to end latency, which feels like almost immediate responses. </p><p>While we've chatted with Moshi previously which has these capabilities in the same model, the above uses LLama 3.1 70B even, which is an actual production grade LLM, which is a significant different from what Moshi offers. 🔥</p><p>Ok holy s**t, did I actually finish the writeup for this insane week? This was indeed one of the craziest weeks in Open Source AI, I honestly did NOT expect this to happen but I'm so excited to keep playing with all these tools, but also to see how the amazing open source community of finetuners will meet all these LLamas. Which I'm sure I'll be reporting on from now on until the next huge big AI breakthrough! </p><p>Till then, see you next week, if you're listening to the podcast, please give us 5 stars on Apple podcast / Spotify? It really does help, and I'll finish with this: </p><p>IT'S SO GOOD TO BE BACK! 😂🫡 </p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-july-25-opensource-gpt4</link><guid isPermaLink="false">substack:post:147014132</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Thu, 25 Jul 2024 21:20:58 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/147014132/653ac71b3e664f8f3d4da9347676c365.mp3" length="70678655" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5890</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/147014132/bcd9de4a0e52cd397532d60f2c7831d5.jpg"/></item><item><title><![CDATA[📅 ThursdAI - July 11 - Mixture of Agents & Open Router interviews (no news this week)]]></title><description><![CDATA[<p>Hey all, Alex here… well, not actually here, I’m scheduling this post in advance, which I haven’t yet done, because I'm going on vacation! </p><p>That’s right, next week is my birthday 🎉 and a much needed break, somewhere with a beach is awaiting, but I didn’t want to leave you hanging for too long, so posting this episode with some amazing un-released before material. </p><p>Mixture of Agents x2</p><p>Back in the far away days of June 20th (not that long ago but feels like ages!), Together AI announced a new <a target="_blank" href="https://arxiv.org/pdf/2406.04692">paper</a>, released <a target="_blank" href="https://github.com/togethercomputer/moa">code</a> and posted a long <a target="_blank" href="https://www.together.ai/blog/together-moa">post</a> about a new method to collaboration between smaller models to beat larger models. They called it Mixture of Agents, and <a target="_blank" href="https://x.com/james_y_zou/status/1801656163936964919">James Zou</a> joined us to chat about that effort. </p><p>Shortly after that - in fact, during the live ThursdAI show, <a target="_blank" href="https://twitter.com/corbtt/status/1803813970018791845">Kyle Corbitt</a> announced that OpenPipe also researched an approached similar to the above, using different models and a bit of a different reasoning, and also went after the coveted AlpacaEval benchmark, and achieved SOTA score of 68.8 using this method. </p><p>And I was delighted to invite both James and Kyle to chat about their respective approach the same week that both broke AlpacaEval SOTA and hear how utilizing collaboration between LLMs can significantly improve their outputs! </p><p>This weeks buzz - what I learned at W&B this week</p><p>So much buzz this week from the Weave team, it’s hard to know what to put in here. I can start with the incredible integrations my team landed, <a target="_blank" href="https://wandb.github.io/weave/guides/integrations/mistral">Mistral AI</a>, <a target="_blank" href="https://wandb.github.io/weave/guides/integrations/llamaindex">LLamaIndex</a>, <a target="_blank" href="https://wandb.github.io/weave/guides/integrations/dspy">DSPy</a>, <a target="_blank" href="https://wandb.github.io/weave/guides/integrations/openrouter">OpenRouter</a> and even <a target="_blank" href="https://wandb.github.io/weave/guides/integrations/local_models">Local Models</a> served by Ollama, LmStudio, LLamaFile can be now auto tracked with Weave, which means you literally have to only instantiate Weave and it’ll auto track everything for you! </p><p>But I think the biggest, hugest news from this week is this great eval comparison system that the Weave Tim just pushed, it’s honestly so feature rich that I’ll have to do a deeper dive on it later, but I wanted to make sure I include at least a few screencaps because I think it looks fantastic! </p><p>Open Router - <strong>A unified interface for LLMs</strong></p><p>I’ve been a long time fan of <a target="_blank" href="https://openrouter.ai/">OpenRouter.ai </a>and I was very happy to have Alex Atallah on the show to talk about Open Router (even if this did happen back in April!) and I’m finally satisfied with the sound quality to released this conversation. </p><p></p><p>Open Router is serving both foundational models like GPT, Claude, Gemini and also Open Source ones, and supports the OpenAI SDK format, making it super simple to play around and evaluate all of them on the same code. They even provide a few models for free! Right now you can use Phi for example completely free via their API. </p><p>Alex goes deep into the areas of Open Router that I honestly didn’t really know about, like being a marketplace, knowing what trendy LLMs are being used by people in near real time (check out WebSim!) and more very interesting things! </p><p>Give that conversation a listen, I’m sure you’ll enjoy it! </p><p></p><p>That’s it folks, no news this week, I would instead like to recommend a new newsletter by friends of the pod <a target="_blank" href="https://x.com/iScienceLuvr/">Tanishq Abraham</a> and <a target="_blank" href="https://x.com/arankomatsuzaki">Aran Komatsuzaki </a>both of whom are doing a weekly paper X space and recently start posting it on Substack as well! </p><p>It’s called AI papers of the week, and if you’re into papers which we don’t usually cover, there’s no better duo! In fact, Tanishq often used to come to ThursdAI to explain papers so you may recognize his voice :) </p><p>See you all in two weeks after I get some seriously needed R&R 👋 😎🏖️</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-july-11-mixture-of-agents</link><guid isPermaLink="false">substack:post:146483540</guid><dc:creator><![CDATA[Alex Volkov, James Zou, Kyle Corbitt, and Alex Atallah]]></dc:creator><pubDate>Thu, 11 Jul 2024 22:30:00 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/146483540/7bd509285d76286612f807228f49c0a7.mp3" length="49700610" type="audio/mpeg"/><itunes:author>Alex Volkov, James Zou, Kyle Corbitt, and Alex Atallah</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>4142</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/146483540/411c732dccd37e4bf187ea274bf4fb9f.jpg"/></item><item><title><![CDATA[📆 🎂 - ThursdAI #52 - Moshi Voice, Qwen2 finetunes, GraphRag deep dive and more AI news on this celebratory 1yr ThursdAI]]></title><description><![CDATA[<p>Hey everyone! Happy 4th of July to everyone who celebrates! I celebrated today by having an intimate conversation with 600 of my closest X friends 😂 Joking aside, today is a celebratory episode, 52nd consecutive weekly ThursdAI show! I've been doing this as a podcast for a year now!</p><p>Which means, there are some of you, who've been subscribed for a year 😮 Thank you! Couldn't have done this without you. In the middle of my talk at AI Engineer (I still don't have the video!) I had to plug ThursdAI, and I asked the 300+ audience who is a listener of ThursdAI, and I saw a LOT of hands go up, which is honestly, still quite humbling. So again, thank you for tuning in, listening, subscribing, learning together with me and sharing with your friends! </p><p>This week, we covered a new (soon to be) open source voice model from KyutAI, a LOT of open source LLM, from InternLM, Cognitive Computations (<a target="_blank" href="https://x.com/cognitivecompai/">Eric Hartford</a> joined us), Arcee AI (<a target="_blank" href="https://x.com/LucasAtkins7">Lukas Atkins</a> joined as well) and we have a deep dive into GraphRAG with <a target="_blank" href="https://x.com/emileifrem">Emil Eifrem</a> CEO of Neo4j (who shares why it was called Neo4j in the first place, and that he's a ThursdAI listener, whaaat? 🤯), this is definitely a conversation you don't want to miss, so tune in, and read a breakdown below:</p><p><strong>TL;DR of all topics covered:</strong> </p><p>* Voice & Audio</p><p>* KyutAI releases Moshi - first ever 7B end to end voice capable model (<a target="_blank" href="https://us.moshi.chat/?queue_id=talktomoshi">Try it</a>)</p><p>* Open Source LLMs </p><p>* Microsoft Updated Phi-3-mini - almost a new model </p><p>* InternLM 2.5 - best open source model under 12B on Hugging Face (<a target="_blank" href="https://huggingface.co/internlm">HF</a>, <a target="_blank" href="https://github.com/InternLM/InternLM">Github</a>)</p><p>* Microsoft open sources GraphRAG (<a target="_blank" href="https://www.microsoft.com/en-us/research/blog/graphrag-new-tool-for-complex-data-discovery-now-on-github/">Announcement</a>, <a target="_blank" href="https://github.com/microsoft/graphrag">Github</a>, <a target="_blank" href="https://arxiv.org/abs/2404.16130">Paper</a>)</p><p>* OpenAutoCoder-Agentless - SOTA on SWE Bench - 27.33% (<a target="_blank" href="https://github.com/OpenAutoCoder/Agentless">Code</a>, <a target="_blank" href="https://huggingface.co/papers/2407.01489">Paper</a>)</p><p>* Arcee AI - Arcee Agent 7B - from Qwen2 - Function / Tool use finetune (<a target="_blank" href="https://huggingface.co/arcee-ai/Arcee-Agent">HF</a>)</p><p>* LMsys announces RouteLLM - a new Open Source LLM Router (<a target="_blank" href="https://github.com/lm-sys/RouteLLM?tab=readme-ov-file#evaluation">Github</a>)</p><p>* DeepSeek Chat got an significant upgrade (<a target="_blank" href="https://platform.deepseek.com/api-docs/updates/">Announcement</a>)</p><p>* Nomic GPT4all 3.0 - Local LLM (<a target="_blank" href="https://www.nomic.ai/gpt4all">Download</a>, <a target="_blank" href="https://github.com/nomic-ai/gpt4all">Github</a>)</p><p>* This weeks Buzz</p><p>* New free Prompts course from WandB in 4 days (<a target="_blank" href="https://www.wandb.courses/courses/prompting?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=thursdai&#38;utm_id=july4">pre sign up</a>)</p><p>* Big CO LLMs + APIs</p><p>* Perplexity announces their new pro research mode (<a target="_blank" href="https://www.perplexity.ai/hub/blog/pro-search-upgraded-for-more-advanced-problem-solving?utm_medium=social&#38;utm_source=social&#38;utm_campaign=pro-search">Announcement</a>)</p><p>* X is <a target="_blank" href="https://twitter.com/altryne/status/1808198424715579418">rolling out</a> "Grok Analysis" button and it's BAD in "fun mode" and then <a target="_blank" href="https://x.com/ibab/status/1808256801890357553">paused</a> roll out</p><p>* Figma pauses the rollout of their AI text to design tool "Make Design" (<a target="_blank" href="https://x.com/zoink/status/1808045661189033990">X</a>)</p><p>* Vision & Video</p><p>* Cognitive Computations drops DolphinVision-72b - VLM (<a target="_blank" href="https://x.com/cognitivecompai/status/1807601242644066548">HF</a>)</p><p>* Chat with Emil Eifrem - CEO Neo4J about GraphRAG, AI Engineer</p><p>Voice & Audio</p><p>KyutAI Moshi - a 7B end to end voice model (<a target="_blank" href="https://us.moshi.chat/?queue_id=talktomoshi">Try It</a>, <a target="_blank" href="https://www.youtube.com/live/hm2IJSKcYvo">See Announcement</a>)</p><p>Seemingly out of nowhere, another french AI juggernaut decided to drop a major announcement, a company called KyutAI, backed by Eric Schmidt, call themselves "the first European private-initiative laboratory dedicated to <strong>open research</strong> in artificial intelligence" in a press release back in November of 2023, have quite a few rockstar co founders ex Deep Mind, Meta AI, and have Yann LeCun on their science committee.</p><p>This week they showed their first, and honestly quite mind-blowing release, called Moshi (Japanese for Hello, Moshi Moshi), which is an end to end voice and text model, similar to GPT-4o demos we've seen, except this one is 7B parameters, and can run on your mac! </p><p>While the utility of the model right now is not the greatest, not remotely close to anything resembling the amazing GPT-4o (which was <a target="_blank" href="https://x.com/alxfazio/status/1806540927814770756">demoed live</a> to me and all of AI Engineer by Romain Huet) but Moshi shows very very impressive stats! </p><p>Built by a small team during only 6 months or so of work, they have trained an LLM (Helium 7B) an Audio Codec (Mimi) a Rust inference stack and a lot more, to give insane performance. </p><p>Model latency is 160ms and mic-to-speakers latency is 200ms, which is so fast it seems like it's too fast. The demo often responds faster than I'm able to finish my sentence, and it results  in an uncanny, "reading my thoughts" type feeling. </p><p>The most important part is this though, a quote of KyutAI post after the announcement : </p><p>Developing Moshi required significant contributions to audio codecs, multimodal LLMs, multimodal instruction-tuning and much more. <strong>We believe the main impact of the project will be sharing all Moshi’s secrets with the upcoming paper and open-source of the model.</strong></p><p>I'm really looking forward to how this tech can be applied to the incredible open source models we already have out there! Speaking to out LLMs is now officially here in the Open Source, way before we got GPT-4o and it's exciting! </p><p>Open Source LLMs </p><p>Microsoft stealth update Phi-3 Mini to make it almost a new model</p><p>So stealth in fact, that I didn't even have this update in my notes for the show, but thanks to incredible community (Bartowsky, Akshay Gautam) who made sure we don't miss this, because it's so huge. </p><p>The model used additional post-training data leading to substantial gains on instruction following and structure output. We also improve multi-turn conversation quality, explicitly support <|system|> tag, and significantly improve reasoning capability</p><p>Phi-3 June update is quite significant across the board, just look at some of these scores, 354.78% improvement in JSON structure output, 30% at GPQA</p><p>But also specifically for coding, a 33→93 jump in Java coding, 33→73 in Typescript, 27→ 85 in Python! These are just incredible numbers, and I definitely agree with Bartowski here, there's enough here to call this a whole new model rather than an "seasonal update" </p><p>Qwen-2 is the start of the show right now </p><p>Week in and week out, ThursdAI seems to be the watercooler for the best finetuners in the community to come, hang, share notes, and announce their models. A month after Qwen-2 was <a target="_blank" href="https://sub.thursdai.news/p/thursdai-jun-6th-qwen2-beats-llama">announced on ThursdAI</a> stage live by friend of the pod and Qwen dev lead Junyang Lin, and a week after it re-took number 1 on the revamped open LLM leaderboard on HuggingFace, we now have great finetunes on top of Qwen-2. </p><p>Qwen-2 is the star of the show right now. Because there's no better model. This is like GPT 4 level. It's Open Weights GPT 4. We can do what we want with it, and it's so powerful, and it's multilingual, and it's everything, it's like the dream model. I love it</p><p>Eric Hartford - Cognitive Computations</p><p>We've had 2 models finetunes based on Qwen 2 and their authors on the show this week, first was <a target="_blank" href="https://x.com/LucasAtkins7">Lukas Atkins</a> from Arcee AI (company behind MergeKit), they released Arcee Agent, a 7B Qwen-2 finetune/merge specifically focusing on tool use and function calling. </p><p>We also had a chat with <a target="_blank" href="https://x.com/cognitivecompai">Eric Hartford</a> from Cognitive Computations (which Lukas previously participated in) with the biggest open source VLM on top of Qwen-2, a <strong>72B parameter Dolphin Vision</strong> (Trained by <a target="_blank" href="https://x.com/stablequan">StableQuan</a>, available on the <a target="_blank" href="https://huggingface.co/cognitivecomputations/dolphin-vision-72b">HUB</a>) ,and it's likely the biggest open source VLM that we've seen so far.</p><p>The most exciting part about it, is Fernando Neta's "SexDrugsRockandroll" dataset, which supposedly contains, well.. a lot of uncensored stuff, and it's perfectly able to discuss and analyze images with mature and controversial content.</p><p>InternLM 2.5 - SOTA open source under 12B with 1M context (<a target="_blank" href="https://huggingface.co/internlm">HF</a>, <a target="_blank" href="https://github.com/InternLM/InternLM">Github</a>)</p><p>The folks at <strong>Shanghai AI</strong> release InternLM 2.5 7B, and a chat version along with a whopping 1M context window extension. These metrics are ridiculous, beating LLama-3 8B on literally every metric on the new HF leaderboard, and even beating Llama-3 70B on MATH and coming close on GPQA!</p><p>The folks at Intern not only released a beast of a model, but also have released a significantly imporved tool use capabilities with it, including their own agentic framework called <a target="_blank" href="https://github.com/InternLM/InternLM/blob/main/agent/lagent.md">Lagent</a>, which comes with Code Interpreter (python execution), Search Capabilities, and of course the abilities to plug in your own tools.</p><p>How will you serve 1M context on production you ask? Well, these folks ALSO open sourced <a target="_blank" href="https://github.com/InternLM/InternLM/blob/main/chat/lmdeploy.md">LMDeploy</a>, "an efficient, user-friendly toolkit designed for compressing, deploying, and serving LLM models" which has been around for a while, but is now supporting this new model of course, handles dynamic NTK and some offloading of context etc' </p><p>So an incredible model + tools release, can't wait to play around with this! </p><p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></p><p>This weeks Buzz (What I learned with WandB this week)</p><p>Hey, did you know we at Weights & Biases have free courses? While some folks ask you for a LOT of money for basic courses, at Weights & Biases, they are... you guessed it, completely free! And a lot of effort goes into recording and building the agenda, so I'm happy to announce that our "Developer's Guide to LLM Prompting" course is going to launch in 4 days! </p><p>Delivered by my colleague <a target="_blank" href="https://www.linkedin.com/search/results/all/?fetchDeterministicClustersOnly=true&#38;heroEntityKey=urn%3Ali%3Afsd_profile%3AACoAABqicD8BnCDswPippdPvUB0LgzEskc6grK4&#38;keywords=anish%20shah&#38;origin=RICH_QUERY_SUGGESTION&#38;position=0&#38;searchId=7a4b0794-b4a1-4d8c-9066-d11b0313bde7&#38;sid=4Du&#38;spellCorrectionEnabled=false">Anish</a> (who's just an amazing educator) and <a target="_blank" href="https://www.linkedin.com/in/teodora-danilovic-218779207/?originalSubdomain=uk">Teodora</a> from AutogenAI, you will learn everything prompt building related, and even if you are a seasoned prompting pro, there will be something for you there! Pre-register for the course <a target="_blank" href="https://www.wandb.courses/courses/prompting?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=thursdai&#38;utm_id=july4">HERE</a></p><p>Big CO LLMs + APIs</p><p>How I helped roll back an XAI feature and Figma rolled back theirs </p><p>We've covered Grok (with a K this time) from XAI multiple times, and while I don't use it's chat interface that much, or the open source model, I do think they have a huge benefit in having direct access to real time data from the X platform. </p><p>Given that I basically live on X (to be able to deliver all these news to you) I started noticing a long promised, Grok Analysis button show up under some posts, first on mobile, then on web versions of X. </p><p>Of course I had to test it, and whoa, I was honestly shocked at just how unhinged and profanity laced the analysis was. </p><p>Now I'm not easily shocked, I've seen jailbroken LLMs before, I tried to get chatGPT to say curse words multiple times, but it's one thing when you expect it and a complete another thing when a billion dollar company releases a product that answers... well like this: </p><p>Luckily Igor Babushkin (Co founder of XAI) noticed and the roll out was paused, so looks like I helped red team grok! 🫡 </p><p>Figma pauses AI "make design" feature</p><p>Another AI feature was paused by a big company after going viral on X (what is it about X specifically?) and this time it was Figma! </p><p>In a super viral post, Andy Allen posted a video where he asks the new AI feature from Figma called "Make Design" a simple "weather app" and what he receives looks almost 100% identical to the iOS weather app! </p><p>This was acknowledged by the CEO of Figma and almost immediately paused as well. </p><p>GraphRAG... GraphRAG everywhere</p><p>Microsoft released a pre-print paper called GraphRag (<a target="_blank" href="https://arxiv.org/abs/2404.16130">2404.16130</a>) which talks about utilizing LLMs to first build and the use Graph databases to achieve better accuracy and performance for retrieval tasks such as "global questions directed at an entire text corpus"</p><p>This week, Microsoft open sourced GraphRag on <a target="_blank" href="https://github.com/microsoft/graphrag">Github</a> 👏 and I wanted to dive a little deeper into what this actually means, as this is a concept I haven't head of before last week, and suddenly it's everywhere. </p><p>Last week during AI Engineer, the person who first explained this concept to me (and tons of other folks in the crowd at his talk) was <a target="_blank" href="https://x.com/emileifrem">Emil Eifrem</a>, CEO of <a target="_blank" href="https://neo4j.com/">Neo4J</a> and I figured he'd be the right person to explain the whole concept in a live conversation to the audience as well, and he was! </p><p>Emil and I (and other folks in the audience) had a great, almost 40 minute conversation about the benefits of using Graph databases for RAG, how LLMs unlocked the ability to convert unstructured data into Graph linked databases, accuracy enhancements and unlocks like reasoning over the whole corpus of data, developer experience improvements, and difficulties / challenges with this approach. </p><p>Emil is a great communicator, with a deep understanding of this field, so I really recommend to listen to this deep dive. </p><p><p>Thank you for reading ThursdAI - Recaps of the most high signal AI weekly spaces. This post is public so feel free to share it.</p></p><p>This is it for this weeks newsletter, and a wrap on year 1 of ThursdAI as a podcast (this being out 52nd weekly release!) </p><p>I'm going on vacation next week, but I will likely still send the TL;DR, so look out for that, and have a great independence day, and rest of your holiday weekend if you celebrate, and if you're not, I'm sure there will be cool AI things announced by the next time we meet 🫡 </p><p>As always, appreciate your attention,</p><p>Alex </p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-52-moshi-voice-qwen2-finetunes</link><guid isPermaLink="false">substack:post:146291193</guid><dc:creator><![CDATA[Alex Volkov, Emil Eifrem, and Nisten]]></dc:creator><pubDate>Thu, 04 Jul 2024 21:29:24 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/146291193/546272d77cb927fe3a988c7738cd7e78.mp3" length="79507342" type="audio/mpeg"/><itunes:author>Alex Volkov, Emil Eifrem, and Nisten</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6625</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/146291193/9af4215637387510359357f8c18993d0.jpg"/></item><item><title><![CDATA[📅 ThursdAI - Gemma 2, AI Engineer 24', AI Wearables, New LLM leaderboard]]></title><description><![CDATA[<p>Hey everyone, sending a quick one today, no deep dive, as I'm still in the middle of AI Engineer World's Fair 2024 in San Francisco (in fact, I'm writing this from the incredible floor 32 presidential suite, that the team here got for interviews, media and podcasting, and hey to all new folks who I’ve just met during the last two days!) </p><p></p><p>It's been an incredible few days meeting so many ThursdAI community members, listeners and folks who came on the pod! The list honestly is too long but I've got to meet friends of the pod Maxime Labonne, Wing Lian, Joao Morra (crew AI), Vik from Moondream, Stefania Druga not to mention the countless folks who came up and gave high fives, introduced themselves, it was honestly a LOT of fun. (and it's still not over, if you're here, please come and say hi, and let's take a LLM judge selfie together!)</p><p>On today's show, we recorded extra early because I had to run and play dress up, and boy am I relieved now that both the show and the talk are behind me, and I can go an enjoy the rest of the conference 🔥 (which I will bring you here in full once I get the recording!) </p><p>On today's show, we had the awesome pleasure to have <strong>Surya Bhupatiraju</strong> who's a research engineer at Google DeepMind, talk to us about their newly released amazing Gemma 2 models! It was very technical, and a super great conversation to check out! </p><p>Gemma 2 came out with 2 sizes, a 9B and a 27B parameter models, with 8K context (we addressed this on the show) and this 27B model incredible performance is beating LLama-3 70B on several benchmarks and is even beating Nemotron 340B from NVIDIA! </p><p>This model is also now available on the Google AI studio to play with, but also on the hub! </p><p>We also covered the renewal of the HuggingFace open LLM leaderboard with their new benchmarks in the mix and normalization of scores, and how Qwen 2 is again the best model that's tested! </p><p>It's was a very insightful conversation, that's worth listening to if you're interested in benchmarks, definitely give it a listen. </p><p>Last but not least, we had a conversation with Ethan Sutin, the co-founder of Bee Computer. At the AI Engineer speakers dinner, all the speakers received a wearable AI device as a gift, and I onboarded (cause Swyx asked me) and kinda forgot about it. On the way back to my hotel I walked with a friend and chatted about my life. </p><p>When I got back to my hotel, the app prompted me with "hey, I now know 7 new facts about you" and it was incredible to see how much of the conversation it was able to pick up, and extract facts and eve TODO's! </p><p>So I had to have Ethan on the show to try and dig a little bit into the privacy and the use-cases of these hardware AI devices, and it was a great chat! </p><p></p><p>Sorry for the quick one today, if this is the first newsletter after you just met me and register, usually there’s a deeper dive here, expect a more in depth write-ups in the next sessions, as now I have to run down and enjoy the rest of the conference! </p><p>Here's the TL;DR and my <strong>RAW</strong> show notes for the full show, in case it's helpful! </p><p>* AI Engineer is happening right now in SF</p><p>* Tracks include Multimodality, Open Models, RAG & LLM Frameworks, Agents, Al Leadership, Evals & LLM Ops, CodeGen & Dev Tools, Al in the Fortune 500, GPUs & Inference</p><p>* Open Source LLMs </p><p>* HuggingFace - <strong>LLM Leaderboard v2 - (</strong><a target="_blank" href="https://huggingface.co/spaces/open-llm-leaderboard/blog"><strong>Blog</strong></a><strong>)</strong></p><p>* Old Benchmarks sucked and it's time to renew</p><p>* New Benchmarks</p><p>* <strong>MMLU-Pro</strong> (Massive Multitask Language Understanding - Pro version, <a target="_blank" href="https://arxiv.org/abs/2406.01574">paper</a>)</p><p>* <strong>GPQA</strong> (Google-Proof Q&A Benchmark, <a target="_blank" href="https://arxiv.org/abs/2311.12022">paper</a>). GPQA is an extremely hard knowledge dataset</p><p>* <strong>MuSR</strong> (Multistep Soft Reasoning, <a target="_blank" href="https://arxiv.org/abs/2310.16049">paper</a>).</p><p>* <strong>MATH</strong> (Mathematics Aptitude Test of Heuristics, Level 5 subset, <a target="_blank" href="https://arxiv.org/abs/2103.03874">paper</a>)</p><p>* <strong>IFEval</strong> (Instruction Following Evaluation, <a target="_blank" href="https://arxiv.org/abs/2311.07911">paper</a>)</p><p>* 🤝 <strong>BBH</strong> (Big Bench Hard, <a target="_blank" href="https://arxiv.org/abs/2210.09261">paper</a>). BBH is a subset of 23 challenging tasks from the BigBench dataset</p><p>* The community will be able to vote for models, and we will prioritize running models with the most votes first</p><p>* Mozilla announces Builders Accelerator @ AI Engineer (<a target="_blank" href="https://x.com/swyx/status/1806008516597146098">X</a>)</p><p>* Theme: Local AI </p><p>* 100K non dilutive funding</p><p>* Google releases Gemma 2 (<a target="_blank" href="https://x.com/_philschmid/status/1806343336292229234">X</a>, <a target="_blank" href="https://blog.google/technology/developers/google-gemma-2/">Blog</a>)</p><p>* Big CO LLMs + APIs</p><p>* UMG, Sony, Warner sue Udio and Suno for copyright (<a target="_blank" href="https://x.com/jason_koebler/status/1805301151543476314">X</a>)</p><p>* were able to recreate some songs</p><p>* sue both companies</p><p>* have 10 unnamed individuals who are also on the suit</p><p>* Google Chrome Canary has Gemini nano (<a target="_blank" href="https://x.com/marcelpociot/status/1805678162099032354">X</a>)</p><p>* </p><p>* Super easy to use  <a target="_blank" href="window.ai">window.ai</a>.createTextSession()</p><p>* Nano 1 and 2, at a 4bit quantized 1.8B and 3.25B parameters has decent performance relative to Gemini Pro</p><p>* Behind a feature flag</p><p>* Most text gen under 500ms </p><p>* Unclear re: hardware requirements </p><p>* Someone already built extensions</p><p>* someone already posted this on HuggingFace</p><p>* Anthropic Claude share-able projects (<a target="_blank" href="https://twitter.com/AnthropicAI/status/1805616725733339199">X</a>)</p><p>* Snapshots of Claude conversations shared with your team</p><p>* Can share custom instructions</p><p>* Anthropic has released new "Projects" feature for Claude AI to enable collaboration and enhanced workflows</p><p>* Projects allow users to ground Claude's outputs in their own internal knowledge and documents</p><p>* Projects can be customized with instructions to tailor Claude's responses for specific tasks or perspectives</p><p>* "Artifacts" feature allows users to see and interact with content generated by Claude alongside the conversation</p><p>* Claude Team users can share their best conversations with Claude to inspire and uplevel the whole team</p><p>* North Highland consultancy has seen 5x faster content creation and analysis using Claude</p><p>* Anthropic is committed to user privacy and will not use shared data to train models without consent</p><p>* Future plans include more integrations to bring in external knowledge sources for Claude</p><p>* OpenAI voice mode update - not until Fall</p><p>* AI Art & Diffusion & 3D</p><p>* Fal open sourced <strong>AuraSR</strong> - a 600M upscaler based on GigaGAN (<a target="_blank" href="https://x.com/burkaygur/status/1805997534244229461">X</a>, <a target="_blank" href="https://x.com/burkaygur/status/1805997534244229461">Fal</a>)</p><p>* Interview with Ethan Sutin from Bee Computer</p><p>* We all got Bees as a gifts</p><p>* AI Wearable that extracts TODOs, knows facts, etc'</p><p>* </p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-gemma-2-ai-engineer-24-ai</link><guid isPermaLink="false">substack:post:146061046</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Thu, 27 Jun 2024 21:56:35 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/146061046/c6005b30c7008ca328acd2835ea9e89e.mp3" length="58367615" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>4864</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/146061046/9cbf3366f85af413e2055a5d326a09f7.jpg"/></item><item><title><![CDATA[📅 ThursdAI - June 20th - 👑 Claude Sonnet 3.5 new LLM king, DeepSeek new OSS code king, Runway Gen-3 SORA competitor, Ilya's back & more AI news from this crazy week]]></title><description><![CDATA[<p>Hey, this is Alex. Don't you just love when assumptions about LLMs hitting a wall just get shattered left and right and we get new incredible tools released that leapfrog previous state of the art models, that we barely got used to, from just a few months ago? I SURE DO! </p><p>Today is one such day, this week was already busy enough, I had a whole 2 hour show packed with releases, and then Anthropic decided to give me a reason to use the #breakingNews button (the one that does the news show like sound on the live show, you should join next time!) and announced <strong>Claude Sonnet 3.5</strong> which is their best model, beating Opus while being 2x faster and 5x cheaper! (also beating GPT-4o and Turbo, so... new king! For how long? ¯\_(ツ)_/¯)</p><p>Critics are already raving, it's been half a day and they are raving! Ok, let's get to the TL;DR and then dive into Claude 3.5 and a few other incredible things that happened this week in AI! 👇 </p><p></p><p>TL;DR of all topics covered: </p><p>* <strong>Open Source LLMs</strong> </p><p>* NVIDIA - Nemotron 340B - Base, Instruct and Reward model (<a target="_blank" href="https://x.com/_philschmid/status/1801651752426524996">X</a>)</p><p>* DeepSeek coder V2 (230B MoE, 16B)  (<a target="_blank" href="https://x.com/deepseek_ai/status/1802680388256768145">X</a>, <a target="_blank" href="http://huggingface.co/deepseek-ai">HF</a>)</p><p>* Meta FAIR - Chameleon MMIO models (<a target="_blank" href="https://x.com/jpineau1/status/1803095402058481826">X</a>)</p><p>* HF + BigCodeProject are deprecating HumanEval with BigCodeBench (<a target="_blank" href="https://x.com/BigCodeProject/status/1803072295910494686">X</a>, <a target="_blank" href="https://huggingface.co/spaces/bigcode/bigcodebench-leaderboard">Bench</a>)</p><p>* NousResearch - Hermes 2 LLama3 Theta 70B - GPT-4 level OSS on MT-Bench (<a target="_blank" href="https://x.com/Teknium1/status/1803889137118048625">X</a>, <a target="_blank" href="https://huggingface.co/NousResearch/Hermes-2-Theta-Llama-3-70B-GGUF">HF</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Gemini Context Caching is available </p><p>* Anthropic releases Sonnet 3.5 - beating GPT-4o (<a target="_blank" href="https://x.com/AnthropicAI/thread/1803790676988920098">X</a>, <a target="_blank" href="Claude.ai">Claude.ai</a>)</p><p>* Ilya Sutskever starting <a target="_blank" href="http://SSI.inc">SSI.inc</a> - safe super intelligence (<a target="_blank" href="https://x.com/danielgross/status/1803476684160770075">X</a>)</p><p>* Nvidia is the biggest company in the world by market cap</p><p>* <strong>This weeks Buzz</strong> </p><p>* Alex in SF next week for AIQCon, AI Engineer. ThursdAI will be sporadic but will happen!</p><p>* W&B Weave now has support for tokens and cost + Anthropic SDK out of the box (<a target="_blank" href="https://wandb.me/weave">Weave Docs</a>)</p><p>* <strong>Vision & Video</strong></p><p>* Microsoft open sources Florence 230M & 800M Vision Models (<a target="_blank" href="https://x.com/reach_vb/status/1803366557612933499">X</a>, <a target="_blank" href="https://huggingface.co/collections/microsoft/florence-6669f44df0d87d9c3bfb76de">HF</a>)</p><p>* Runway Gen-3 - (t2v, i2v, v2v) Video Model (<a target="_blank" href="https://x.com/runwayml">X</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* Google Deepmind teases V2A video-to-audio model (<a target="_blank" href="https://deepmind.google/discover/blog/generating-audio-for-video/">Blog</a>)</p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* Flash Diffusion for SD3 is out - Stable Diffusion 3 in 4 steps! (<a target="_blank" href="https://x.com/CChadebec/status/1803114609018110268">X</a>)</p><p></p><p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></p><p></p><p>🦀 New king of LLMs in town - Claude 3.5 Sonnet 👑 </p><p>Ok so first things first, Claude Sonnet, the previously forgotten middle child of the Claude 3 family, has now received a brain upgrade! </p><p>Achieving incredible performance on many benchmarks, this new model is 5 times cheaper than Opus at $3/1Mtok on input and $15/1Mtok on output. It's also competitive against GPT-4o and turbo on the standard benchmarks, achieving incredible scores on MMLU, HumanEval etc', but we know that those are already behind us. </p><p>Sonnet 3.5, aka <strong>Claw'd</strong> (which is a great marketing push by the Anthropic folks, I love to see it), is beating all other models on <a target="_blank" href="http://Aider.chat">Aider.chat</a> code editing leaderboard, winning on the new <a target="_blank" href="http://livebench.ai">livebench.ai</a> leaderboard and is getting top scores on MixEval Hard, which has 96% correlation with LMsys arena.</p><p>While benchmarks are great and all, real folks are reporting real findings of their own, here's what Friend of the Pod Pietro Skirano had to say after playing with it: </p><p>there's like a lot of things that I saw that I had never seen before in terms of like creativity and like how much of the model, you know, actually put some of his own understanding into your request</p><p>-@Skirano</p><p>What's notable a capability boost is this quote from the Anthropic release blog: </p><p>In an internal agentic coding evaluation, Claude 3.5 Sonnet solved <strong>64% of problems</strong>, outperforming Claude 3 <strong>Opus which solved 38%</strong>. </p><p>One detail that Alex Albert from Anthropic pointed out from this released was, that on GPQA (Graduate-Level Google-Proof Q&A) Benchmark, they achieved a 67% with various prompting techniques, beating PHD experts in respective fields in this benchmarks that average 65% on this. This... this is crazy</p><p>Beyond just the benchmarks </p><p>This to me is a ridiculous jump because Opus was just so so good already, and Sonnet 3.5 is jumping over it with agentic solving capabilities, and also vision capabilities. Anthropic also announced that vision wise, Claw'd is significantly better than Opus at vision tasks (which, again, Opus was already great at!) and lastly, Claw'd now has a great recent cutoff time, it knows about events that happened in <strong>February 2024</strong>! </p><p>Additionally, <a target="_blank" href="http://claude.ai">claude.ai</a> got a new capability which significantly improves the use of Claude, which they call artifacts. It needs to be turned on in settings, and then Claude will have access to files, and will show you in an aside, rendered HTML, SVG files, Markdown docs, and a bunch more stuff, and it'll be able to reference different files it creates, to create assets and then a game with these assets for example! </p><p>1 Ilya x 2 Daniels to build Safe SuperIntelligence </p><p>Ilya Sutskever, Co-founder and failed board Coup participant (leader?) at OpenAI, has resurfaced after a long time of people wondering "where's Ilya" with one hell of an announcement. </p><p>The company is called SSI of Safe Super Intelligence, and he's cofounding it with Daniel Levy (prev OpenAI, PHD Stanford) and Daniel Gross (AI @ Apple, AIgrant, AI Investor). </p><p>The only mandate of this company is apparently to have a straight shot at safe super-intelligence, skipping AGI, which is no longer the buzzword (Ilya is famous for the "feel the AGI" chant within OpenAI) </p><p>Notable also that the company will be split between Palo Alto and Tel Aviv, where they have the ability to hire top talent into a "cracked team of researchers"</p><p>Our singular focus means no distraction by management overhead or product cycles</p><p>Good luck to these folks! </p><p>Open Source LLMs </p><p>DeepSeek coder V2 (230B MoE, 16B)  (<a target="_blank" href="https://x.com/deepseek_ai/status/1802680388256768145">X</a>, <a target="_blank" href="http://huggingface.co/deepseek-ai">HF</a>)</p><p>The folks at DeepSeek are not shy about their results, and until the Sonnet release above, have released a 230B MoE model that beats GPT4-Turbo at Coding and Math! With a great new 128K context window and an incredible open license (you can use this in production!) this model is the best open source coder in town, getting to number 3 on aider code editing and number 2 on BigCodeBench (which is a new Benchmark we covered on the pod with the maintainer, definitely worth a listen. HumanEval is old and getting irrelevant) </p><p>Notable also that DeepSeek has launched an API service that seems to be so competitively priced that it doesn't make sense to use anything else, with $0.14/$0.28 I/O per Million Tokens, it's a whopping 42 times cheaper than Claw'd 3.5! </p><p>Support of 338 programming languages, it should also run super quick given it's MoE architecture, the bigger model is only 21B active parameters which scales amazing on CPUs. </p><p>They also released a tiny 16B MoE model called Lite-instruct and it's 2.4B active params. </p><p>This weeks Buzz (What I learned with WandB this week)</p><p>Folks, in a week, I'm going to go up on stage in front of tons of AI Engineers wearing a costume, and... it's going to be epic! I finished writing my talk, now I'm practicing and I'm very excited. If you're there, please join the Evals track 🙂 </p><p>Also in W&B this week, coinciding with Claw'd release, we've added a <strong>native integration with the Anthropic Python SDK</strong> which now means that all you need to do to track your LLM calls with Claw'd is pip install weave and import weave and weave.init('your project name' </p><p>THAT'S IT! and you get this amazing dashboard with usage tracking for all your Claw'd calls for free, it's really crazy easy, give it a try! </p><p>Vision & Video </p><p>Runway Gen-3 - SORA like video model announced (<a target="_blank" href="https://x.com/runwayml/status/1803777056812994606">X</a>, <a target="_blank" href="https://runwayml.com/blog/introducing-gen-3-alpha/">blog</a>)</p><p>Runway, you know the company who everyone was "sorry for" when SORA was announced by OpenAI, is not sitting around waiting to "be killed" and is announcing Gen-3, an incredible video model capable of realistic video generations, physics understanding, and a lot lot more. </p><p>The videos took over my timeline, and this looks to my eyes better than KLING and better than Luma Dream Machine from last week, by quite a lot! </p><p>Not to mention that Runway has been in video production for way longer than most, so they have other tools that work with this model, like motion brush, lip syncing, temporal controls and many more, that allow you to be the director of the exactly the right scene. </p><p>Google Deepmind video-to-audio (<a target="_blank" href="https://twitter.com/GoogleDeepMind/status/1802733643992850760">X</a>)</p><p>You're going to need to turn your sound on for this one! Google has released a tease of a new model of theirs that can be paired amazingly well with the above type generative video models (of which Google also has one, that they've teased and it's coming bla bla bla) </p><p>This one, watches your video and provides acoustic sound fitting the scene, with on-sceen action sound! They showed a few examples and honestly they look so good, a drummer playing drums and that model generated the drums sounds etc' 👏  Will we ever see this as a product from google though? Nobody knows! </p><p>Microsoft releases tiny (0.23B, 0.77B) Vision Models Florence (<a target="_blank" href="https://x.com/skalskip92/status/1803798306897956878">X</a>, <a target="_blank" href="https://huggingface.co/microsoft/Florence-2-large">HF</a>, <a target="_blank" href="https://huggingface.co/spaces/gokaygokay/Florence-2">Try It</a>)</p><p>This one is a very exciting release because it's MIT licensed, and TINY! Less than 1 Billion parameters, meaning it can completely run on device, it's a vision model, that beats MUCH  bigger vision models by a significant amount on tasks like OCR, segmentation, object detection, image captioning and more! </p><p>They have leveraged (and supposedly going to release) a FLD-5B dataset, and they have specifically made this model to be fine-tunable across these tasks, which is exciting because open source vision models are going to significantly benefit from this release almost immediately. </p><p>Just look at this hand written OCR capability! Stellar! </p><p>NousResearch - Hermes 2 Theta 70B - inching over GPT-4 on MT-Bench</p><p>Teknium and the Nous Reseach crew have released a new model just to mess with me, you see, the live show was already recorded and edited, the file exported, the TL'DR written, and the newsletter draft almost ready to submit, and then I check the Green Room (DM group for all friends of the pod for ThursdAI, it's really an awesome Group Chat) and Teknium drops that they've beat GPT-4 (unsure which version) on MT-Bench with a finetune and a merge of LLama-3</p><p>They beat Llama-3 instruct which on its own is very hard, by merging in Llama-3 instruct into their model with Charles Goddards help (merge-kit author) </p><p>As always, these models from Nous Research are very popular, but apparently a bug at HuggingFace shows that this one is extra super duper popular, clocking in at almost 25K downloads in the past hour since release, which doesn't quite make sense 😅 anyway, I'm sure this is a great one, congrats on the release friends! </p><p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></p><p>Phew, somehow we covered all (most? all of the top interesting) AI news and breakthroughs of this week? Including interviews and breaking news! </p><p>I think that this is our almost 1 year anniversary since we started putting ThursdAI on a podcast, episode #52 is coming shortly! </p><p>Next week is going to be a big one as well, see you then, and if you enjoy these, give us a 5 start review on whatever podcast platform you're using? It really helps 🫡 </p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-june-20th-claude-sonnet</link><guid isPermaLink="false">substack:post:145848854</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Thu, 20 Jun 2024 22:55:10 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/145848854/0a3b413fa0e740fd1b11e14ebb7f89fd.mp3" length="49964783" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>4164</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/145848854/8aac3cbfbfbab6564051cfabf7e41c6c.jpg"/></item><item><title><![CDATA[ThursdAI - June 13th, 2024 - Apple Intelligence recap, Elons reaction, Luma's Dream Machine, AI Engineer invite, SD3 & more AI news from this past week]]></title><description><![CDATA[<p>Happy Apple AI week everyone (well, those of us who celebrate, some don't) as this week we finally got told what Apple is planning to do with this whole generative AI wave and presented Apple Intelligence (which is AI, get it? they are trying to rebrand AI!)</p><p>This weeks pod and newsletter main focus will be Apple Intelligence of course, as it was for most people compared to how the market reacted ($APPL grew over $360B in a few days after this announcement) and how many people watched each live stream (10M at the time of this writing watched the WWDC keynote on youtube, compared to 4.5 for the OpenAI GPT-4o, 1.8 M for Google IO) </p><p>On the pod we also geeked out on new eval frameworks and benchmarks including a chat with the authors of MixEvals which I wrote about last week and a new benchmark called Live Bench from Abacus and Yan Lecun</p><p>Plus a new video model from Luma and finally SD3, let's go! 👇 </p><p><strong>TL;DR of all topics covered:</strong> </p><p>* <strong>Apple WWDC recap and Apple Intelligence</strong> (<a target="_blank" href="https://x.com/altryne/status/1800207009033474142">X</a>)</p><p>* <strong>This Weeks Buzz</strong></p><p>* AI Engineer expo in SF (June 25-27)  come see my talk, it's going to be Epic (<a target="_blank" href="https://x.com/aiDotEngineer/status/1791506805065216017">X</a>, <a target="_blank" href="https://ai.engineer/schedule?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=thursdai&#38;utm_id=march28">Schedule</a>)</p><p>* <strong>Open Source LLMs</strong> </p><p>* Microsoft Samba - 3.8B MAMBA + Sliding Window Attention beating Phi 3 (<a target="_blank" href="https://x.com/liliang_ren/status/1801027052147216457">X</a>, <a target="_blank" href="arxiv.org/abs/2406.07522">Paper</a>)</p><p>* Sakana AI releases LLM squared - LLMs coming up with preference algorithms to train better LLMS  (<a target="_blank" href="https://x.com/RobertTLange/status/1801150001147482498">X</a>,<a target="_blank" href="https://sakana.ai/llm-squared/"> Blog</a>)</p><p>* Abacus + Yan Lecun release <a target="_blank" href="LiveBench.AI">LiveBench.AI</a> - impossible to game benchmark (<a target="_blank" href="https://x.com/bindureddy/status/1801010849160818701">X</a>, <a target="_blank" href="livebench.ai">Bench</a></p><p>* Interview with MixEval folks about achieving 96% arena accuracy with 5000x less price</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Mistral announced a 600M series B round</p><p>* Revenue at OpenAI DOUBLED in the last 6 month and is now at $3.4B annualized (<a target="_blank" href="https://www.theinformation.com/articles/openais-annualized-revenue-doubles-to-3-4-billion-since-late-2023?utm_source=ti_app">source</a>)</p><p>* Elon drops lawsuit vs OpenAI </p><p>* <strong>Vision & Video</strong></p><p>* Luma drops DreamMachine - SORA like short video generation in free access (<a target="_blank" href="https://x.com/LumaLabsAI/status/1800921380034379951">X</a>, <a target="_blank" href="https://lumalabs.ai/dream-machine">TRY IT</a>)</p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* Stable Diffusion Medium weights are here (<a target="_blank" href="https://x.com/StabilityAI/status/1800875914299048404">X</a>, <a target="_blank" href="https://huggingface.co/stabilityai/stable-diffusion-3-medium">HF</a>, <a target="_blank" href="https://fal.ai/models/fal-ai/stable-diffusion-v3-medium">FAL</a>)</p><p>* <strong>Tools</strong></p><p>* Google releases GenType - create an alphabet with diffusion Models (<a target="_blank" href="https://x.com/labsdotgoogle/status/1800198132321710209">X</a>, <a target="_blank" href="https://labs.google/gentype">Try It</a>)</p><p>Apple Intelligence</p><p>Technical LLM details </p><p>Let's dive right into what wasn't show on the keynote, in a 6 minute <a target="_blank" href="https://x.com/altryne/status/1800540764612841911">deep dive video</a> from the state of the union for developers and in a follow up post on machine learning <a target="_blank" href="https://machinelearning.apple.com/research/introducing-apple-foundation-models?utm_source=ainews&#38;utm_medium=email&#38;utm_campaign=ainews-talaria-apples-new-mlops-superweapon-4066">blog</a>, Apple shared some very exciting technical details about their on device models and orchestration that will become Apple Intelligence. </p><p>Namely, on device they have trained a bespoke 3B parameter LLM, which was trained on licensed data, and uses a bunch of very cutting edge modern techniques to achieve quite an incredible on device performance. Stuff like GQA, Speculative Decoding, a very unique type of quantization (which they claim is almost lossless) </p><p>To maintain model , <strong>we developed a new framework using LoRA adapters that incorporates a mixed 2-bit and 4-bit configuration strategy — averaging 3.5 bits-per-weight — to achieve the same accuracy as the uncompressed models [...] on iPhone 15 Pro we are able to reach time-to-first-token latency of about 0.6 millisecond per prompt token, and a generation rate of 30 tokens per second</strong></p><p>These small models (they also have a bespoke image diffusion model as well) are going to be finetuned with a lot of LORA adapters for specific tasks like Summarization, Query handling, Mail replies, Urgency and more, which gives their foundational models the ability to specialize itself on the fly to the task at hand, and be cached in memory as well for optimal performance. </p><p>Personal and Private (including in the cloud) </p><p>While these models are small, they will also benefit from 2 more things on device, a vector store of your stuff (contacts, recent chats, calendar, photos) they call semantic index and a new thing apple is calling App Intents, which developers can expose (and the OS apps already do) that will allows the LLM to use tools like moving files, extracting data across apps, and do actions, this already makes the AI much more personal and helpful as it has in its context things about me and what my apps can do on my phone. </p><p>Handoff to the Private Cloud (and then to OpenAI)</p><p>What the local 3B LLM + context can't do, it'll hand off to the cloud, in what Apple claims is a very secure way, called Private Cloud, in which they will create a new inference techniques in the cloud, on Apple Silicon, with Secure Enclave and Secure Boot, ensuring that the LLM sessions that run inference on your data are never stored, and even Apple can't access those sessions, not to mention train their LLMs on your data. </p><p>Here are some benchmarks Apple posted for their On-Device 3B model and unknown size server model comparing it to GPT-4-Turbo (not 4o!) on unnamed benchmarks they came up with. </p><p>In cases where Apple Intelligence cannot help you with a request (I'm still unclear when this actually would happen) IOS will now show you this dialog, suggesting you use chatGPT from OpenAI, marking a deal with OpenAI (in which apparently nobody pays nobody, so neither Apple is getting paid by OpenAI to be placed there, nor does Apple pay OpenAI for the additional compute, tokens, and inference) </p><p>Implementations across the OS</p><p>So what will people be able to actually do with this intelligence? I'm sure that Apple will add much more in the next versions of iOS, but at least for now, <strong>Siri is getting an LLM brain transplant</strong> and is going to be much more smarter and capable, from understanding natural speech better (and just, having better ears, the on device speech to text is improved and is really good now in IOS 18 beta) to being able to use app intents to do actions for you across several apps. </p><p>Other features across the OS will use Apple Intelligence to prioritize your notifications, and also summarize group chats that are going off, and have built in tools for rewriting, summarizing, and turning any text anywhere into anything else. Basically think of many of the tasks you'd use chatGPT for, are now built into the OS level itself for free. </p><p>Apple is also adding AI Art diffusion features like GenMoji (the ability to generate any emoji you can think of, like chefs kiss, or a seal with a clown nose) and while this sounds absurd, I've never been in a slack or a discord that didn't have their own unique custom emojis uploaded by their members. </p><p>And one last feature I'll highlight is this Image Playground, Apple's take on generating images, which is not only just text, but a contextual understanding of your conversation, and let's you create with autosuggested concepts instead of just text prompts and is going to be available to all developers to bake into their apps. </p><p>Elon is SALTY - and it's not because of privacy</p><p>I wasn't sure if to include this segment, but in what became my most viral tweet since the beginning of this year, I posted about Elon muddying the water about what Apple actually announced, and called it a Psyop that worked. Many MSMs and definitely the narrative on X, turned into what Elon thinks about those announcements, rather than the announcements themselves and just look at this insane reach.</p><p>We've covered Elon vs OpenAI before (a lawsuit that he actually withdrew this week, because emails came out showing he knew and was ok with OpenAI not being Open) and so it's no surprise that when Apple decided to partner with OpenAI and not say... XAI, Elon would promote absolutely incorrect and ignorant takes to take over the radio waves like he will ban apple devices from all his companies, or that OpenAI will get access to train on your iPhone data. </p><p>This weeks BUZZ (Weights & Biases Update) </p><p>Hey, if you're reading this, it's very likely that you've already registered or at least heard of <a target="_blank" href="http://ai.engineer">ai.engineer</a> and if you haven't, well I'm delighted to tell you, that we're sponsoring this awesome event in San Francisco June 25-27. Not only are we official sponsors, both Lukas (the Co-Founder and CEO) and I will be there giving talks (mine will likely be crazier than his) and we'll have a booth there, so if your'e coming, make sure to come by my talk (or Lukas's if you're a VP and are signed up for that exclusive track) </p><p>Everyone in our corder of the world is going to be there, Swyx told me that many of the foundational models labs are coming, OpenAI, Anthropic, Google, and there's going to be tons of tracks (My talk is of course in the Evals track, come, really, I might embarrass myself on stage to eternity you don't want to miss this) </p><p>Swyx kindly provided listeners and readers of ThursdAI with a special coupon <strong>feeltheagi</strong> so even more of a reason to try and convince your boss and come see me on stage in a costume (I've said too much!) </p><p>Vision & Video</p><p>Luma drops DreamMachine - SORA like short video generation in free access (<a target="_blank" href="https://x.com/LumaLabsAI/status/1800921380034379951">X</a>, <a target="_blank" href="https://lumalabs.ai/dream-machine">TRY IT</a>)</p><p>In an absolute surprise, Luma AI, a company that (used to) specialize in crafting 3D models, has released a free access video model similar to SORA, and Kling (which we covered last week) that generates 5 second videos (and doesn't require a chinese phone # haha) </p><p>It's free to try, and supports text to video, image to video, cinematic prompt instructions, great and cohesive narrative following, character consistency and a lot more. </p><p></p><p>Here's a comparison of the famous SORA videos and LDM (Luma Dream Machine) videos that I was provided on X by a AmebaGPT, however, worth noting that these are cherry picked SORA videos while LDM is likely a much smaller and quicker model and that folks are creating some incredible things already! </p><p>AI Art & Diffusion & 3D </p><p>Stable Diffusion Medium weights are here (<a target="_blank" href="https://x.com/StabilityAI/status/1800875914299048404">X</a>, <a target="_blank" href="https://huggingface.co/stabilityai/stable-diffusion-3-medium">HF</a>, <a target="_blank" href="https://fal.ai/models/fal-ai/stable-diffusion-v3-medium">FAL</a>)</p><p>It's finally here (well, I'm using finally carefully here, and really hoping that this isn't the last thing Stability AI releases) ,the weights for Stable Diffusion 3 are available on HuggingFace! SD3 offers improved photorealism and awesome prompt adherence, like asking for multiple subjects doing multiple things. </p><p>It's also pretty good at typography and fairly resource efficient compared to previuos versions, though I'm still waiting for the super turbo distilled versions that will likely come soon! </p><p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></p><p>And that's it for this week folks, it's been a hell of a week, I really do appreciate each and one of you who makes it to the end reading, engaging and would love to ask for feedback, so if anything didn't resonate, too long / too short, or on the podcast itself, too much info, to little info, please do share, I will take it into account 🙏 🫡 </p><p>Also, we're coming up to the 52nd week I've been sending these, which will mark ThursdAI BirthdAI for real (the previous one was for the live shows) and I'm very humbled that so many of you are now reading, sharing and enjoying learning about AI together with me 🙏 </p><p>See you next week, </p><p>Alex </p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-june-13th-2024-apple-intelligence</link><guid isPermaLink="false">substack:post:145622349</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Thu, 13 Jun 2024 23:25:49 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/145622349/83f10ba9deee5e0c7b32121604399490.mp3" length="76619156" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6385</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/145622349/1f1bb21cc8d6f42d5c6075af012000b7.jpg"/></item><item><title><![CDATA[📅 ThursdAI - Jun 6th - 👑 Qwen2 Beats Llama-3! Jina vs. Nomic for Multimodal Supremacy, new Chinese SORA, Suno & Udio user uploads & more AI news]]></title><description><![CDATA[<p>Hey hey! This is Alex! 👋 </p><p>Some podcasts have 1 or maaaybe 2 guests an episode, we had 6! guests today, each has had an announcement, an open source release, or a breaking news story that we've covered! (PS, this edition is very multimodal so click into the Substack as videos don't play in your inbox)</p><p>As you know my favorite thing is to host the folks who make the news to let them do their own announcements, but also, hitting that BREAKING NEWS button when something is actually breaking (as in, happened just before or during the show) and I've actually used it 3 times this show! </p><p>It's not every week that we get to announce a NEW SOTA open model with the team that worked on it. Junyang (Justin) Lin from Qwen is a friend of the pod, a frequent co-host, and today gave us the breaking news of this month, as Qwen2 72B, is <strong>beating LLama-3 70B on most benchmarks</strong>!  That's right, a new state of the art open LLM was announced on the show, and Justin went deep into details 👏 (so don't miss this conversation, listen to wherever you get your podcasts) </p><p>We also chatted about SOTA multimodal embeddings with Jina folks (Bo Wand and Han Xiao) and Zach from Nomic, dove into an open source compute grant with FALs Batuhan Taskaya and much more! </p><p><strong>TL;DR of all topics covered:</strong> </p><p>* <strong>Open Source LLMs </strong></p><p>* Alibaba announces Qwen 2 - 5 model suite (<a target="_blank" href="https://x.com/JustinLin610/status/1798747072319074347">X</a>, <a target="_blank" href="https://huggingface.co/collections/Qwen/qwen2-6659360b33528ced941e557f">HF</a>)</p><p>* Jina announces Jina-Clip V1 - multimodal embeddings beating CLIP from OAI (<a target="_blank" href="https://x.com/JinaAI_/status/1798333405593014762">X</a>, <a target="_blank" href="https://jina.ai/news/jina-clip-v1-a-truly-multimodal-embeddings-model-for-text-and-image/">Blog</a>, <a target="_blank" href="https://huggingface.co/spaces/Xenova/webgpu-jina-clip?v2=">Web Demo</a>)</p><p>* Nomic announces Nomic-Embed-Vision (<a target="_blank" href="https://x.com/nomic_ai/status/1798368463292973361">X</a>, <a target="_blank" href="https://blog.nomic.ai/posts/nomic-embed-vision">BLOG</a>)</p><p>* MixEval - arena style rankings with <strong>Chatbot Arena model rankings with 2000× less time (5 minutes) and 5000× less cost ($0.6) (</strong><a target="_blank" href="https://x.com/NiJinjie/status/1798182749049852411"><strong>X</strong></a><strong>, </strong><a target="_blank" href="https://mixeval.github.io/"><strong>Blog</strong></a><strong>)</strong></p><p>* <strong>Vision & Video</strong></p><p>* Kling - open access video model SORA competitor from China (<a target="_blank" href="https://x.com/bdsqlsz/status/1798710076175528354">X</a>)</p><p>* <strong>This Weeks Buzz</strong> </p><p>* WandB supports Mistral new finetuning service (X)</p><p>* Register to my June 12 workshop on building Evals with Weave <a target="_blank" href="https://wandb.ai/site/resources/events/llms-virtual-workshop">HERE</a></p><p>* <strong>Voice & Audio</strong></p><p>* StableAudio Open - <a target="_blank" href="https://x.com/dadabots/status/1798398048403653047">X</a>, <a target="_blank" href="https://stability.ai/news/introducing-stable-audio-open">BLOG</a>, <a target="_blank" href="https://huggingface.co/spaces/artificialguybr/Stable-Audio-Open-Zero">TRY IT</a></p><p>* Suno launches "upload your audio" feature to select few - <a target="_blank" href="https://x.com/altryne/status/1798493515477094548">X</a> </p><p>* Udio - upload your own audio feature - <a target="_blank" href="https://x.com/udiomusic/status/1798369297758077066">X</a></p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* Stable Diffusion 3 weights are coming on June 12th (<a target="_blank" href="https://stability.ai/stablediffusion3">Blog</a>)</p><p>* JasperAI releases Flash Diffusion (<a target="_blank" href="https://x.com/benjamin_aubin_/status/1798707273650389082">X</a>, <a target="_blank" href="https://huggingface.co/spaces/jasperai/flash-diffusion">TRY IT</a>, <a target="_blank" href="https://www.jasper.ai/blog/announcing-flash-diffusion">Blog</a>)</p><p>* Big CO LLMs + APIs</p><p>* Group of ex-OpenAI sign a new letter  - <a target="_blank" href="http://righttowarn.ai">righttowarn.ai</a> </p><p>* A hacker releases TotalRecall - a tool to extract all the info from MS Recall Feature (<a target="_blank" href="https://github.com/xaitax/TotalRecall">Github</a>)</p><p>Open Source LLMs </p><p>QWEN 2 - new SOTA open model from Alibaba (<a target="_blank" href="https://x.com/JustinLin610/status/1798747072319074347">X</a>, <a target="_blank" href="https://huggingface.co/collections/Qwen/qwen2-6659360b33528ced941e557f">HF</a>)</p><p>This is definitely the biggest news for this week, as the folks at Alibaba released a very surprising and super high quality suite of models, spanning from a tiny 0.5B model to a new leader in open models, <strong>Qwen 2 72B</strong> </p><p>To add to the distance from Llama-3, these new models support a wide range of context length, all large, with 7B and 72B support up to 128K context. </p><p>Justin mentioned on stage that actually finding sequences of longer context lengths is challenging, and this is why they are only at 128K.</p><p>In terms of advancements, the highlight is advanced Code and Math capabilities, which are likely to contribute to overall model advancements across other benchmarks as well. </p><p>It's also important to note that all models (besides the 72B) are now released with Apache 2 license to help folks actually use globally, and speaking of globality, these models have been natively trained with 27 additional languages, making them considerably better at multilingual prompts! </p><p>One additional amazing thing was, that a finetune was released by Eric Hartford and Cognitive Computations team, and AFAIK this is the first time a new model drops with an external finetune. Justing literally said "It is quite amazing. I don't know how they did that. Well, our teammates don't know how they did that, but, uh, it is really amazing when they use the Dolphin dataset to train it."</p><p>Here's the Dolphin finetune metrics and you can <a target="_blank" href="https://huggingface.co/spaces/cognitivecomputations/chat">try it out</a> here</p><p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></p><p>Jina-Clip V1 and Nomic-Embed-Vision SOTA multimodal embeddings</p><p>It's quite remarkable that we got 2 separate SOTA of a similar thing during the same week, and even more cool that both companies came to talk about it on ThursdAI! </p><p>First we welcomed back Bo Wang from Jina (who joined by Han Xiao the CEO) and Bo talked about multimodal embeddings that beat OpenAI CLIP (which both conceded was a very low plank) </p><p>Jina Clip V1 is apache 2 open sourced, while Nomic Embed is beating it on benchmarks but is CC-BY-NC non commercially licensed, but in most cases, if you're embedding, you'd likely use an API, and both companies offer these embeddings via their respective APIs</p><p>One thing to note about Nomic, is that they have mentioned that these new embeddings are backwards compatible with the awesome Nomic embed endpoints and embeddings, so if you've used that, now you've gone multimodal! </p><p>Because these models are fairly small, there are now web versions, thanks to transformer.js, of <a target="_blank" href="https://huggingface.co/spaces/Xenova/webgpu-jina-clip?v2=">Jina</a> and <a target="_blank" href="https://x.com/nomic_ai/status/1798751625420669285">Nomic Embed</a> (caution, this will download large-ish files) built by non other than our friend Xenova.</p><p>If you're building any type of multimodal semantic search, these two embeddings systems are now open up all your RAG needs for multi modal data! </p><p>This weeks Buzz (What I learned with WandB this week)</p><p>Mistral announced built in finetuning server support, and has a simple WandB integration! (<a target="_blank" href="https://x.com/capetorch/status/1798621553707344055">X</a>)</p><p>Also, my workshop about building evals 101 is coming up next week, June 12, excited to share with you a workshop that we wrote for in person crowd, please register <a target="_blank" href="https://wandb.ai/site/resources/events/llms-virtual-workshop">here</a></p><p> and hope to see you next week! </p><p>Vision & Video</p><p>New SORA like video generation model called KLING in open access (<a target="_blank" href="https://kling.kuaishou.com/">DEMO</a>)</p><p>This one has to be seen to be believed, out of nowhere, an obscure (to us) chinese company <a target="_blank" href="http://kuaishou.com">kuaishou.com</a> dropped a landing page with tons of videos that are clearly AI generated, and they all look very close to SORA quality, way surpassing everything else we've seen in this category (Runaway, Pika, SVD) </p><p>And they claim that they offer support for it via their App (but you need apparently a Chinese phone number, so not for me) </p><p>It's really hard to believe that this quality exists already outside of a frontier lab full of GPUs like OpenAI and it's now in waitlist mode, where as SORA is "coming soon" </p><p>Voice & Audio</p><p>Stability open sources Stable Audio Open (<a target="_blank" href="https://x.com/dadabots/status/1798398048403653047">X</a>, <a target="_blank" href="https://stability.ai/news/introducing-stable-audio-open">BLOG</a>, <a target="_blank" href="https://huggingface.co/spaces/artificialguybr/Stable-Audio-Open-Zero">TRY IT</a>)</p><p>A new open model from Stability is always fun, and while we wait for SD3 to drop weights (June 12! we finally have a date) we get this awesome model from Dadabots at team at Stability. </p><p>It's able to generate 47s seconds of music, and is awesome at generating loops, drums and other non vocal stuff, so not quite where Suno/Udio are, but the samples are very clean and sound very good. Prompt: New York Subway</p><p>They focus the model on being able to get Finetuned on a specific drummers style for example, and have it be open and specialize in samples, and sound effects and not focused on melodies or finalized full songs but it has some decent skills in simple prompts, like "progressive house music"</p><p>This model has a non commercial license and can be played with <a target="_blank" href="https://huggingface.co/spaces/artificialguybr/Stable-Audio-Open-Zero">here</a></p><p>Suno & Udio let users upload their own audio! </p><p>This one is big, so big in fact, that I am very surprised that both companies announced this exact feature the same week. </p><p>Suno has reached out to me and a bunch of other creators, and told us that we are now able to upload our own clips, be it someone playing solo guitar, or even whistling and have Suno remix it into a real proper song. </p><p>In this example, this is a very viral video, this guy sings at a market selling fish (to ladies?) and Suno was able to create this remix for me, with the drop, the changes in his voice, the melody, everything, it’s quite remarkable! </p><p>AI Art & Diffusion</p><p>Flash Diffusion from JasperAI / Clipdrop team (<a target="_blank" href="https://x.com/benjamin_aubin_/status/1798707273650389082">X</a>, <a target="_blank" href="https://huggingface.co/spaces/jasperai/flash-diffusion">TRY IT</a>, <a target="_blank" href="https://www.jasper.ai/blog/announcing-flash-diffusion">Blog</a>, <a target="_blank" href="https://gojasper.github.io/flash-diffusion-project/">Paper</a>)</p><p>Last but definitely not least, we now have a banger of a diffusion update, from the Clipdrop team (who was <em>amazing </em>things before Stability bought them and then sold them to JasperAI) </p><p>Diffusion models likle Stable Diffusion often take 30-40 inference steps to get you the image, searching for your prompt through latent space you know? </p><p>Well recently there have been tons of these new distill methonds, models that are like students, who learn from the teacher model (Stable Diffusion XL for example) and distill the same down to a few steps (sometimes as low as 2!) </p><p>Often the results are, distilled models that can run in real time, like SDXL Turbo, Lightning SDXL etc</p><p>Now Flash Diffusion achieves State-of-the-Art (SOTA) performance metrics, specifically in terms of Fréchet Inception Distance (FID) and CLIP Score. These metrics are the default for evaluating the quality and relevance of generated images. </p><p>And Jasper has open sourced the whole training code to allow for reproducibility which is very welcome!</p><p>Flash diffusion also comes in not only image generation, but also inpaining and upscaling, allowing it to be applied to other methods to speed them up as well. </p><p>— </p><p>This is all for this week, I mean, there are TONS more stuff we could have covered, and we did mention them on the pod, but I aim to serve as a filter to the most interesting things as well so, until next week 🫡 </p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-jun-6th-qwen2-beats-llama</link><guid isPermaLink="false">substack:post:145395563</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Fri, 07 Jun 2024 00:00:21 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/145395563/fe1c2101c5ffc48e6eceac6d24e9b3df.mp3" length="74706874" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6225</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/145395563/19e19d96278a090af08b3608e1a2cf8b.jpg"/></item><item><title><![CDATA[📅 ThursdAI - May 30 - 1000 T/s inference w/ SambaNova, <135ms TTS with Cartesia, SEAL leaderboard from Scale & more AI news]]></title><description><![CDATA[<p>Hey everyone, Alex here! </p><p>Can you believe it's already end of May? And that 2 huge AI companies conferences are behind us (Google IO, MSFT Build) and Apple's WWDC is just ahead in 10 days! Exciting! </p><p>I was really looking forward to today's show, had quite a few guests today, I'll add all their socials below the TL;DR so please give them a follow and if you're only in reading mode of the newsletter, why don't you give the podcast a try 🙂 It's impossible for me to add the density of knowledge that's being shared on stage for 2 hours here in the newsletter! </p><p>Also, before we dive in, I’m hosting a free workshop soon, about building evaluations from scratch, if you’re building anything with LLMs in production, more than welcome to <a target="_blank" href="https://wandb.ai/site/resources/events/llms-virtual-workshop">join us on June 12th</a> (it’ll be virtual)</p><p>TL;DR of all topics covered: </p><p>* <strong>Open Source LLMs</strong> </p><p>* Mistral open weights Codestral - 22B dense coding model (<a target="_blank" href="https://x.com/dchaplot/status/1795823340533469560">X</a>, <a target="_blank" href="https://mistral.ai/news/codestral/">Blog</a>)</p><p>* Nvidia open sources NV-Embed-v1 - Mistral based SOTA embeddings (<a target="_blank" href="https://x.com/arankomatsuzaki/status/1795930779756712191">X</a>, <a target="_blank" href="https://huggingface.co/nvidia/NV-Embed-v1">HF</a>)</p><p>* HuggingFace Chat with tool support (<a target="_blank" href="https://twitter.com/altryne/status/1795491886909530479">X</a>, <a target="_blank" href="https://huggingface.co/chat/">demo</a>)</p><p>* Aider beats SOTA on Swe-Bench with 26% (<a target="_blank" href="https://x.com/paulgauthier/status/1794447750442226047">X</a>, <a target="_blank" href="https://aider.chat/2024/05/22/swe-bench-lite.html">Blog</a>, <a target="_blank" href="https://github.com/paul-gauthier/aider">Github</a>)</p><p>* OpenChat - Sota finetune of Llama3 (<a target="_blank" href="https://x.com/alignment_lab/status/1794116045269017044">X</a>, <a target="_blank" href="https://huggingface.co/openchat/openchat-3.6-8b-20240522">HF</a>, <a target="_blank" href="openchat.team">Try It</a>)</p><p>* LLM 360 - K2 65B - fully transparent and reproducible (<a target="_blank" href="https://x.com/llm360/status/1795833911580438807">X</a>, <a target="_blank" href="https://t.co/5eM2IaUgm6">Paper</a>, <a target="_blank" href="https://huggingface.co/LLM360/K2">HF</a>, <a target="_blank" href="https://wandb.ai/llm360/K2?nw=29mu6l0zzqq">WandB</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Scale announces SEAL Leaderboards -  with private Evals (<a target="_blank" href="https://x.com/alexandr_wang/status/1795857651592491281">X</a>, <a target="_blank" href="https://t.co/10nk53awrL">leaderboard</a>)</p><p>* SambaNova achieves >1000T/s on Llama-3 full precision</p><p>* Groq hits back with breaking 1200T/s on Llama-3</p><p>* Anthropic tool support in GA (<a target="_blank" href="https://x.com/AnthropicAI/status/1796210547077128578">X</a>, <a target="_blank" href="https://www.anthropic.com/news/tool-use-ga">Blogpost</a>)</p><p>* OpenAI adds GPT4o, Web Search, Vision, Code Interpreter & more to free users (<a target="_blank" href="https://x.com/OpenAI/status/1795900306490044479">X</a>)</p><p>* Google Gemini & Gemini Flash are topping the evals leaderboards, in GA(<a target="_blank" href="https://x.com/lmsysorg/status/1795512202465845686">X</a>)</p><p>* Gemini Flash finetuning coming soon</p><p>* <strong>This weeks Buzz (What I learned at WandB this week)</strong></p><p>* Sponsored a Mistral hackathon in Paris</p><p>* We have an upcoming <a target="_blank" href="https://wandb.ai/site/resources/events/llms-virtual-workshop">workshop</a> in 2 parts - come learn with me</p><p>* <strong>Vision & Video</strong></p><p>* LLama3-V - Sota OSS VLM (<a target="_blank" href="https://x.com/siddrrsh/status/1795541002620727439">X</a>, <a target="_blank" href="https://github.com/mustafaaljadery/llama3v">Github</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* Cartesia AI - super fast SSM based TTS with very good sounding voices (<a target="_blank" href="https://x.com/cartesia_ai/status/1795856778456084596">X</a>, <a target="_blank" href="https://t.co/rMnegk14Jl">Demo</a>)</p><p>* <strong>Tools & Hardware</strong></p><p>* Jina Reader (<a target="_blank" href="https://jina.ai/reader/#demo">https://jina.ai/reader/</a>) </p><p>* <strong>Co-Hosts and Guests</strong></p><p>* Rodrigo Liang (<a target="_blank" href="https://x.com/RodrigoLiang">@RodrigoLiang</a>) & Anton McGonnell (<a target="_blank" href="https://x.com/aton2006">@aton2006</a>) from SambaNova</p><p>* Itamar Friedman (<a target="_blank" href="https://x.com/itamar_mar">@itamar_mar</a>) Codium</p><p>* Arjun Desai (<a target="_blank" href="https://x.com/jundesai">@jundesai</a>) - Cartesia</p><p>* Nisten Tahiraj (<a target="_blank" href="https://x.com/nisten">@nisten</a>) - Cohost</p><p>* Wolfram Ravenwolf (<a target="_blank" href="https://twitter.com/WolframRvnwlf">@WolframRvnwlf</a>)</p><p>* Eric Hartford (<a target="_blank" href="https://twitter.com/erhartford">@erhartford</a>)</p><p>* Maziyar Panahi (<a target="_blank" href="https://x.com/MaziyarPanahi">@MaziyarPanahi</a>)</p><p>Scale SEAL leaderboards (<a target="_blank" href="https://scale.com/leaderboard">Leaderboard</a>)</p><p>Scale AI has announced their new initiative, called SEAL leaderboards, which aims to provide yet another point of reference in how we understand frontier models and their performance against each other. </p><p>We've of course been sharing LMSys arena rankings here, and openLLM leaderboard from HuggingFace, however, there are issues with both these approaches, and Scale is approaching the measuring in a different way, focusing on very private benchmarks and dataset curated by their experts (Like Riley Goodside) </p><p>The focus of SEAL is private and novel assessments across Coding, Instruction Following, Math, Spanish and more, and the main reason they keep this private, is so that models won't be able to train on these benchmarks if they leak to the web, and thus show better performance due to data contamination. </p><p>They are also using ELO scores (Bradley-Terry) and I love this footnote from the actual website: </p><p>"To ensure leaderboard integrity, we require that models can only be featured the FIRST TIME when an organization encounters the prompts"</p><p>This means they are taking the contamination thing very seriously and it's great to see such dedication to being a trusted source in this space. </p><p>Specifically interesting also that on their benchmarks, GPT-4o is not better than Turbo at coding, and definitely not by 100 points like it was announced by LMSys and OpenAI when they released it! </p><p>Gemini 1.5 Flash (and Pro) in GA and showing impressive performance </p><p>As you may remember from my <a target="_blank" href="https://sub.thursdai.news/p/thursdai-may-16-openai-gpt-4o-google?r=2imipa">Google IO recap</a>, I was really impressed with Gemini Flash, and I felt that it went under the radar for many folks. Given it's throughput speed, 1M context window, and multimodality and price tier, I strongly believed that Google was onto something here. </p><p>Well this week, not only was I proven right, I didn't actually realize how right I was 🙂 as we heard breaking news from Logan Kilpatrick during the show, that the models are now in GA, and that Gemini Flash gets upgraded to 1000 RPM (requests per minute) and announced that finetuning is coming and will be free of charge! </p><p>Not only with finetuning won't cost you anything, inference on your tuned model is going to cost the same, which is very impressive. </p><p>There was a sneaky price adjustment from the announced pricing to the GA pricing that upped the pricing by 2x on output tokens, but even despite that, Gemini Flash with $0.35/1MTok for input and $1.05/1MTok on output is probably the best deal there is right now for LLMs of this level. </p><p>This week it was also confirmed both on LMsys, and on Scale SEAL leaderboards that Gemini Flash is a very good coding LLM, beating Claude Sonnet and LLama-3 70B! </p><p>SambaNova + Groq competing at 1000T/s speeds</p><p>What a week for inference speeds! </p><p>SambaNova (an AI startup with $1.1B in investment from Google Ventures, Intel Capital, Samsung, Softbank founded in 2017) has announced that they broke the 1000T/s inference barrier on Llama-3-8B in full precision mode (suing their custom hardware called RDU (reconfigurable dataflow unit)</p><p>As you can see, this is incredible fast, really, try it yourself <a target="_blank" href="https://fast.snova.ai/">here</a>. </p><p>Seeing this, the folks at Groq, who had the previous record on super fast inference (as I reported just <a target="_blank" href="https://sub.thursdai.news/p/thursdai-feb-22nd-groq-near-instant?utm_source=publication-search">in February</a>) decided to not let this slide, and released an incredible 20% improvement on their own inference of LLama-3-8B, getting to 1200Ts, showing that they are very competitive. </p><p>This bump in throughput is really significant, many inference providers that use GPUs, and not even hitting 200T/s, and Groq improved their inference by that amount within 1 day of being challenged. </p><p>I had the awesome pleasure to have Rodrigo the CEO on the show this week to chat about SambaNova and this incredible achievement, their ability to run this in full precision, and future plans, so definitely give it a listen. </p><p>This weeks Buzz (What I learned with WandB this week)</p><p>This week was buzzing at Weights & Biases! After co-hosting a Hackathon with Meta a few weeks ago, we cohosted another Hackathon, this time with Mistral, in Paris. (where we also announced our new integration with their Finetuning!)</p><p>The organizers Cerebral Valley have invited us to participate and it was amazing to see the many projects that use WandB and Weave in their finetuning presentations, including a friend of the pod Maziyar Panahi who's team nabbed 2nd place (you can read about their project <a target="_blank" href="https://x.com/MaziyarPanahi/status/1794783772346606070">here</a>) 👏</p><p>Also, I'm going to do a virtual workshop together with my colleague Anish, about prompting and building evals, something we know a thing or two about, it's free and I would very much love to invite you to register and <a target="_blank" href="https://wandb.ai/site/resources/events/llms-virtual-workshop">learn with us</a>! </p><p>Cartesia AI (<a target="_blank" href="https://play.cartesia.ai/">try it</a>)</p><p>Hot off the press, we're getting a new Audio TTS model, based on the State Space model architecture (remember Mamba?) from a new startup called Cartesia AI, who aim to bring real time intelligence to on device compute! </p><p>The most astonishing thing they released was actually the speed with which they model starts to generate voices, under 150ms, which is effectively instant, and it's a joy to play with <a target="_blank" href="https://play.cartesia.ai/">their playground</a>, just look at how fast it started generating this intro I recorded using their awesome 1920's radio host voice</p><p>Co-founded by Albert Gu, Karan Goel and Arjun Desai (who joined the pod this week) they have shown incredible performance but also showed that transformer alternative architectures like SSMs can really be beneficial for audio specifically, just look at this quote!</p><p>On speech, a parameter-matched and optimized Sonic model trained on the same data as a widely used Transformer improves audio quality significantly (20% lower perplexity, 2x lower word error, 1 point higher NISQA quality).</p><p>With lower latency (1.5x lower time-to-first-audio), faster inference speed (2x lower real-time factor) and higher throughput (4x)</p><p>In Open Source news: </p><p>Mistral released Codestral 22B - their flagship code model with a new non commercial license</p><p>Codestral is now available under the new Mistral license for non-commercial R&D use. With a larger context window of 32K, Codestral outperforms all other models in RepoBench, a long-range evaluation for code generation. Its fill-in-the-middle capability is favorably compared to DeepSeek Coder 33B. </p><p>Codestral is supported in VSCode via a plugin and is accessible through their API, Le Platforme, and Le Chat.  </p><p>HuggingFace Chat with tool support (<a target="_blank" href="https://twitter.com/altryne/status/1795491886909530479">X</a>, <a target="_blank" href="https://huggingface.co/chat/">demo</a>)</p><p>This one is really cool, HF added Cohere's Command R+ with tool support and the tools are using other HF spaces (with ZeroGPU) to add capabilities like image gen, image editing, web search and more! </p><p>LLM 360 - K2 65B - fully transparent and reproducible (<a target="_blank" href="https://x.com/llm360/status/1795833911580438807">X</a>, <a target="_blank" href="https://t.co/5eM2IaUgm6">Paper</a>, <a target="_blank" href="https://huggingface.co/LLM360/K2">HF</a>, <a target="_blank" href="https://wandb.ai/llm360/K2?nw=29mu6l0zzqq">WandB</a>)</p><p>The awesome team at LLM 360 released K2 65B, which is an open source model that comes very close to LLama70B on benchmarks, but the the most important thing, is that they open source everything, from code, to datasets, to technical write-ups, they even open sourced their WandB plots 👏 </p><p>This is so important to the open source community, that we must highlight and acknowledge the awesome effort from LLM360 ai of doing as much open source! </p><p>Tools - Jina reader</p><p>In the tools category, while we haven't discussed this on the pod, I really wanted to highlight Jina reader. We've had Bo from Jina AI talk to us about Embeddings in the past episodes, and since then Jina folks released this awesome tool that's able to take any URL and parse it in a nice markdown format that's very digestable to LLMs. </p><p>You can pass any url, and it even does vision understanding! And today they released PDF understanding as well so you can pass the reader PDF files and have it return a nicely formatted text! </p><p>The best part, it's free! (for now at least!)</p><p>And that’s a wrap for today, see you guys next week, and if you found any of this interesting, please share with a friend 🙏 </p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-may-30-1000-ts-inference</link><guid isPermaLink="false">substack:post:145147054</guid><dc:creator><![CDATA[Alex Volkov, Arjun Desai, Nisten, and Itamar Friedman]]></dc:creator><pubDate>Fri, 31 May 2024 00:10:28 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/145147054/bd624a40b3c7a6dd92b411e99f70df52.mp3" length="81264998" type="audio/mpeg"/><itunes:author>Alex Volkov, Arjun Desai, Nisten, and Itamar Friedman</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6772</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/145147054/d95856d529da4c63c5c18c0f0c1e5cb4.jpg"/></item><item><title><![CDATA[📅 ThursdAI - May 23 - OpenAI troubles, Microsoft Build, Phi-3 small/large, new Mistral & more AI news]]></title><description><![CDATA[<p>Hello hello everyone, this is Alex, typing these words from beautiful Seattle (really, it only rained once while I was here!) where I'm attending Microsoft biggest developer conference BUILD. </p><p>This week we saw OpenAI get in the news from multiple angles, none of them positive and Microsoft clapped back at Google from last week with tons of new AI product announcements (CoPilot vs Gemini) and a few new PCs with NPU (Neural Processing Chips) that run alongside CPU/GPU combo we're familiar with. Those NPUs allow for local AI to run on these devices, making them AI native devices! </p><p>While I'm here I also had the pleasure to participate in the original AI tinkerers thanks to my friend <a target="_blank" href="https://twitter.com/jheitzeb">Joe Heitzberg</a> who operates and runs the <a target="_blank" href="https://aitinkerers.org">aitinkerers.org</a> (of which we are a local branch in Denver) and it was amazing to see tons of folks who listen to ThursdAI + read the newsletter and talk about Weave and evaluations with all of them! (Btw, one the left is Vik from Moondream, which we covered multiple times). I </p><p>Ok let's get to the news: </p><p><strong>TL;DR of all topics covered:</strong> </p><p>* <strong>Open Source LLMs</strong> </p><p>* HuggingFace commits 10M in ZeroGPU (<a target="_blank" href="https://x.com/ClementDelangue/status/1791115403734778185">X</a>)</p><p>* Microsoft open sources Phi-3 mini,  Phi-3 small (7B) Medium (14B) and vision models w/ 128K context (<a target="_blank" href="https://azure.microsoft.com/en-us/blog/new-models-added-to-the-phi-3-family-available-on-microsoft-azure/">Blog</a>, <a target="_blank" href="https://huggingface.co/spaces/ysharma/Microsoft_Phi-3-Vision-128k">Demo</a>)</p><p>* Mistral 7B 0.3 - Base + Instruct (<a target="_blank" href="https://huggingface.co/mistralai/Mistral-7B-v0.3">HF</a>)</p><p>* LMSys created a "hard prompts" category (<a target="_blank" href="https://x.com/lmsysorg/status/1792625968865026427">X</a>)</p><p>* Cohere for AI releases Aya 23  - 3 models, 101 languages, (<a target="_blank" href="https://x.com/CohereForAI/status/1793643648703168807">X</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Microsoft Build recap - New AI native PCs, Recall functionality, Copilot everywhere </p><p>* Will post a dedicated episode to this on Sunday</p><p>* OpenAI pauses GPT-4o Sky voice because Scarlet Johansson complained</p><p>* Microsoft AI PCs - Copilot+ PCs (<a target="_blank" href="https://stratechery.com/2024/windows-returns/">Blog</a>)</p><p>* Anthropic - <strong>Scaling Monosemanticity </strong> paper - about mapping the features of an LLM (<a target="_blank" href="https://x.com/AnthropicAI/status/1792935506587656625">X</a>, <a target="_blank" href="https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html">Paper</a>)</p><p>* <strong>Vision & Video</strong></p><p>* OpenBNB - MiniCPM-Llama3-V 2.5 (<a target="_blank" href="https://x.com/OpenBMB/status/1792761578422747567">X</a>, <a target="_blank" href="https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5">HuggingFace</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* OpenAI pauses Sky voice due to ScarJo hiring legal counsel</p><p>* <strong>Tools & Hardware</strong></p><p>* Humane is looking to sell (<a target="_blank" href="https://x.com/TechCrunch/status/1793254586532143127">blog</a>)</p><p></p><p>Open Source LLMs </p><p>Microsoft open sources Phi-3 mini,  Phi-3 small (7B) Medium (14B) and vision models w/ 128K context (<a target="_blank" href="https://azure.microsoft.com/en-us/blog/new-models-added-to-the-phi-3-family-available-on-microsoft-azure/">Blog</a>, <a target="_blank" href="https://huggingface.co/spaces/ysharma/Microsoft_Phi-3-Vision-128k">Demo</a>)</p><p>Just in time for Build, Microsoft has open sourced the rest of the Phi family of models, specifically the small (7B) and the Medium (14B) models on top of the mini one we just knew as Phi-3. </p><p>All the models have a small context version (4K and 8K) and a large that goes up to 128K (tho they recommend using the small if you don't need that whole context) and all can run on device super quick. </p><p>Those models have <strong>MIT license</strong>, so use them as you will, and are giving an incredible performance comparatively to their size on benchmarks. Phi-3 mini, received an interesting split in the vibes, it was really good for reasoning tasks, but not very creative in it's writing, so some folks dismissed it, but it's hard to dismiss these new releases, especially when the benchmarks are that great! </p><p>LMsys just updated their arena to include a hard prompts category (<a target="_blank" href="https://x.com/lmsysorg/status/1792625968865026427">X</a>) which select for complex, specific and knowledge based prompts and scores the models on those. Phi-3 mini actually gets a big boost in ELO ranking when filtered on hard prompts and beats GPT-3.5 😮 Can't wait to see how the small and medium versions perform on the arena.</p><p>Mistral gives us function calling in Mistral 0.3 update (<a target="_blank" href="https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3">HF</a>)</p><p>Just in time for the Mistral hackathon in Paris, Mistral has released an update to the 7B model (and likely will update the MoE 8x7B and 8x22B Mixtrals) with function calling and a new vocab. </p><p>This is awesome all around because function calling is important for agenting capabilities, and it's about time all companies have it, and apparently the way Mistral has it built in matches the Cohere Command R way and is already supported in Ollama, using raw mode. </p><p>Big CO LLMs + APIs</p><p>Open AI is not having a good week - Sky voice has paused, Employees complain</p><p>OpenAI is in hot waters this week, starting with pausing the Sky voice (arguably the best most natural sounding voice out of the ones that launched) due to complains for Scarlett Johansson about this voice being similar to hers. Scarlett appearance in the movie Her, and Sam Altman tweeting "her" to celebrate the release of the incredible GPT-4o voice mode were all talked about when ScarJo has released a statement saying she was shocked when her friends and family told her that OpenAI's new voice mode sounds just like her. </p><p>Spoiler, it doesn't really, and they hired an actress and have had this voice out since September last year, as they outlined in their blog following ScarJo complaint. </p><p>Now, whether or not there's legal precedent here, given that Sam Altman reached out to Scarlet twice, including once a few days before the event, I won't speculate, but for me, personally, not only Sky doesn't sound like ScarJo, it was my favorite voice even before they demoed it, and I'm really sad that it's paused, and I think it's unfair to the actress who was hired for her voice. See her own statement: </p><p>Microsoft Build - CoPilot all the things</p><p>I have recorded a Built recap with Ryan Carson from Intel AI and will be posting that as it's own episode on Sunday, so look forward to that, but for now, here are the highlights from BUILD:</p><p>* Copilot everywhere, Microsoft builds the CoPilot as a platform</p><p>* AI native laptops with NPU chips for local AI </p><p>* Recall an on device AI that let's you search through everything you saw or typed with natural language</p><p>* Github Copilot Workspace + Extensions </p><p>* Microsoft stepping into education with sponsoring Khan Academy free for all teaches in the US</p><p>* Copilot Team member and Agent - Copilot will do things proactively as your team member</p><p>* GPT-4o voice mode is coming to windows and to websites! </p><p><p>Hey, if you like reading this, can you share with 1 friend? It’ll be an awesome way to support this pod/newsletter! </p></p><p>Anthropic releases the <strong>Scaling Monosemanticity </strong>paper</p><p>This is quite a big thing that happened this week for Mechanistic Interpretability and Alignment, with Anthropic releasing a new paper and examples of their understanding of what LLM "thinks". </p><p>They have done incredible work in this area, and now they have scaled it up all the way to production models like Claude Haiku, which shows that this work can actually understand which "features" are causing which tokens to output. </p><p>In the work they highlighted features such as "deception", "bad code" and even a funny one called "Golden Gate bridge" and showed that clamping these features can affect the model outcomes. </p><p>One these features have been identified, they can be turned on or off with various levels of power, for example they turned up the Golden Gate Bridge feature up to the maximum, and the model thought it was the Golden Gate bridge. </p><p>While a funny example, they also found features for racism, bad / wrong code, inner conflict, gender bias, sycophancy and more, you can play around with some examples <a target="_blank" href="https://transformer-circuits.pub/2024/scaling-monosemanticity/umap.html?targetId=1m_1013764&#38;utm_source=ainews&#38;utm_medium=email&#38;utm_campaign=ainews-anthropic-cracks-the-llm-genome-project">here</a> and definitely read the full blog if this interests you, but overall it shows incredible promise in alignment and steer-ability of models going forward on large scale </p><p>This weeks Buzz (What I learned with WandB this week)</p><p>I was demoing Weave all week long in Seattle, first at the AI Tinkerers event, and then at MSFT BUILD. </p><p>They had me record a pre-recorded video of my talk, and then have a 5 minute demo on stage, which (was not stressful at all!) so here's the pre-recorded video that turned out really good! </p><p>Also, we're sponsoring the <a target="_blank" href="https://partiful.com/e/EFvUkVMiTCP2cVrRU1cD">Mistral Hackathon</a> this weekend in Paris, so if you're in EU and want to hack with us, please go, it's hosted by Cerebral Valley and HuggingFace and us → </p><p>Vision</p><p>Phi-3 mini Vision </p><p>In addition to Phi-3 small and Phi-3 Medium, Microsoft released Phi-3 mini with vision, which does an incredible job understanding text and images! (You can demo it <a target="_blank" href="https://huggingface.co/spaces/ysharma/Microsoft_Phi-3-Vision-128k">right here</a>)</p><p>Interestingly, the Phi-3 mini with vision has 128K context window which is amazing and even beats Mistral 7B as a language model! Give it a try</p><p>OpenBNB - MiniCPM-Llama3-V 2.5 (<a target="_blank" href="https://x.com/OpenBMB/status/1792761578422747567">X</a>, <a target="_blank" href="https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5">HuggingFace</a>, <a target="_blank" href="https://huggingface.co/spaces/openbmb/MiniCPM-Llama3-V-2_5">Demo</a>)</p><p>Two state of the art vision models in one week? well that's incredible. A company I haven't heard of OpenBNB have released MiniCPM 7B trained on top of LLama3 and they claim that they outperform the Phi-3 vision</p><p>They claim that it has GPT-4 vision level performance and  achieving an <strong>700+ score on OCRBench, surpassing proprietary models such as GPT-4o, GPT-4V-0409, Qwen-VL-Max and Gemini Pro</strong></p><p>In my tests, Phi-3 performed a bit better, I showed both the same picture, and Phi was more factual on the hard prompts: </p><p>Phi-3 Vision:</p><p>And that's it for this week's newsletter, look out for the Sunday special full MSFT Build recap and definitely give the whole talk a listen, it's full of my co-hosts and their great analysis of this weeks events! </p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-may-23-openai-troubles-microsoft</link><guid isPermaLink="false">substack:post:144922497</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Thu, 23 May 2024 20:07:34 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/144922497/4926913c8f349b9f1fb6fefe31621cd6.mp3" length="74166381" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6180</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/144922497/94e9d63fb8e2a557d287d66b2c298fb9.jpg"/></item><item><title><![CDATA[📅 ThursdAI - May 16 - OpenAI GPT-4o, Google IO recap, LLama3 hackathon, Yi 1.5, Nous Hermes Merge & more AI news]]></title><description><![CDATA[<p>Wow, holy s**t, insane, overwhelming, incredible, the future is here!, "still not there", there are many more words to describe this past week. (TL;DR at the end of the blogpost)</p><p>I had a feeling it's going to be a big week, and the companies did NOT disappoint, so this is going to be a very big newsletter as well. </p><p>As you may have read last week, I was very lucky to be in San Francisco the weekend before Google IO, to co-host a hackathon with Meta LLama-3 team, and it was a blast, I will add my notes on that in This weeks Buzz section. </p><p>Then on Monday, we all got to watch the crazy announcements from OpenAI, namely a new flagship model called <strong>GPT-4o</strong> (we were right, it previously was im-also-a-good-gpt2-chatbot) that's twice faster, 50% cheaper (in English, significantly more so in other languages, more on that later) and is Omni (that's the o) which means it is end to end trained with voice, vision, text on inputs, and can generate text, voice and images on the output. </p><p>A true <strong>MMIO</strong> (multimodal on inputs and outputs, that's not the official term) is here and it has some very very surprising capabilities that blew us all away. Namely the ability to ask the model to "talk faster" or "more sarcasm in your voice" or "sing like a pirate", though, we didn't yet get that functionality with the GPT-4o model, it is absolutely and incredibly exciting. Oh and it's available to everyone for free! </p><p><strong>That's GPT-4 level intelligence, for free for everyone, without having to log in!</strong></p><p>What's also exciting was how immediate it was, apparently not only the model itself is faster (unclear if it's due to newer GPUs or distillation or some other crazy advancements or all of the above) but that training an end to end omnimodel reduces the latency to incredibly immediate conversation partner, one that you can interrupt, ask to recover from a mistake, and it can hold a conversation very very well. </p><p>So well, that indeed it seemed like, the Waifu future (digital girlfriends/wives) is very close to some folks who would want it, while we didn't get to try it (we got GPT-4o but not the new voice mode as Sam confirmed) OpenAI released a bunch of videos of their employees chatting with Omni (that's my nickname, use it if you'd like) and many online highlighted how thirsty / flirty it sounded. I downloaded all the videos for an X thread and I named one girlfriend.mp4, and well, just judge for yourself why: </p><p>Ok, that's not all that OpenAI updated or shipped, they also updated the Tokenizer which is incredible news to folks all around, specifically, the rest of the world. The new tokenizer reduces the previous "foreign language tax" by a LOT, making the model way way cheaper for the rest of the world as well</p><p>One last announcement from OpenAI was the <strong>desktop app</strong> experience, and this one, I actually got to use a bit, and it's incredible. MacOS only for now, this app comes with a launcher shortcut (kind of like RayCast) that let's you talk to ChatGPT right then and there, without opening a new tab, without additional interruptions, and it even can understand what you see on the screen, help you understand code, or jokes or look up information. Here's just one example I just had over at X. And sure, you could always do this with another tab, but the ability to do it without context switch is a huge win.  </p><p>OpenAI had to do their demo 1 day before GoogleIO, but even during the excitement about GoogleIO, they had announced that Ilya is not only alive, but is also departing from OpenAI, which was followed by an announcement from Jan Leike (who co-headed the superailgnment team together with Ilya) that he left as well. This to me seemed like a well executed timing to give dampen the Google news a bit. </p><p>Google is BACK, backer than ever, Alex's Google IO recap</p><p>On Tuesday morning I showed up to Shoreline theater in Mountain View, together with creators/influencers delegation as we all watch the incredible firehouse of announcements that Google has prepared for us. </p><p>TL;DR - Google is adding Gemini and AI into all it's products across workspace (Gmail, Chat, Docs), into other cloud services like Photos, where you'll now be able to ask your photo library for specific moments. They introduced over 50 product updates and I don't think it makes sense to cover all of them here, so I'll focus on what we do best.</p><p>"Google with do the Googling for you" </p><p>Gemini 1.5 pro is now their flagship model (remember Ultra? where is that? 🤔) and has been extended to 2M tokens in the context window! Additionally, we got a new model called Gemini Flash, which is way faster and very cheap (up to 128K, then it becomes 2x more expensive)</p><p>Gemini Flash is multimodal as well and has 1M context window, making it an incredible deal if you have any types of videos to process for example. </p><p>Kind of hidden but important was a <a target="_blank" href="https://ai.google.dev/gemini-api/docs/caching">caching</a> announcement, which IMO is a big deal, big enough it could post a serious risk to RAG based companies. Google has claimed they have a way to introduce caching of the LLM activation layers for most of your context, so a developer won't have to pay for repeatedly sending the same thing over and over again (which happens in most chat applications) and will significantly speed up work with larger context windows. </p><p>They also mentioned Gemini Nano, a on device Gemini, that's also multimodal, that can monitor calls in real time for example for older folks, and alert them about being scammed, and one of the cooler announcements was, Nano is going to be baked into the Chrome browser. </p><p>With Gemma's being upgraded, there's not a product at Google that Gemini is not going to get infused into, and while they counted 131 "AI" mentions during the keynote, I'm pretty sure Gemini was mentioned way more! </p><p>Project Astra - A universal AI agent helpful in everyday life</p><p>After a few of the announcements from Sundar, (newly knighted) Sir Demis Hassabis came out and talked about DeepMind research, AlphaFold 3 and then turned to project Astra.</p><p>This demo was really cool and kind of similar to the GPT-4o conversation, but also different. I'll let you just watch it yourself: </p><p>TK: project astra demo</p><p>And this is no fake, they actually had booths with Project Astra test stations and I got to chat with it (I came back 3 times) and had a personal demo from Josh Woodward (VP of Labs) and it works, and works fast! It sometimes disconnects and sometimes there are misunderstandings, like when multiple folks are speaking, but overall it's very very impressive. </p><p>If you remember the infamous video with the rubber ducky that was edited by Google and caused a major uproar when we found out? It's basically that, on steroids, and real and quite quite fast.</p><p>Astra has a decent short term memory, so if you ask it where something was, it will remember, and Google cleverly used that trick to also show that they are working on augmented reality glasses with Astra built in, which would make amazing sense. </p><p>Open Source LLMs</p><p>Google open sourced PaliGemma VLM</p><p>Giving us something in the open source department, adding to previous models like RecurrentGemma, Google has uploaded a whopping 116 different checkpoints of a new VLM called PaliGemma to the hub, which is a State of the Art vision model at 3B. </p><p>It's optimized for finetuning for different workloads such as Visual Q&A, Image and short video captioning and even segmentation! </p><p>They also mentioned that Gemma 2 is coming next month, will be a 27B parameter model that's optimized to run on a single TPU/GPU. </p><p>Nous Research Hermes 2 Θ (Theta) - their first Merge!</p><p>Collaborating with Charles Goddard from Arcee (the creators of MergeKit), Teknium and friends merged the recently trained Hermes 2 Pro with Llama 3 instruct to get a model that's well performant on all the tasks that LLama-3 is good at, while maintaining capabilities of Hermes (function calling, Json mode) </p><p>Yi releases 1.5 with apache 2 license</p><p>The folks at <a target="_blank" href="http://01.ai">01.ai</a> release Yi 1.5, with 6B, 9B and 34B (base and chat finetunes) </p><p>Showing decent benchmarks on Math and Chinese, 34B beats LLama on some of these tasks while being 2x smaller, which is very impressive</p><p>This weeks Buzz - LLama3 hackathon with Meta</p><p>Before all the craziness that was announced this week, I participated and judged the first ever Llama-3 hackathon. It was quite incredible, with over 350 hackers participating, Groq, Lambda, Meta, Ollama and others sponsoring and giving talks and workshops it was an incredible 24 hours at Shak 15 in SF (where Cerebral Valley hosts their hackathons) </p><p>Winning hacks were really innovative, ranging from completely open source smart glasses for under 20$, to a LLM debate platform with an LLM judge on any moral issue, and one project that was able to jailbreak llama by doing some advanced LLM arithmetic. Kudos to the teams for winning, and it was amazing to see how many of them adopted <a target="_blank" href="http://wandb.me/weave">Weave</a> as their observability framework as it was really easy to integrate. </p><p>Oh and I got to co-judge with the 🐐 of HuggingFace</p><p>This is all the notes for this week, even though there was a LOT lot more, check out the TL;DR and see you here next week, which I'll be recording from Seattle, where I'll be participating in the Microsoft BUILD event, so we'll see Microsoft's answer to Google IO as well. If you're coming to BUILD, come by our booth and give me a high five! </p><p>TL;DR of all topics covered: </p><p>* OpenAI Announcements</p><p>* GPT-4o</p><p>* Voice mode</p><p>* Desktop App</p><p>* Google IO recap:</p><p>* <strong>Google Gemini</strong></p><p>* <strong>Gemini 1.5 Pro:</strong> Available globally to developers with a 2-million-token context window, enabling it to handle larger and more complex tasks.</p><p>* <strong>Gemini 1.5 Flash:</strong> A faster and less expensive version of Gemini, optimized for tasks requiring low latency.</p><p>* <strong>Gemini Nano with Multimodality:</strong> An on-device model that processes various inputs like text, photos, audio, web content, and social videos.</p><p>* <strong>Project Astra:</strong> An AI agent capable of understanding and responding to live video and audio in real-time.</p><p>* <strong>Google Search</strong></p><p>* <strong>AI Overviews in Search Results:</strong> Provides quick summaries and relevant information for complex search queries.</p><p>* <strong>Video Search with AI:</strong> Allows users to search by recording a video, with Google's AI processing it to pull up relevant answers.</p><p>* <strong>Google Workspace</strong></p><p>* <strong>Gemini-powered features in Gmail, Docs, Sheets, and Meet:</strong> Including summarizing conversations, providing meeting highlights, and processing data requests.</p><p>* <strong>"Chip":</strong> An AI teammate in Google Chat that assists with various tasks by accessing information across Google services.</p><p>* <strong>Google Photos</strong></p><p>* <strong>"Ask Photos":</strong> Allows users to search for specific items in photos using natural language queries, powered by Gemini.</p><p>* <strong>Video Generation</strong></p><p>* <strong>Veo Generative Video:</strong> Creates 1080p videos from text prompts, offering cinematic effects and editing capabilities.</p><p>* <strong>Other Notable AI Announcements</strong></p><p>* <strong>NotebookLM:</strong> An AI tool to organize and interact with various types of information (documents, PDFs, notes, etc.), allowing users to ask questions about the combined information.</p><p>* <strong>Video Overviews (Prototyping):</strong> A feature within NotebookLM that generates audio summaries from uploaded documents.</p><p>* <strong>Code VR:</strong> A generative video AI model capable of creating high-quality videos from various prompts.</p><p>* <strong>AI Agents:</strong> A demonstration showcasing how AI agents could automate tasks across different software and systems.</p><p>* <strong>Generative Music:</strong> Advancements in AI music generation were implied but not detailed.</p><p>* Open Source LLMs </p><p>* Google PaliGemma 3B  - sota open base VLM (<a target="_blank" href="https://huggingface.co/blog/paligemma">Blog</a>)</p><p>* Gemma 2 - 27B coming next month</p><p>* Hermes 2 Θ (Theta) - Merge of Hermes Pro & Llama-instruct (<a target="_blank" href="https://twitter.com/Teknium1/status/1790795557021372575">X</a>, <a target="_blank" href="https://huggingface.co/NousResearch/Hermes-2-Theta-Llama-3-8B-GGUF">HF</a>)</p><p>* Yi 1.5 - Apache 2 licensed 6B, 9B and 34B (<a target="_blank" href="https://twitter.com/_philschmid/status/1789665086564405750">X</a>)</p><p>* Tiger Lab - MMLU-pro - a harder MMLU with 12K questions (<a target="_blank" href="https://x.com/WenhuChen/status/1790597967319007564">X</a>, <a target="_blank" href="https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro">HuggingFace</a>)</p><p>* This weeks Buzz (What I learned with WandB this week) </p><p>* Llama3 hackathon with Meta, Cerebral Valley, HuggingFace and Weights & Biases</p><p>* Vision & Video</p><p>* Google announces VEO - High quality cinematic generative video generation (<a target="_blank" href="https://deepmind.google/technologies/veo/?utm_source=x&#38;utm_medium=social&#38;utm_campaign=&#38;utm_content=">X</a>)</p><p>* AI Art & Diffusion & 3D</p><p>* Google announces Imagen3 - their latest Gen AI art model (<a target="_blank" href="https://deepmind.google/technologies/imagen-3/?utm_source=x&#38;utm_medium=social&#38;utm_campaign=&#38;utm_content=">Blog</a>)</p><p>* Tools</p><p>* Cursor trained a model that does 1000tokens/s and editing 😮  (<a target="_blank" href="https://twitter.com/amanrsanger/status/1790947733899203027">X</a>)</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-may-16-openai-gpt-4o-google</link><guid isPermaLink="false">substack:post:144705685</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Fri, 17 May 2024 00:34:23 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/144705685/6afbb3b7bb09f2289ef4355113f76c16.mp3" length="82353016" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6863</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/144705685/9c71b48683db272c0161a20e426f9fdb.jpg"/></item><item><title><![CDATA[📅 ThursdAI - May 9 - AlphaFold 3, im-a-good-gpt2-chatbot, Open Devin SOTA on SWE-Bench, DeepSeek V2 super cheap + interview with OpenUI creator & more AI news]]></title><description><![CDATA[<p>Hey 👋 (show notes and links a bit below)</p><p>This week has been a great AI week, however, it does feel like a bit "quiet before the storm" with Google I/O on Tuesday next week (which I'll be covering from the ground in Shoreline!) and rumors that OpenAI is not just going to let Google have all the spotlight!</p><p>Early this week, we got 2 new models on LMsys, <strong>im-a-good-gpt2-chatbot</strong> and <strong>im-also-a-good-gpt2-chatbot</strong>, and we've now confirmed that they are from OpenAI, and folks have been testing them with logic puzzles, role play and have been saying great things, so maybe that's what we'll get from OpenAI soon?</p><p>Also on the show today, we had a BUNCH of guests, and as you know, I love chatting with the folks who make the news, so we've been honored to host Xingyao Wang and Graham Neubig core maintainers of Open Devin (which just broke SOTA on Swe-Bench this week!) and then we had friends of the pod Tanishq Abraham and Parmita Mishra dive deep into AlphaFold 3 from Google (both are medical / bio experts).</p><p>Also this week, OpenUI from Chris Van Pelt (Co-founder & CIO at Weights & Biases) has been blowing up, taking #1 Github trending spot, and I had the pleasure to invite Chris and chat about it on the show!</p><p>Let's delve into this (yes, this is I, Alex the human, using Delve as a joke, don't get triggered 😉)</p><p>TL;DR of all topics covered (trying something new, my Raw notes with all the links and bulletpoints are at the end of the newsletter)</p><p>* <strong>Open Source LLMs</strong></p><p>* OpenDevin getting SOTA on Swe-Bench with 21% (<a target="_blank" href="https://twitter.com/xingyaow_/status/1787862432888545665">X</a>, <a target="_blank" href="https://xwang.dev/blog/2024/opendevin-codeact-1.0-swebench/">Blog</a>)</p><p>* DeepSeek V2 - 236B (21B Active) MoE (<a target="_blank" href="https://x.com/deepseek_ai/status/1787478986731429933">X</a>, <a target="_blank" href="https://t.co/v1TFy7LHNy">Try It</a>)</p><p>* Weights & Biases OpenUI blows over 11K stars (<a target="_blank" href="https://x.com/maximelabonne/status/1788572494812577992">X</a>, <a target="_blank" href="https://github.com/wandb/openui">Github</a>, <a target="_blank" href="https://openui.fly.dev/">Try It</a>)</p><p>* LLama-3 120B Chonker Merge from Maxime Labonne (<a target="_blank" href="https://twitter.com/maximelabonne/status/1788572494812577992">X</a>, <a target="_blank" href="https://huggingface.co/lmstudio-community/Meta-Llama-3-120B-Instruct-GGUF">HF</a>)</p><p>* Alignment Lab open sources Buzz - 31M rows training dataset (<a target="_blank" href="https://twitter.com/alignment_lab/status/1788571492046758392">X</a>, <a target="_blank" href="https://huggingface.co/datasets/H-D-T/Buzz/viewer">HF</a>)</p><p>* xLSTM - new transformer alternative (<a target="_blank" href="https://x.com/HochreiterSepp/status/1788072466675335185">X</a>, <a target="_blank" href="https://arxiv.org/abs/2405.04517">Paper</a>, <a target="_blank" href="https://twitter.com/ArmenAgha/status/1788110043046682809">Critique</a>)</p><p>* <strong>Benchmarks & Eval updates</strong></p><p>* LLama-3 still in 6th place (<a target="_blank" href="https://lmsys.org/blog/2024-05-08-llama3/">LMsys analysis</a>)</p><p>* Reka Core gets awesome 7th place and Qwen-Max breaks top 10 (<a target="_blank" href="https://twitter.com/lmsysorg/status/1788329885746045442">X</a>)</p><p>* No upsets in LLM leaderboard</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Google DeepMind announces AlphaFold-3 (<a target="_blank" href="https://www.nature.com/articles/s41586-024-07487-w">Paper</a>, <a target="_blank" href="https://twitter.com/demishassabis/status/1788229162563420560">Announcement</a>)</p><p>* OpenAI publishes their Model Spec (<a target="_blank" href="https://cdn.openai.com/spec/model-spec-2024-05-08.html">Spec</a>)</p><p>* OpenAI tests 2 models on LMsys (im-also-a-good-gpt2-chatbot & im-a-good-gpt2-chatbot)</p><p>* OpenAI joins Coalition for Content Provenance and Authenticity (<a target="_blank" href="https://openai.com/index/understanding-the-source-of-what-we-see-and-hear-online/">Blog</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* Udio adds in-painting - change parts of songs (<a target="_blank" href="https://x.com/udiomusic/status/1788243716676759668">X</a>)</p><p>* 11Labs joins the AI Audio race (<a target="_blank" href="https://twitter.com/ammaar/status/1788630726532899266">X</a>)</p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* ByteDance PuLID - new high quality ID customization (<a target="_blank" href="https://huggingface.co/spaces/yanze/PuLID">Demo</a>, <a target="_blank" href="https://github.com/ToTheBeginning/PuLID">Github</a>, <a target="_blank" href="https://arxiv.org/pdf/2404.16022">Paper</a>)</p><p>* <strong>Tools & Hardware</strong></p><p>* Went to the Museum with Rabbit R1 (<a target="_blank" href="https://twitter.com/altryne/status/1787508048488956233">My Thread</a>)</p><p>* <strong>Co-Hosts and Guests</strong></p><p>* Graham Neubig (<a target="_blank" href="https://twitter.com/gneubig">@gneubig</a>) & Xingyao Wang (<a target="_blank" href="https://twitter.com/xingyaow_">@xingyaow_</a>) from Open Devin</p><p>* Chris Van Pelt (<a target="_blank" href="https://twitter.com/vanpelt">@vanpelt</a>) from Weights & Biases</p><p>* Nisten Tahiraj (<a target="_blank" href="https://x.com/nisten">@nisten</a>) - Cohost</p><p>* Tanishq Abraham (<a target="_blank" href="https://twitter.com/iScienceLuvr">@iScienceLuvr</a>)</p><p>* Parmita Mishra (<a target="_blank" href="https://twitter.com/prmshra">@prmshra</a>)</p><p>* Wolfram Ravenwolf (<a target="_blank" href="https://twitter.com/WolframRvnwlf">@WolframRvnwlf</a>)</p><p>* Ryan Carson (<a target="_blank" href="https://twitter.com/ryancarson">@ryancarson</a>)</p><p>Open Source LLMs</p><p>Open Devin getting a whopping 21% on SWE-Bench (<a target="_blank" href="https://twitter.com/xingyaow_/status/1787862432888545665">X</a>, <a target="_blank" href="https://xwang.dev/blog/2024/opendevin-codeact-1.0-swebench/">Blog</a>)</p><p>Open Devin started as a tweet from our friend Junyang Lin (on the Qwen team at Alibaba) to get an open source alternative to the very popular Devin code agent from Cognition Lab (recently valued at $2B 🤯) and 8 weeks later, with tons of open source contributions, >100 contributors, they have almost 25K stars on <a target="_blank" href="https://github.com/OpenDevin/OpenDevin">Github</a>, and now claim a State of the Art score on the very hard Swe-Bench Lite <a target="_blank" href="https://www.swebench.com/">benchmark</a> beating Devin and Swe-Agent (with 18%)</p><p>They have done so by using the <strong>CodeAct </strong>framework developed by Xingyao, and it's honestly incredible to see how an open source can catch up and beat a very well funded AI lab, within 8 weeks! Kudos to the OpenDevin folks for the organization, and amazing results!</p><p>DeepSeek v2 - huge MoE with 236B (21B active) parameters (<a target="_blank" href="https://x.com/deepseek_ai/status/1787478986731429933">X</a>, <a target="_blank" href="https://t.co/v1TFy7LHNy">Try It</a>)</p><p>The folks at DeepSeek is releasing this huge MoE (the biggest we've seen in terms of experts) with 160 experts, and 6 experts activated per forward pass. A similar trend from the Snowflake team, just extended even longer. They also introduce a lot of technical details and optimizations to the KV cache.</p><p>With benchmark results getting close to GPT-4, Deepseek wants to take the crown in being the cheapest smartest model you can run, not only in open source btw, they are now offering this model at an incredible .28/1M tokens, that's 28 cents per 1M tokens!</p><p>The cheapest closest model in price was Haiku at $.25 and GPT3.5 at $0.5. This is quite an incredible deal for a model with 32K (128 in open source) context and these metrics.</p><p>Also notable is the training cost, they claim that it took them 1/5 the price of what Llama-3 cost Meta, which is also incredible. Unfortunately, running this model locally a nogo for most of us 🙂</p><p>I would mention here that metrics are not everything, as this model fails quite humorously on my basic logic tests</p><p>LLama-3 120B chonker Merge from Maxime LaBonne (<a target="_blank" href="https://twitter.com/search?q=from%3Aspectate_or+120B">X</a>, <a target="_blank" href="https://huggingface.co/lmstudio-community/Meta-Llama-3-120B-Instruct-GGUF">HF</a>)</p><p>We're covered Merges before, and we've had the awesome Maxime Labonne talk to us at length about model <a target="_blank" href="https://sub.thursdai.news/p/merge-deepdive-maxime-labonne?utm_source=publication-search">merging on ThursdAI</a> but I've been waiting for Llama-3 merges, and Maxime did NOT dissapoint!</p><p>A whopping 120B llama (Maxime added 50 layers to the 70B Llama3) is doing the rounds, and folks are claiming that Maxime achieved AGI 😂 It's really funny, this model, is... something else.</p><p>Here just one example that Maxime shared, as it goes into an existential crisis about a very simple logic question. A question that Llama-3 answers ok with some help, but this... I've never seen this. Don't forget that merging has no additional training, it's mixing layers from the same model so... we still have no idea what Merging does to a model but... some brain damange definitely is occuring.</p><p>Oh and also it comes up with words!</p><p></p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p><p></p><p>Big CO LLMs + APIs</p><p>Open AI publishes Model Spec (<a target="_blank" href="https://twitter.com/joannejang/status/1788255370504220940">X</a>, <a target="_blank" href="https://cdn.openai.com/spec/model-spec-2024-05-08.html">Spec</a>, <a target="_blank" href="https://openai.com/index/introducing-the-model-spec">Blog</a>)</p><p>OpenAI publishes and invites engagement and <a target="_blank" href="https://openai.com/form/model-spec-feedback/">feedback</a> for their internal set of rules for how their models should behave. Anthropic has something similar with Constitution AI.</p><p>I specifically liked the new chain of command (Platform > Developer > User > Tool) rebranding they added to the models, making OpenAI the Platform, changing "system" prompts to "developer" and having user be the user. Very welcome renaming and clarifications (h/t <a target="_blank" href="https://x.com/swyx/status/1788383796225573017">Swyx</a> for his analysis)</p><p>Here are a summarized version of OpenAI's new rules of robotics (thanks to Ethan Mollic)</p><p>* follow the chain of command: Platform > Developer > User > Tool</p><p>* Comply with applicable laws</p><p>* Don't provide info hazards</p><p>* Protect people's privacy</p><p>* Don't respond with NSFW contents</p><p>Very welcome effort from OpenAI, showing this spec in the open and inviting feedback is greately appreciated!</p><p>This comes on top of a pretty big week for OpenAI, announcing an integration with Stack Overflow, <a target="_blank" href="https://openai.com/index/understanding-the-source-of-what-we-see-and-hear-online">Joining</a> the Coalition for Content Provenance and Authenticity + embedding watermarks in SORA and DALL-e images, telling us they have built a classifier that detects AI images with 96% certainty!</p><p>im-a-good-gpt2-chatbot and im-also-a-good-gpt2-chatbot</p><p>Following last week gpt2-chat mystery, Sam Altman trolled us with this tweet</p><p>And then we got 2 new models on LMSys, im-a-good-gpt2-chatbot and im-also-a-good-gpt2-chatbot, and the timeline exploded with folks trying all their best logic puzzles on these two models trying to understand what they are, are they GPT5? GPT4.5? Maybe a smaller version of GPT2 that's pretrained on tons of new tokens?</p><p>I think we may see the answer soon, but it's clear that both these models are really good, doing well on logic (better than Llama-70B, and sometimes Claude Opus as well)</p><p>And the speculation is pretty much over, we know OpenAI is behind them after seeing this oopsie on the Arena 😂</p><p>you can try these models as well, they seem to be very favored in the random selection of models, but they show up only in battle mode so you have to try a few times <a target="_blank" href="https://chat.lmsys.org/">https://chat.lmsys.org/</a></p><p>Google DeepMind announces AlphaFold3 (<a target="_blank" href="https://www.nature.com/articles/s41586-024-07487-w">Paper</a>, <a target="_blank" href="https://twitter.com/maxjaderberg/status/1788222920365309952">Announcement</a>)</p><p>Developed by DeepMind and IsomorphicLabs, AlphaFold has previously predicted the structure of every molecule known to science, and now AlphaFold 3 was announced which can now predict the structure of other biological complexes as well, paving the way for new drugs and treatments.</p><p>What's new here, is that they are using diffusion, yes, like Stable Diffusion, starting with noise and then denoising to get a structure, and this method is 50% more accurate than existing methods.</p><p>If you'd like more info about this very important paper, look no further than the awesome 2 minute paper youtube, who did a thorough analysis <a target="_blank" href="https://www.youtube.com/watch?v=Mz7Qp73lj9o&#38;t=337s">here</a>, and listen to the Isomorphic Labs podcast with Weights & Biases CEO Lukas on <a target="_blank" href="https://www.youtube.com/watch?v=-hl0jpwWbV4&#38;list=PLD80i8An1OEEb1jP0sjEyiLG8ULRXFob_&#38;index=1">Gradient Dissent</a></p><p>They also released <a target="_blank" href="https://golgi.sandbox.google.com/about">AlphaFold server</a>, a free research tool allowing scientists to access these capabilities and predict structures for non commercial use, however it seems that it's somewhat limited (from a conversation we had with a researcher on stage)</p><p>This weeks Buzz (What I learned with WandB this week)</p><p>This week, was amazing for Open Source and Weights & Biases, not every week a side project from a CIO blows up on... well everywhere. #1 trending on Github for Typescript and 6 overall, OpenUI (<a target="_blank" href="https://github.com/wandb/openui">Github</a>) has passed 12K stars as people are super excited about being able to build UIs with LLms, but in the open source.</p><p>I had the awesome pleasure to host Chris on the show as he talked about the inspiration and future plans, and he gave everyone his email to send him feedback (a decision which I hope he doesn't regret 😂) so definitely check out the last part of the show for that.</p><p>Meanwhile here's my quick tutorial and reaction about OpenUI, but just give it a try <a target="_blank" href="https://openui.fly.dev/">here</a> and build something cool!</p><p>Vision</p><p>I was shared some news but respecting the team I decided not to include it in the newsletter ahead of time, but expect open source to come close to GPT4-V next week 👀</p><p>Voice & Audio</p><p>11 Labs joins the AI music race (<a target="_blank" href="https://twitter.com/ammaar/status/1788630726532899266">X</a>)</p><p>Breaking news from 11Labs, that happened during the show (but we didn't notice) is that they are stepping into the AI Music scene and it sounds pretty good!)</p><p></p><p>Udio adds Audio Inpainting (<a target="_blank" href="https://twitter.com/udiomusic/status/1788243716676759668">X</a>, <a target="_blank" href="https://www.udio.com/">Udio</a>)</p><p>This is really exciting, Udio decided to prove their investment and ship something novel!</p><p>Inpainting has been around in diffusion models, and now selecting a piece of a song on Udio and having Udio reword it is so seamless it will definitely come to every other AI music, given how powerful this is!</p><p>Udio also announced their pricing tiers this week, and it seems that this is the first feature that requires subscription</p><p>AI Art & Diffusion</p><p>ByteDance PuLID for no train ID Customization (<a target="_blank" href="https://huggingface.co/spaces/yanze/PuLID">Demo</a>, <a target="_blank" href="https://github.com/ToTheBeginning/PuLID">Github</a>, <a target="_blank" href="https://arxiv.org/pdf/2404.16022">Paper</a>)</p><p>It used to take a LONG time to finetune something like Stable Diffusion to generate an image of your face using DreamBooth, then things like LoRA started making this much easier but still needed training.</p><p>The latest crop of approaches for AI art customization is called ID Customization and ByteDance just released a novel, training free version called PuLID which works very very fast with very decent results! (<a target="_blank" href="https://huggingface.co/spaces/yanze/PuLID">really, try it on your own face</a>), previous works like InstantID an IPAdapter are also worth calling out, however PuLID seems to be the state of the art here! 🔥</p><p>And that's it for the week, well who am I kidding, there's so much more we covered and I just didn't have the space to go deep into everything, but definitely check out the podcast episode for the whole conversation. See you next week, it's going to be 🔥 because of IO and ... other things 👀</p><p></p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-may-9-alphafold-3-im-a-good</link><guid isPermaLink="false">substack:post:144484408</guid><dc:creator><![CDATA[Alex Volkov, Xingyao Wang, Nisten, Chris Van Pelt, and Graham Neubig]]></dc:creator><pubDate>Fri, 10 May 2024 00:14:17 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/144484408/d2643e7684d5ec388257341fd149c45b.mp3" length="77652716" type="audio/mpeg"/><itunes:author>Alex Volkov, Xingyao Wang, Nisten, Chris Van Pelt, and Graham Neubig</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6471</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/144484408/21b10e8cfc42dc9771528a663a88e498.jpg"/></item><item><title><![CDATA[ThursdAI - May 2nd - New GPT2? Copilot Workspace, Evals and Vibes from Reka, LLama3 1M context (+ Nous finetune) & more AI news]]></title><description><![CDATA[<p>Hey 👋  Look it May or May not be the first AI newsletter you get in May, but it's for sure going to be a very information dense one. As we had an amazing conversation on the live recording today, over 1K folks joined to listen to the first May updates from ThursdAI. </p><p>As you May know by now, I just love giving the stage to folks who are the creators of the actual news I get to cover from week to week, and this week, we had again, 2 of those conversations. </p><p>First we chatted with Piotr Padlewski from Reka, the author on the new Vibe-Eval <a target="_blank" href="https://publications.reka.ai/reka-vibe-eval.pdf">paper</a> & Dataset which they published this week. We've had Yi and Max from Reka on the show before, but it was Piotr's first time and he was super super knowledgeable, and was really fun to chat with. </p><p>Specifically, as we at Weights & Biases launch a new product called Weave (which you should check out at <a target="_blank" href="https://wandb.me/weave)">https://wandb.me/weave</a>) I'm getting more a LOT more interested in Evaluations and LLM scoring, and in fact, we started the whole show today with a full segment on Evals, Vibe checks and covered a new paper from Scale about overfitting.  </p><p>The second deep dive was with my friend Idan Gazit, from GithubNext, about the new iteration of Github Copilot, called Copilot Workspace. It was a great one, and you should definitely give that one a listen as well</p><p></p><p>TL;DR of all topics covered + show notes </p><p>* <strong>Scores and Evals</strong></p><p>* No notable changes, LLama-3 is still #6 on LMsys</p><p>* gpt2-chat came and went (<a target="_blank" href="https://rentry.org/GPT2">in depth chan writeup</a>)</p><p>* Scale checked for Data Contamination on GSM8K using GSM-1K (<a target="_blank" href="https://twitter.com/hughbzhang/status/1785877026794356858">Announcement</a>, <a target="_blank" href="https://arxiv.org/abs/2405.00332">Paper</a>)</p><p>* Vibes-Eval from Reka - a set of multimodal evals (<a target="_blank" href="https://twitter.com/YiTayML/status/1785734991433118167">Announcement</a>, <a target="_blank" href="https://publications.reka.ai/reka-vibe-eval.pdf">Paper</a>, <a target="_blank" href="https://huggingface.co/datasets/RekaAI/VibeEval">HF dataset</a>)</p><p>* <strong>Open Source LLMs</strong> </p><p>* Gradient releases 1M context window LLama-3 finetune (<a target="_blank" href="https://x.com/Gradient_AI_/status/1785036209468907796">X</a>)</p><p>* MaziyarPanahi/Llama-3-70B-Instruct-DPO-v0.4 (<a target="_blank" href="https://twitter.com/MaziyarPanahi/status/1785659308933308918">X</a>, <a target="_blank" href="https://huggingface.co/MaziyarPanahi/Llama-3-70B-Instruct-DPO-v0.4">HF</a>)</p><p>* Nous Research - Hermes Pro 2 - LLama 3 8B (<a target="_blank" href="https://twitter.com/NousResearch/status/1785779313826308096">X</a>, <a target="_blank" href="https://huggingface.co/NousResearch/Hermes-2-Pro-Llama-3-8B">HF</a>)</p><p>* AI Town is running on Macs thanks to Pinokio (<a target="_blank" href="https://x.com/cocktailpeanut/status/1785702250599371088">X</a>)</p><p>* LMStudio releases their CLI - LMS (<a target="_blank" href="https://twitter.com/LMStudioAI/status/1786076035789815998">X</a>, <a target="_blank" href="https://github.com/lmstudio-ai/lmstudio-cli">Github</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Github releases Copilot Workspace (<a target="_blank" href="https://twitter.com/github/status/1785006787755721210">Announcement</a>)</p><p>* AI21 - releases Jamba Instruct w/ 256K context (<a target="_blank" href="https://www.ai21.com/blog/announcing-jamba-instruct">Announcement</a>)</p><p>* Google shows Med-Gemini with some great results (<a target="_blank" href="https://twitter.com/alan_karthi/status/1785117444383588823">Announcement</a>)</p><p>* Claude releases IOS app and Team accounts (<a target="_blank" href="https://x.com/AnthropicAI/status/1785701418546180326">X</a>)</p><p>* <strong>This weeks Buzz</strong></p><p>* We're heading to SF to sponsor the biggest LLama-3 hackathon ever with Cerebral Valley (<a target="_blank" href="https://x.com/cerebral_valley/status/1785366241030607209">X</a>)</p><p>* Check out my video for Weave our new product, it's just 3 minutes (<a target="_blank" href="https://youtu.be/aH9HKzB3BSI?si=QdpVGtPFAJr9w-bM">Youtube</a>)</p><p>* <strong>Vision & Video</strong></p><p>* Intern LM open sourced a bunch of LLama-3 and Phi based VLMs (<a target="_blank" href="https://x.com/mervenoyann/status/1784897393940463674">HUB</a>)</p><p>* And they are MLXd by the "The Bloke" of MLX, Prince Canuma (<a target="_blank" href="https://twitter.com/Prince_Canuma/status/1785423514977092083">X</a>)</p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* ByteDance releases Hyper-SD - Stable Diffusion in a single inference step (<a target="_blank" href="https://fastsdxl.ai/">Demo</a>)</p><p>* <strong>Tools & Hardware</strong></p><p>* Still haven't open the AI Pin, and Rabbit R1 just arrived, will open later today</p><p>* <strong>Co-Hosts and Guests</strong></p><p>* Piotr Padlewski (<a target="_blank" href="https://twitter.com/PiotrPadlewski">@PiotrPadlewski</a>) from Reka AI</p><p>* Idan Gazit (<a target="_blank" href="https://twitter.com/idangazit">@idangazit</a>) from Github Next</p><p>* Wing Lian (<a target="_blank" href="https://twitter.com/winglian">@winglian</a>)</p><p>* Nisten Tahiraj (<a target="_blank" href="https://x.com/nisten">@nisten</a>)</p><p>* Yam Peleg (<a target="_blank" href="https://twitter.com/Yampeleg/status/1785405199831498834">@yampeleg</a>)</p><p>* LDJ (<a target="_blank" href="https://twitter.com/ldjconfirmed">@ldjconfirmed</a>)</p><p>* Wolfram Ravenwolf (<a target="_blank" href="https://twitter.com/WolframRvnwlf">@WolframRvnwlf</a>)</p><p>* Ryan Carson (<a target="_blank" href="https://twitter.com/ryancarson">@ryancarson</a>)</p><p>Scores and Evaluations</p><p>New corner in today's pod and newsletter given the focus this week on new models and comparing them to existing models.</p><p>What is GPT2-chat and who put it on LMSys? (and how do we even know it's good?)</p><p>For a very brief period this week, a new mysterious model appeared on LMSys, and was called gpt2-chat. It only appeared on the Arena, and did not show up on the leaderboard, and yet, tons of sleuths from 4chan to reddit to X started trying to figure out what this model was and wasn't. </p><p>Folks started analyzing the tokenizer, the output schema, tried to get the system prompt and gauge the context length. Many folks were hoping that this is an early example of GPT4.5 or something else entirely. </p><p>It did NOT help that uncle SAMA first posted the first tweet and then edited it to remove the - and it was unclear if he's trolling again or foreshadowing a completely new release or an old GPT-2 but retrained on newer data or something. </p><p>The model was really surprisingly good, solving logic puzzles better than Claude Opus, and having quite amazing step by step thinking, and able to provide remarkably informative, rational, and relevant replies. The average output quality across many different domains places it on, at least, the same level as high-end models such as GPT-4 and Claude Opus.</p><p>Whatever this model was, the hype around it made LMSYS add a clarification to their terms and temporarily take off the model now. And we're waiting to hear more news about what it is. </p><p>Reka AI gives us Vibe-Eval a new multimodal evaluation dataset and score (<a target="_blank" href="https://twitter.com/YiTayML/status/1785734991433118167">Announcement</a>, <a target="_blank" href="https://publications.reka.ai/reka-vibe-eval.pdf">Paper</a>, <a target="_blank" href="https://huggingface.co/datasets/RekaAI/VibeEval">HF dataset</a>)</p><p>Reka keeps surprising, with only 20 people in the company, their latest Reka Core model is very good in multi modality, and to prove it, they just released a new paper + a new method of evaluating multi modal prompts on VLMS (Vision enabled Language Models) </p><p>Their new Open Benchmark + Open Dataset is consistent of this format: </p><p>And I was very happy to hear from one of the authors on the paper <a target="_blank" href="https://twitter.com/PiotrPadlewski">@PiotrPadlewski</a> on the pod, where he mentioned that they were trying to create a dataset that was going to be very hard for their own model (Reka Core) and just decided to keep evaluating other models on it. </p><p>They had 2 main objectives : (i) vibe checking multimodal chat models for day-to-day tasks and (ii) deeply challenging and probing the capabilities of present frontier models. To this end, <strong>the hard set contains > 50% questions that all frontier models answer incorrectly</strong></p><p>Chatting with Piotr about it, he mentioned that not only did they do a dataset, they actually used Reka Core as a Judge to score the replies from all models on that dataset and found that using their model in this way roughly correlates to non-expert human judgement! Very very interesting stuff. </p><p>The "hard" set is ... well hard! </p><p>Piotr concluded that if folks want to do research, they will provide free API access to Reka for that, so hit them up over DMs if you want to take this eval for a spin on your new shiny VLM (or indeed verify the metrics they put up) </p><p>Scale tests for eval dataset contamination with GSM-1K (<a target="_blank" href="https://twitter.com/hughbzhang/status/1785877026794356858">Announcement</a>, <a target="_blank" href="https://arxiv.org/abs/2405.00332">Paper</a>)</p><p><a target="_blank" href="http://Scale.ai">Scale.ai</a> is one of the most prominent companies in AI you may never have heard of, they are valued at $13B dollars and have pivoted from data processing for autonomous vehicles to being the darling of the government, with agreements from the DoD for data pipeline and evaluation for US Military. </p><p>They have released a new paper as well, creating (but not releasing) a new dataset that matches the GSM8K (Grade School Math) dataset and evaluation that many frontier companies love to showcase in their release benchmarks with some surprising results! </p><p>So Scale folks created (but not released) a dataset called GSK 1K, which tracks and is similar to the public GSM-8K dataset, and tested a bunch of existing models on their new one, to see the correlation, and if the different was very stark, assume that some models overfitted (or even had their dataset contaminated) on the publicly available GSM8K. </p><p>On one end, models like Mistral or Phi do up to 10% worse on GSM1k compared to GSM8k. On the other end, models like Gemini, Claude, or GPT show basically no signs of being overfit.</p><p>The author goes on to say that overfitting doesn't necessarily mean it's a bad model, and highlights Phi-3 which has a 10% difference on their new GSK-1K score compared to GSM-8K, but still answers 68% of their dataset, while being a tiny 3.8B parameter model. </p><p>It seems that Scale is now stepping into the Evaluation game and have noticed how much interest there is in actually understanding how models perform, and are stepping into this game, by building (but not releasing so they don't leak) datasets. Jim Fan tweet (and Scale CEO Alex Wang QT) seem to agree that this is the right positioning for Scale (as they don't have models of their own and so can be neutral like Moody's)</p><p>Open Source LLMs </p><p>LLama-3 gets 1M context window + Other LLama-3 news</p><p>In the second week of LLama-3 corner, we are noticing a significant ramp in all things Llama-3, first with the context length. The same folks from last week, Gradient, have spend cycles and upscaled/stretched LLama-3 to a whopping 1 million tokens in the context window (Llama-3 8B Gradient Instruct 1048k), with a very decent Niddle in the Haystack result. </p><p>The main problem? Transformers have quadratic attention scaling issues for longer context, so this isn't something that you'd be able to run on your mac (nay, on your cluster) any time soon, and it's almost only theoretical at this point. </p><p>The upside? We had Wing Lian (from Axolotl) on the show, and he talked about a new method called LoRD (which is now part of <a target="_blank" href="https://github.com/arcee-ai/mergekit#lora-extraction">MergeKit</a>) which is a way to extract Loras from models. </p><p>Think of it as LLM arithmetic, you take the base model (llama-3 in this case) and the finetune (Llama-3 8B Gradient Instruct 1048k) and simple run a command like so: </p><p>mergekit-extract-lora llama-3-8B-gradient-instruct-1048K llama-3-8B just-the-context-lora [--no-lazy-unpickle] --rank=desired_rank</p><p>And boom, in theory, you have a tiny LoRA file that's extracted that is only the difference between these two models, the base and it's finetune. </p><p>It's really exciting stuff to be able to do brain surgery on these models and extract only one specific essence! </p><p>First LLama-3 finetunes that beat the instruct version </p><p>Folks and Nous research give us a new Hermes-Pro on top of Llama-8B (<a target="_blank" href="https://twitter.com/NousResearch/status/1785779313826308096">X</a>, <a target="_blank" href="https://huggingface.co/NousResearch/Hermes-2-Pro-Llama-3-8B">HF</a>) that is beating the llama-3 instruct on benchmarks, which is apparently very hard to do, given that Meta created a LOT of human labeled instructions (10M or so) and gave us a really really good instruct model. </p><p>Nous Hermes 2 pro is also giving Llama-3 additional superpowers like function calling and tool use, specifically mentioning that this is the model to use if you do any type of agentic stuff</p><p>This new version of Hermes maintains its excellent general task and conversation capabilities - but also excels at Function Calling, JSON Structured Outputs, and has improved on several other metrics as well, scoring a 90% on our function calling evaluation built in partnership with <a target="_blank" href="Fireworks.AI">Fireworks.AI</a>, and an 84% on our structured JSON Output evaluation.</p><p>Kudos Teknium1, Karan and <a target="_blank" href="https://twitter.com/intrstllrninja">@intrstllrninja</a> on this release, can't wait to try it out 🫡 </p><p>LMStudio gives us a CLI (<a target="_blank" href="https://github.com/lmstudio-ai/lmstudio-cli">Github</a>)</p><p>And speaking of "trying it out", you guys know that my recommended way of running these local models is LMStudio, and no, Yagil didn't sponsor ThursdAI haha I just love how quickly this piece of software became the go to locally for me running these models. </p><p>Well during ThursdAI I got a #breakingNews ping from their discord, that LM Studio now has a CLI (command line interface) which allows one to load/unload and run the webserver with the new CLI (kind of similar to Ollama) </p><p>And since LM Studio exposes an OpenAI compatible completions API once the models are loaded, you are not able to use these models with a simple change to the your script like so: </p><p>client = OpenAI(base_url="http://localhost:1234/v1", api_key="lm-studio") </p><p>Which is amazing and I'm very happy about this option, as this opens the door to tons of automations and evaluation possibilities (with something like Weave), in fact while writing this, I downloaded the model from HuggingFace, loaded a web-server and ran my first prompts, and it all took like 5 minutes, and is very easy to do! </p><p>This weeks Buzz (What happens in Weights & Biases this week)</p><p>I have so much to share, but I want to make sure I don't overwhelm the newsletter, but here we go. First of all, I'm flying out to SF again! in a few weeks to sponsor and judge the first ever LLama-3 hackathon, together with Meta, and hosted by the fine folks at Cerebral Valley (<a target="_blank" href="https://partiful.com/e/p5bNF0WkDd1n7JYs3m0A">sign up and come hack!</a>)</p><p>Cerebral Valley is hosting their events at this beautiful place called Shak-15 which I've mentioned before on the newsletter, and I'm excited to finally take part in one of their events! </p><p>The second part I can't wait to tell you about, is a week after, I'm going to Microsoft BUILD conference in Seattle, and will be representing Weights & Biases in that huge event (which last year featured Andrej Karpathy giving state of LLMs)</p><p>Here's a video I recorded for that event, which I worked really hard on, and would love some feedback. Please also let me know if you notice anything that an AI did in this video 👀 There's... something</p><p>As always, if you're attending any of these events, and see me, please do come say hi and give me a high five. I love meeting ThursdAI community folks in the wild, it really makes up for the fact that I'm working remotely from Denver and really makes this whole thing worth it! </p><p>Big Companies & APIs</p><p>Github’s new Copilot Workspace in Technical Preview</p><p>I was very happy to have friend of the pod Idan Gazit, Senior Director of Research at GitHub Next, the place in Github that comes up with incredible stuff (including where Copilot was born) to talk to us about Copilot's next iteration after the chat experience, workspace! </p><p>Workspace is indeed that, a workspace for you and copilot to start working together, on github issues specifically, taking into context more than just 1 file, and breaking down the task into planning, iteration and human feedback. </p><p>It looks really slick, and per Idan, uses a LOT of tokens of gpt-4-turbo, and I've had a chance to get in there and play around. </p><p>They break down every task into a Specification that Copilot comes up with, and then you iteratively can work on until you get the required result, then into planning model, where you would see a whole plan, and then copilot will get to work and start iterating on your task. </p><p>Does this remind you of anything? AGENTS you may yell in your head as you read these words, however, I recommend you listen to Idan in our chat on the pod, because his take on agents are, we don't want these tools to replace us, we want them to help us, and what is an agent anyway, this word is very overused. And I have to agree, given the insane valuations we've seen in agent startups like Cognition Labs with Devin. </p><p>I've taken Workspace for a spin, and asked it for a basic task to translate a repo documentation into Russian, a task I know LLMs are really good at, and it identified all the README files in the repo, and translated them beautifully, but then it didn't place those new translations into a separate folder like I asked, a case Idan admitted they didn't yet build for, and hey, this is why this is a Technical Preview, you just can't build an LLM based product behind the scenes and release it, you need feedback, and evaluations on your product from actual users! </p><p>You can see my <a target="_blank" href="https://copilot-workspace.githubnext.com/altryne/openai-cookbook?shareId=6282afa7-c99b-42d5-9cbe-afc17361fec5">whole session here</a>, in this nice link they give to be able to share (and fork if you have access) a workspace</p><p>The integration into Github is quite amazing, there's now a text box everyone on Github that you can ask for changes to a repo in natural language + a Raycast extension that allows you to basically kickstart a whole repo using Copilot Workspace from anywhere </p><p>And here's the result inside a new workspace   👇</p><p>I will run this later and see if it actually worked, given that Idan also mentioned, that Copilot does NOT run the code it writes, but it does allow me to easily do so via GIthub Codespaces (a bit confusing of a naming between the two!) and spin up a machine super quick. </p><p>I strongly recommend to listen to Idan on the pod because he went into a lot of detail about additional features, where they are planning to take this in the future etc' </p><p>I can go on and on, but I need to play with all the amazing new tools and models we just got today (and also start editing the podcast it's almost 4PM and I have 2 hours to send it!) so with that, thank you for reading , and see you next time 🫡 </p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-may-2nd-new-gpt2-copilot</link><guid isPermaLink="false">substack:post:144254900</guid><dc:creator><![CDATA[Alex Volkov, Idan Gazit, and Nisten]]></dc:creator><pubDate>Fri, 03 May 2024 00:35:40 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/144254900/19c297936f84fc66816ecdcdc97145a6.mp3" length="78517539" type="audio/mpeg"/><itunes:author>Alex Volkov, Idan Gazit, and Nisten</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6543</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/144254900/5a29fd0eb5dfec8dea2d41eb0d594747.jpg"/></item><item><title><![CDATA[📅 ThursdAI - April 25 - Phi-3 3.8B impresses, LLama-3 gets finetunes, longer context & ranks top 6 in the world, Snowflake's new massive MoE and other AI news this week]]></title><description><![CDATA[<p>Hey hey folks, happy ThursdAI  🎉  </p><p>Not a lot of house-keeping here, just a reminder that if you're listening or reading from Europe, our European <a target="_blank" href="http://fullyconnected.com">fullyconnected.com</a> conference is happening in May 15 in London, and you're more than welcome to join us there. I will have quite a few event updates in the upcoming show as well. </p><p>Besides this, this week has been a very exciting one for smaller models, as Microsoft teased and than released Phi-3 with MIT license, a tiny model that can run on most macs with just 3.8B parameters, and is really punching above it's weights. To a surprising and even eyebrow raising degree! Let's get into it 👇</p><p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></p><p>TL;DR of all topics covered: </p><p>* <strong>Open Source LLMs</strong> </p><p>* Microsoft open sources Phi-3 (<a target="_blank" href="https://x.com/EldanRonen/status/1782852004257149390">X</a>, <a target="_blank" href="https://huggingface.co/microsoft/Phi-3-mini-4k-instruct">HF</a>)</p><p>* LLama3 70B top5 (no top 6) on LMsys (<a target="_blank" href="http://leaderboard.lmsys.org">LMsys Arena</a>)</p><p>* Snowflake open sources Arctic - A massive hybrid MoE (<a target="_blank" href="https://twitter.com/SnowflakeDB/status/1783130529434624014">X</a>, <a target="_blank" href="https://arctic.streamlit.app/">Try it</a>, <a target="_blank" href="https://www.snowflake.com/blog/arctic-open-efficient-foundation-language-models-snowflake/">HF</a>)</p><p>* Evolutionary Model merges support in MergeKit (<a target="_blank" href="https://blog.arcee.ai/tutorial-tutorial-how-to-get-started-with-evolutionary-model-merging/">Blog</a>)</p><p>* Llama-3 8B finetunes roundup - Longer Context (128K) and Dolphin & Bagel Finetunes</p><p>* HuggingFace FINEWEB - a massive 45TB (the GPT4 of datasets) and 15T tokens high quality web data dataset (<a target="_blank" href="https://huggingface.co/datasets/HuggingFaceFW/fineweb">HF</a>)</p><p>* Cohere open sourced their chat interface (<a target="_blank" href="https://twitter.com/nickfrosst/status/1783220910427709766">X</a>)</p><p>* Apple open sources OpenElm 4 models + training library called corenet (<a target="_blank" href="https://huggingface.co/collections/apple/openelm-instruct-models-6619ad295d7ae9f868b759ca">HF</a>, <a target="_blank" href="https://github.com/apple/corenet">Github</a>, <a target="_blank" href="https://arxiv.org/abs/2404.14619">Paper</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Google Gemini 1.5 pro is #2 on LMsys arena </p><p>* Devin is now worth 2BN and Perplexity is also a Unicorn </p><p>* A new comer called Augment (backed by Eric Schmidt) is now coming out of stealth (<a target="_blank" href="https://twitter.com/hyhieu226/status/1783237649035383002">X</a>)</p><p>* <strong>Vision & Video</strong></p><p>* Adobe releases VideoGigaGAN - high quality upscaler with temporal consistency (<a target="_blank" href="https://arxiv.org/abs/2404.12388">paper</a>)</p><p>* TLDraw autocomplete UI demo (<a target="_blank" href="https://x.com/altryne/status/1782475422745141747">X</a>)</p><p>* <strong>This Weeks Buzz - What I learned in WandB this week</strong></p><p>* Joe Spisak talk about Llama3 on Stage at WandB Fully connected (<a target="_blank" href="http://wandb.me/llama3">Full Talk</a>, <a target="_blank" href="https://x.com/altryne/status/1782608360249905612">TLDR</a>)</p><p>* Voice & Audio</p><p>* <a target="_blank" href="http://Play.ai">Play.ai</a> (previously play.ht) releases conversational Voice AI platform (<a target="_blank" href="https://twitter.com/play_ht/status/1782563211633651746">X</a>)</p><p>* AI Art & Diffusion & 3D</p><p>* IMGsys.org- like LMsys but for image generation model + leaderboard from FAL (<a target="_blank" href="http://imgsys.org/">try it</a>)</p><p>* Tools & Hardware</p><p>* Rabbit R1 release party & no shipping update in sight</p><p>* I'm disillusioned about my AI Pin and will return it</p><p>Open Source LLMs </p><p>Llama-3 1 week-aversary 🎂  - Leaderboard ranking + finetunes </p><p>Well, it's exactly 1 week since we got Llama-3 from Meta and as expected, the rankings show a very very good story. (also it was downloaded over 1.2M times and already has 600 derivatives on HuggingFace) </p><p>Just on Monday, Llama-3 70B (the bigger version) took the incredible 5th place (now down to 6th) on LMSys, and more surprising, given that the Arena now has category filters (you can filter by English only, Longer chats, Coding etc) if you switch to English Only, this model shows up 2nd and was number 1 for a brief period of time. </p><p>So just to sum up, an open weights model that you can run on most current consumer hardware is taking over GPT-4-04-94, Claude Opus etc' </p><p>This seems dubious, because well, while it's amazing, it's clearly not at the level of Opus/Latest GPT-4 if you've used it, in fact it fails some basic logic questions in my tests, but it's a good reminder that it's really hard to know which model outperforms which and that the arena ALSO has a bias, of which people are using it for example and that evals are not a perfect way to explain which models are better. </p><p>However, LMsys is a big component of the overall vibes based eval in our community and Llama-3 is definitely a significant drop and it's really really good (even the smaller one) </p><p>One not so surprising thing about it, is that the Instruct version is also really really good, so much so, that the first finetunes of Eric Hartfords Dolphin (<a target="_blank" href="https://twitter.com/erhartford/status/1783273948022755770">Dolphin-2.8-LLama3-70B</a>) is improving just a little bit over Meta's own instruct version, which is done very well. </p><p>Per Joe Spisak (Program Manager @ Meta AI) chat at the Weights & Biases conference last week (which you can watch below) he said "<strong>I would say the magic is in post-training. That's where we are spending most of our time these days. Uh, that's where we're generating a lot of human annotations.</strong>" and they with their annotation partners, generated up to 10 million annotation pairs, both PPO and DPO and then did instruct finetuning. </p><p>So much so that Jeremy Howard suggests to finetune their instruct version rather than the base model they released.</p><p>We also covered that despite the first reactions to the 8K context window, the community quickly noticed that extending context window for LLama-3 is possible, via existing techniques like Rope scaling, YaRN and a new <a target="_blank" href="https://twitter.com/rohanpaul_ai/status/1783574428858696161">PoSE</a> method. <a target="_blank" href="https://x.com/winglian/status/1783552669379874932">Wing Lian</a> (Maintainer of Axolotl finetuneing library) is stretching the model to almost 128K context window and doing NIH tests and it seems very promising! </p><p>Microsoft releases Phi-3 (<a target="_blank" href="https://twitter.com/EldanRonen/status/1782852004257149390">Announcement</a>, <a target="_blank" href="https://arxiv.org/abs/2404.14219">Paper</a>, <a target="_blank" href="https://huggingface.co/microsoft/Phi-3-mini-4k-instruct">Model</a>)</p><p>Microsoft didn't really let Meta take the open models spotlight, and comes with an incredible report and follow up with a model release that's <strong>MIT licened</strong>, tiny (3.8B parameters) and performs very very well even against Llama-3 70B. </p><p>Phi is a set of models from Microsoft that train on synthetic high-quality dataset modeled after textbooks-is-all-you-need/TinyStories approach. </p><p>The chart is quite incredible, the smallest (mini) Phi-3 is beating Llama-3-8B AND Mixtral on MMLU scores, BigBench and Humaneval. Again to simplify, this TINY 3.8B model, half the size of 1 Mixtral expert, beats Mixtral and newly released Llama-3-8B on most benchmark, not to mention GPT-3.5! </p><p>It's honestly quite a crazy chart to look at, which raises the question, did this model train on these benchmarks? 🤔 </p><p>I still haven't seen definitive proof that the folks at Microsoft trained on any benchmarks data, I did see engagement from them and a complete denial, however we did see a few attempts at using Phi-3 and the quantized versions and the wrong end token formatting seem to be very prevalent in shaping the early opinion that this model performance is detached from it's very high scoring. </p><p>Not to mention that model being new, there's confusion about how to use it, see<a target="_blank" href="https://x.com/abacaj/status/1783572016823521668"> thread from Anton Bacaj</a> about HuggingFace potentially using the wrong end token to finish conversations. </p><p>Now to an actual performance of this tiny model, I asked it a simple logic based question that trips many models even ones good with logic (Opus and GPT-4 answer it correctly usually) and it performed very well (here a comparison with LLama-3-70B which didn't do as well)</p><p>Additionally, their tokenizer is very interesting, they have all these terms that receive a full token, things like function_list, calc, ghreview, ghissue, and others, which highlight some interesting potential use-cases they have planned for this set of models or give us a hint at it's training process and how come it's so very good. </p><p>Snowflake open sources Arctic - a massive 480B MoE Hybrid with Apache 2 license (<a target="_blank" href="https://twitter.com/SnowflakeDB/status/1783130529434624014">X</a>, <a target="_blank" href="https://arctic.streamlit.app/">Try it</a>, <a target="_blank" href="https://www.snowflake.com/blog/arctic-open-efficient-foundation-language-models-snowflake/">HF</a>)</p><p>Snowflake is a name I haven't yet used on ThursdAI and this field is getting crowded, but they just released something interesting (+ a LOT of open source, including training code, checkpoints, research insights etc')</p><p>The thing I found most interesting is, the massive 128 experts MoE but also the Hybrid architecture. Not quite an MoE and definitely not a dense model. </p><p>They claim to have found that training <em>Many-but-condensed experts with more expert choices</em> is working well for them based on DeepSpeed research. </p><p>You can give this model a try <a target="_blank" href="https://arctic.streamlit.app/">here</a> and I have, using the same 2 questions I had for Phi and LLama and found the model not that great at logic to be honest, but it was really fast considering the total size, so inference optimization for this type of architecture is definitely geared towards Enterprise (as well as training cost, they claim it cost just under $2 million dollars to train) </p><p>Big CO LLMs + APIs</p><p>Not a lot of super interesting things in this corner, besides Gemini 1.5 pro (the one with 1M context window) finally appearing in the Arena and taking the amazing #2 spot (pushing Llama-3 8B to number 6 on the same day it just appeared in there lol) </p><p>This is very impressive, and I gotta wonder what happened with Gemini Ultra if pro with larger context beats it outright. It's indeed very good, but not THAT good if you use it om simple logic problems and don't use the whole context length. </p><p>I suspect that we'll hear much more about their AI stuff during the upcoming Google IO (which I was invited to and am going to cover) </p><p>Additionally, we've had quite a few AI Unicorns born, with Perplexity becoming a freshly mint Unicorn with an additional round of funding and Devin, the 6-month old agent startup getting to a <a target="_blank" href="https://twitter.com/amir/status/1783163540951687456">2 billion valuation</a> 😮 </p><p>This weeks Buzz (What I learned with WandB this week)</p><p>It's been exactly 1 week since our conference in SF and since Joe Spisak by complete chance announced Meta LLama - 3 live on stage a few hours after it was officially announced. </p><p>In this weeks buzz, I'm very happy to bring you that recording, as promised last week. </p><p>I will also share that our newly announced new LLM observability tool Weave launched officially during the conference and it'll be my job to get you to use it 🙂 And shoutout to those in the ThursdAI community who already used and provided feedback, it's really helpful! </p><p>AI Art & Diffusion</p><p>The fine folks at <a target="_blank" href="http://FAL.ai">FAL.ai</a> have launched the <a target="_blank" href="http://LMsys.org">LMsys.org</a> for images, and called it.... <a target="_blank" href="http://IMGsys.org">IMGsys.org</a>  🙂 It's a adversarial arena with different image generators, all hosted on Fal I assume, that lets the user choose which images are "better" which is a vague term. </p><p>But it's really fun, give it a try! </p><p>Tools & Hardware</p><p>Rabbit R1 first impressions</p><p>We finally got a tease of R1 from Rabbit, as the first customers started receiving this device (where's mine?? I didn't even get a tracking number) </p><p>Based on the presentation (which I watched so you don't have to) the response time, which was one of the most talked about negative pieces of AI Pin seems very decent. We're going to see a lot of reviews, but I'm very excited about my Rabbit 👏 🐇 </p><p>Apparently I wasn't as fast as I thought on the pre-order so will have to wait patiently, but meanwhile, check out <a target="_blank" href="https://x.com/rileybrown_ai/status/1783378266180485209">this review</a> from Riley Brown. </p><p>That's the deep dive for this week, for the rest of the coverage, please listen to the episode and if you liked it, share with a friend! </p><p>I'll also be traveling quite a bit in the next two months, I'll be in Seattle for MSFT BUILD, and in San Francisco (more on this soon) a couple of times, hope to meet some of you, please come say hi! 🫡 </p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-april-25-phi-3-38b-impresses</link><guid isPermaLink="false">substack:post:144011472</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Fri, 26 Apr 2024 00:49:21 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/144011472/20438e3e556d4a2e5a674df3d72cc1ed.mp3" length="58730233" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>4894</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/144011472/27428e28f45d0889f692b59b5dc6addb.jpg"/></item><item><title><![CDATA[📅 ThursdAI - Apr 18th - 🎉 Happy LLama 3 day + Bigxtral instruct, WizardLM gives and takes away + Weights & Biases conference update]]></title><description><![CDATA[<p>Happy LLama 3 day folks! After a lot of rumors, speculations, and apparently pressure from the big Zuck himself, we finally can call <strong>April 18th, 2024, LLaMa 3 day</strong>! </p><p>I am writing this, from a lobby of the Mariott hotel in SF, where our annual conference is happening called Fully Connected, and I recorded today's episode from my hotel room. I really wanna shout out how awesome it was to meet folks who are listeners of the ThursdAI pod and newsletter subscribers, participate in the events, and give high fives. </p><p>During our conference, we had the pleasure to have <a target="_blank" href="https://twitter.com/joespeez">Joe Spisak</a>, the Product Director of LLaMa at Meta, to actually announce LLaMa3 on stage! It was so exhilarating, I was sitting in the front row, and then had a good chat with Joe outside of the show 🙌 </p><p>The first part of the show was of course, LLaMa 3 focused, we had such a great time chatting about the amazing new 8B and 70B models we got, and salivating after the announced but not yet released 400B model of LLaMa 3 😮 </p><p>We also covered a BUNCH of other news from this week, that was already packed with tons of releases, AI news and I was happy to share my experiences running a workshop a day before our conference, with focus on LLM evaluations. (If there's an interest, I can share my notebooks and maybe even record a video walkthrough, let me know in the comments) </p><p>Ok let's dive in 👇 </p><p>Happy LLama 3 day 🔥 </p><p>The technical details</p><p>Meta has finally given us what we're all waiting for, an incredibly expensive (2 clusters of 24K H100s over 15 Trillion tokens) open weights models, the smaller <strong>8B</strong> one and the larger <strong>70B</strong> one. </p><p>We got both instruction fine tune and base models, which are great for finetuners, and worth mentioning that this is a dense model (not a mixture of experts, all the parameters are accessible for the model during inference) </p><p>It is REALLY good at benchmarks, with the 7B model beating the previous (LLaMa 2 70B) on pretty much all benchmarks, and the new 70B is inching on the bigger releases from the past month or two, like Claude Haiku and even Sonnet! </p><p>The only downsides are the 8K context window + non multimodality, but both are coming according to Joe Spisak who announced LLama3 on stage at our show Fully Connected 🔥 </p><p>I was sitting in the front row and was very excited to ask him questions later! </p><p>By the way, Joe did go into details they haven't yet talked about pulblicly (see? I told you to come to our conference! and some of you did!) and I've been live-tweeting his whole talk + the chat outside with the "extra" spicy questions and Joes winks haha, you can read that <a target="_blank" href="https://twitter.com/altryne/status/1781052731899416787">thread here</a></p><p>The additional info</p><p>Meta has also partnered with both Google and Bing (take that OpenAI) and inserted LLama 3 into the search boxes of Facebook, Instagram, Messenger and Whatsapp plus deployed it to a new product called <a target="_blank" href="http://meta.ai">meta.ai</a> (you can try it there now) and is now serving LLama 3 to more than 4 Billion people across all of those apps, talk about compute cost! </p><p>Llama 3 also has a new Tokenizer (that Joe encouraged us to "not sleep on") and a bunch of new security tools like Purple LLama and LLama Guard. PyTorch team recently released finetuning library called TorchTune is now supporting LLama3 finetuning natively out of the box as well (and integrates Wandb as it's first party experiment tracking tool) </p><p>If you'd like more details, directly from Joe, I was <a target="_blank" href="https://twitter.com/altryne/status/1780683714101756081">live tweeting</a> his whole talk, and am working at getting the slides from our team. We'll likely have a recording as well, will post it as soon as we have it. </p><p>Here's a TL;DR (with my notes for the first time) of everything else we talked about, but given today is LLaMa day, and I still have to do fully connected demos, I will "open source" my notes and refer you to the podcast episode to hear more detail about everything else that happened today 🫡 </p><p><strong>TL;DR of all topics covered:</strong> </p><p>* Meta releases LLama 3 -8B, 70B and later 400B (<a target="_blank" href="https://x.com/AIatMeta/status/1780997403979735440">Announcement</a>, <a target="_blank" href="https://llama.meta.com/llama3/?utm_source=twitter&#38;utm_medium=organic_social&#38;utm_content=video&#38;utm_campaign=llama3">Models</a>, <a target="_blank" href="http://meta.ai">Try it</a>, <a target="_blank" href="https://twitter.com/LMStudioAI/status/1781087087745274116">Run Locally</a>)</p><p>* Open Source LLMs </p><p>* Meta LLama 3 8B, 70B and later 400B (X, Blog)</p><p>* Trained 15T tokens! </p><p>* 70B and 8B modes released + Instruction finetuning</p><p>* 8K context length , not multi modal</p><p>* 70B gets 82% on MMLU and 81.7% on HumanEval</p><p>* 128K vocab tokenizer</p><p>* Dense model not MoE</p><p>* Both instruction tuned on human annotated datasets</p><p>* Open Access</p><p>* The model already uses RoPe </p><p>* Bigxtral instruct 0.1 (<a target="_blank" href="https://mistral.ai/news/mixtral-8x22b/">Blog</a>, <a target="_blank" href="https://labs.perplexity.ai/">Try it</a>)</p><p>* Instruct model of the best Apache 2 model around</p><p>* Release a comparison chart that everyone started "fixing" </p><p>* 🤖 Mixtral 8x22B is Mistral AI's latest open AI model, with unmatched performance and efficiency </p><p>* 🗣 It is fluent in 5 languages: English, French, Italian, German, Spanish</p><p>* 🧮 Has strong math and coding capabilities  </p><p>* 🧠 Uses only 39B parameters out of 141B total, very cost efficient</p><p>* 🗜 Can recall info from large documents thanks to 64K token context window</p><p>* 🆓 Released under permissive open source license for anyone to use</p><p>* 🏆 Outperforms other open models on reasoning, knowledge and language benchmarks  </p><p>*  🌐 Has strong multilingual abilities, outperforming others in 4 languages</p><p>* 🧪 Excellent basis for customization through fine-tuning</p><p>* New Tokenizer from Mistral (<a target="_blank" href="https://docs.mistral.ai/guides/tokenization/">Docs</a>)</p><p>* Focusing on Tool Use with tokens 🔥</p><p>* WizardLM-2 8x22B, 70B and 7B (<a target="_blank" href="https://twitter.com/WizardLM_AI/status/1779899325868589372">X</a>, HF)</p><p>* Released it and then pulled it back from HF and Github due to microsoft toxicity not passing</p><p>* Big CO LLMs + APIs</p><p>* OpenAI gives us Batch API + Assistants API v2 </p><p>* Batch is 50% cost and win win win</p><p>* Assistants API V2 - new RAG</p><p>* new file search tool</p><p>* up to 10,000 files per assistant</p><p>* new vector store</p><p>* Reka gives us Reka Core (<a target="_blank" href="https://twitter.com/RekaAILabs/status/1779894622334189592">X</a>, Try)</p><p>* Multimodal that understands video as well</p><p>* 20 people team</p><p>* Video understanding is very close to Gemini </p><p>* 128K context </p><p>* Core has strong reasoning abilities including for language, math and complex analysis.</p><p>* 32 languages support </p><p>* HuggingFace ios chat bot now </p><p>* This weeks Buzz</p><p>* Me + team led a workshop a day before the conference (<a target="_blank" href="https://twitter.com/altryne/status/1780683714101756081">Workshop Thread</a>)</p><p>* Fully Connected in SF was an incredible success, over 1000 AI attendies + Meta AI announcement on stage 🔥 </p><p>* PyTorch new TorchTune finetuning library with first class WandB support (<a target="_blank" href="https://x.com/kakemeister/status/1780281318506668370">X</a>)</p><p>* Vision & Video</p><p>* Microsoft VASA-1 animated avatars (<a target="_blank" href="https://x.com/minchoi/status/1780792793079632130">X</a>, <a target="_blank" href="https://www.microsoft.com/en-us/research/project/vasa-1/">Blog</a>)</p><p>* Amazing level of animation from 1 picture + Sound</p><p>* Harry Potter portraits are here</p><p>* They likely won't release this during Election year</p><p>* Looks very good ,close to EMO but no code</p><p>* 📺 Videos show faces speaking naturally with head movements and lip sync</p><p>* 🔬 Researchers are exploring applications in education, accessibility and more</p><p>* HuggingFace updates IDEFICS2 8B VLM (<a target="_blank" href="https://twitter.com/reach_vb/status/1779923496887361745">X</a>, <a target="_blank" href="https://huggingface.co/collections/HuggingFaceM4/idefics2-661d1971b7c50831dd3ce0fe">HF</a>)</p><p>* Apache 2 license</p><p>* Competitive with 30B models</p><p>* 12 point increase in VQAv2, 30 point increase in TextVQA (compared to Idefics 1)</p><p>* > 10x fewer parameters than Idefics 1</p><p>* Supports image resolution up to 980 x 980+</p><p>* Better OCR capabilities (thanks to more than 6TB of OCR pre-training data)</p><p>* Adobe shows Firefly video + SORA support (<a target="_blank" href="https://twitter.com/bilawalsidhu/status/1779911317467398446">X</a>)</p><p>* Voice & Audio</p><p>* Rewind AI is now Limitless (<a target="_blank" href="https://twitter.com/dsiroker/status/1779857843895599383">X</a>)</p><p>* New service & Brand name</p><p>* Transcription to you </p><p>* Hardware device that looks sleek </p><p>* 100hours </p><p>* Privacy support in cloud</p><p>* AI Art & Diffusion & 3D</p><p>* Stability - Stable Diffusion 3 is here </p><p>* Available via API only</p><p>* Partnered with Fireworks HQ for the release</p><p>* Needs stability AI membership to use / access $$</p><p>* Big step up in composition and notorious issues like hands, "AI faces" etc. (from </p><p>* Seems to prefer simpler prompts.</p><p>* Way more copyright-friendly. It's hard to get any kind of brands/logos. </p><p>* Text is amazing.</p><p>* Others</p><p>* New AIrChat with amazing transcription is out, come join <a target="_blank" href="https://www.air.chat/c/ai">us in our AI corner there</a></p><p>* Humane AI pin was almost killed by MKBHD review</p><p>* Rabbit reviews incoming</p><p>That's all for this week, next week we have an amazing guest, see you then! 🫡 </p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-apr-18th-happy-llama-3-day</link><guid isPermaLink="false">substack:post:143728336</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Fri, 19 Apr 2024 00:49:09 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/143728336/b21d9a8ffbc6194e633717edeb3a8d41.mp3" length="96273943" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>8023</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/143728336/63c46b3bee3aa568916cb7c285ac8937.jpg"/></item><item><title><![CDATA[📅 ThursdAI - Apr 11th, 2024 - GPT4 is king again, New Mixtral 8x22B + First finetune, New Gemini 1.5, Cohere beats old GPT4, more AI news]]></title><description><![CDATA[<p>this week was absolutely bonkers. For starters, for the first time ever, we got an Open Weights model (Command R+) to jump over GPT-4 in human rankings on LMsys, this is huge!</p><p>Then on Tuesday, it seems that all the companies just wanted to one up one another, first Gemini 1.5 released with updates, made it available in 180 countries, added audio mode + tons of API improvements and system prompts, then less than an hour later, OpenAI has given us a "majorly improved" GPT-4 Turbo version (2024-04-09) that is now back to being the <strong>BEST LLM IN THE WORLD</strong> and to cap that day off, Mistral did the thing again, the thing being, dropping a torrent link in a tweet with no explanations.</p><p>What was in that torrent is a Mixtral 8x22B MoE (which we started calling Bixtral) which comes with an Apache2 license and seems to be VERY good!</p><p>We also saw the first finetune from HuggingFace/KAIST folks less than 48 hours later (the authors of said finetune actually came on the show 🎉 )</p><p><a target="_blank" href="https://wandb.ai/site/resources/events/fully-connected?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=thursdai&#38;utm_id=april11">Fully Connected</a> is a week from today! If you haven't yet signed up, use THURSDAI promo code and come hear from Richard Socher (You.com), Jerry Liu (Ilamaindex CEO), Karoly (TwoMinutePapers), Joe Spisak (Meta) and and leaders from NVIDIA, Snowflake, Microsoft, Coatue, Adobe, Siemens, Lambda and tons more 👇</p><p>TL;DR of all topics covered:</p><p>* <strong>Open Source LLMs</strong></p><p>* 🔥 Mistral releases Mixtral 8x22 Apache 2 licensed MoE model (<a target="_blank" href="https://twitter.com/MistralAI/status/1777869263778291896">Torrent</a>, <a target="_blank" href="https://labs.perplexity.ai/">TRY IT</a>)</p><p>* Cohere CMDR+ jumps to no 6 on LMSys and beats GPT4 (<a target="_blank" href="https://twitter.com/lmsysorg/status/1777630133798772766">X</a>)</p><p>* CodeGemma, RecurrentGemma & Gemma Instruct 1.1 (<a target="_blank" href="https://twitter.com/robdadashi/status/1777317210836312233">Announcement</a>)</p><p>* Auto-code-rover gets 22% on SWE bench (<a target="_blank" href="https://x.com/AbhikRoychoudh1/status/1777494000611852515">Announcement</a>)</p><p>* HuggingFace - Zephyr 141B-A35B - First Bixtral Finetune (<a target="_blank" href="https://x.com/osanseviero/status/1778430866718421198">Announcement</a>)</p><p>* Mistral 22B - 1 single expert extracted from MoE (<a target="_blank" href="https://x.com/mejia_petit/status/1778390352082215129">Announcement</a>, <a target="_blank" href="https://huggingface.co/Vezora/Mistral-22B-v0.1">HF</a>)</p><p>* This weeks Buzz - Weights & Biases updates</p><p>* FullyConnected is in 1 week! (<a target="_blank" href="https://wandb.ai/site/resources/events/fully-connected?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=thursdai&#38;utm_id=april11">Come meet us</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* 🔥 GPT-4 turbo is back to being number 1 AI with 88.2% Human Eval score (<a target="_blank" href="https://x.com/OpenAI/status/1778574613813006610">X</a>)</p><p>* Gemini 1.5 Pro now understands audio, uses unlimited files, acts on your commands, and lets devs build incredible things with JSON mode (<a target="_blank" href="https://x.com/liambolling/status/1777758743637483562">X</a>)</p><p>* LLama 3 coming out in less than a month (confirmed by Meta folks)</p><p>* XAI Grok now powers news summaries on X (<a target="_blank" href="https://twitter.com/i/trending/1778406570449072628">Example</a>)</p><p>* Cohere new Rerank 3 (<a target="_blank" href="https://txt.cohere.com/rerank-3/">X</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* HuggingFace trained Parler-TTS (<a target="_blank" href="https://x.com/sanchitgandhi99/status/1778093250324189627">Announcement</a>, <a target="_blank" href="https://github.com/huggingface/parler-tts">Github</a>)</p><p>* Udio finally launched it's service (<a target="_blank" href="https://x.com/udiomusic/status/1778045322654003448">Announcement</a>, Leak, Try It)</p><p>* Suno has added explore mode (<a target="_blank" href="suno.ai/explore">suno.ai/explore</a>)</p><p>* Hardware</p><p>* Humane AI pin has started shipping - <a target="_blank" href="https://x.com/JoannaStern/status/1778469290988994741">reviews are not amazing</a></p><p></p><p>Open Source LLMs</p><p>Command R+ first open weights model that beats last year GPT4 versions</p><p>This is massive, really a milestone to be discussed, and even though tons of other news happened, the first time an open weights model is beating GPT-4 not on a narrow case (coding, medical) but on a general human evaluation on the arena.</p><p>This happened just a year after GPT-4 first came out, and is really really impressive.</p><p>Command R+ has been getting a lot of great attention from the community as well, folks were really surprised by the overall quality, not to mention the multilingual abilities of CommandR+</p><p>Mixtral 8x22B MoE with 65K context and Apache 2 license (Bigstral)</p><p>Despite the above, Cohere time in the sun (ie top open weights model on lmsys) may not be that long if the folks at Mistral have anything to say about it!</p><p>Mistral decided to cap the crazy Tuesday release day with another groundbreaking tweet of theirs which includes a torrent link and nothing else (since then they of course uploaded the model to the hub) giving us what potentially will unseat Command R from the rankings.</p><p>The previous Mixtral (8x7B) signaled the age of MoEs and each expert in that was activated from Mistral 7B, but for this new affectionally named Bixtral model, each expert is a 22B sized massive model.</p><p>We only got a base version of it, which is incredible on it's own right, but it's not instruction finetuned yet, and the finetuner community is already cooking really hard! Though it's hard because this model requires a lot of compute to finetune, and not only GPUs, Matt Shumer came on the pod and mentioned that GPUs weren't actually the main issue, it was system RAM when the finetune was finished.</p><p>The curious thing about it was watching the loss and the eval loss. it [Bixtral] learns much faster than other models - Matt Shumer</p><p>Matt was trying to run Finetunes for Bigstral and had a lot of interesting stuff to share, definitely check out that conversation on the pod.</p><p>Bigstral is... big, and it's not super possible to run it on consumer hardware.... yet, because <a target="_blank" href="https://twitter.com/nisten/status/1778173811428311398">Nisten somehow got it to run on CPU only</a> 🤯 using Justin Tuneys LLM kernels (from last week) and LLama.cpp with 9tok/s which is kinda crazy.</p><p>HuggingFace + KAIST release Zephyr 141B-A35B (First Mixtral 8x22 finetune)</p><p>And that was fast, less than 48 hours after the torrent drop, we already see the first instruction finetune from folks at HuggingFace and KAIST AI.</p><p>They give us a new finetune using <a target="_blank" href="https://huggingface.co/papers/2403.07691">ORPO</a>, a technique by KAIST that significantly improves finetuning ability (they finetuned Bigstral with <strong>7k capybara instructions</strong> for <strong>1.3 hours</strong> on 4 nodes of 8 x H100s)</p><p>They used the distilled Capybara Dataset (From LDJ and Argilla) to give this model a bit more clarity and instruction following.</p><p>You can find the model on the <a target="_blank" href="https://huggingface.co/HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1">hub</a> here, and the question is, but now the question is would one run this? 😅</p><p>Btw the authors of the finetune and the ORPO paper from KAIST, <a target="_blank" href="https://twitter.com/jiwoohong98">Jiwoo Hong</a> and <a target="_blank" href="https://twitter.com/nlee288">Noah Lee</a> came on the pod and chatted about this finetune and ORPO which was awesome! Definitely check this conversation out.</p><p>Big CO LLMs + APIs</p><p>Gemini 1.5 Pro updates - Audio Mode, JSON, System prompts and becomes free</p><p>Google really pulled out all the stops for this updated release of Gemini 1.5 Pro, it's flagship, 1M context window model.</p><p>Its now available for free to over 180 countries, has a new audio mode where you can upload up to 9.5 hours of audio (which is crazy on it's own) and it's not merely transcription, it seems that they baked an audio encoder in there so the model can understand some tonality and even some dogs barking in the background!</p><p>In fact, instead of me writing down, how about I show you an example of Gemini itself extracting everything I said about it during the show? Here's a screenshot of me uploading 2+ hours of raw unedited audio form the show today:</p><p>You can see the Google AI studio (which is a very clean product!) and the new system message, the ability to turn the safety filters off (thank you!) and the audio mode. Not to mention the 250K tokens 😂 that my audio cost this model. Mind you, the highest context window after Gemini is Claude 3 with 200K.</p><p>Google also significantly improves the APIs, and gave access to a new file upload API that allows up to 2GB files uploaded (to support this amazing context and multimodality) 🔥</p><p>OpenAI - GPT 4 turbo a new and "Majorly improved version"</p><p></p><p>Remember when Gemini 1.5 was announced? You may not remember that specific day, because an hour after that, OpenAI published SORA and blew our collective minds off.</p><p>Well, OpenAI is at it again, but this time it didn't quite work the same way, but an hour after Gemini 1.5 updates came out, OpenAI released GPT4-Turbo-April-9 aka (<strong>gpt-4-turbo-2024-04-09</strong>) and basically all they said that it was "majorly improved"</p><p>The technical stuff first, they combined the tool use (function calling) API with the Vision API, which is feature parity with Anthropic).</p><p>The vibes are currently good, folks are seeing improvements across the board in logic and code creation, specifically the folks at Cursor posted an example (and enabled this model in their IDE) where it writes higher quality code.</p><p>As I’m writing these words, LMSys updated us that this new model shot up to the top of the arena taking the Mantle back from Opus as the best AI we have, and also a confirmation from <strong>OpenAI that this model is now powering the chatGPT interface</strong> 👏</p><p>OpenAI also just open sourced a repo to show what they used to get these exact scores for the new GPT-4 and they are impressive</p><p>This weeks Buzz (What I learned with WandB this week)</p><p>Final Call! Fully Connected, our very own annual conference is about to commence</p><p>(hehe of course it's happening on a ThursdAI, I still have to think about how to record the show next week)</p><p>Please feel free to use the code <strong>THURSDAI</strong> to sign up and come see us.</p><p>As a reminder, we're also running a workshop a day before, where we're going to showcase Weave and give practical examples for LLM builders, and it's going to be a lot of fun! Looking forward to see some of you there!</p><p>Audio & Voice</p><p>Udio launches a suno competitor AI Music service</p><p>For the past week+ I've seen tons of AI plugged folks in SF post about "a new AI for music is coming and it's going to be amazing". Well it's finally here, called Udio and it gives Suno a run for its money for sure.</p><p>With the ability to create full tracks, create into and outro, remix, and a very needed AI enhanced prompting, Udio does look very very polished and sounds GOOD!</p><p>Here is an example of a classical music track that's been going viral:</p><p>I've played a few more examples on the show itself, and you can check out the trending creations on their page.</p><p>Interestingly, this is probably a diffusion model, and so folks have been squeezing all kinds of stuff that's not only musical out of there, including, stand up comedy with a full laugh track.</p><p></p><p>Suno adds explore mode</p><p>Meanwhile Suno is not going down without a fight and have released this amazing new page where they generated thousands of samples for hundreds of interesting/weird sound styles, letting you get exposed and learn about different musical styles. I really liked it so recorded a short reaction video:</p><p>Phew, somehow we made it, we were able to summarize the huge news this week in under two hours + a newsletter!</p><p>The one thing I haven't been able to do is to actually try out many of the stuff I talked about, so after writing this, will take a little break and delve into some of the other things I haven't yet tried 👀</p><p>See you guys next week in limited capacity (maybe, we'll see) and until then, have a great week 🫡</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-apr-11th-2024-gpt4-is-king</link><guid isPermaLink="false">substack:post:143502712</guid><dc:creator><![CDATA[Alex Volkov and Matt Shumer]]></dc:creator><pubDate>Fri, 12 Apr 2024 00:42:19 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/143502712/af48e7468d7526705ab0841ecfcc7d98.mp3" length="70986728" type="audio/mpeg"/><itunes:author>Alex Volkov and Matt Shumer</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5915</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/143502712/e9caeec0f596a1409adbe7374a163d24.jpg"/></item><item><title><![CDATA[📅 ThursdAI Apr 4 - Weave, CMD R+, SWE-Agent, Everyone supports Tool Use + JAMBA deep dive with AI21]]></title><description><![CDATA[<p>Happy first ThursdAI of April folks, did you have fun on April Fools? 👀 I hope you did, I made a poll on my feed and 70% did not participate in April Fools, which makes me a bit sad! </p><p>Well all-right, time to dive into the news of this week, and of course there are TONS of news, but I want to start with our own breaking news! That's right, we at Weights & Biases have breaking new of our own today, we've launched our new product today called Weave! </p><p>Weave is our new toolkit to track, version and evaluate LLM apps, so from now on, we have Models (what you probably know as Weights & Biases) and Weave. So if you're writing any kind RAG system, anything that uses Claude or OpenAI, Weave is for you! </p><p>I'll be focusing on Weave and I'll be sharing more on the topic, but today I encourage you to listen to the launch conversation I had with Tim & Scott from the Weave team here at WandB, as they and the rest of the team worked their ass off for this release and we want to celebrate the launch 🎉</p><p><strong>TL;DR of all topics covered:</strong> </p><p>* <strong>Open Source LLMs</strong> </p><p>* Cohere - CommandR PLUS - 104B RAG optimized Sonnet competitor (<a target="_blank" href="https://twitter.com/aidangomez/status/1775878606108979495">Announcement</a>, <a target="_blank" href="https://huggingface.co/spaces/CohereForAI/c4ai-command-r-plus">HF</a>)</p><p>* Princeton SWE-agent - OSS Devin - gets <strong>12.29%</strong> on SWE-bench (<a target="_blank" href="https://twitter.com/jyangballin/status/1775114444370051582">Announcement</a>, <a target="_blank" href="https://github.com/princeton-nlp/SWE-agent">Github</a>)</p><p>* Jamba paper is out (<a target="_blank" href="https://twitter.com/AI21Labs/status/1774824070053331093">Paper</a>)</p><p>* Mozilla LLamaFile now goes 5x faster on CPUs (<a target="_blank" href="https://twitter.com/JustineTunney/status/1774621341473489024">Announcement</a>, <a target="_blank" href="https://justine.lol/matmul/">Blog</a>)</p><p>* Deepmind - Mixture of Depth paper (<a target="_blank" href="https://x.com/TheSeaMouse/status/1775782800362242157?s=20">Thread</a>, <a target="_blank" href="https://arxiv.org/abs/2404.02258">ArXiv</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Cloudflare AI updates (<a target="_blank" href="https://blog.cloudflare.com/workers-ai-ga-huggingface-loras-python-support?utm_campaign=cf_blog&#38;utm_content=20240402&#38;utm_medium=organic_social&#38;utm_source=facebook,linkedin,twitter">Blog</a>)</p><p>* Anthropic adds function calling support (<a target="_blank" href="https://x.com/AnthropicAI/status/1775979799644934281">Announcement</a>, <a target="_blank" href="https://docs.anthropic.com/claude/docs/tool-use">Docs</a>)</p><p>* Groq lands function calling (<a target="_blank" href="https://x.com/GroqInc/status/1775634099849322632?s=20">Announcement</a>, <a target="_blank" href="https://console.groq.com/docs/tool-use?utm_content=288316824&#38;utm_medium=social&#38;utm_source=twitter&#38;hss_channel=tw-842860575289819136#models">Docs</a>)</p><p>* OpenAI is now open to customers without login requirements </p><p>* Replit Code Repair - 7B finetune of deep-seek that outperforms Opus (<a target="_blank" href="https://x.com/pirroh/status/1775327316157358564?s=20">X</a>)</p><p>* Google announced Gemini Prices + Logan joins (<a target="_blank" href="https://x.com/artificialguybr/status/1775381126502101217?s=20">X</a>)קרמ</p><p>* <strong>This weeks Buzz - oh so much BUZZ!</strong></p><p>* Weave lunch! Check weave out! (<a target="_blank" href="https://wandb.me/weave">Weave Docs</a>, <a target="_blank" href="github.com/wandb/weave">Github</a>)</p><p>* Sign up with Promo Code THURSDAI at <a target="_blank" href="http://fullyconnected.com">fullyconnected.com</a> </p><p>* <strong>Voice & Audio</strong></p><p>* OpenAI Voice Engine will not be released to developers (<a target="_blank" href="https://openai.com/blog/navigating-the-challenges-and-opportunities-of-synthetic-voices?utm_source=ainews&#38;utm_medium=email&#38;utm_campaign=ainews-evals-based-ai-engineering">Blog</a>)</p><p>* Stable Audio v2 dropped (<a target="_blank" href="https://x.com/StabilityAI/status/1775501906321793266?s=20">Announcement</a>, <a target="_blank" href="https://stableaudio.com/community">Try here</a>)</p><p>* Lightning Whisper MLX - 10x faster than whisper.cpp (<a target="_blank" href="https://twitter.com/maxaljadery/status/1775196809893478797">Announcement</a>, Github)</p><p>* AI Art & Diffusion & 3D</p><p>* Dall-e now has in-painting (<a target="_blank" href="https://x.com/OpenAI/status/1775569161759985737?s=20">Announcement</a>) </p><p>* Deep dive</p><p>* Jamba deep dive with <a target="_blank" href="https://www.linkedin.com/in/roi-cohen-90b87121/">Roi Cohen</a> from AI21 and <a target="_blank" href="https://twitter.com/maximelabonne/status/1775511912773566733">Maxime Labonne </a></p><p>Open Source LLMs </p><p>Cohere releases Command R+, 104B RAG focused model (<a target="_blank" href="https://txt.cohere.com/command-r-plus-microsoft-azure/">Blog</a>)</p><p>Cohere surprised us, and just 2.5 weeks after releasing Command-R (which became very popular and is No 10 on Lmsys arena) gave us it's big brother, <strong>Command R PLUS</strong></p><p>With 128K tokens in the context window, this model is multilingual as well, supporting 10 languages and is even beneficial on tokenization for those languages (a first!) </p><p>The main focus from Cohere is advanced function calling / tool use, and RAG of course, and this model specializes in those tasks, beating even GPT-4 turbo. </p><p>It's clear that Cohere is positioning themselves as RAG leaders as evident by this accompanying tutorial on starting with <a target="_blank" href="https://txt.cohere.com/rag-chatbot/">RAG apps</a> and this model further solidifies their place as the experts in this field. Congrats folks, and thanks for the open weights 🫡</p><p>SWE-Agent from Princeton</p><p>Folks remember Devin? The super cracked team born agent with a nice UI that got 13% on the SWE-bench a very hard (for LLMs) benchmark that requires solving real world issues?</p><p>Well now we have an open source agent that comes very very close to that called SWE-Agent</p><p>SWE agent has a dedicated terminal and tools, and utilizes something called ACI (Agent Computer Interface) allowing the agent to navigate, search, and edit code. </p><p>The dedicated terminal in a docker environment really helps as evident by a massive 12.3% score on SWE-bench where GPT-4 gets only 1.4%! </p><p>Worth mentioning that SWE-bench is a very hard benchmark that was created by the folks who released SWE-agent, and here's some <a target="_blank" href="https://x.com/_carlosejimenez/status/1775157915789463935">videos</a> of them showing the agent off, this is truly an impressive achievement!</p><p>Deepmind publishes Mixture of Depth (arXiv)</p><p>Thanks to <a target="_blank" href="https://twitter.com/TheSeaMouse/status/1775782800362242157">Hassan</a> who read the paper and wrote a deep dive, this paper by Deepmind shows their research into optimizing model inference. Apparently there's a way to train LLMs without affecting their performance, which later allows to significantly reduce compute on some generated tokens.  </p><p>🧠 Transformer models currently spread compute uniformly, but Mixture-of-Depths allows models to dynamically allocate compute as needed</p><p>💰 Dynamically allocating compute based on difficulty of predicting each token leads to significant compute savings </p><p>⏳ Predicting the first token after a period is much harder than within-sentence tokens, so more compute is needed</p><p> 🗑 Most current compute is wasted since difficulty varies between tokens</p><p>We're looking forward to seeing models trained with this, as this seems to be a very big deal in how to optimize inference for LLMs. </p><p><p>Thank you for reading ThursdAI - Best way to support us is to just share this with folks 👇</p></p><p>Big CO LLMs + APIs</p><p>Anthropic and Groq announce function calling / tool use support, Cohere takes it one step further</p><p>In yet another example of how OpenAI is leading not only in models, but in developer experience, most models and API providers are now using the same messages API structure. </p><p>Back in June of 2023, OpenAI gave us function calling, and finally the industry is aligning to this format, which is now being rebranded as "tool use" </p><p>If you're unfamiliar with the concept, tool use allows a developer to specify what tools the model can have in addition to just spitting out tokens, think browsing the web, or using RAG to get more information, or check the weather, or... turn off a lighbulb in your smart home. </p><p>The LLM then decides based on user input, if a specific tool needs to be called, responds with the tool and parameters it needs to the developer, and then expects the result of that tool, and finally, is able to respond to the user with the complete information. </p><p>So this week we've got Command R, Groq and Anthropic all adding support for tool use, which is incredible for developer experience across the board and will allow developers to move between all those APIs. </p><p>Cohere goes one step further with something they call Multi Step tool use, which is a significant step up and is very interesting to explore, as it gives their models the ability to rank and order tool execution, and ovserve their responses.</p><p>Anthropic Docs <a target="_blank" href="https://docs.anthropic.com/claude/docs/tool-use">https://docs.anthropic.com/claude/docs/tool-use</a></p><p>Groq Docs <a target="_blank" href="https://console.groq.com/docs/tool-use">https://console.groq.com/docs/tool-use</a></p><p>Cohere Docs <a target="_blank" href="https://docs.cohere.com/docs/multi-step-tool-use">https://docs.cohere.com/docs/multi-step-tool-use</a></p><p>Cloudflare AI is now in GA + workers in Python</p><p>If you've been following ThursdAI, you know I'm a huge Cloudflare fan. I've built my startup (https://targum.video) on top of Cloudflare workers platform, and I gave them early feedback about having to step into AI in a big way. And they did, with workers AI which is now in GA. </p><p>Workers AI lets developers in the Cloudflare ecosystem run LLMs (they mostly feature Opensource LLMs which is incredible), host vectors, run whisper and basically have end to end serverless apps that are powered by AI (they have GPUs in 150 cities around the world)</p><p>This week Clouflare announced also the ability to write workers in Python, which was sorely missing for some folks (like me!) who love FastAPI for example, and while it's not a full python environment, the depth to which they had to go in order to allow python to execute on their edge is kind of ridiculous, read up on <a target="_blank" href="https://blog.cloudflare.com/workers-ai-ga-huggingface-loras-python-support?utm_campaign=cf_blog&#38;utm_content=20240402&#38;utm_medium=organic_social&#38;utm_source=facebook,linkedin,twitter">it here</a></p><p>I'm hoping to work with them to bring weave into the workers for python soon 🤞 because building AI applications with Cloudflare is so simple, they even have a HuggingFace integration which allows you to bring models into your CF environment with 1 click. </p><p>This weeks Buzz - SO MUCH BUZZ</p><p>Hey, well first of all, I now can offer you a 15% off a ticket to our conference, so use THURSDAI when you checkout and get a ticket <a target="_blank" href="https://wandb.ai/site/resources/events/fully-connected?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=thursdai&#38;utm_id=april4">here</a></p><p>Now that Weave is out, it's possible to say that our workshop on April 17 (same link as above) is going to be focused on LLM evaluations and yes, I will be talking about how to use weave to build LLM applications in production safely. If this field is new to you, please <a target="_blank" href="https://blog.cloudflare.com/workers-ai-ga-huggingface-loras-python-support?utm_campaign=cf_blog&#38;utm_content=20240402&#38;utm_medium=organic_social&#38;utm_source=facebook,linkedin,twitter">sign up</a> and come to the workshop!</p><p>JAMBA deep dive with Roi @ AI21 and Maxime Labonne</p><p>As always, what I cover in this newsletter are only the highlights of what we talked about, but there was so much more, I really recommend you to listen to the episode. This of this weeks episode as 2 episodes (maybe I should re-release the deep dive as a separate episode) because we had a long conversation with Roi Cohen who's a PM @ AI21 and Maxime Labonne (Author of LazyMergeKit and first finetune of JAMBA), it's really worth tuning into that interview. Here's a little snippet: </p><p>Aaaand this is it for this week, or you know what? Maybe it's not! I shared this on X but if you don't follow me on X, I decided to prank my whole feed by saying that I'm basically changing careers and becoming a Russian AI DJ, called DJ Thursday and I will only play AI generated music. </p><p>The weird thing, how many people were like, yeah ok, this makes sense for you 😅 So here's my April Fools (one of them) joke, hope you enjoy the high quality of these tunes and see you all next week 🫡 </p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-apr-4-weave-cmd-r-swe-agent</link><guid isPermaLink="false">substack:post:143286226</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Fri, 05 Apr 2024 01:04:33 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/143286226/a69a5f845d0688167b8e24ba8efccf7d.mp3" length="79258259" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6605</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/143286226/a22fbaa24d7e5178267326fa38d1ab10.jpg"/></item><item><title><![CDATA[📅 ThursdAI - Mar 28 - 3 new MoEs (XXL, Medium and Small), Opus is 👑 of the arena, Hume is sounding emotional + How Tanishq and Paul turn brainwaves into SDXL images 🧠👁️]]></title><description><![CDATA[<p>Hey everyone, this is Alex and can you believe that we're almost done with Q1 2024? March 2024 was kind of crazy of course, so I'm of course excited to see what April brings (besides <a target="_blank" href="https://wandb.ai/site/resources/events/fully-connected?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=thursdai&#38;utm_id=march28">Weights & Biases conference in SF called Fully Connected</a>, which I encourage you to attend and say Hi to me and the team!) </p><p>This week we have tons of exciting stuff on the leaderboards, say hello to the new best AI in the world Opus (+ some other surprises), in the open source we had new MoEs (one from Mosaic/Databricks folks, which tops the open source game, one from AI21 called Jamba that shows that a transformers alternative/hybrid can actually scale) and tiny MoE from Alibaba, as well as an incredible Emotion TTS from Hume. </p><p>I also had the pleasure to finally sit down with friend of the pod Tanishq Abraham and Paul Scotti from MedArc and chatted about MindEye 2, how they teach AI to read minds using diffusion models 🤯🧠👁️</p><p><p>Thank you for reading ThursdAI - Recaps of the most high signal AI weekly spaces. This post is public so feel free to share it.</p></p><p>TL;DR of all topics covered: </p><p>* <strong>AI Leaderboard updates</strong></p><p>* Claude Opus is number 1 LLM on arena (and in the world)</p><p>* Claude Haiku passes GPT4-0613</p><p>* 🔥 Starling 7B beta is the best Apache 2 model on LMsys, passing GPT3.5</p><p>* <strong>Open Source LLMs</strong> </p><p>* Databricks/Mosaic DBRX - a new top Open Access model (<a target="_blank" href="https://twitter.com/jefrankle/status/1772961586497425683">X</a>, <a target="_blank" href="https://huggingface.co/databricks/dbrx-instruct">HF</a>)</p><p>* 🔥 AI21 - Jamba 52B - Joint Attention Mamba MoE (<a target="_blank" href="https://www.ai21.com/blog/announcing-jamba">Blog</a>, <a target="_blank" href="https://huggingface.co/ai21labs/Jamba-v0.1">HuggingFace</a>)</p><p>* Alibaba - Qwen1.5-MoE-A2.7B (<a target="_blank" href="https://twitter.com/JustinLin610/status/1773370228296007951">Announcement</a>, <a target="_blank" href="https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4">HF</a>)</p><p>* Starling - 7B that beats GPT3.5 on lmsys (<a target="_blank" href="https://huggingface.co/Nexusflow/Starling-LM-7B-beta">HF</a>)</p><p>* LISA beats LORA as the frontrunner PeFT (<a target="_blank" href="https://twitter.com/Rui45898440/status/1772996453557997924">X</a>, <a target="_blank" href="https://arxiv.org/abs/2403.17919">Paper</a>)</p><p>* Mistral 0.2 Base released (<a target="_blank" href="https://twitter.com/dchaplot/status/1771672289953866212">Announcement</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Emad leaves stability 🥺</p><p>* Apple rumors - Baidu, Gemini, Anthropic, who else? (<a target="_blank" href="https://x.com/altryne/status/1772118633394606360?s=20">X</a>)</p><p>* <strong>This weeks buzz</strong></p><p>* WandB Workshop in SF confirmed April 17 - LLM evaluations (<a target="_blank" href="https://wandb.ai/site/resources/events/fully-connected?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=thursdai&#38;utm_id=march28">sign up here</a>)</p><p>* <strong>Vision & Video</strong></p><p>* Sora showed some demos by actual artists, Air Head was great (<a target="_blank" href="https://x.com/altryne/status/1772316519743058337?s=20">Video</a>)</p><p>* Tencent Aniportait - generate Photorealistic Animated avatars (<a target="_blank" href="https://x.com/_akhaliq/status/1772926152698396709?s=20">X</a>)</p><p>* MedArc - MindEye 2 - fMRI signals to diffusion models (<a target="_blank" href="https://twitter.com/iScienceLuvr/status/1769909429435294058">X</a>) </p><p>* <strong>Voice & Audio</strong></p><p>* Hume demos EVI -  empathic voice analysis & generation (<a target="_blank" href="https://twitter.com/hume_ai/status/1773038932428468238">X</a>, <a target="_blank" href="https://www.hume.ai/?launchWidget=true">demo</a>)</p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* Adobe firefly adds structure reference and style transfer - (<a target="_blank" href="https://twitter.com/ProperPrompter/status/1773047666055770322">X</a>, <a target="_blank" href="https://firefly.adobe.com/generate/images?id=622f6d6f-0216-44a0-a3de-944fe6ede44f">Demo</a>)</p><p>* <strong>Discussion</strong></p><p>* Deep dive into MindEye 2 with Tanishq & Paul from MedArc</p><p>* Is narrow finetuning done-for with larger context + cheaper prices - debate</p><p>🥇🥈🥉Leaderboards updates from LMSys (<a target="_blank" href="https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard">Arena</a>)</p><p>This weeks updates to the <a target="_blank" href="https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard">LMsys arena</a> are significant. (Reminder in LMsys they use a mix of MT-Bench, LLM as an evaluation <strong>and</strong> user ELO scores where users play with these models and choose which answer they prefer)</p><p><strong>For the first time since the Lmsys arena launched, the top model is NOT GPT-4 based.</strong> It's now Claude's Opus, but that's not surprising if you used the model, what IS surprising is that Haiku, it's tiniest, fastest brother is now well positioned at number 6, beating a GPT4 version from the summer, Mistral Large and other models while being dirt cheap. </p><p>We also have an incredible show from the only Apache 2.0 licensed model in the top 15, Starling LM 7B beta, which is now 13th on the chart, with incredible finetune of a finetune (OpenChat) or Mistral 7B. 👏 </p><p>Yes, you can now run a GPT3.5 beating model, on your mac, fully offline 👏 Incredible. </p><p>Open Source LLMs (Welcome to MoE's)</p><p>Mosaic/Databricks gave us DBRX 132B  MoE -  trained on 12T tokens (<a target="_blank" href="https://twitter.com/jefrankle/status/1772961586497425683">X</a>, <a target="_blank" href="https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm">Blog</a>, <a target="_blank" href="https://huggingface.co/databricks/dbrx-instruct">HF</a>)</p><p>Absolutely crushing the previous records, Mosaic has released the top open access model (one you can download and run and finetune) in a while, beating LLama 70B, Grok-1 (314B) and pretty much every other non closed source model in the world not only on metrics and evals, but also on inference speed</p><p>It uses a Mixture of Experts (MoE) architecture with 16 experts that each activate for different tokens. this allows it to have 36 billion actively parameters compared to 13 billion for Mixtral. DBRX has strong capabilities in math, code, and natural language understanding. </p><p>The real kicker is the size, It was pre-trained on <strong>12 trillion tokens of text and code</strong> with a maximum context length of 32,000 tokens, which is just incredible, considering that LLama 2 was just 2T tokens. And the funny thing is, they call this DBRX-medium 👀 Wonder what large is all about.</p><p>Graph credit Awni Hannun from MLX (<a target="_blank" href="https://x.com/awnihannun/status/1773356250543149417?s=20">Source</a>)</p><p>You can play with the DBRX <a target="_blank" href="https://huggingface.co/spaces/databricks/dbrx-instruct">here</a> and you'll see that it is SUPER fast, not sure what Databricks magic they did there, or how much money they spent (ballpark of ~$10M) but it's truly an awesome model to see in the open access! 👏 </p><p>AI21 releases JAMBA - a hybrid Transformer + Mamba 58B MoE (<a target="_blank" href="https://www.ai21.com/blog/announcing-jamba">Blog</a>, <a target="_blank" href="https://huggingface.co/ai21labs/Jamba-v0.1">HF</a>)</p><p>Oh don't I love #BreakingNews on the show! Just a few moments before ThursdAI, AI21 dropped this bombshell of a model, which is not quite the best around (see above) but has a few very interesting things going for it. </p><p>First, it's a hybrid architecture model, capturing the best of Transformers and Mamba architectures, and achieving incredible performance on the larger context window size (Transformers hardware requirements scale quadratically with attention/context window)</p><p>AI21 are the first to show (and take the bet) that hybrid architecture models actually scale well, and are performant (this model comes close to Mixtral MoE on many benchmarks) while also being significantly cost advantageous and faster on inference on longer context window. In fact they claim that Jamba is the only model in its size class that fits up to 140K context on a <strong>single GPU! </strong>  </p><p>This is a massive effort and a very well received one, not only because this model is Apache 2.0 license (thank you AI21 👏) but also because this is now the longest context window model in the open weights (up to 256K) and we've yet to see the incredible amount of finetuning/optimizations that the open source community can do once they set their mind to it! (see Wing from Axolotl, <a target="_blank" href="https://x.com/winglian/status/1773432465849299188?s=20">add support</a> for finetuning Jamba the same day it released) </p><p>Can't wait to see the benchmarks for this model once it's properly instruction fine-tuned. </p><p>Small MoE from Alibaba - Qwen 1.5 - MoE - A2.7B (<a target="_blank" href="https://qwenlm.github.io/blog/qwen-moe/">Blog</a>, <a target="_blank" href="https://huggingface.co/Qwen">HF</a>)</p><p>What a week for Mixture of Experts models, we got an additional MoE from the awesome Qwen team, where they show that training a A2.7B (the full model is actually 14B but only 2.7B are activated at the same time) is cheaper, 75% reduction in training costs and 174% improvement in inference speed!</p><p>Also in open source: </p><p>Lisa beats LORA for the best parameter efficient training</p><p> 📰 LISA is a new method for memory-efficient large language model fine-tuning presented in a Hugging Face paper💪 LISA achieves better performance than LoRA with less time on models up to 70B parameters🧠 Deep networks are better suited to LISA, providing more memory savings than shallow networks💾 Gradient checkpointing greatly benefits LISA by only storing gradients for unfrozen layers📈 LISA can fine-tune models with up to 7B parameters on a single 24GB GPU🚀 Code implementation in LMFlow is very simple, only requiring 2 lines of code🤔 LISA outperforms full parameter training in instruction following tasks</p><p>Big CO LLMs + APIs</p><p>Emad departs from Stability AI.</p><p>In a very surprising (perhaps unsurprising to some) move, Emad Mostaque, founder and ex-CEO of stability announces his departure, and focus on decentralized AI</p><p>For me personally (and I know countless others) we all started our love for Open Source AI with Stable Diffusion 1.4, downloading the weights, understanding that we can create AI on our machines, playing around with this. It wasn't easy, stability was sued to oblivion, I think LAION is still down from a lawsuit but we got tons of incredible Open Source from Stability, and tons of incredible people who work/worked there. </p><p>Big shoutout to Emad and very excited to see what he does next</p><p>Throwback to NEURIPS where Emad borrowed my GPU Poor hat and wore it ironically 😂 Promised me a stability hat but... I won't hold it against it him 🙂 </p><p>This weeks Buzz (What I learned with WandB this week)</p><p>I'm so stoked about the workshop we're running before the annual Fully Connected conference in SF! Come hear about evaluations, better prompting with Claude, and tons of insights that we have to share in our workshop, and of course, join the main event on April 18 with the whole Weights & Biases crew! </p><p>Vision</p><p>Sora was given to artists, they created ... art</p><p>Here's a short by a company called ShyKids who got access to SORA alongside other artists, it's so incredibly human, and I love the way they used storytelling to overcome technological issues like lack of consistency between shots. Watch it and enjoy imagining a world where you could create something like this without living your living room. </p><p></p><p>This also shows that human creativity and art is still deep in the middle of all these creations, even with tools like SORA</p><p>MindEye 2.0  - faster fMRI-to-image</p><p>We had the awesome pleasure to have Tanishq Abraham and Paul Scotti, who recently released a significantly bette version of fMRI to Image model called MindEye 2.0, shortening the time it takes from 40 hours of data to just 1 hour of fMRI data. This is quite remarkable and I would encourage you to listen to the full interview that's coming out this Sunday on ThursdAI.</p><p>Voice</p><p>Hume announces EVI - their Empathic text to speech mode (<a target="_blank" href="https://twitter.com/altryne/status/1773028549089259727">Announcement</a>, <a target="_blank" href="https://demo.hume.ai/">Demo</a>)</p><p>This one is big folks, really was blown away (see my blind reaction below), Hume announced EVI, a text to speech generator that can reply with emotions! It's really something, and it has be seen to experience. This is in addition to Hume already having an understanding of emotions via voice/imagery, and the whole end to end conversation with an LLM that understands what I feel is quite novel and exciting! </p><p>The Fine-Tuning Disillusionment on X</p><p>Quite a <a target="_blank" href="https://twitter.com/emollick/status/1770618237782307075">few</a> <a target="_blank" href="https://x.com/abacaj/status/1772377838798229993?s=20">folks</a> noticed a sort of disillusionment from finetuning coming from some prominent pro open source, pro fine-tuning accounts leading me to post this: </p><p>And we of course had to have a conversation about it, as well as Hamel Husain wrote this <a target="_blank" href="https://hamel.dev/blog/posts/fine_tuning_valuable.html">response blog</a> called "Is Finetuning still valuable" </p><p>I'll let you listen to the conversation, but I will say, like with RAG, finetuning is a broad term that doesn't apply evenly across the whole field. For some narrow use-cases, it may simply be better/cheaper/faster to deliver value to users with using smaller cheaper but longer context models and just provide all the information/instructions to the model in the context window. </p><p>From the other side, we had data privacy concerns, RAG over a finetune model can absolutely be better than just a simple RAG, and just a LOT more considerations before we make this call that fine-tuning is not "valuable" for specific/narrow use-cases. </p><p>This is it for this week folks, another incredible week in AI, full of new models, exciting developments and deep conversations! See you next week 👏 </p><p></p><p>Transcript Below: </p><p>[00:00:00] <strong>Alex Volkov:</strong> Hey, this is ThursdAI, I'm Alex Volkov, and just a little bit of housekeeping before the show. And what a great show we had today. This week started off slow with some, some news, but then quickly, quickly, many open source and open weights releases from Mosaic and from AI21 and from Alibaba. We're starting to pile on and at the end we had too many things to talk about as always.</p><p>[00:00:36] <strong>Alex Volkov:</strong> , I want to thank my co hosts Nisten Tahirai, LDJ, Jan Peleg, And today we also had Robert Scoble with a surprise appearance and helped me through the beginning. We also had Justin, Junyang, Lin from Alibaba and talk about the stuff that they released from Quen. And after the updates part, we also had two deeper conversations at the second part of this show.</p><p>[00:01:07] <strong>Alex Volkov:</strong> The first one was with Danish Matthew Abraham. and Paul Gotti from MedArc about their recent paper and work on MindEye2, which translates fMRI images using diffusion models into images. So fMRI signals into images, which is mind reading, basically, which is incredible. So a great conversation, and it's always fun to have Tanish on the pod.</p><p>[00:01:37] <strong>Alex Volkov:</strong> And the second conversation stemmed from a recent change in the narrative or a sentiment change in our respective feeds about fine tuning in the era of long context, very cheap models like Claude. And that conversation is also very interesting to listen to. One thing to highlight is this week we also saw the first time GPT 4 was toppled down from the Arena, and we now have the, a change in regime of the best AI possible, uh, which is quite, quite stark as a change, and a bunch of other very exciting and interesting things in the pod today.</p><p>[00:02:21] <strong>Alex Volkov:</strong> So, as a brief reminder, if you want to support the pod, the best way to do this is to share it with your friends and join our live recordings every ThursdAI on X. But if you can't sharing it with a friend, sharing a subscription from Substack, or subscribing, uh, to a pod platform of your choice is a great way to support this pod.</p><p>[00:02:48] <strong>Alex Volkov:</strong> With that, I give you March 28th, ThursdAI.</p><p>[00:02:52] <strong>Alex Volkov:</strong> Hello hello everyone, for the second time? we're trying this again, This is ThursdayAI, now you March 28th. My name is Alex Volkov. I'm an AI evangelist with Weights Biases. And for those of you who are live with us in the audience who heard this for the first time, apologies, we just had some technical issues and hopefully they're sorted now.</p><p>[00:03:21] <strong>Alex Volkov:</strong> And in order to make sure that they're sorted, I want to see that I can hear. Hey Robert Scoble joining us. And I usually join their spaces, but Robert is here every week as well. How are you, Robert? Robert.</p><p>[00:03:35] <strong>Robert Scoble:</strong> great. A lot of news flowing through the system. New</p><p>[00:03:39] <strong>Alex Volkov:</strong> we have, a lot of updates to do.</p><p>[00:03:43] <strong>Robert Scoble:</strong> photo editing techniques. I mean, the AI world is just hot and</p><p>[00:03:48] <strong>Robert Scoble:</strong> going.</p><p>[00:03:49] <strong>Alex Volkov:</strong> A week to week, we feel the excited Acceleration and I also want to say hi to Justin Justin is the core maintainer of the Qwen team. Qwen, we've talked about, and we're going to talk about today, because you guys have some breaking news. But also, you recently started a new thing called OpenDevon. I don't know if we have tons of updates there, but definitely folks who saw Devon, which we reported on, what a few weeks ago, I think? Time moves really fast in this AI world. I think, Justin, you posted something on X, and then it started the whole thing. So you want to give , two sentences about OpenDevon.</p><p>[00:04:21] <strong>Justin Lin:</strong> Yeah, sure. I launched the Open Devon project around two weeks ago because we just saw Devon. It is very popular. It is very impressive. And we just think that Whether we can build something with the open source community, work together, build an agent style, or do some research in this. So we have the project, and then a lot of people are coming in, including researchers and practitioners in the industry.</p><p>[00:04:46] <strong>Justin Lin:</strong> So we have a lot of people here. Now we are working generally good. Yeah You can see that we have a front end and back end and a basic agent system. So we are not far from an MVP So stay tuned</p><p>[00:05:01] <strong>Alex Volkov:</strong> Amazing. so definitely Justin when there's updates to update, you know where to come on Thursday. I, and but also you have like specific when updates that we're going to get to in the open source open source area So folks I'm going to run through everything that we have to cover and hopefully we'll get to everything.</p><p>[00:05:18] <strong>Alex Volkov:</strong> ,</p><p>[00:05:18] TL;DR - March 28th</p><p>[00:05:18] <strong>Alex Volkov:</strong> here's the TLDR or everything that's important in the world of AI that we're going to talk about for the next two hours, starting now. right So we have a leaderboard update, and I thought this is gonna be cool to just have a leaderboard update section because when big things are happening, on the leaderboards, and specifically I'm talking here about The lmsys Arena leaderboard the one that also does EmptyBench, which is, LLM, Judges, LLMs, but also multiple humans interact with these models and in two windows and then they calculate ELO scores, which correlates the best of the vibes evaluations that We all know and love and folks, Claude Opus is the number one LLM on Arena right now. Claude Appus, as the one that we've been talking about, I think, since week to week to week to week is</p><p>[00:06:05] <strong>Alex Volkov:</strong> now the number one LLM in the world and it's quite impressive, and honestly, in this instance, the arena was like, lagging behind all our vibes We talked about this already, we felt it on AXE and on LokonLama and all other places. so I think it's a big deal it's a big deal because for the first time since, I think forever it's clear to everyone that GPT4 was actually beat now not only that, Sonnet, which is their smaller version, also beats some GPT 4's version. and Haiku, their tiniest, super cheap version, 25 cents per million tokens. you literally can use Haiku the whole day, and at the end of the month, you get I don't know, 5 bucks. Haiku also passes one of the versions of GPT 4 for some of the vibes and Haiku is the distilled Opus version, so that kind of makes sense.</p><p>[00:06:53] <strong>Alex Volkov:</strong> But it's quite incredible that we had this upheaval and this change in leadership in the LMS arena, and I thought it's worth mentioning here before. So let's in the open source LLM stuff, we have a bunch of updates here. I think the hugest one yesterday, Databricks took over all of our feeds the Databricks bought this company called Mosaic, and we've talked about Mosaic multiple times before and now they're combined forces and for the past.</p><p>[00:07:17] <strong>Alex Volkov:</strong> year they've been working on something called DBRX, and now it's we got, in addition to the big company models that's taken over, so cloud Opus took over GPT 4, We now have a new open access model that takes over as the main lead. and they call this DPRX medium, which is funny. It's 132 billion parameter language model. and it's a mixture of experts with I think, 16 experts, and it's huge, and it beats Lama270b, it beats Mixtral, it beats Grock on at least MMLUE and human Evil scores and so it's really impressive to see, and we're gonna, we're gonna chat about DPRx as well and there's a bunch of stuff to cover there as well and Justin, I think you had a thread that we're gonna go through, and you had a great reaction.</p><p>[00:08:02] <strong>Alex Volkov:</strong> summary, so we're gonna cover that just today, what 30 minutes before this happened we have breaking news. I'm actually using breaking news here in the TLDR section because</p><p>[00:08:11] <strong>Alex Volkov:</strong> why [00:08:20] not?</p><p>[00:08:22] <strong>Alex Volkov:</strong> So AI21, a company from Israel releases something incredible. It's called Jamba. It's 52 billion parameters. but the kicker is it's not a just a Transformer It's a joint architecture from joint attention and Mamba. And we've talked about Mamba and we've talked about Hyena. Those are like state space models that they're trying to do a Competition to Transformers architecture with significantly better context understanding. and Jamba 52 looks quite incredible. It's also a mixture of experts. as you notice, we have a bunch of mixture of experts here. and It's it's 16 experts with two active generation It supports up to 256K context length and quite incredible. So we're going to talk about Jamba.</p><p>[00:09:03] <strong>Alex Volkov:</strong> We also have some breaking news So in the topic of breaking news Junyang, you guys also released something. you want to do the announcement yourself? It would be actually pretty cool.</p><p>[00:09:13] <strong>Justin Lin:</strong> Yeah, sure. Yeah just now we released a small MOE model which is called QWEN 1. 5 MOE with A2. 7B, which means we activate, uh, 2. 7 billion parameters. Its total parameter is, uh, 14 billion, but it actually activates around, uh, 2. 7 billion parameters</p><p>[00:09:33] <strong>Alex Volkov:</strong> thanks Justin for breaking this down a little bit. We're going to talk more about this in the open source as we get to this section I also want to mention that, in the news about the Databricks, the DBRX model, something else got lost and was released actually on Thursday last week.</p><p>[00:09:49] <strong>Alex Volkov:</strong> We also didn't cover this. Starling is now a 7 billion parameter model that beats GPT 3.5 on LMsys as Well so Starling is super cool and we're going to add a link to this and talk about Starling as Well Stability gave us A new stable code instruct and Stability has other news as well that we're going to cover and it's pretty cool.</p><p>[00:10:07] <strong>Alex Volkov:</strong> It's like a very small code instruct model that beats the Starchat, like I think 15b as well. So we got a few open source models. We also got a New method to Finetune LLMs, it's called Lisa if you guys know what LORA is, there's a paper called Lisa, a new method for memory efficient large language model Fine tuning.</p><p>[00:10:25] <strong>Alex Volkov:</strong> And I think this is it. Oh no, there's one tiny news in the open source as well mistral finally gave us Mistral 0. 2 base in a hackathon that they participated in with a bunch of folks. on the weekend, and there was a little bit of a confusion about this because we already had Mistral 0.</p><p>[00:10:43] <strong>Alex Volkov:</strong> 2 instruct model, and now they released this base model that many finetuners want the base model. so just worth an update there. In the big companies LLMs and APIs, I don't think We have tons of stuff besides, Cloud opus as we said, is the number one LLM in the world. The little bit of news there is that Emmad Mostak leaves stability AI and that's like worthwhile Mentioning because definitely Imad had a big effect, on my career because I started my whole thing with stable Diffusion 1. 4 release. and we also have some Apple rumors where as you guys remember, we've talked about Apple potentially having their own model generator, they have a bunch of Open source that they're working on, they have the MLX platform, we're seeing all these signs. and then, this week we had rumors that Apple is going to go. with Gemini, or sorry, last week, we had rumors that Apple is going to go with Gemini, this week, we had rumors that Apple is going to sign with Entropic, and then now Baidu, And also this affected the bunch of stuff. so it's unclear, but worth maybe mentioning the Apple rumors as well in this week's buzz, the corner where I talk about weights and biases, I already mentioned, But maybe I'll go a little bit in depth that we're in San Francisco on April 17th and 18th, and the workshop is getting filled up, and it's super cool to see, and I actually worked on the stuff that I'm going to show, and it's super exciting, and it covers pretty much a lot of the techniques, that we cover here on ThursdAI as well.</p><p>[00:12:05] <strong>Alex Volkov:</strong> In the vision and video category, This was a cool category as well, because Sora for the first time, the folks at Sora they gave Sora to artists and they released like a bunch of actual visual demos that look mind blowing. Specifically Airhead, i think was mind blowing. We're gonna cover this a little bit.</p><p>[00:12:21] <strong>Alex Volkov:</strong> If you guys remember Emo, the paper that wasn't released on any code that took One picture and made it sing and made it an animated character. Tencent released something close to that's called AnimPortrait. but Any portrait doesn't look as good as emo, But actually the weights are there.</p><p>[00:12:36] <strong>Alex Volkov:</strong> So you can now take one image and turn it into a talking avatar and the weights are actually open and you can use it and it's pretty cool. and in the vision and video, I put this vision on video as well, but MedArk released MindEye 2, and we actually Have a chat closer to the second hour with with yeah, with Tanishq and Paul from AdArc about MindEye 2, which is reading fMRI signals and turning them into images of what you saw, which is crazy. And I Think the big update from yesterday as Well from voice and audio category is that Hume, a company called Hume, demos something called EVI which is their empathetic voice analysis and generation model, which is crazy I posted a video about this yesterday on my feed. you talk to this model, it understands Your emotions. Apparently this is part of what Hume has on the platform. you can actually use this right now but now they already, they showed a 11 labs competitor, a text to speech model that actually can generate voice in multiple emotions. and it's pretty like stark to talk to it. and it answers sentence by sentence and it changes its emotion sentence from by sentence. and hopefully I'm going to get access to API very soon and play around with this. really worth talking about. Empathetic or empathic AIs in the world of like agentry and everybody talks about the, the</p><p>[00:13:53] <strong>Alex Volkov:</strong> AI therapist.</p><p>[00:13:54] <strong>Alex Volkov:</strong> So we're going to cover Hume as well. I think a very brief coverage in the AI art and diffusion Adobe Firefly had their like annual conference Firefly is a one year old and they added some stuff like structure reference and style transfer and one discussion at the end of the show IS narrow fine tuning done for for large, with larger contexts and cheaper prices for Haiku. we had the sentiment on our timelines, and I maybe participated in this a little bit, and so we had the sentiment and , I would love a discussion about Finetuning, because I do see quite A few prominent folks like moving away from this concept of Finetuning for specific knowledge stuff.</p><p>[00:14:32] <strong>Alex Volkov:</strong> Tasks, still yes but for knowledge, it looks like context windows the way they're evolving. They're going to move towards, potentially folks will just do RAG. So we're going to have a discussion about fine tuning for specific tasks, for narrow knowledge at the end there. and I think this is everything that We are going to talk about here. That's a lot. So hopefully we'll get to a bunch of it.</p><p>[00:14:51] Open Source -</p><p>[00:14:51] <strong>Alex Volkov:</strong> and I think we're going to start with our favorite, which is open source</p><p>[00:15:12] <strong>Alex Volkov:</strong> And while I was giving the TLDR a friend of the pod and frequent co host Yam Pelleg joined us. Yam, how are you?</p><p>[00:15:18] <strong>Yam Peleg:</strong> Hey, how are you doing?</p><p>[00:15:19] <strong>Alex Volkov:</strong> Good! I saw something that you were on your way to to visit. our friends at AI21. Is that still the</p><p>[00:15:24] <strong>Alex Volkov:</strong> awesome, awesome.</p><p>[00:15:25] <strong>Yam Peleg:</strong> 10 I'll be there in 10, 20 minutes.</p><p>[00:15:27] <strong>Alex Volkov:</strong> Oh, wow Okay. so we have 10, 20 minutes. and if you guys are there and you want to like hop on, you're also welcome so actually while you're here, I would love to hear from you we, We have two things to discuss. They're major in the open source and like a bunch of other stuff to cover I think the major like the thing that took over all our timelines is that Mosaic is back and Databricks, the huge company that does like a bunch of stuff. They noticed that Mosaic is doing very incredible things. and around, I don't know, six months ago, maybe almost a year ago, they Databricks acquired Mosaic. and Mosaic has been quiet since Then just a refresher for folks who haven't followed us for for longest time Mosaic released a model that was for I don't know, like three months, two months was like the best 7 billion parameter model called mpt and</p><p>[00:16:10] DBRX MoE 132B from Mosaic</p><p>[00:16:10] <strong>Alex Volkov:</strong> Mosaic almost a year ago, I think in May also broke the barrier of what we can consider a large context window so they announced a model with 64 or 72k context window and they were the first before cloud, before anybody else. and since then they've been quiet. and they have an inference platform, they have a training platform, they have a bunch of stuff that Databricks acquired. and yesterday they came out with a bang. and this bang is, they now released the top open access model, the BITS LLAMA, The BITS Mixtral, the BITS Grok1, The BITS all these things [00:16:40] And it's huge. It's a 132 billion parameter MOE that they've trained on I don't know why Seven</p><p>[00:16:49] <strong>Alex Volkov:</strong> 12,</p><p>[00:16:49] <strong>Yam Peleg:</strong> 12,</p><p>[00:16:50] <strong>Alex Volkov:</strong> jesus Christ, 12 trillion parameters.</p><p>[00:16:53] <strong>Alex Volkov:</strong> This is like a huge I don't think we've seen anything come close to this amount of training, Right</p><p>[00:16:59] <strong>Yam Peleg:</strong> Oh yeah, it's insane. I mean, the next one is six of Gemma, the next one we know. We don't know about Mistral, but the next one we know is six trillion of Gemma, and it's already nuts. So, but Yeah. It's a much larger model. I think the interesting thing to say is that it's the age of MOE now everyone is really seeing a mixture of experts and the important thing to, to pay attention to is that they are not entirely the same.</p><p>[00:17:27] <strong>Yam Peleg:</strong> So there is still exploration in terms of the architecture or of small tweaks to the MOE, how to do them, how to actually implement them better, what works better, what is more efficient and so on and so forth. That we just heard about Qwen MOE, which is also a little bit different than the others.</p><p>[00:17:44] <strong>Yam Peleg:</strong> So there is still exploration going on and just looking at what is coming out and everything turns out to be at the ballpark of Mistral and Mixtral just makes me more curious. Like, how did they do this? How everything is just on, on the same ballpark as them? How did they manage to train such powerful models?</p><p>[00:18:04] <strong>Yam Peleg:</strong> Both of them. And Yeah.</p><p>[00:18:06] <strong>Yam Peleg:</strong> I just want to say that because it's amazing to see.</p><p>[00:18:10] <strong>Alex Volkov:</strong> So, so just to highlight, and I think we've been highlighting this When Grok was released, we've been highlighting and now we're highlighting This as well. A significantly smaller model from Mixtral is still up there. It's still given the good fight, even though these models like twice and maybe three times as large sometimes and have been trained. So we don't know how much Mixtral was trained on right but Mixtral is still doing The good fight still after all this time which is quite incredible. and we keep mentioning this when Grok was released, we mentioned this. And now when this was released, we mentioned this as well.</p><p>[00:18:38] <strong>Alex Volkov:</strong> It's. What else should we talk about in DBRX? Because I think that obviously Databricks want to show off the platform. Nisten, go ahead. Welcome, by the way. You want to give us a comment about DBRX as well? Feel free.</p><p>[00:18:51] <strong>Nisten Tahiraj:</strong> Hey guys, sorry I'm late. I was stuck debugging C and it finally worked. I just lost a good time. I used DBRX yesterday. I was comparing it I used it in the LMTS arena. And then I opened the Twitter space and told people to use it. And now it just hit rate limits so you can't use it anymore. Yeah.</p><p>[00:19:11] <strong>Nisten Tahiraj:</strong> It was pretty good. I very briefly did some coding example. It felt better than than Code Llama to me. It wasn't as good as Cloud Opus stuff, but it did give me working gave me working bash scripts. So, yeah, in the very brief, short amount of time I use it, it seemed pretty good, so,</p><p>[00:19:31] <strong>Alex Volkov:</strong> Yep.</p><p>[00:19:32] <strong>Nisten Tahiraj:</strong> that's about it.</p><p>[00:19:33] <strong>Nisten Tahiraj:</strong> As for the Mistral and Mixtral question, so, I use Mistral large a lot, I use I use medium a lot, And the 70s, and the Frankensteins of the 70s, and they all start to feel the same, or incremental over each other. It's just the data. It's just the way they feed it. They feed this thing, and the way they raise it, I think it's it's all they're all raised the same way in the same data.</p><p>[00:20:03] <strong>Nisten Tahiraj:</strong> Yeah, the architecture makes some difference, but the one thing that you notice is that it doesn't get that much better with the much larger models. So it's just the data.</p><p>[00:20:20] <strong>Justin Lin:</strong> That's what I think it is.</p><p>[00:20:21] <strong>Alex Volkov:</strong> I want to ask Justin to also comment on this, because Justin, you had a thread that</p><p>[00:20:24] <strong>Alex Volkov:</strong> had a great coverage as well. What's your impressions from DBRX and kind of the size and the performance per size as well?</p><p>[00:20:32] <strong>Justin Lin:</strong> Yeah, the site is pretty large and it activates a lot of parameters. I remember it's 36 billion and the model architecture is generally fine. Actually, I talked to them a few times. around three months ago, last December introduced Quent2Dem and I accidentally saw it yesterday there are some common senses.</p><p>[00:20:57] <strong>Justin Lin:</strong> I think it is really good. They use TIC token tokenizer with the GPT2BP tokenizer. Recently I have been working with LLAMA tokenizer and the sentence piece tokenizer, well, makes me feel sick. Yeah. It's complicated. Yeah, but the GPT BPE tokenizer, because I have been working with BPE tokenizer years ago, so everything works great.</p><p>[00:21:22] <strong>Justin Lin:</strong> And we were just, for Qwen 1. 5, we just changed it from the implementation of TIP token to the GPT 2 BPE tokenizer by Hugging Face. It is simple to use. I think it's good to change the tokenizer. And it's also good to have the native chat ML format so that I think in the future people are going to use this chat ML format because the traditional chat formats like human assistant, there are a lot of risks in it.</p><p>[00:21:53] <strong>Justin Lin:</strong> So chat ML format is generally good. I think they have done a lot of great choices, but I'm not that, Impressed by their performance in the benchmark results, although benchmarks are not that important, but it's a good indicator. For example, when you look at its MMLU performance, I expect it to be, well, if you have trained it really good.</p><p>[00:22:19] <strong>Justin Lin:</strong> I haven't trained a 100 billion MOE model, but I expect it to be near 80. It is just 73 with 12 trillion tokens. I don't know if they repeat the training epics or they have diverse 12 trillion tokens. They didn't share the details, but I think it could be even better. I am relatively impressed by their coding performance, just as Nisten said.</p><p>[00:22:47] <strong>Justin Lin:</strong> The coding capability looks pretty well, but then I found that well?</p><p>[00:22:53] <strong>Justin Lin:</strong> DBRX Instruct because you can improve and instruct model to a really high level at human eval, but, it's hard for you to improve it for the base model. I'm not pretty sure maybe I need to try more, but it's generally a very good model.</p><p>[00:23:10] <strong>Alex Volkov:</strong> Yeah, absolutely. We got the new contender for the Open weights, open source. So the LLAMA folks are probably like thinking about, the release date it's very interesting what LLAMA will come out with. Notable that this is only an LLM. There's nothing like, there's no multimodality here. and the rumors are the LLAMA will hopefully will be multimodal. so whatever comparison folks do and something like like GPT 4, it's also notable that this is not multi modal yet, this is just text. One thing I will say is that they call this DBRX Medium which hints at potentially having a DBLX, DBRX Large or something, and also something that was hidden and they didn't give it, yet they retrained MPT.</p><p>[00:23:48] <strong>Alex Volkov:</strong> Yam, I think you commented on this and actually Matei Zaharia, the chief scientist there commented on, on, on your thread. They retrained the MPT7B, which was like for a while, the best 7 billion parameter model almost a year ago. and they said that it cost them like twice less to train the same model, something like this, which I thought it was notable as well.</p><p>[00:24:07] <strong>Alex Volkov:</strong> I don't know, Yam, if you want to, if you want to chime in on The yeah.</p><p>[00:24:10] <strong>Yam Peleg:</strong> The interesting thing here is that I mean, it's obvious to anyone in the field that you can, making the model much, much, much better if you get better data. So, what they basically say, what they basically show with actions is that if you have, you can even make the model even twice as better or twice as cheaper to train depending on how you look at it, just by making the data better.</p><p>[00:24:35] <strong>Yam Peleg:</strong> And my own comment on this is that at the moment, to the best of my knowledge Better, better data is something that is not quite defined. I mean, there is a lot of there is a lot of intuition, there are, I think big things when you look at broken data, it's broken. But it's really hard to define what exactly is better data apart [00:25:00] from a deduplication and all of the obvious.</p><p>[00:25:03] <strong>Yam Peleg:</strong> It's very hard to define what exactly is the influence of specific data on performance down the line. So, so it's really interesting to hear from people that have done this and made a model twice as better. What exactly did they do? I mean, because they probably are onto something quite big to get to these results.</p><p>[00:25:27] <strong>Yam Peleg:</strong> Again, it's amazing to see. I mean, it's just a year, maybe even less than a year of progress. I think MPT is from May. If I remember, so it's not even a year of progress and we already have like twice as better models and things are progressing</p><p>[00:25:42] <strong>Alex Volkov:</strong> Worth mentioning also that Databricks not only bought Mosaic, they bought like a bunch of startups, lilac, the friends from Lilac the, we had the folks from Lilac, IL and Daniel here on the pod. And we talked about how important data their data tools specifically is. and they've been a big thing in open source.</p><p>[00:25:58] <strong>Alex Volkov:</strong> All these folks from Databricks, they also highlight like how much Li help them understand their data. very much. so I'm really hoping that they're going to keep Lilac around and free to use as well one last thing that I want to say, it's also breaking news, happened two hours ago, the author of Megablocks, The training library from MOEs, Trevor gale I think he's in DeepMind, he has now given Databricks the mega Blocks library.</p><p>[00:26:23] <strong>Alex Volkov:</strong> So Databricks is also taking over and supporting the mega blocks training library for Moes. that is they say out for firms the next best library for Moes as well and there was a little bit of a chat where Arthur Mech from Mistro said, Hey, welcome to the party. And then somebody replied and said, you are welcome and then they showed the kind of the core contributors to the mega blocks library. And a lot of them are, folks from Databricks. and so now they've taken over this library.</p><p>[00:26:50] AI21 - JAMBA - hybrid Transformer/Mamba Architecture 52B MoE</p><p>[00:26:50] <strong>Alex Volkov:</strong> So yes MOE seems to be a big thing and now let's talk about the next hot MOE AI 21. The folks that I think the biggest like lab for AI in Israel, they released something called Jamba, which is a 52 billion parameter, MOE. and the interesting thing about Jamba is not that it's an MOE is that it's a Mamba and joint attention. so it's like a, it's a mamba transformer. Is that what it is? It's a combined architecture. We've talked about state space models a little bit here, and we actually talked with the author Eugene from RWKV, and we've mentioned Hyena from Together AI, and we mentioned Mamba before and all I remember that we mentioned is that those models, the Mamba models, still don't get the same kind of performance and now we're getting this like 52 billion parameter mixture of excerpt model that does. Quite impressive on some numbers and comes close to LLAMA70B even, which is quite Impressive MMLU is almost 70, 67%. I don't see a human eval score. I don't think they added this. But they Have quite impressive numbers across the board for something that's like a New architecture.</p><p>[00:27:52] <strong>Alex Volkov:</strong> 50 billion parameters with 12 active and what else is interesting here? The New architecture is very interesting. it supports up to 256. thousand context length, which is incredible. Like this Open model now Beats cloud 2 in just the context length, which is also incredible. Just to remind you Databricks, even though they released like a long context model before Databricks DBRX is 32, 32, 000.</p><p>[00:28:15] <strong>Alex Volkov:</strong> This is 256. And not only does it support 256 because of its Unique architecture They can fit up to 140k contexts on a single A180 GB GPU. I know I'm saying a lot of numbers. Very fast, But if you guys remember, for those of you who frequent the pod, we've talked with folks from , the yarn scaling method. and the problem with the context window in Transformers is that the more context you have the more resources it basically takes in a very basic thing. And so the SSM models and the Mamba architecture, they specifically focus on lowering the requirements for long context. and this model gets three times as throughput on long context compared to Mistral.</p><p>[00:28:57] <strong>Alex Volkov:</strong> 8 times 7b, compared to Mixtral, basically. so very exciting, yeah, you wanna comment on this I know you're like almost there, meeting with the guys but Please give us the comments,</p><p>[00:29:07] <strong>Yam Peleg:</strong> I'm there. I'm there in five minutes, so I can maybe if time works towards favour, maybe I can even get you the people on the pod</p><p>[00:29:14] <strong>Alex Volkov:</strong> That'd be incredible.</p><p>[00:29:15] <strong>Yam Peleg:</strong> I'm just, yeah, what what is important here, in my opinion, is that first, I mean, absolutely amazing to see the results.</p><p>[00:29:23] <strong>Yam Peleg:</strong> But what was not known to this point is whether or not those types of models scale. to these sizes. We had smaller Mambas and they were, they looked really promising, but we were at the point where, okay, it looks promising. It looks like it could be at the same ballpark of transformers, but to test this out, someone need to just invest a lot of money into the compute and just see what the results they get.</p><p>[00:29:53] <strong>Yam Peleg:</strong> And it's a risk. You don't know what you're going to get if you're going to do it. And it turns out that you get a really good model at the same ballpark. Maybe slightly less performant as a transformer, but it is expectable. The thing the thing worth mentioning here is that Mamba the Mamba architecture is way more efficient in terms of context size.</p><p>[00:30:15] <strong>Yam Peleg:</strong> As you just said, transformers are quadratic in terms of complexity. When you increase the context. So you have if you have two tokens, you need you need four times that you can say the memory. And if you have four tokens, you need 16 and it just goes on and on and it just explodes, which is why context length is such a problem but Mamba scales much more friendly, memory friendly, you can say.</p><p>[00:30:39] <strong>Yam Peleg:</strong> So, but the thing is that you do pay with the performance of the model. So. What you, what people do is a hybrid between the two, so you can find some sweet spot where you don't just use so much memory and yet you don't have the performance degrade that bad. And I mean, yeah, it's a risk. At the end of the day, you need to train, training such a large model is a lot of money, is a lot of money in terms of compute.</p><p>[00:31:06] <strong>Yam Peleg:</strong> And they did it, released it in Apache 2, which is amazing for everyone to use. And proving for, to everyone that, all right, if you follow this recipe, you get this result. Now people can build on top of that and can train maybe even larger model or maybe even, maybe just use this model. I'm, I didn't try it yet, but I think it's an incredible thing to try because it's it's not the same as Mixtral.</p><p>[00:31:33] <strong>Yam Peleg:</strong> Mixtral is a little bit better, but it's at the same ballpark as Mixtral, but you get way more context there. At your home on a small GPU for cheap. It's amazing.</p><p>[00:31:41] <strong>Alex Volkov:</strong> and Mixtral specifically,</p><p>[00:31:43] <strong>Yam Peleg:</strong> potential.</p><p>[00:31:45] <strong>Alex Volkov:</strong> thanks Yamin, I just want to highlight that Mixtral Is this like amazing model that WE compare models like three times the size to it, and they barely beat Mixtral. We talked about this when Grok 1 was released, we now talked about this when DBRX was released with like</p><p>[00:31:57] <strong>Alex Volkov:</strong> 12 trillion parameters in data.</p><p>[00:32:00] <strong>Alex Volkov:</strong> Mixtral is this basically like the golden standard. We've always had this standard for like how well performing an open model could be and it has been for a while, the best open model that we have and now we're getting this like new architecture, completely new architecture, basically a bet on on would it even scale from Fox from AI21? and it comes close to Mistral, but it does 3x throughput on long contexts compared to mixtral. and it has 256 context window with, if you want to get this from mixtral, You can train it with yarn, you can do all these things, but then you won't be able to actually scale it. hosted because it's gonna cost you so much money because of The quadratic attention.</p><p>[00:32:33] <strong>Alex Volkov:</strong> And</p><p>[00:32:34] <strong>Alex Volkov:</strong> they specifically say, the only model of its size class, that fits up to 140, 000 context windows on a single GPU. Which is quite incredible. and obviously Apache 2 license is great. I don't know if they also released a bunch of stuff like training code and data stuff. So we're definitely going to keep you posted.</p><p>[00:32:50] <strong>Alex Volkov:</strong> And yam hopefully will ask all these questions. But the efficiency in speed where like the closer you get to 128 context, the faster the model kind of performs is also quite incredible. like it. Yeah, it's quite incredible. the graphs there, we're going to post it, everything in the show notes, but absolutely a great release from AI21. shout out AI21 folks and definitely give them our love there and specifically because of the Apache 2 license. Anything else I want to hear from maybe Justin, if you want to comment on the joint architecture anything that you have you guys play with [00:33:20] the joint attention and Mamba. have you what's your reaction to this?</p><p>[00:33:25] <strong>Justin Lin:</strong> Yeah, We are trying with members with very small architectures. We can reach similar performance to transformer, but we did not scale it to very large size, so we don't know what will happen.</p><p>[00:33:38] <strong>Alex Volkov:</strong> So just this is great and Apache 2, and we're very happy shout out to folks at the i21. Briefly let's cover the rest of the stuff that we have still to cover in the open source.</p><p>[00:33:47] Mistral base 0.2</p><p>[00:33:47] <strong>Alex Volkov:</strong> We'll briefly cover this in the TLDR. We'll start with Mistral Mistral 0. 2 base released. so for fine tuning, obviously, for folks who know it's better For fine tuning purposes to have a base model than the instruct model, because then you can mistral.</p><p>[00:33:59] <strong>Alex Volkov:</strong> 0. 2 base was released in Hackathon last week together with Cerebral Valley and some other friends in San Francisco.</p><p>[00:34:08] <strong>Alex Volkov:</strong> There was some confusion about it because we had Instruct 0. 2 before we had a model that said, based on mistral 0. 2 and was like very well performing, the 7 billion parameter one. and now there is the base model. and then somebody went and changed the base of the instruct model to this one versus the previous one but nevermind, they cleared that confusion up and we have this like base model.</p><p>[00:34:28] <strong>Alex Volkov:</strong> It's also like open source and it's great.</p><p>[00:34:30] <strong>Nisten Tahiraj:</strong> there is one thing here about the previous Mistral instruct that they released. That one has been trained for 32k context, and I used it as as a personal chatbot. I'm making it with just the base the base Mistral 7b, and I'm noticing it is much better at carrying forward a conversation.</p><p>[00:34:50] <strong>Nisten Tahiraj:</strong> So I, I think a lot of the fine tunes should probably switch and just rerun on the new Mr. Instruct especially the ones that are geared towards conversational stuff. Because again, Mr. Instruct is limited to eight K and more likely just, you should always just keep it under 4K to get accuracy.</p><p>[00:35:11] <strong>Nisten Tahiraj:</strong> So, that's one thing here. The new seven B performs much better at larger contexts and, and summarizing</p><p>[00:35:18] Starling 7B beta - top apache 2 LLM in the world</p><p>[00:35:18] <strong>Alex Volkov:</strong> One incredible news is Starling. And I think I think. Justin, you had both of you and and Yam as well, you guys talked about this. We're starting actually now is a 7 billion parameter model that beats GPT 3. 5 on LMC Serena, which is quite incredible, right?</p><p>[00:35:34] <strong>Alex Volkov:</strong> I think it's the first and the only 7 billion parameter model that beats GPT 3. 5 on like user preference. And it's, it was hidden in between the DBRX news</p><p>[00:35:42] <strong>Alex Volkov:</strong> but let me see if I can. Put this up here real quick. so this model was released, what, a week ago, a week and a day ago. It's</p><p>[00:35:48] <strong>Alex Volkov:</strong> What do we know from this?</p><p>[00:35:49] <strong>Yam Peleg:</strong> Yeah, I just want to say, and to go in five minutes, I just want to say about Starling this is the second model. So if you haven't tried yet the first one you definitely want to try. I know there are people that are skeptics about 7b models and saying that they are too small. Just give this one a try.</p><p>[00:36:10] <strong>Yam Peleg:</strong> Just give this one a chance. Trust me, just give this specific one a chance. It is an amazing model, seriously, it's an amazing model and it's just showing to everyone that there is a lot more to squeeze out. Scale works, absolutely, but there is a lot more to squeeze out besides scale. And I seriously can't wait for the same technique to be applied on a larger model just to see what we get to.</p><p>[00:36:35] <strong>Yam Peleg:</strong> Because it's an amazing result, seriously.</p><p>[00:36:37] <strong>Alex Volkov:</strong> Nisten, go ahead.</p><p>[00:36:40] <strong>Nisten Tahiraj:</strong> So. The model is is still based and it's actually based off of open chat 3.5. The one thing that their Raven, the Nexus Raven team does well is they had that nexus Raven 13 B model. And for some time that was the best function calling small model you can.</p><p>[00:36:59] <strong>Nisten Tahiraj:</strong> So, I haven't tried this one, but I highly suspect it's probably pretty good at function calling. I'm just looking at it right now, it is Mistral based, it's exactly based off of OpenChat 3. 5 from Alignment Lab, so they fine tuned on top of that, and yeah, I would highly recommend people to use it.</p><p>[00:37:20] <strong>Nisten Tahiraj:</strong> I've used the one that has been trained off of OpenChat a lot, and</p><p>[00:37:24] <strong>Alex Volkov:</strong> They did a bang up job there because this 7 billion parameter model now beats GPT 3. 5, beats CLOUD 2. 1, beats Mistral Next, and Gemini pro and CLOUD 2, like this is the 13th, based on LMsys at least, this is the 13th model, it's 7 billion parameters, it's Apache 2, this is the from Berkeley folks, This is the only Apache 2 licensed model on the LLM leaderboard in the first like top</p><p>[00:37:48] <strong>Alex Volkov:</strong> 20, I think, or top top 13. So it bigs, I don't know how it beats Mixtral. So anyway, yeah, StarLing is great. It looks great Try it, folks. Definitely worth mentioning. We're gonna run through some other updates because we still have a tons of stuff to cover and then we Have some guests here in the audience that want to join and talk about very interesting things</p><p>[00:38:05] LISA beats LORA for AI Finetuning</p><p>[00:38:05] <strong>Alex Volkov:</strong> I don't have a lot of information about Lisa specifically, but I will just mention that there's if you guys in the fine tuning area, you know that Laura and we have Laura in the diffusion? models area as well lower rank adaptations, so folks in the Diffusion world have been training LORES for a while, more than a year, and now there's a new paper dropped that's called a new method for memory efficient large language model fine tuning.</p><p>[00:38:27] <strong>Alex Volkov:</strong> I'll say this slowly a new method for memory efficient large language model fine tuning. So this is not for diffusion stuff this is for large language, it's called Lisa and achieves better performance than LoRa with less time on models up to 70 billion parameters, and yeah, the results look pretty cool for folks who do fine tuning, it's worth comparing this and I know for a while we had different methods for fine tuning like QLora, for example, different Lora, there was an attempt to figure out which one is the best and so Lisa now is a new contender with a paper out and I think code will follow up as well.</p><p>[00:38:59] <strong>Alex Volkov:</strong> Lisa can fine tune models up to 7 billion parameters on a single 24 gigabyte GPU. so you can fine tune 7 billion parameter Mistral, for example, on a 4090 with a 24 gigabyte GPU, which is pretty cool.</p><p>[00:39:13] <strong>Alex Volkov:</strong> And code implementation in LMFlow is very simple. so awesome to have this and we'll add this to the show notes for folks who actually do fine tunes. And it's gonna be awesome. so I think that covers all of the open source stuff, and we obviously spent almost an hour running through open source and I do want to move towards What is the next super exciting stuff that we have this week before we jump into a conversation.</p><p>[00:39:37] Hume EVI emotion based TTS</p><p>[00:39:37] <strong>Alex Volkov:</strong> Yes I want to move into Hume. I want to move into the voice, and audio category. This is an unusual jump between categories. we usually talk about big companies next but there's honestly not that much that happened there. So maybe we'll briefly cover it, but the thing that broke my mind, I'm going to paste this on top here. and hopefully you guys will just listen to me instead of going and watching This is that a company called Hume finally released something that many people have been very excited about. and they showed a few demos there and they finally released something. so Hume has been around for a while.</p><p>[00:40:08] <strong>Alex Volkov:</strong> Apparently they do emotion analysis very well and they actually have this product out there. you can upload the video and actually audio of yourself speaking and they will and understanding of what you're saying. of your emotions and intonations, which is pretty cool. and we know that's a piece that's missing from multimodal LLMs, right? Okay, so Hume, they already had a platform for emotions understanding, and yesterday Hume released their demo of an emotional TTS, a text to speech model that not only speaks This text position actually replies with emotion. and combined with the previous system. that they had that they can understand your emotion, as you can hear, I'm talking about this I was a little bit sad when Hamel had to drop, but now I'm very excited again to talk to you about Hume. so they actually have a running analysis of this voice as it runs. and they understand what kind of like where you are in the emotion scale, which is, first of all, exciting to see on yourself. Second of all, it's like very alarming. Their understanding of emotions, whether or not it's like precise enough to tell the truth, for example. and the text to speech of theirs that generates emotion based text is quite something. I've never seen anything close to it before the only thing that came close to me is that if you guys remember, we talked about 11 labs have style transfer thing where you can actually talk and they would take an AI voice and basically dub you but with the same emotion. So, that was the only thing that came close to what I heard yesterday from Hume. so hume has this model that's gonna be out in I think they said April? [00:41:40] that you'd be able as a developer to assign what emotion it will answer with. and together with the first part, which is a voice emotion understanding, like the text to speech understanding, they now have a speech to text with emotion. the whole end to end feeling is like nothing I've ever experienced and Robert I think I saw you first repost about this So I want to hear if like you play with the demo and like what your thoughts about this Because I was blown away and I will definitely want to hear about What do you think about this?</p><p>[00:42:14] <strong>Robert Scoble:</strong> blown away too. They you nailed it. It lets AI understand your emotion and build a much more human interaction with AI. The one problem is, I believe it's 7 an hour or something like that, so it's fairly expensive to integrate, but, for people who are building new kinds of applications that are going to have to integrate with human beings, I think it's very well done. You should look at it.</p><p>[00:42:41] <strong>Alex Volkov:</strong> Absolutely and definitely for folks who have the Uncanny Valley in different, LLMs that, reading for a long time is not the same. Is not quite the same I think we're gonna see some more emotionality in many of these, demos, and it's gonna be very exciting, together with the fact that recently there has been like this video of basically HeyGen the deepfake company that translates your lips and people were saying, Hey, this is like a fully end to end AI and we're so doomed all of these kind of AI generated voices, they still use 11 labs so I got to think that 11 labs is not going to be like that much behind and we'll start Working on some emotion like output as well but I would definitely add the link to this, and actually the video of me testing this out Hume, in in the show notes, and more than welcome for you guys to try this as well.</p><p>[00:43:27] <strong>Alex Volkov:</strong> I think the demo is demo, oh huma. ai. They actually have a chatbot on the website? hume. ai, where you can talk to the chatbot in your voice, and answers with voice as well but the full demo is more mind blowing. They understand your emotionality, they understand the emotionality of they then translate the emotionality into the actual context. and when the model talks back at you and when you say something like when you try to be when you try to fake it, and you yell, but you say, I'm so happy the model says, Hey, you look a little bit conflicted. So actually understand like what you're saying and what your meaning or basically the way you say it is different.</p><p>[00:44:00] <strong>Alex Volkov:</strong> So they actually build this understanding into the demo, which is super cool to play with. Yeah, so hume definitely worth checking out. I think that next in the voice and audio, I think that basically that's it that we had to cover but a similar area in AI creation is vision and video.</p><p>[00:44:15] SORA examples from filmmakers</p><p>[00:44:15] <strong>Alex Volkov:</strong> And this week we had oh my God the beginning of this week was like all excited about how to how the world of entertainment will look and the reason is because OpenAI took Sora, I'm hoping by this point that Sora is needs no introduction at this point right Sora is OpenAI's text to video model, and it's leagues above everything else that we saw in the world before this and it blew our creative minds, and keeps blowing some people's minds on TikTok. and OpenAI gave access to Sora, to a few creators content creators not Hollywood Apparently they're on the way to Hollywood right now and to talk with folks, But they gave it to a few filmmakers in in the like independent world I think a few Companies from Toronto and they finally showed us demos of what.</p><p>[00:45:03] <strong>Alex Volkov:</strong> Instead of The developers in OpenAI, and some prompts that they do with Sora, what an actual studio can do with some creativity and it looks like they also hired an artist in residence for OpenAI as well and wow my mind was definitely blown. the there was one short video that looked like something that I would, I would have seen in Sundance festival. It's called Airhead from from Toronto based. film</p><p>[00:45:28] <strong>Alex Volkov:</strong> creator called ShyKids, and I'm gonna add this to the show notes because this definitely, at least for me, was the most viral thing that I saw. And, I Absolutely loved it. It was, it felt very human it felt incredible. It's this very short story about something, somebody with a balloon instead of his head and the way they tell the story they kind of work around the technical limitations, which we all know, right if you generate two videos in Sora, the first the character persistence between those two videos will Not be there. And that's a big problem with every video generation. But this one, they worked around this because they told the story of this air balloon guy and his head throughout their life So like the character consistency isn't really required there. And I just really love that like actual storytellers can work around the technology to create something that feels so good Obviously the audio there was amazing and the production and the storytelling, everything. So. I think everybody saw it at this point, but if you haven't airhead from shy kids is quite incredible.</p><p>[00:46:27] Tencent AniPortrait - Animated Avatars</p><p>[00:46:27] . Okay. , I want to talk about Tencent released something called AniPortrait any with N a N I like animated portrait and it's generating photorealistic animated avatars. and if you guys remember Emo, we've talked about this before Emo was quite incredible to me. the examples that Imo showed were pretty much the same level, the same jumping capability in the way that Sora showed the previous image to video generations, Imo showed to kind of Image to animated character and</p><p>[00:46:56] <strong>Alex Volkov:</strong> was incredible.</p><p>[00:46:56] <strong>Alex Volkov:</strong> Like lips moved and eyes and consistency was there. So, the problem with Emo is that they haven't released the code And I think for now Emo is the highest like AI GitHub repo with the highest number of stars with no code. I think it's like 25, 000 stars or something. Everybody's waiting for Emo and haven't dropped.</p><p>[00:47:15] <strong>Alex Volkov:</strong> And when I say everyone, I necessarily mean the kind of the waifu creator world who would love, nothing more than just generate an image in stable diffusion something and then animated this with some, let's say emotional voice from the human thing, that we just mentioned. but the second best one for now is AnyPortrait. and actually the code was dropped. and the kind of the lips movement is great And the eyes, it's not close to emo, but it's really good compared to WAV to leap on different areas and if you ever built like an animated character AI stuff, you'll know that, the open source options.</p><p>[00:47:49] <strong>Alex Volkov:</strong> were not great the closed source options like HeyGen and different like labs like DID and Synthetic, I think, I don't remember the name. They were okay. They were great. but the open source options were not there. So any portrait right now is the best version We have it dropped yesterday. if you are doing any kind of like character animation, give any portrait a try and let us know, I'm definitely gonna play with this.</p><p>[00:48:12] <strong>Alex Volkov:</strong> Definitely gonna play</p><p>[00:48:12] <strong>Alex Volkov:</strong> with this. I think we've covered most of the stuff that we wanted to cover besides weights and biases stuff and NB companies.</p><p>[00:48:18] MindEye 2 - Interview with Tanishq and Paul from MedArc</p><p>[00:48:18] <strong>Alex Volkov:</strong> But now I am very excited to bring two friends here one friend of the pod and for a long time and now a new one, Paul Scotti and you guys here to talk to us about MindEye. to the second version. so I will just like briefly do an introduction that MindEye came around the summer I think I want to say, and we covered this because in my head everything was like multimodal, multimodal. When were we going to get multimodal? This was before vision happened. and one of the craziest multimodalities that we expected was something like a fMRI signal, like brain signals. and then you guys raised MindEye, which was like, mind blowing. and so I would love to hear about the history of like how Med Ark started like doing brain interpretation. and then let's talk about MindEye 2 and what's exciting about this recent release, but feel free please to unmute Tanishq and then Paul and introduce yourself briefly.</p><p>[00:49:08] <strong>Tanishq Abraham:</strong> Yeah Yeah, I'll just provide a quick background and summary and then I'll let Paul talk about MindEye 2 in more detail. But, yeah, basically, I'm introducing myself again. I'm Tanish I'm Tanish. Work at Stability ai and I also am the founder of MedARC and I lead MedARC, which is a medical ai open source medical ai research organization.</p><p>[00:49:30] <strong>Tanishq Abraham:</strong> And, we mostly are focused on trading foundation models for medicine. And So we do have a kind of a kind of research in neuroscience and ai and combining AI and neuroscience, which. which is what Paul is leading at MedArc. But yeah, like we started I guess looking into this sort of neuroscience AI research for quite some time actually.</p><p>[00:49:54] <strong>Tanishq Abraham:</strong> Actually, I think even before I officially started MedArc when I was organizing [00:50:00] some open source medical AI projects, this was one of the projects that I actually had started, I think, back in summer of 2022. And I think, just generally, the idea was that there's, the idea was we were working on this fMRI to image reconstruction problem, which is basically the idea that we take the, we have a person that is looking at some images and we take their fMRI signal.</p><p>[00:50:25] <strong>Tanishq Abraham:</strong> and we want to use AI to reconstruct the image that the person was looking at just in the fMRI signal. So it's the sort of mind reading kind of problem that we're working on. And I think up, back in 2022 when we started working on this, at first no the techniques that people were using were quite basic and, I think the sort of neuroscience community was quite behind in what they were, in what they were using.</p><p>[00:50:48] <strong>Tanishq Abraham:</strong> So I think we were pretty excited about the possibility of utilizing some of the latest techniques in generative AI to advance this field. And yeah, first did I start this project and there were a couple volunteers that were helping out, but luckily Paul had also discovered that we were working on this and he, he joined this project and really spearheaded this kind of neuroscience AI initiative that we've been having at MedArc.</p><p>[00:51:14] <strong>Tanishq Abraham:</strong> And yeah, that resulted in MindEye, which we released in April. I think May of last year and and then we've been continuing to work on improving those results and that has now resulted in MindEye 2. And we also have some other sorts of projects in the neuroscience AI area, like training foundation models for fMRI and we're exploring some other ideas as well.</p><p>[00:51:37] <strong>Tanishq Abraham:</strong> But yeah, I think with MindEye one, we had a very simple sort of pipeline of. of taking the fMRI signal and converting them to clip image embeddings and and then basically re generating an image from the clip image embeddings, and that worked quite well and The only difference, the only issue with that was that it required a lot of data, and we have developed this new pipeline, which Paul will talk more about, that requires less data, is more efficient, and is giving also better results with better, sort of, image generation models, so, for example, we're using SDXL for this MindEye 2 model so, yeah, I think I'll let Paul talk more about the motivation and how MindEye 2 works.</p><p>[00:52:18] <strong>Alex Volkov:</strong> So I just like before we get to Paul thank you for joining guys. first of all, I just want to highlight how insane to me is the thing that you guys talking about where many people like think that, oh yeah, generative AI generates images. Yeah. And generate some texts. And You guys like translating brain signals into what people actually saw. and I think I saw a separate from You also an attempt to understand fMRI. So Paul, maybe feel free to introduce yourself and maybe also cover prior work in this area. I would love to know, if this is something you guys came up with or something You saw and improved on, I would love to know as well.</p><p>[00:52:52] <strong>Alex Volkov:</strong> That's</p><p>[00:52:57] <strong>Paul Scotti:</strong> This, yeah, like Tanisha was saying, we started out working on this together over Discord back in 2022. And at the time, there weren't really any good results doing reconstruction of images from, looking at images inside of an MRI machine. And what really spurred several new papers in this field is open sourced image generation models like stable diffusion clip models, and also importantly a good data set of people looking at images in an MRI machine.</p><p>[00:53:34] <strong>Paul Scotti:</strong> It's a very difficult dataset to collect because we're talking about eight people who spent 30 to 40 hours inside of this MRI machine looking at images one at a time for three seconds each.</p><p>[00:53:48] <strong>Paul Scotti:</strong> So it's, it really was the culmination of dataset and new models that allowed this to work. For the MindEye 2 stuff specifically, We focused on trying to get good results using only one hour instead of 40 hours of data.</p><p>[00:54:07] <strong>Paul Scotti:</strong> And this is pretty important because if you're trying to do these machine learning techniques on new subjects, new data sets, maybe apply to the clinical setting, you aren't going to be collecting dozens of hours of data, especially for clinical populations. It's just too expensive and you're taking up their valuable time.</p><p>[00:54:29] <strong>Paul Scotti:</strong> So we, there's a lot of papers now that have been focusing on fRIDA image, just because it's a cool topic. So our paper shows, state of the art results, but specifically in the one hour domain, We show that you can pre train a model on other people's brains in order to have a better starting point to fine tune the model on a separate, held out subject's brain.</p><p>[00:54:54] <strong>Paul Scotti:</strong> And for people who aren't maybe as familiar with neuroimaging stuff or how the brain is, how the brain works, your brain is wired very differently to other people. It's not like there's the same. part of the brain that always handles, what happens when you look at a picture of an elephant or something.</p><p>[00:55:15] <strong>Paul Scotti:</strong> We have different shapes and sizes of brains. We have different patterns of activity that lead to how we perceive vision. And the reconstructions that we're talking about are not as simple as just, was it a dog that you were looking at? Was it an elephant? So you need some sort of way to align all these different people's brains and their different visual representations into a shared latent space so that you can then get the rest of this pipeline with the, diffusion models and MLPs to work and actually have that be informative to generalize from, my brain to your brain.</p><p>[00:55:53] <strong>Alex Volkov:</strong> so incredible that I have, so many questions, Paul, but I will start with maybe, The differences between brains that something that you said, I also want to talk about, the visual cortex and how that thing happens, but I would be remiss if I don't mention at least that you guys are now talking about MindEye at the same time we're we got the first the first Neuralink implanted human showing that he can control basically a machine with their, with his brain with implants.</p><p>[00:56:19] <strong>Alex Volkov:</strong> But you guys are completely non invasive kind of understanding of these brain signals. But to an extent and Neuralink also is some sort of like an invasive understanding of brain signals and transforming them into actions versus something that they see. but , they mentioned that they're working on sight fixing.</p><p>[00:56:34] <strong>Alex Volkov:</strong> As well.</p><p>[00:56:34] <strong>Alex Volkov:</strong> Could you maybe give us a brief understanding of fMRI, how that translates into the signals from visual contact? How do, how does this machine know what I see and how then you are able to then use diffusion models to recreate what I see.</p><p>[00:56:48] <strong>Alex Volkov:</strong> Could you give us like a little bit more of a, what's, where's the magic here?</p><p>[00:56:52] <strong>Paul Scotti:</strong> Yeah, so, fMRI right now is the best method if we're talking about non invasive tech. If you have electrodes on someone's brain, obviously that's going to give you a much better signal. But it's also not viable to do that for most projects and for applying it to clinical settings and new research and everything.</p><p>[00:57:14] <strong>Paul Scotti:</strong> So we used fMRI, which is a bit crude in the sense that you have these people that are needing to make as little motion as possible. The MRI machine is basically tracking blood flow. So when you look at an image of something, the neurons in your brain that correspond to representing that image are active and they require more oxygenation to help with how they've been used in relation to the other voxels in the brain that are not as relevant for activating to that image.</p><p>[00:57:50] <strong>Paul Scotti:</strong> Basically, you're tracking this kind of slow moving time course of blood flow that corresponds to where in the brain is active. And then you are have this 3D volume of the brain and the corresponding blood oxygenation changes for every given 3D cube or voxel in the brain. And what we did is we took all the voxels corresponding to the visual cortex, The back of the brain that seems to be active when you look at stuff, and we feed that through this neural network.[00:58:20]</p><p>[00:58:20] <strong>Paul Scotti:</strong> And specifically, we feed that through MLPs and a diffusion prior and all this stuff to give us a model that can translate from brain space to clip space. where CLIP is, these models that are contrastively trained typically with text and images So that you can have this multimodal space where you have the ability to align a given image caption with the image itself.</p><p>[00:58:48] <strong>Paul Scotti:</strong> This you can think of as a third space, a new modality for CLIP that's the brain. So we use the same sort of technique of contrastively mapping the brain and its paired samples corresponding to the images into the CLIP space. And then there are so called unclip models, also sometimes called image variations models, that allow you to undo clip space back to pixel space.</p><p>[00:59:13] <strong>Paul Scotti:</strong> And so that's how we actually get the image reconstructions at the end, where the model only gets the brain activities and has to generate the corresponding image.</p><p>[00:59:23] <strong>Alex Volkov:</strong> So I'm still like picking up my job from the floor here, because what you're basically saying is this, the same architecture that is able to Drop cats by understanding the word cat and like a pool, the concept of a cat from latent space. Now you've able to generalize and add multimodality, which is like brain understanding of a cat or like what happens in the brainflow in the visual cortex when somebody looks at a cat and you're basically placing it in the same latent space neighborhood. and now you're able to reconstruct an image based on this. I'm still like trying to obviously wrap my head around this but I would love to maybe ask.</p><p>[01:00:01] <strong>Alex Volkov:</strong> the Tanishq as well. , could you talk about MindEye 2 and specifically the improvements that you did, and how you achieved them and what they are in fact and then how it applies to the clinical field,</p><p>[01:00:11] <strong>Tanishq Abraham:</strong> Right. I mean, so with MindEye 2 like Paul mentioned, our main focus was what can we do to basically use less data when it comes to a new subject. So if you have a, you have a new person that you want to, read their mind, you want to do this reconstruction, we don't want them to have to do 40 hours of scanning because with MindEye 1, you'd have to basically train a separate model for every single subject.</p><p>[01:00:34] <strong>Tanishq Abraham:</strong> So it was like a completely separate model for each subject. So if you had a new subject, you would have to get 40 hours of scanning with that new subject to create a new model. So</p><p>[01:00:42] <strong>Tanishq Abraham:</strong> the idea with MindEye 2 is that we have,</p><p>[01:00:45] <strong>Tanishq Abraham:</strong> We, we train,</p><p>[01:00:46] <strong>Tanishq Abraham:</strong> A model on all of</p><p>[01:00:48] <strong>Tanishq Abraham:</strong> The previous subjects.</p><p>[01:00:50] <strong>Tanishq Abraham:</strong> So for example, we have</p><p>[01:00:51] <strong>Tanishq Abraham:</strong> Eight subjects in the data set,</p><p>[01:00:53] <strong>Tanishq Abraham:</strong> You train on seven of the subjects,</p><p>[01:00:56] <strong>Tanishq Abraham:</strong> And</p><p>[01:00:56] <strong>Tanishq Abraham:</strong> You, and so it's training on all seven subjects and then you are able to then fine tune. that model on a new subject, but you only need one hour of data.</p><p>[01:01:06] <strong>Paul Scotti:</strong> So basically for any new subject, now you only need one hour of data.</p><p>[01:01:09] <strong>Paul Scotti:</strong> So the way that works is that basically we have adapter layers, which is just like these sorts of like linear layers that you have that for each each subject. So, you basically have this sort of layer that is you have the fMRI data from a new subject, but you do have this like linear adapter layer that is basically converting it to again like a kind of a shared space for all the fMRI data.</p><p>[01:01:32] <strong>Paul Scotti:</strong> So then basically when you are taking a new patient or a new subject, all you have to do is fine tune this linear adapter for that new subject. And, yeah, so that's the general idea with. What we try to do there with that way, we only have to use only one hour of data.</p><p>[01:01:49] <strong>Paul Scotti:</strong> But then on top of that, of course, we have various modifications to the entire pipeline that also just gives you better results overall. So for example instead of in the past, when we were taking our clip image embedding and then reconstructing We used a different model called Versatile Diffusion, but here what we did is we actually took SDXL, and the problem with a model like SDXL, for example, is that it only takes in clip text embeddings.</p><p>[01:02:19] <strong>Paul Scotti:</strong> So because, these models are text to image models, so oftentimes a lot of these models are going to be taking, they're taking like clip text embeddings, and that's what they're conditioned on. But here, what we did is we fine tuned SDXL to instead be conditioned on clip image embeddings, and so we have this SDXL unclipped model, that's what we call it and so that, is one, for example, improvement that we use this model instead of the previous model, which was versatile diffusion.</p><p>[01:02:42] <strong>Paul Scotti:</strong> There are a few other like different improvements to the architecture, to the conditioning that we have. I think Paul can again, talk more about that, but I think the main kind of innovation Apart from, this is just the general improvements. I think the main innovation is the use of this sort of adapters for?</p><p>[01:02:59] <strong>Paul Scotti:</strong> Each subject that allows us to then fine tune for new subjects with only one hour of data. and</p><p>[01:03:05] <strong>Paul Scotti:</strong> Paul, I feel free to add any other details as well, Paul.</p><p>[01:03:08] <strong>Alex Volkov:</strong> Yeah. I want to follow up with Paul specifically around you're moving from 40 hours to let's say one hour, one hour still in this like fMRI, basically a coffin, right? like it's a huge machine, like it's super incredibly expensive so the data, the it's not Maybe I'm actually going to presume here, but maybe please correct me if I'm wrong.</p><p>[01:03:26] <strong>Alex Volkov:</strong> Unlike other areas where like synthetic data is now a thing where people like actually improve Have you guys played with synthetic data at all? is that something that you've tried and seems helpful? Or is this like something that actually humans need to sit in those machines and provide some data for</p><p>[01:03:40] <strong>Alex Volkov:</strong> you?</p><p>[01:03:42] <strong>Paul Scotti:</strong> Yeah, I mean, to an extent you need real data to validate things, but we have done augmentation to, which is like synthetic data. to make the models more. Robust, right? So like we've played around with, averaging samples from different images together, doing mix up kind of data augmentations to make the pipeline work better and for some other projects that we're doing that might be involving more synthetic approaches.</p><p>[01:04:16] <strong>Alex Volkov:</strong> Awesome. And so I think I'll end with this one last question is the very famous quote from Jurassic Park is that scientists were preoccupied thinking if they could, they didn't stop thinking if they should, but not in this area. I want to ask you like specifically, what are the some of the applications that you see for something like this when you guys get to MindEye 3 or 4 or 5 and it's maybe with different signals, maybe with EEG, I don't know, what are some of the implications that you see of like being able to read somebody's mind and what can it help?</p><p>[01:04:47] <strong>Alex Volkov:</strong> with?</p><p>[01:04:49] <strong>Paul Scotti:</strong> Yeah. So, you want, yeah, you can go ahead, Paul. Okay. You, yeah. Okay. So, like there's just so many different directions, right? Like you've got right now we're focusing on perception, but the more interesting thing would be mental imagery, like dream reading applying these models to real time so that you can reconstruct while they're still in the scanner that allows you to do cool new experimental designs as well.</p><p>[01:05:15] <strong>Paul Scotti:</strong> You could look at memory, try to reconstruct someone's memory for something. Yeah, Dinesh, maybe you can add on to that. Yeah. So,</p><p>[01:05:26] <strong>Tanishq Abraham:</strong> the thing is, what's really interesting is that a lot of the sort of,</p><p>[01:05:28] <strong>Tanishq Abraham:</strong> Pathways and activity for,</p><p>[01:05:30] <strong>Tanishq Abraham:</strong> Perceiving an image that you're looking at right now, a lot of them are similar for</p><p>[01:05:33] <strong>Tanishq Abraham:</strong> Imagining and dreams and these sorts of things.</p><p>[01:05:35] <strong>Tanishq Abraham:</strong> So of course there are some differences, but that's the thing is that a lot of these pipelines should hopefully be,</p><p>[01:05:41] <strong>Tanishq Abraham:</strong> Generalizable to some of these other applications like,</p><p>[01:05:44] <strong>Tanishq Abraham:</strong> Reconstructing what you're imagining and things like this.</p><p>[01:05:46] <strong>Tanishq Abraham:</strong> And in fact, there are there is some work in this already.</p><p>[01:05:49] <strong>Tanishq Abraham:</strong> There's like a paper from one of our collaborators that may be coming out in a couple months that is exploring this. So it's actually not just limited to. what you're looking at, but you know, more generally as well. But I think just even with this technology that we have with what you're looking at and reconstructing that, I think there's lots of interesting like clinical applications.</p><p>[01:06:08] <strong>Tanishq Abraham:</strong> For example maybe, the way you perceive is associated with your mental condition. So maybe it could be used for different biomarkers, different diagnostic applications. So for example, if you're depressed, for example, maybe you are going to perceive an image.</p><p>[01:06:21] <strong>Tanishq Abraham:</strong> in a more dull fashion, for example. And so I think there's a lot you can learn about how the brain works by looking at how people are perceiving it perceiving images, and also utilizing that for potential clinical and diagnostic applications. So that's also an area that is completely underexplored.</p><p>[01:06:39] <strong>Tanishq Abraham:</strong> [01:06:40] And it's been also pretty much underexplored because people weren't able to get such high quality reconstructions before with, I think the introduction of MindEye 1 was like one of the first times that we were able to get such high quality reconstructions. And of course, even then, we had to use the 40 hours of data to do that.</p><p>[01:06:56] <strong>Tanishq Abraham:</strong> And now we're actually bringing it down to one hour of data. And with further work, we may be able to bring out, bring it down even further. So now we're actually potentially having it's actually, potentially possible to use this for actual clinical applications. And so that is what I'm most excited in the near term in potential diagnostic applications or for potential neuroscience research applications.</p><p>[01:07:17] <strong>Tanishq Abraham:</strong> And then of course, long term vision is trying to apply this for, looking at imagination, dreams, memory. That's, I think, the long term vision and interest there. So that's at least how I see this field progressing and what I'm interested in personally. One, maybe just one more quick nuance is that with the shared subject stuff, it's not limited necessarily to reconstructing images.</p><p>[01:07:41] <strong>Tanishq Abraham:</strong> So typically, machine learning approaches, you need a lot of data, but data takes a lot of time in the MRI machine. And so this approach of using other people's brains as a better starting point allows clinicians to potentially use more complicated ML pipelines for investigating the brain, maybe even outside of image reconstructions, in a way that's feasible given the time commitments that scanning entails.</p><p>[01:08:11] <strong>Alex Volkov:</strong> I absolutely loved, the first thing you said, Paul, that, if we get to real time as the person in the machine, that some stuff, some understanding, interpretation of what they're going through could happen as well. That's extremely exciting. And at</p><p>[01:08:23] <strong>Alex Volkov:</strong> the rate of where Junaid is going I'm, I'm positive that This is possible and I'm very happy that you guys are working on this and are excited about building like improvements on this the jump from 40 hours to one hour seems incredible to me? And if this trend continues, definitely exciting possibilities. Thank you guys for coming up. Maybe let's finish on this what are you Restricted on from going forward. Is it like compute? Is it data? is it talent Maybe you want it like shout out. Maybe you're hiring. Feel free. The stage is just like what else is needed to get to MindEye 3 faster</p><p>[01:08:56] <strong>Tanishq Abraham:</strong> Yeah, I think it's mostly manpower, I guess, I mean, I think it's, mostly relying on volunteers and, Paul, of course, is doing a great job leading this so that I think is the main limitation and of, but of course, yeah, like with MedArc, we are doing everything, open source and transparently so we, we have a Discord server where we organize all of our Our our research and progress and well, we have all the contributors joined.</p><p>[01:09:20] <strong>Tanishq Abraham:</strong> We, I mean, we've been lucky to have amazing contributors so far, from Princeton University of Minnesota University of Waterloo, from all around the world, we've had people contribute, but of course, more contributors are better, of course. And, if you're interested in this sort of research.</p><p>[01:09:35] <strong>Tanishq Abraham:</strong> Please please join our Discord, and of course feel free to, to read the papers as well and follow us on Twitter we'll be updating our progress on Twitter as well but yeah I think Yeah, just, check out our Twitter and join our Discord, I think is the main one.</p><p>[01:09:49] <strong>Tanishq Abraham:</strong> But yeah,</p><p>[01:09:50] <strong>Alex Volkov:</strong> absolutely. And thank you guys for coming up. I'm very happy that I was able to talk to you guys. Cause last time when you raised my hand, I was like, Oh, this is so cool. I know the niche, but yeah, back then we weren't bringing you up. So Paul, thank you It's great meeting you and you guys are doing incredible work and</p><p>[01:10:03] <strong>Alex Volkov:</strong> I think it's very important.</p><p>[01:10:04] <strong>Alex Volkov:</strong> I'm very happy to highlight this as well. Now we're moving to something a little bit different.</p><p>[01:10:08] <strong>Alex Volkov:</strong> Let's reset the space a little bit, and then let's talk about fine tuning.</p><p>[01:10:24] <strong>Alex Volkov:</strong> All righty. ThursdAI, March 28th, the second part of the show. If you just joined us, we Just had an incredible conversation with Paul Scotty and Tanishq Abraham from MedArk and I guess stability, part of stability</p><p>[01:10:43] <strong>Alex Volkov:</strong> as well. and we've talked about AI reading your brain and understanding what you saw, which is incredible.</p><p>[01:10:48] <strong>Alex Volkov:</strong> And I definitely recommend listening to this if you just joined in the middle or or just joining us late Meanwhile, we also covered a bunch of open source stuff so far. We also covered that cloud Opus is now taking over as the number one LLM in the world right now, and something we all knew, but now LMC Serena is catching up? We also had a bunch of breaking news and I wanna just reset the space and say that, hey, for everybody who joined us for the first time this is ThursdAI. we talk about AI every day. everything that's important and impactful in the world?</p><p>[01:11:18] <strong>Alex Volkov:</strong> of AI from week to Week and we've been doing this for more than a year. and you're more than welcome to join us in the conversation in in in the comments as well. We're reading through those. And if you're late to any part of this is released as a podcast episode on every</p><p>[01:11:33] <strong>Alex Volkov:</strong> podcast platform. So you're more than welcome to follow us on Twitter. Apple and Spotify and whatever you get your podcast. and also there's a newsletter with all the links and videos and everything we talk about that you have to actually see, right? So a link to the MindEye paper will be in the show notes and the newsletter as Well</p><p>[01:11:48] This weeks buzz - WandB in SF in April</p><p>[01:11:48] <strong>Alex Volkov:</strong> I will also say that my actual job is an AI evangelist with Weights Biases, a company that builds tools for all these model creators to actually track their experiments. and Weights Biases is coming to San Francisco in April 18th and April 17th. we have a conference there. You're, if you're in the area or you want to fly in and meet like a bunch of folks in San Francisco, you're more than welcome to use this as your Reason and opportunity I think for the next few days</p><p>[01:12:15] <strong>Alex Volkov:</strong> the tickets are still early bird and it's 50 percent price we're doing a workshop on April 17th about improving your business with LLMs. And we're doing everything from prompting to evaluation and doing a bunch of very exciting conversations. So if you're in the area, please stop By and, high five me. I'm going to be in San Francisco for the whole week. and moving on here. I want to chat about finetuning, and I see LDJ here.</p><p>[01:12:36] Discussion : Is finetuning still valunable?</p><p>[01:12:36] <strong>Alex Volkov:</strong> I think we've covered pretty much everything important unless there's breaking news and hopefully folks will DM me If there are breaking news there has been a sentiment in in at least, in our little bubble of AI, On X, where some folks started to get a little bit disillusioned with the concept of Fine tuning. and I don't think the disillusionment necessarily is with the concept of fine tuning as a concept I think the kind of the general vibe of getting and I think some folks like Ethan Mollick and Anton Bakaj was like a folk folks we follow for some like information.</p><p>[01:13:07] <strong>Alex Volkov:</strong> The disillusionment stems from the fact that we previously covered that long context Windows maybe affect like rag for example, RAG use cases, but long context window could also affect finetuning, because if you get something like a Haiku, which is now the world's like fifth or sixth, LLM in the world but it costs 25 cents a million tokens, and you can send a bunch of examples into Haiku for every request you maybe you maybe not needing to fine tune? and so this has been a little bit of a sentiment and also the bigger models they release like the recent Databricks model is huge and it's really hard to fine tune you have to like actually have a bunch of hardware so we've seen the sentiment and I really briefly wanted to touch with LDJ and Nisten and Junyang and Tanishq also like everybody who's on stage feel free to chime in and from the</p><p>[01:13:55] <strong>Alex Volkov:</strong> audience.</p><p>[01:13:56] <strong>Alex Volkov:</strong> If you're friends of the pod, do you want to come up and talk about fine tuning? Let's talk about this sentiment. LDJ, I saw your question. Yes, we've covered Jumba in the beginning. We're very excited. I think Jan was here and now he's talking to actual AI21 folks. So I want to do this like fine tuning conversation.</p><p>[01:14:09] <strong>Alex Volkov:</strong> LDJ, we briefly covered this and we said, Hey, it would be awesome to just chat about this like face to face. So what's your take on this recent sentiment? What are you getting from this?</p><p>[01:14:18] <strong>LDJ:</strong> yeah, I guess when it comes specifically to, I guess, like the business advantage of fine tuning for a specific use case to try and have a cost advantage over open AI models or something, I feel like things might be changing with Haiku and, I mean, you talked about this before It was either you or somebody else posted like a chart of the average trend of the cost for like how good the model is and Haiku is breaking that trend of it's like Really good while being like significantly cheaper than it should be given the previous trends</p><p>[01:14:53] <strong>Alex Volkov:</strong> think that was Swyx. Let me go find it Yeah.</p><p>[01:14:56] <strong>LDJ:</strong> Yeah and yeah, I think just overall for a lot of things that [01:15:00] people would have fine tuned open source models for, Haiku, it just might make sense to use Haiku, and it might be able to do those things that you would fine tune for anyways better or equal, and at the same time be really cheap already to run.</p><p>[01:15:14] <strong>LDJ:</strong> And I think it definitely The amount of tasks that it makes sense to fine tune on from an economic point of view, it's just probably less tasks now than before and I guess that is probably going to get less as a closed source becomes more and more efficient.</p><p>[01:15:32] <strong>Alex Volkov:</strong> Yeah, so absolutely there's a</p><p>[01:15:33] <strong>Alex Volkov:</strong> few areas for which fine tune is a concept even, right? There's like the general instruction fine tuning you take a base model, you try to make it more helpful. but there's also fine tuning for more knowledge, for example, that I think and maybe you guys can correct me on this and feel free to step in here, Junyang as well Is that the kind of the knowledge fine tuning the like giving this model like more information?</p><p>[01:15:56] <strong>Alex Volkov:</strong> sometimes suffers from stuff like catastrophic forgetting that the model like starts to forget some other stuff.</p><p>[01:16:02] <strong>Alex Volkov:</strong> But also things like RAG, for example, are potentially helping in that area where you can actually have a a sighting of a specific source that the model like referred onto, which is very important especially in the enterprise and companies area right like when you want to build something like a assistant or something like retrieval or something like search or better search you actually don't want to count on the model's hallucinations potential. you want to cite something. So for knowledge retrieval, RAG seems to be at least in the companies and enterprise area RAG seems to be like winning over Finetuning. and then the question is RAG over a Finetune model for your specific stuff better than RAG over a general model with a huge context? and I think that this is the area of disillusionment, specifically around the cost of pulling everything back and I think previously context window was very not cost effective We briefly mentioned this today in the area of Jamba models where Context is now like cheaper with those models, but for a regular Transformer LLM, context is expensive.</p><p>[01:17:04] <strong>Alex Volkov:</strong> The more context you have, The kind of, the more the hardware requirements grow and so I think that some of the kind of disillusionment especially comes from that. some of it is probably also related to how Big the models have gotten. I don't know, Nisten, if you want to chime in on this or like how even the Grok one the model was huge. people were like getting excited, but then some folks like Technium from Nous Research, like I said, we won't even try to fine tune this for even Instruction, because it's just too big so I wanted to hear from Nisten, from you because you guys also did like a bunch of fine tuning. And also maybe merging as well is related to here.</p><p>[01:17:43] <strong>Nisten Tahiraj:</strong> Yeah, gotta keep in mind that for a while, fine tuning was a lot more expensive. Running fine tuned models was a lot more expensive than using GPT 3. 5. And then it got a lot cheaper with all the API companies, especially together and the other ones. So the business case for it has not really been how how cheap it is.</p><p>[01:18:08] <strong>Nisten Tahiraj:</strong> I think in my opinion, the business case has. been all about data ownership. A lot of companies that have their own chatbots and stuff, they they see the data as their property and the value in their company, so the reason they fine tune is not because necessarily it's better, sometimes it is but it's been to just have full control of the data. And there have been a lot of drawbacks where you could have the knowledge could be lost. But there are much newer techniques where you can do, quote unquote, lossless fine tuning and and still have it. But yeah, I'll I'll land it there. So I think the business case is not necessarily the cost that has, it's always just been about data ownership.</p><p>[01:18:53] <strong>Nisten Tahiraj:</strong> I'm actually doing consulting for one client now that really just wants to use Grok. Some they use the Grok API before and now they want to run it on their own and they don't care how many. JVs and stuff it costs to run because they factor it in with what their users pay.</p><p>[01:19:13] <strong>Nisten Tahiraj:</strong> So, so, so yeah I'm noticing that it's more about the ownership side, not not necessarily the performance or cost.</p><p>[01:19:21] <strong>Alex Volkov:</strong> GR with a K or grok with a Q.</p><p>[01:19:23] <strong>Nisten Tahiraj:</strong> Grok with a K the new, yeah,</p><p>[01:19:25] <strong>Alex Volkov:</strong> Oh, really? What API they use for grok. There's no API is there an API for grok that I missed?</p><p>[01:19:31] <strong>Nisten Tahiraj:</strong> No they</p><p>[01:19:31] <strong>Ian Maurer:</strong> open source the model.</p><p>[01:19:33] <strong>Alex Volkov:</strong> Oh, so somebody hosted this and then they used the API since the, since last week basically</p><p>[01:19:37] <strong>Ian Maurer:</strong> no, they people</p><p>[01:19:38] <strong>Nisten Tahiraj:</strong> have used have used grok. I think they just did a, like a translation layer via via premium, but they did use grok in, in, in a product for, via an API. I'll have to, I'll have to double check how exactly,</p><p>[01:19:53] <strong>Alex Volkov:</strong> like I can think of a way, but I'm not saying it's kosher. Like you, you can, you can put a Chrome extension and use the browser. Very</p><p>[01:19:59] <strong>Nisten Tahiraj:</strong> No, even Levels. io deployed a, uh, like a WhatsApp bot that was that was running off of Grok too. So again I'll check up on that. I don't know what API stuff they, they used, but I am helping them now just run their own.</p><p>[01:20:16] <strong>Alex Volkov:</strong> I see. LDJ, you unmuted. You want to chime in on the kind of like a specific choice and data ownership piece of the fine tuning, which I think is important. But from the other side if I'm representing the other side and I'm not, I'm just trying to figure out like where the vibes are coming from about this eligiment is most clouds now run most</p><p>[01:20:34] <strong>Alex Volkov:</strong> Open source models, or at least, Microsoft definitely is now like supporting Mixtral.</p><p>[01:20:38] <strong>Alex Volkov:</strong> I don't know if they're going to run Grok for you or not. And. There's also something to be said where if you're running Cloud from inside Amazon or Bedrock or Vertex or whatever you still own your data, aren't you?</p><p>[01:20:52] <strong>LDJ:</strong> I'm not too familiar with the situation with Vertex and stuff but I do think that in the situations where a business has to. would want to and has to fine tune on like their company data so that employees can actually like, use something that is like an AI that understands the internal company information.</p><p>[01:21:12] <strong>LDJ:</strong> That is I would say still a decent sized use case that you would have to use like the open source models for like unless you're fine with giving open AI your data and stuff, but I'm not saying necessarily open AI will train on it. I know they have different clauses and stuff, but you know, there's always like that risk and if you want to keep that stuff secret and internal, then you do have to still just use the open source models to fine tune.</p><p>[01:21:38] <strong>Alex Volkov:</strong> Yeah. the additional kind of piece that, that I think Ethan knowledge like pointed to and before I get to Justin super quick, is that the example of Bloomberg and I think LDJ you wanted to push back on this example, but I'll cover this like briefly. B Bloomberg, sorry. Bloomberg famously trained a model called Bloomberg gpt based on the type of financial data that Bloomberg has access to.</p><p>[01:22:00] <strong>Alex Volkov:</strong> And back then it like it significantly improved. LLM thinking about like finances and financial data, et cetera, only to then find out that a General model, like GPT 4, like blows it out of the Water Whatever 10 million, whatever they spent on that. And I think this was like also A highlight of how general models after they get released and they're getting better they're getting better across the board Not only for your task, but also for your task as well and before we get to Junaid and LDJ, you had a pushback that they didn't do it correctly, it was a skill issue or something like this, right?</p><p>[01:22:32] <strong>LDJ:</strong> Yeah. I think it was honestly more of a skill issue on Bloomberg's part because. And I'll try and find the exact source for what I'm about to say, but it was like within a few weeks of Bloomberg GPT releasing, like there's like just a couple open source developers that released like a finance specific model.</p><p>[01:22:49] <strong>LDJ:</strong> That was performing significantly better on the finance benchmarks with the same amount or less parameters. And that was just within a few weeks of Bloomberg GPT releasing. So obviously you didn't even need all that Bloomberg data and all that stuff to actually even get something that, that well performing.</p><p>[01:23:06] <strong>Alex Volkov:</strong> Yep. All right.</p><p>[01:23:07] <strong>Alex Volkov:</strong> I want to get to Justin, because, Justin, obviously you're on the Qwen team, you guys are building models that then other folks maybe fine tune and probably also supporting, enterprise use cases. What's your take on the fine tuning area?[01:23:20]</p><p>[01:23:20] <strong>Justin Lin:</strong> Yeah, just some comment on the fine tuning for customer data. I think I somehow disagree with the idea that. We can inject new knowledge to the model through fine tuning because it is really difficult to do that. Do this thing with such a few data because we often use a very small amount of data to for fine tuning. I have read the paper, I don't remember its name, but it's telling us that fine tuning is more about aligning to the behavior, to the style, but not injecting new knowledge. If you want to inject new knowledge, you have to do things like this. Pre training next token prediction with ten, tens of billions of tokens so you can do this, but it is really hard.</p><p>[01:24:09] <strong>Justin Lin:</strong> Something I would like to comment is that our customers fine tune our model and they found that the general capability is decreased. With the new knowledge I think this is quite reasonable because somehow our customers or users don't know really how to fine tune for a general model.</p><p>[01:24:29] <strong>Justin Lin:</strong> They want the general capability, but they want something new. So we have provided a solution is that we just provide our data for general fine tuning in a black box way. So you can use our data, but you cannot see our data, and you can mix our data with your own, yeah, customer data so that you can train a new model which has a balanced behavior good general capabilities, but some new knowledge or some new styles of your company or something like that.</p><p>[01:25:04] <strong>Justin Lin:</strong> Yeah. This is some of my personal</p><p>[01:25:06] <strong>Justin Lin:</strong> experience. Yeah.</p><p>[01:25:07] <strong>Alex Volkov:</strong> I really appreciate this, because I think that The difference is important fine tuning is not like a catch all sentence. There's fine tuning for style fine tuning for alignment for different ways to respond, for example. and that I think still, makes perfect sense. We have base models, we have fine tuned models for instruction fine tuning, for example. but I think that this is, at least the way I see it on my kind of radar, and I wanted to bring this to ThursdAI because I think it's very important for folks who follow this to also know that this is happening is from specifically from Finetuning with new knowledge, not only new kind of styles, new knowledge specifically, because the additional piece here is fine tuning takes a while and like maybe we said about Bloomberg maybe a skill issue Maybe you have to get like those machine learning engineers whereas with the advent of faster hardware better models that are open for you and They're now hosted on the actual kind of like the bedrock from Amazon. for example, this is basically in your cloud, They're running whatever haiku but in your cloud and the same agreements of not training all Your data is like the same, they apply OpenAI, You can run through Microsoft thing and in your cloud in Azure, and it's not like sending some data to OpenAI. So when we get to like bigger contexts, the ability of you to switch up and give this whatever product you're building on top of these LLMs, new data That's easier than Finetune with just like just providing the same context as well.</p><p>[01:26:29] <strong>Alex Volkov:</strong> Tanishq, I saw you had your hand up and definitely want to hear from you as well.</p><p>[01:26:34] <strong>Tanishq Abraham:</strong> Yeah, I guess I just had a few thoughts about this whole thing because, I'm working in the medical AI space and we're like by two models for, clinical applications, medical applications. So, I have various thoughts about this. I think just generally, of course I think it's with Phytuning yeah, It's particularly useful if you like, I think LDJ is of course, but actually the use case of yeah, if there's private data, that's of course a big one.</p><p>[01:26:56] <strong>Tanishq Abraham:</strong> I think also if you want to have models locally, you want to use models locally. I think that's another big use case. A lot of times, there are many cases where, you don't want. To use cloud services, I think like in the medical scenario, for example, maybe you don't want to send medical data to various cloud providers and having some sort of local models could potentially be useful.</p><p>[01:27:13] <strong>Tanishq Abraham:</strong> And of course there are other applications where maybe you want to have models run on, Some sort of like smartphones or other devices. So that's, I think one particular area where like fine tuning is particularly valuable. I think, in the sort of just to provide maybe some context in the medical space, medical AI space, I think this idea of whether or not fine tuning is useful is, I think, honestly, in my opinion, like an argument that's like still not settled yet.</p><p>[01:27:38] <strong>Tanishq Abraham:</strong> So for example, like in the clinical LSP space, you have models like, of course, GPT 4, you have then you have, Google has their MedPOM models, then other model, other people are creating specific fine tunes. About a couple of years ago, or maybe it was a year ago, there was a paper that tried to see if for example, something like GPT 3 was better, or fine tuning a specific model for medical use cases was better.</p><p>[01:28:02] <strong>Tanishq Abraham:</strong> They found that fine tuning was better performing and of course required less parameters and was a smaller model. But then something like people Google, for example, created their MedPAL models. Those are more like alignment in the sense that Justin was talking about. The knowledge is mostly there in the original PAL models and they're just doing some sort of instruction fine tuning.</p><p>[01:28:22] <strong>Tanishq Abraham:</strong> And so that has been showing to do quite well. Thank you. And then recently there was a paper, the MedPrompt paper, which basically prompted GPT 4 to basically outperform all these other models for medical tasks. And so that one was just trying to say like a general purpose model is good enough.</p><p>[01:28:40] <strong>Tanishq Abraham:</strong> So I think there's still a lot of it's still actually an open question, at least in, in this specific area, whether or not PHI tuning is better, or if it's just alignment that's needed, or you can just use the general purpose model. And so I think we're trying to study this question a little bit more detail as well, and try to see if PHI tuning really is necessary, if that actually does provide benefit.</p><p>[01:28:58] <strong>Tanishq Abraham:</strong> And at least for me, I think of it more like, when I say PHI tuning, I also think of it like as continued pre trading where, yeah, we are probably be trading on we are trading on like tens of billions of tokens. To add knowledge to a model. And I think, there's, people talk about FI tuning, but they also talk about continued pre-training and sometimes the distinction between those is a little bit kind of a group.</p><p>[01:29:18] <strong>Tanishq Abraham:</strong> There isn't much of a distinction sometimes, so there's also that as well. And I think that also is a lot of the times the question between whether or not it's just doing alignment versus adding knowledge. I think, that's. Part of that discussion and that, that isn't really I think clarified very often so that there's, that's the other aspect, but yeah, those are my thoughts on the topic.</p><p>[01:29:37] <strong>Alex Volkov:</strong> thanks Tanishq. And I also want to welcome Ian Moore to the stage. Ian, it's been a while since you've been here. Thoughts on this exciting discussion and have you seen the same trends or the same kind of vibes that I brought up on where you read and</p><p>[01:29:51] <strong>Ian Maurer:</strong> yeah.</p><p>[01:29:51] <strong>Ian Maurer:</strong> We were talking about this in January, Alex, I found the conversation, right? Finetuning versus RAG, the question is what's your goal? What's your use case? What's your eval? I think Hamill even mentioned, do you know, even know what your evals are? Do you even know what you're trying to accomplish?</p><p>[01:30:03] <strong>Ian Maurer:</strong> Without that good luck fine tuning, good luck building an app. Anyways my, I have a very distinct opinion and perspective, but I'll give you guys background so you understand where it's coming from. My company is 12 years old. We've got an old, good old fashioned AI company where we've curated 100, 000, Rules, effectively, in a knowledge base.</p><p>[01:30:20] <strong>Ian Maurer:</strong> It's a graph. It's got ontologies and things like that. And those rules have been curated by experts with PhDs, and we have an API that sets over it, and reasons over it, and can match patients to clinical trials. This is for cancer, right? So, patients get, DNA sequence, and it's very complicated, whatever.</p><p>[01:30:35] <strong>Ian Maurer:</strong> So, the great thing about large language models and as they get bigger and better is that they can understand language, including all language, including medical language, so they can understand the intent of a provider, right? The provider's trying to accomplish something, which is as quickly as possible, how do I help this patient?</p><p>[01:30:51] <strong>Ian Maurer:</strong> And So the thing that I have found that's most useful for us is to help that expert be as productive as they can possibly be. Use the large language model to understand their intent, what they have. I have a patient, they have a problem, what they want to find the best possible treatments for that patient.</p><p>[01:31:07] <strong>Ian Maurer:</strong> And then how to do that is by giving that large language model tools, right? Don't. Why do I want to fine tune knowledge into it? And then I just, I basically black boxed all my knowledge, right? Great. I have all this great knowledge I've curated over the years. I'm going to fine tune it into my system. And now it's a black box and I can't tell you where from or why it's there.</p><p>[01:31:25] <strong>Ian Maurer:</strong> No, I want to be able to tell you, here's the trials that are available for your patient. Here's the drugs that are available for your patient. This is the, the best possible outcome for that. And here's the link to the clinical trials page, or here's the link to the the FDA page that tells you why this drug is so [01:31:40] good.</p><p>[01:31:40] <strong>Ian Maurer:</strong> I can't do that if it's a black box. I'd be hallucinating all over the place. So my perspective is Finetuning is great if you're talking about a very discreet use case that you're trying to, drill down on cost. Hey, I figured out this named entity recognition pattern and now I'm, I was doing it expensively with few shot learning.</p><p>[01:31:57] <strong>Ian Maurer:</strong> Now I'm going to go, fine tune something and save that cost. But otherwise, you know Use the best possible model, give them tools, whether it's, through function calling or GPT actions are actually pretty good. And that's the best way to get the value out of the large language model and work with existing knowledge.</p><p>[01:32:13] <strong>Alex Volkov:</strong> So definitely sightings and knowing exactly about your data and not like blurring it out inside the brain of LLM, fuzzing it out where you can't actually know where it came from or whether or Not it's hallucinated. I think that's a big piece here that companies are actually like starting to also get into.</p><p>[01:32:30] <strong>Alex Volkov:</strong> And so I think you're Your perspective. is very important as Well I think also from the perspective at least the vibes that I've seen from the perspective of updating that data afterwards, like just continue fine tuning, like requires more knowledge and more skill, rather than just updating your vector databases, let's say, and have the model provide enough context. and I think the smartness to price ratio, I think is very important as well. If we get like models like Haiku, for example, they're like incredibly cheap. But have a vast context length that you can use both for fine tuning towards alignment, let's say, or behave like whatever you want it to behave or answer as our company versus answer is like the, this LLM together with you have enough context to do that and it's not cost prohibitive for you to use this large context for a bunch of stuff. and it's very important</p><p>[01:33:18] <strong>Alex Volkov:</strong> so I thanks Ian for coming up. I want to tie this back a little bit and then close the discussion also, I do want to shout out that you also have an awesome list of function calling, which now includes a bunch of open source. models that support function calling as well . The support is like function calling as well and it talks about the specifics in which they support function calling. Which is great and definitely will be in the show notes as well and with that folks, I think we'll end ThursdAI for today we had a bunch of stuff.</p><p>[01:33:44] <strong>Alex Volkov:</strong> There's a small breaking news from ray Ray just mentioned that Cursor the AI editor that we a lot of US use and love they just released an update where like their Cursor, like Copilot plus feature is still available. twice as fast now in some areas and that's been like awesome to use. So if you haven't used Cursor yet, definitely give it, give them A try.</p><p>[01:34:02] <strong>Alex Volkov:</strong> And Cursor is like really impressive, especially with Opus. If you have paid for Cursor Premium, have access to the best LLM in the world. I think that this is all that we wanted to talk about. thank you everybody for</p><p>[01:34:12] <strong>Alex Volkov:</strong> joining from week to week.</p><p>[01:34:13] <strong>Alex Volkov:</strong> I think that's most of what we talked about on ThursdAI for March 28th. With that, I want to thank Nisten, LDJ, Justin, Junyang, Robert Skobel was here before, Ian Moore jumped on, Tanishq, and Paul, definitely from MedArc and everybody else who joined us I really appreciate everybody's time here. If you're not subscribed to ThursdAI to get every link that we've talked about, I really work hard to give you all the links, so definitely give a subscription Other than that have a nice Thursday, everyone. We'll see you next week. Cheers, everyone.</p><p>[01:34:41] <strong>Ian Maurer:</strong> Bye everybody.</p><p>[01:34:42] <strong>Alex Volkov:</strong> bye bye</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-mar-28-3-new-moes-xxl-medium</link><guid isPermaLink="false">substack:post:143055825</guid><dc:creator><![CDATA[Alex Volkov and Paul Scotti]]></dc:creator><pubDate>Thu, 28 Mar 2024 23:41:17 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/143055825/16858b0905e92351582da7c2792dbbea.mp3" length="68520616" type="audio/mpeg"/><itunes:author>Alex Volkov and Paul Scotti</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5710</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/143055825/de78002f3a15dc02199ec3975ee5a2f2.jpg"/></item><item><title><![CDATA[📅 ThursdAI - Mar 21 - Grok, GTC, first OSS AI hardware, Neuralink Human, Prompting Claude and more AI news]]></title><description><![CDATA[<p>March madness... I know for some folks this means basketball or something, but since this is an AI newsletter, and this March was indeed mad, I am claiming it. This week seemed madder from one day to another. And the ai announcements kept coming throughout the recording, I used the "breaking news" button a few times during this week's show! </p><p>This week we covered tons of corporate AI drama in the BigCO segment, from Inflection → Microsoft move, to Apple Gemini rumors, to Nvidia GTC conference, but we also had a bunch of OpenSource to go over, including an exciting glimpse into the O1 from Open Interpreter, which the founder <a target="_blank" href="https://twitter.com/hellokillian/status/1770830926043332708">Killian</a> (of the ThursdAI mafia haha) joined to chat about briefly after an all nighter release push! </p><p>Another returning FOTP (friend of the pod) <a target="_blank" href="https://twitter.com/mattshumer_/status/1770494629844074975">Matt Shumer</a> joined as we did a little deep dive into prompting Claude, and how he went viral (seems to happen a lot to Matt) with a project of his to make Claude write prompts for itself! Definitely worth a listen, it's the first segment post the TL'DR on the pod 👂 this week.</p><p>Btw, did you already check out fully connected? It's the annual Weights & Biases conference in SF next month, and tickets are flying, I'm going to be there and actually do a workshop one day prior, would love to <a target="_blank" href="https://wandb.ai/site/resources/events/fully-connected?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=thursdai&#38;utm_id=march21">invite</a> you to join as well!</p><p><strong>TL;DR of all topics covered:</strong> </p><p>* <strong>Open Source LLMs</strong></p><p>* Xai open sources Grok (<a target="_blank" href="https://x.com/ibab_ml/status/1769447989192675748?s=20">X</a>, <a target="_blank" href="https://x.ai/blog/grok-os?utm_source=ainews&#38;utm_medium=email&#38;utm_campaign=ainews-grok-1-in-bio">Blog</a>, <a target="_blank" href="https://huggingface.co/xai-org/grok-1">HF</a>, <a target="_blank" href="https://github.com/xai-org/grok-1">Github</a>) </p><p>* Sakana AI releases a new paper + 2 JP merged SOTA models (<a target="_blank" href="https://twitter.com/SakanaAILabs/status/1770613032198279663">X</a>, <a target="_blank" href="https://arxiv.org/abs/2403.13187">Paper</a>, <a target="_blank" href="https://sakana.ai/evolutionary-model-merge/">Blogpost</a>)</p><p>* Open Interpreter announces O1 - the Linux for AI devices (<a target="_blank" href="https://twitter.com/OpenInterpreter/status/1770821439458840846">X</a>, <a target="_blank" href="https://www.openinterpreter.com/01">Project</a>)</p><p>* LM studio new modes (<a target="_blank" href="https://twitter.com/LMStudioAI/status/1770135856780595493">X</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Nvidia GTC conference - Blackwell platform, NIMs and Gr00t robotics</p><p>* Jensen interviewed transformers authors<strong> </strong></p><p>* Apple rumored to look at a deal including GEMINI</p><p>* Apple releases a multi modal MM1 paper (<a target="_blank" href="https://x.com/morgymcg/status/1768560632570237375?s=20">X</a>)</p><p>* Inflection founders leave to head Microsoft AI</p><p>* Google opens up Gemini 1.5 with 1M context access to all (<a target="_blank" href="https://twitter.com/JeffDean/status/1770653917543870571">X</a>)</p><p>* <strong>Vision & Video</strong></p><p>* NVIDIA + MIT release VILA (13B, 7B and 2.7B) (<a target="_blank" href="https://twitter.com/reach_vb/status/1770403591024451689">X</a>, <a target="_blank" href="https://huggingface.co/Efficient-Large-Model">HuggingFace</a>, <a target="_blank" href="https://arxiv.org/abs/2312.07533">Paper</a>)</p><p>* <strong>This week's BUZZ</strong></p><p>* Fully Connected is coming, sign up <a target="_blank" href="https://wandb.ai/site/resources/events/fully-connected?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=thursdai&#38;utm_id=march21">here</a>, get tickets, join us. </p><p>* I'm running a workshop in SF a day before on improving your LLM step by step including exciting announcements (<a target="_blank" href="https://wandb.ai/site/resources/events/fully-connected?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=thursdai&#38;utm_id=march14">same link</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* Suno V3 launched officially (<a target="_blank" href="https://x.com/suno_ai_/status/1770857426507399285?s=20">X</a>, <a target="_blank" href="https://www.suno.ai/blog/v3">Blog</a>, <a target="_blank" href="https://www.suno.ai/blog/v3">Play with it</a>)</p><p>* Distil-whisper-v3 - more accurate, and 6x version of whisper large (<a target="_blank" href="https://twitter.com/sanchitgandhi99/status/1770877844823896117">X</a>, <a target="_blank" href="https://huggingface.co/distil-whisper/distil-large-v3">Code</a>)</p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* Stability presents SD3 TURBO - 4 steps to get same high quality generation (<a target="_blank" href="https://arxiv.org/abs/2403.12015">Paper</a>)</p><p>* Stability open sources Stable Video 3D (<a target="_blank" href="https://sv3d.github.io/">Blog</a>, <a target="_blank" href="https://huggingface.co/stabilityai/sv3d">Models</a>)</p><p>* <strong>Tools & Others</strong></p><p>* Neuralink interview with the first Human NeuroNaut - Nolan (<a target="_blank" href="https://x.com/neuralink/status/1770563939413496146?s=20">X</a>)</p><p>* Lex & Sama released a podcast, barely any news</p><p>* Matt Shumer releases his Claude Prompt engineer (<a target="_blank" href="https://twitter.com/mattshumer_/status/1770494629844074975">X</a>, <a target="_blank" href="https://anthropic.com/metaprompt-notebook">Metaprompt</a>, <a target="_blank" href="https://github.com/mshumer/gpt-prompt-engineer/blob/main/claude_prompt_engineer.ipynb">Matt's Collab</a>)</p><p>Open Source LLMs </p><p>Xai open sources Grok (<a target="_blank" href="https://x.com/ibab_ml/status/1769447989192675748?s=20">X</a>, <a target="_blank" href="https://x.ai/blog/grok-os?utm_source=ainews&#38;utm_medium=email&#38;utm_campaign=ainews-grok-1-in-bio">Blog</a>, <a target="_blank" href="https://huggingface.co/xai-org/grok-1">HF</a>, <a target="_blank" href="https://github.com/xai-org/grok-1">Github</a>) </p><p>Well, Space Uncle Elon has a huge week, from sending starship into orbit successfully to open sourcing an LLM for us, and a huge one at that. Grok is a 314B parameter behemoth, with a mixture of experts architecture of 80B per expert and two active at the same time. </p><p>It's released as a base model, and maybe that's why it was received with initial excitement but then, nobody in the GPU poor compute category has the ability to run/finetune it! </p><p>In terms of performance, it barely beats out Mixtral, while being almost 10x larger, which just shows that.... data is important, maybe more important than Github stars as Arthur (CEO Mistral) helpfully pointed out to Igor (founder of Xai). Still big props to the team for training and releasing this model under apache 2 license.</p><p>Sakana AI launches 2 new models using evolutionary algo merging</p><p>Yeah, that's a mouthful, i've been following Hardmaru (David Ha) for a while before he joined Sakana, and only when the founder (and a co-author on transformers) LLion Jones talked about it on stage at GTC the things connected. Sakana means fish in Japanese, and the idea behind this lab is to create things with using nature like evolutionary algorithms. </p><p>The first thing they open sourced was 2 new SOTA models for Japanese LLM, beating significantly larger models, by using Merging (which <a target="_blank" href="https://sub.thursdai.news/p/merge-deepdive-maxime-labonne">we covered</a> with Maxime previously, and whom Sakana shouted out in their work actually) </p><p>Open Interpreter announces 01 Light - the linux of AI hardware devices</p><p>Breaking news indeed, after we saw the release of R1 go viral in January, Killian (with whom we chatted previously in our <a target="_blank" href="https://youtu.be/qgYVvZmtfbQ?list=PLKKK0msyS7s0gygsjqUx8bgREkeSw40lJ">most favorited episode</a> of last year) posted that if someone wants to build the open source version of R1, it'll be super cool and fit with the vision of <a target="_blank" href="https://twitter.com/OpenInterpreter">Open Interpreter</a> very well.</p><p>And then MANY people did (more than 200), and the O1 project got started, and fast forward a few months, we now have a first glimpse (and the ability to actually pre-order) the O1 Light, their first device that's a button that communicates with your computer (and in the future, with their cloud) and interacts with a local agent that runs code and can learn how do to things with a skill library. </p><p>It's all very very exciting, and to see how this idea goes from an announcement on X, to hundreds of folks collaborating and pushing this to the open has been incredible, and we'll definitely do a deeper dive into capabilities and the whole project once the launch craziness dies down a bit (Killian joined us at the epitome of the launch all-nighter haha) </p><p></p><p>This is poised to be the first open source AI device, completely with .stl files for 3d printing at home, chip designs, ability to run end to end locally on your mac and we really applaud the team for this release 🫡 </p><p>Big CO LLMs + APIs</p><p>Nvidia GTC annual conference - New Blackwell platform, NIMs, Robotics and everything AI +  a chat with the transformer avengers </p><p>This week Nvidia had their annual GTC conference, where Jensen announced a ton of stuff, but the highlights where the new Blackwell chip (the next iteration of the H100) and the GB200 racks with a whopping 720PFlops of compute ( to put this number in perspective: the first DGX that Jensen delivered to OpenAI in 2016 was 0.17 Petaflops ) </p><p>They also announced partnerships with everyone under the sun pretty much, a new way to deliver packaged AI experiences called NIMs (which we at weights & biases <a target="_blank" href="https://wandb.ai/wandb/wb-announcements/reports/Weights-Biases-Delivers-New-Integrations-with-NVIDIA-Technologies-to-Deploy-LLM-Applications-at-Scale--Vmlldzo3MjA0MDE2?galleryTag=ml-news">support</a> as well) and a new foundational operating system for robotics called <a target="_blank" href="https://x.com/DrJimFan/status/1769860044324319658?s=20">GR00T</a> led by Dr Jim Fan. </p><p>Jensen also had the whole transformers original authors cast together on stage (and in the green room) for an hour, for the first time, to chat about, well... transformers. I really need to find the whole video and post it because it's hidden inside the Nvidia GTC website, but it was a very fun chat, where the team reminisced about the naming and their thoughts on the future of LLMs. They also covered each individual company (all of them lefty Google since then) and what they all do. It was a great chat. </p><p>Microsoft buys Inflection (almost) and Apple considers buying Gemini</p><p>In other huge AI player news, 2 of the 3 founders of Inflection AI left to start Microsoft AI (together with some of the staff), namely Mustafa who founded inflection, then helped raise 1.8B dollars, get up to 22K H100 GPUs, release Inflection 2.5 that comes close to GPT4, and then decided to leave. Inflection also pivoted away from consumer (Pi was a very nice AI to chat with) into API services, and apparently Microsoft <a target="_blank" href="https://www.theinformation.com/articles/microsoft-agreed-to-pay-inflection-650-million-while-hiring-its-staff?rc=cmz1ru">will pay</a> Inflection $650 to Inflection in the form of a licensing deal. </p><p>Meanwhile there are rumors that Apple is eyeing Gemini to integrate into IOS, which is, very weird given the recent bad press about Gemini (Unless Apple doesn't want to deal with the same bad press?) and it's even weirder given the latest push from Apple into Open Source. </p><p>Folks at apple this week released a new paper called MM1, outlining a new multi modal model they have trained (but not released) and show that it beats Gemini visual understanding. </p><p>It was also great to see that the authors of that model <a target="_blank" href="https://x.com/l2k/status/1768664568769994869?s=20">shouted out Weights & Biases</a> crew that helped them through their work on this paper👏 </p><p>Nolan - the first NeuralNaut (first human with a Nauralink implanted) </p><p>Just as I was summing up the notes for this week, Neuralink pinged that they are going to go live soon, and I tuned in to see a 20yo Paraplegic gamer, getting interviewed by a Neuralink employee, being very cheerful, while also playing a chess game, all with his brain. We went a really long way since the monkey playing Pong, and Nolan was able to describe his experience "it's like using The Force" of using Neuralink to control his mac cursor. It was all kind of mind-blowing, and even though brain implants are nothing new, the fidelity and the wireless connections + the very quick surgery made this demo such a nonchalant thing, that Nolan didn't even stop playing chess while being interviewed, probably not realizing that millions of people would be watching. </p><p>They have a bunch of ML understanding the signals that Nolan sends from his brain wirelessly, and while this is very exciting, and Nolan prepares for this halloween as Professor X from X-men, because well, he's in fact a telekinesis enabled human, Elon claimed that their next target is fixing blindsight (and that it already works on monkeys) presumably via camera input being triggered in the visual cortex. </p><p>Back in November 2022, I watched the Neuralink keynote and geeked out so hard about this section, where Dan Adams, one of the neuroscientists at Neuralink talked about how it's possible to trigger / stimulate the visual cortex to fix blindness and then generate an image. </p><p>Well, this is it folks, we talked about tons of other stuff of course but these are the main points that made the cut into the newsletter, as always, if you want to support this newsletter/podcast, please share it with friends ❤️ Hope to see you in SF in April (I'll be giving more reminders don't worry) and see you here next ThursdAI 🫡 </p><p></p><p>P.S - I said Intel a bunch of times when I mean Nvidia, apologies, didn’t notice until post publishing 😅 </p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-mar-21-grok-gtc-first-oss</link><guid isPermaLink="false">substack:post:142843776</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Fri, 22 Mar 2024 00:48:54 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/142843776/a02aafb68f10a3660e7286336400a307.mp3" length="75507192" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6292</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/142843776/3531d68742c1015e82b4677618a72562.jpg"/></item><item><title><![CDATA[🎂 ThursdAI BirthdAI March 14: Anthropic Haiku, Devin the new AI SWE, GPT4 gets hands, Cohere and Nous give us tool use models & more AI news]]></title><description><![CDATA[<p>"...Happy birthday dear ThursdAIiiiiiiii, happy birthday to youuuuuu 🎂"</p><p>What a day! Today is π-day (March 14th), 2024. For some reason it's important, not only because it's GPT-4 anniversary, or Claude 1 anniversary, or even that Starship flew to space, but also 🥁 <strong>it's ThursdAI BirthdAI</strong> 🎉 </p><p>Yeah, you heard that right, last year following GPT-4 release, I hopped into a twitter space with a few friends, and started chatting about AI, and while some friends came and went, I never stopped, in fact, I decided to leave my 15 year career in software, and focus on AI, learning publicly, sharing my learnings with as many people as possible and it's been glorious. And so today, I get to celebrate a little 💃</p><p>I also get to reminisce about the state of AI that we were at, back exactly a year ago. Context windows were tiny, GPT-4 came out with 8K (we casually now have models with 200K that cost $0.25/1M tokens), GPT-4 also showed unprecedented levels vision capabilities back then, and now, we have 1.3B parameters models that have similar level of visual understanding, open source was nascent (in fact, LLama.cpp only had it's first commit 4 days prior to GPT4 launch, Stanford released the first Alpaca finetune of Llama just a day prior. </p><p>Hell even the chatGPT API only came out a few days before, so there was barely any products built with AI out there. Not to mention that folks were only starting to figure out what vector DBs were, what RAG is, how to prompt, and that it's possible to run these things in a loop and create agents! </p><p>Other fields evolved as well, just hit play on this song I generated for ThursdAI with Suno V3 alpha, I can’t stop listening to it and imagining that this was NOT possible even a few months ago</p><p>It's all so crazy and happening so fast, that annual moments like these propose a great opportunity to pause the acceleration for a sec. and contextualize it, and bask in the techno-optimism glory of aren't we lucky to live in these times? I sure am, and for me it's the ThursdAI birthday gift to be able to share my excitement with all of you! </p><p><p>Thank you for being a subscriber, the best way you can support ThursdAI is to share this with a friend and tag us on socials 🫡</p></p><p>TL;DR of all topics covered: </p><p>* <strong>Open Source LLMs</strong> </p><p>* Together releases Sequoia speculative decoding (<a target="_blank" href="https://twitter.com/togethercompute/status/1767936720618799336">X</a>, <a target="_blank" href="https://www.together.ai/blog/sequoia">Blog</a>)</p><p>* Hermes Pro from NousResearch - Tool use and function calling (<a target="_blank" href="https://twitter.com/Teknium1/status/1768023030843015208">X</a>, <a target="_blank" href="https://huggingface.co/NousResearch/Hermes-2-Pro-Mistral-7B-GGUF">HF</a>, <a target="_blank" href="https://github.com/NousResearch/Hermes-Function-Calling">Github</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Anthropic releases Claude 3 Haiku (<a target="_blank" href="https://twitter.com/AnthropicAI/status/1768018310615151002">Announcement</a>, <a target="_blank" href="https://www.anthropic.com/news/claude-3-haiku">Blog</a>)</p><p>* Cohere CMD+R (<a target="_blank" href="https://twitter.com/aidangomez/status/1767264315550163024">Announcement</a>, <a target="_blank" href="https://twitter.com/altryne/status/1767282345109844037">HF</a>)</p><p>* <strong>This weeks Buzz</strong></p><p>* Early bird tickets for <a target="_blank" href="https://wandb.ai/site/resources/events/fully-connected?utm_source=thursdai&#38;utm_medium=referral&#38;utm_campaign=thursdai&#38;utm_id=march14">Fully Connected in SF</a> are flying, come meet the Weights & Biases team. We're also going to be running a workshop a day before, come join us! (<a target="_blank" href="https://x.com/weights_biases/status/1765044742033686887?s=20">X</a>)</p><p>* <strong>Vision & Video</strong></p><p>* Deepseek VLM 1.3B and 7B (<a target="_blank" href="https://x.com/reach_vb/status/1767262646380712181?s=20">X</a>,<a target="_blank" href="https://twitter.com/deepseek_ai/status/1767458161618006526">Announcement</a>, <a target="_blank" href="https://huggingface.co/deepseek-ai/deepseek-vl-7b-chat">HF</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* Made a song with Suno v3 Alpha for ThursdAI, it's a banger (<a target="_blank" href="https://twitter.com/altryne/status/1768072472459460653/video/1">Song</a>)</p><p>* <strong>Hardware & Robotics (New)</strong></p><p>* OpenAI now powers Figure - the humanoid robot company (<a target="_blank" href="https://twitter.com/adcock_brett/status/1767913955295744449">X</a>)</p><p>* Cerebras announces the fastest AI chip on earth (<a target="_blank" href="https://twitter.com/CerebrasSystems/status/1767929699177767325">X</a>)</p><p>* Extropic made an announcement about their TPU - Thermodynamic Processing Unit</p><p>* <strong>Tools  & Agents</strong></p><p>* Devin from Cognition Labs (<a target="_blank" href="https://x.com/cognition_labs/status/1767548763134964000?s=20">Announcement</a>, <a target="_blank" href="https://twitter.com/mckaywrigley/status/1767985840448516343">47 minute demo</a>)</p><p>Agents for your house and your Github tasks</p><p>Say hello to Devin from Cognition Labs (<a target="_blank" href="https://twitter.com/cognition_labs/status/1767548763134964000">Announcement</a>, <a target="_blank" href="https://twitter.com/mckaywrigley/status/1767985840448516343">Real world demo</a>)</p><p>By far the most excited I've seen my X feed be this week, was excitement about Cognition Labs new agent called Devin, which they call the first AI software engineer. </p><p>You should really watch the video, and then watch a few other videos, because, well, only a few folks are getting access, and yours truly is not one of them.</p><p>It seems like a very published launch, backed by tons of VC folks, and everybody kept highlighting not only the innovative UI that Devin has, and it has a very polished UX/UI/Dev experience with access to a browser (where you can authenticate and it can pick up doing tasks), terminal (where you can scroll back and forth in time to see what it did when), but also a chat window and a planning window + an IDE where it rights code and you can scrub through that as well. </p><p>Folks were also going crazy about the founder (and team) amount of math ability and IOI gold medals, this video went viral featuring Scott the founder of Cognition, in his youth obliterating this competition… poor Victoria 😅</p><p>Regardless of their incredible math abilities, Devin is actually pretty solid, specifically from the UI side, and again, like with he AutoGPT hype of yesteryear, we see the same issues, it's nice, but cognition hiring page is still looking for human software engineers. Tune into the last 30 minutes of the pod today as we had tons of folks discuss the implications of an AI "software engineer" and whether or not coding skills are still required/desired. Short answer is, yes, don't skip, learn coding. Devin is going to be there to assist but likely will not replace you.</p><p>🤖 OpenAI + Figure give GPT-4 hands (or give figure eyes/ears/mouth)</p><p>Ok this demo you must just see before reading the rest of it, OpenAI announced a partnership with Figure, a humanoid robotics company recently, and just this week they released a demo of this integration. </p><p>Using GPT4-Vision and Text to speech capabilities (with a new, somewhat raspy voice and human like intonations), the bot listens to the human giving it instructions, sees the world in front of it, and is able to perform tasks that the human has asked it to do via voice. This feels like a significant jump in capabilities for these bots, and while it was a given that the two technologies (Actuator based robotics and LLMs) will meet soon , this shows the first I Robot like moment. </p><p>It'll still be a while until you can have this one do your dishes or fold your laundry, but it does feel like it's an eventuality at this point, where as before, it just felt like sci-fi. Kudos on this integration, and can't wait until Optimus from Tesla will add Grok brains and it'll make you laugh nervously at it's cringe jokes 😅 </p><p>This weeks Buzz</p><p>We're coming to SF in April, our annual Fully Connected conference will feature keynote speakers from foundational AI companies, industry, our founders and tons of Weights & Biases users. We'll also be running a workshop (I'm one of the workshop folks) a day before, so keep an eye on that, it'll be likely included in your ticket (which is still, 50% off for early bird)</p><p>Open Source LLMs </p><p>Nous Research gives us Tool Use with Hermes 2 Pro (<a target="_blank" href="https://twitter.com/Teknium1/status/1768023030843015208">Announcement</a>)</p><p>Getting json structured output and giving models the ability to respond with not only text, but specific instructions for which functions to run (aka tool use) is paramount for developers. OpenAI first released this back in June, and since then I've been waiting for Open Source to catch up. And catch up they did, with Nous releasing their first attempt at continued training of the renown Hermes 7B Mistral based model, with tool use and structured output! </p><p>If you're building agents, or any type of RAG system with additional tools, you will definitely be very happy as well, give Hermes Pro a try! </p><p>This one is not a simple download and run, you have to do some coding, and luckily the folks at Nous provided us with plenty of examples in their <a target="_blank" href="https://github.com/NousResearch/Hermes-Function-Calling">Github</a>. </p><p>Deepseek gives us a new Vision model - Deepseek VL 1.3B & 7B (<a target="_blank" href="https://twitter.com/reach_vb/status/1767262646380712181">Announcement</a>)</p><p>Absolutely punching above it's weight, this very high quality vision model from the Deepseek folks is just a sign of what's coming, smaller models, performing incredibly better on several tasks. </p><p>While the top is getting crowded with Claude, GPT4-V and Gemini which are generic, on specific tasks, we're getting tiny models that can offload fully into memory and run hell fast and perform very well on narrow tasks, even in <a target="_blank" href="https://twitter.com/xenovacom/status/1768034516134760831">the browser</a></p><p>Big CO LLMs + APIs</p><p>Anthropic gives the smallest/fastest/cheapest Claude 3 - Haiku</p><p>After releasing Opus and Sonnet earlier, Anthropic has reclaimed their throne as the leading AI lab we always knew them to be. Many friends of the pod prefer Opus for many things now, and I keep seeing this sentiment online, folks are even considering cancelling chatGPT for the first time since... well ever? </p><p>While sonnet, their middle model is taking a significant interesting place on top of the LMsys arena human rated rankings </p><p>Beating all GPT-4 besides the Turbo ones. And now Anthropics has given us Haiku, the smallest of the three Claudes, the fastest, and the cheapest by far. </p><p>With 200K context window, vision capabilities, this model crushes GPT3.5 on many benchmarks and becomes the de-facto cheapest model to run. It only costs $0.25/1M tokens, which is twice cheaper than GPT3.5 but just look at the performance. One thing to note, Anthropic still doesn't support function calling/tool use. </p><p>Cohere releases a new model for retrieval and enterprise purposes - CMD+R</p><p>Cohere goes for the second wind with a great release + open weights approach, and release Command+R (pronounced Commander) which is a model focused on enterprise uses, scalability and tool use. It supports 10 languages, 128K context and beats GPT3.5 and Gemini 1.0 on several tasks, namely on KILT - Knowledge Intensive Language Tasks. The tool use capabilities and the ability to ground information in retrieved context makes this specifically a great model to use for RAG purposes.</p><p>The model is 34B and is available non commercially on the <a target="_blank" href="https://twitter.com/altryne/status/1767282345109844037">hub</a></p><p>Together makes inference go BRRR with Sequoia, a new speculative decoding method</p><p>Together Sequoia shows a way to speed up Llama2-70B and be able to run this on a single consumer GPU with 8x speed up. </p><p>Being able to run AI locally can mean a few things, it can mean, make smaller models better, and we've seen this again and again for the past year. </p><p>Another way is... speculative decoding. </p><p>Being able to lower the inference TBT (time between tokens) by enhancing algorithms of decoding and using tiny draft models, and methods like offloading. The large model essentially remains the same, while a smaller (draft) model can help guide the inference and make it seem much faster. These methods compound, and while Sequoia from Together is new, shows great promise by enhancing the inference time LLama2 70B 8x on consumer hardware and up to 3-4x on dedicated hardware.</p><p>The compounding of these methods is the most exciting part to me, given that they will likely apply broadly (for now Sequoia only supports LLaMa) once a new model / architecture comes out. </p><p>—</p><p>Show notes:</p><p>* Swyx AI news newsletter got a shoutout from <a target="_blank" href="https://twitter.com/karpathy/status/1767616494752731633">Andrej Karpathy</a></p><p>* Anthropic <a target="_blank" href="https://anthropic.com/metaprompt-notebook">metaprompt</a> cookbook from Alex Albert </p><p>* Folks who participated in the AI Agent discussion, <a target="_blank" href="https://twitter.com/nisten">Nisten</a>, <a target="_blank" href="https://twitter.com/RoieSchwabco">Roie Cohen</a>, <a target="_blank" href="https://twitter.com/JunaidDawud">Junaid Dawud</a>, <a target="_blank" href="https://twitter.com/antonosika/">Anton Osika</a>, <a target="_blank" href="https://twitter.com/intent/follow?original_referer=https%3A%2F%2Fspacesdashboard.com%2F&#38;screen_name=khoomeik">Rohan Pandey</a>, <a target="_blank" href="https://twitter.com/ryancarson">Ryan Carson</a></p><p>Thank you for being a subscriber, and for sharing this journey with me, I hope you enjoy both the newsletter format and the podcast  🫡 </p><p>See you here next week 🎂 I’m going to eat a piece of cake</p><p>Full transcript : </p><p>[00:00:00] <strong>Alex Volkov:</strong> Hey, you are on Thursday. I March 14th, 2024. AKA bye. Day AKA Thursday. I birthday. I, I'm sorry for the pun. Uh, I promise I'm gonna, I'm gonna keep it contained as much as I can. My name is Alex Volkov I'm an AI evangelist with weights and biases Today on the show, a birthday celebration for Thursday I Twitter spaces.</p><p>[00:00:31] <strong>Alex Volkov:</strong> That's right. I started recording these exactly a year ago on GPT 4's announcement day, March 14th, 2023. In addition, everything important that happened in the world of AI for the past week that sometimes feels like a year. Including open source LLMs, big companies and their APIs, hardware and robotics for the first time, agents, And more.</p><p>[00:00:59] <strong>Alex Volkov:</strong> We've talked about a lot of stuff. But first, as always, a recap of everything we discussed as I recorded it at the end of the show while everything was fresh in my mind after this little birthday song that AI created for us.</p><p>[00:01:12]</p><p>[00:02:39] <strong>Alex Volkov:</strong> that this is AI generated? Maybe at the end there it went a little bit off, but holy cow, this is, I really listened to this birthday celebration multiple times after I created it with Suno V3 Alpha. So get ready for AI music everywhere. And now, the recap of everything we talked about for this week.</p><p>[00:03:02] <strong>Alex Volkov:</strong> But definitely, Stick around and listen to the end of the show. And as always, you will have chapters on every podcast platform that you use, especially Apple Podcasts.</p><p>[00:03:13] <strong>Alex Volkov:</strong> And if you do use Apple Podcasts, why not? Give us a thumbs up and like a five star review. That really helps. That's how people discover us, believe it or not. Here's a recap of everything we talked about. And following that, a beautiful in depth conversation with many folks who shared this journey with me and been, in one way or another, the reason I kept going this year for ThursdAI.</p><p>[00:03:36] TL;DR - everything we talked about in 10 minutes</p><p>[00:03:36] <strong>Alex Volkov:</strong> Everyone, here's a recap of everything we've talked about on Thursdays. Anniversary for Twitterspaces, March 14th, 2024, which is also Pi Day, which is also the anniversary of ChatGPT4, and anniversary of Cloud1, and we spoke about ThursdAI history, we spoke about how we got here, how now it's a podcast.</p><p>[00:03:56] <strong>Alex Volkov:</strong> And in open source, we had together AI release something called speculative decoding Sequoia. Speculative decoding is not new, but their approach to speculative decoding called Sequoia is new. It is able to Optimize inference for something like a Lama 70 billion parameter on consumer hardware up to 8 to 9 percent faster by just, , predicting a tree of next tokens and letting the model select between them.</p><p>[00:04:20] <strong>Alex Volkov:</strong> Speculative decoding is an additive technique to improve speed of inference of models. On top of models getting smaller and better, the bigger models are going to get faster on local hardware as well due to something like speculative decoding. It's very exciting to see. TogetherAI also announced like an extension of the round and now they're a unicorn and Definitely doing incredible things.</p><p>[00:04:40] <strong>Alex Volkov:</strong> We also, in the open source, we've covered that our friends at Nous Research,</p><p>[00:04:44] <strong>Alex Volkov:</strong> hermes Pro. If you followed us at any point before, you know that Hermes is one of the top Finetunes for Mistral 7 billion parameters. There is the Pro version of Mistral 7 billion on top of Hermes dataset. Hermes dataset also, by the way, is open and you can go and download and use it. This version, the pro version, is specifically focused on tool use and function calling.</p><p>[00:05:07] <strong>Alex Volkov:</strong> And we also covered what tool use is from the perspective of developers who build RAG apps, for example, or need structured output. This new version supports JSON mode and JSON output, which is a very welcome addition to the world of open source.</p><p>[00:05:19] <strong>Alex Volkov:</strong> It has OpenAI endpoint compatibility, and it's hosted on Fireworks, so you can actually try it out and just swap the OpenAI endpoint with that endpoint and see if your tasks are working with Hermes as well.</p><p>[00:05:31] <strong>Alex Volkov:</strong> On the border between open source LLMs and big company LLMs. We then moved to a conversation about Cohere. Cohere is a company that was co founded by one of the authors of the original Transformers paper Aiden Gomez and some other folks. Incredible company mostly focused on enterprise and use cases around RAG retrieval augmented generation.</p><p>[00:05:50] <strong>Alex Volkov:</strong> Cohere had a bunch of models called Reranker and Embedding Models. And now they released something called Command R. And by release, I mean they released it via API, but also they dropped it on Hug and Face in OpenWeight's non commercial license. So you'd be able to actually run and use this locally but you cannot use it commercially yet.</p><p>[00:06:06] <strong>Alex Volkov:</strong> For that, they offer their API and their API is definitely there. It performs very well on RAG application, outperforms other scalable models. So Outperforms, like even Mixtral and Lama70B, they're not comparing themselves to GPT 4 because this model the command R model is definitely focused on enterprise and use cases.</p><p>[00:06:25] <strong>Alex Volkov:</strong> It works very well with their cohere embedding and re rank models in tandem as well, it's focused on tool use. Like previously we said that Technium just added to open source. They're focused on tool use and external tools as well. And their Cohere API has a bunch of external tools that you can plug in into this one, like web search, like stock prices, like a bunch of other things.</p><p>[00:06:45] <strong>Alex Volkov:</strong> Optimized for 10 major languages, which is usually way more than other open models, and trained on 13 more, and has 128k context window.</p><p>[00:06:55] <strong>Alex Volkov:</strong> And in the same area of smaller models, we finally got the small model, Answer from Tropic, the folks that just released Claude three Claude three.</p><p>[00:07:06] <strong>Alex Volkov:</strong> Antropic released the smallest, most performant version of Claude 3 called Haiku. They call it the fastest, most affordable yet model for enterprise applications.</p><p>[00:07:15] <strong>Alex Volkov:</strong> Cloud3 Haiku is 25 cents per million input tokens, where GPT 3. 5, which is considered the cheapest one and the most performant one so far, is half a dollar for a million tokens. So it's half the price of GPT 3. 5. However, it significantly overperforms GPT 3. 5 on any other token. every metric that they've added, including human eval, which is 75 percent versus GPT 3.</p><p>[00:07:39] <strong>Alex Volkov:</strong> 5's 48%. MMLU score is 75. And the kicker here is as 200k context window, like the major Cloud Opus and Cloud Sonnet. So Heiko has 200k context window. Imagine a model that is only 25 cents per million tokens on input. has also 200k contacts window. And it's available via the API, obviously. or Amazon and Google Cloud as well. And it's vision enabled, so you can actually send images. And we geeked out about how a year ago when we started ThursdAI, one of the reasons why we came to the space, because we were blown away by GPT 4's vision capabilities.</p><p>[00:08:14] <strong>Alex Volkov:</strong> And now we're getting I'm not gonna say that Haiku is anywhere close to GPT 4 vision [00:08:20] wise, but it's From what I've tested very decent, given the price point, it's incredibly decent. Then I covered that in the Weights and Biases area we're coming to San Francisco in April 18th is our fully connected conference with many big clients of ours coming, foundational model creators, et cetera, coming to speak on the stage.</p><p>[00:08:40] <strong>Alex Volkov:</strong> And we're also going to do a workshop, , a day before. So April 17th, if you're interested in this, please write to me, I'll definitely. tell you when that's up. The tickets are early bird and you're more than welcome to join us in San Francisco. We will be very happy to see you.</p><p>[00:08:53] <strong>Alex Volkov:</strong> If you came from ThursdAI, come and give me a high five. I would love to, to, show my boss that this is actually pulling some folks. But also we covered continued things in ThursdAI around vision and video. So skipping from Weights and Biases stuff, we covered vision and video.</p><p>[00:09:06] <strong>Alex Volkov:</strong> We covered the DeepSeq, released a DeepSeq VLM, which is a tiny vision model. So Again, in the realm of multimodality this year, we're now getting 1. 3 billion parameter and 7 billion parameter models that on some tasks come close to GPT 4. It's quite incredible. So DeepSeq the folks who released DeepSeq Coder before and very impressive lineup of models open sourced VLM 1.</p><p>[00:09:30] <strong>Alex Volkov:</strong> 3 billion and 7 billion. Incredible, impressive on benchmarks, and the 1. 3 billion parameter model is so tiny, you can run this basically offloaded on your CPU. And in that vein, we also covered briefly, but we did cover that Transformers. js is very soon, from our friend Zenova, is very soon to support WebGPU.</p><p>[00:09:47] <strong>Alex Volkov:</strong> WebGPU is the ability to run these models in your browser in your JavaScript environment on the GPU of your machine, either that's a Mac or a PC. And that's now landed fully in all major browsers right now.</p><p>[00:10:00] <strong>Alex Volkov:</strong> The song that you heard beginning over this was made with suno v3. Alpha and I did this specifically for ThursdAI. And I'm very impressed that a year after we started all this, we're now getting songs that sound like somebody actually went in the studio and sang it. We then mentioned that in the AI art and diffusion corner, we still don't have stable diffusion tree.</p><p>[00:10:20] <strong>Alex Volkov:</strong> We also had another corner today, which is a hardware and robotics corner. And we've covered several very exciting things.</p><p>[00:10:28] <strong>Alex Volkov:</strong> We've covered that Cerebrus announced the fastest AI chip on Earth, with 4 trillion transistors and 900, 000 AI cores, able to train up to 24, 000 people. I don't use the word trillion parameters a lot here, but able to train 24 trillion parameters models on a single device. This sounds incredible, and once they put it in production, I think it's going to be a significant boost to the AI scene.</p><p>[00:10:52] <strong>Alex Volkov:</strong> We also covered Xtropic, the folks that came from Google X, Secret Lab, now announce Xtropic, the folks behind the EAC movement. As well, that's their company, they're building a TPU, Thermodynamic Processing Unit it's a little complex, but basically. They want to do natural physical embodiment of probabilistic learning, and they want to be considered the transistor of the AI era.</p><p>[00:11:17] <strong>Alex Volkov:</strong> And if you want to hear more about this, they have the full space Q& A that we'll link in the comments below. And so we covered Cerebrus, we covered Extropic in the hardware, and then we've talked about how FIGR, the humanoid robot company FIGR we covered before they, they announced a partnership with OpenAI, and this week they released a demo video that's unedited, so end to end recorded in 1x speed, of this, Figure robot, humanoid robot standing in something that looks like a fake kitchen and basically talks to the human in front of it using OpenAI's text to speech technology and vision.</p><p>[00:11:52] <strong>Alex Volkov:</strong> So it actually understands what it sees based on GPT 4 vision, probably custom version of GPT 4 vision, and also then is able to do some stuff. If you haven't seen this video, I'm going to put it in show notes on thursdai. news. Please feel free to subscribe. The video is mind blowing,</p><p>[00:12:07] <strong>Alex Volkov:</strong> but just the fact that the robot can see, talk about what it sees, and then perform tasks embodied in the real world, I think is a great way to see the future happening right now on Pi Day 2024. And I think this is most of the conversation that we've covered from the news perspective, besides this one last thing, where we covered that Cognition Labs released a video and actually started Letting folks in to something they call Devon, the first fully autonomous AI software engineer.</p><p>[00:12:35] <strong>Alex Volkov:</strong> That's the tagline. And obviously we've those of us who covered this, we remember AutoGPT hype from last year. We remember multiple since then, multiple different agentic frameworks, Devon seems like it took that to the next level, not only from a perspective of just being able to execute long tasks, but also from the ability of the UI to show you what it does and being autonomous alongside your software engineer.</p><p>[00:12:59] <strong>Alex Volkov:</strong> So you can, Devon actually has access to a full environment, probably with GPUs as well. It has access to a browser that you can log into your stuff and then Devon can on your behalf, use the browser and go and search for some stuff.</p><p>[00:13:10] <strong>Alex Volkov:</strong> And we had. One hell of a discussion following the Devon news to talk about, and I think it was started by Nisten saying, Hey folks, you have nothing to fear, still learn code. That this news, again, stoked fears of folks saying, Hey, should I even learn to code given these advancements? And we had a great discussion about Coding, taking over coders, for example, replacing or not replacing, and positivity in the age of AI.</p><p>[00:13:34] <strong>Alex Volkov:</strong> And this discussion, I really suggest you listen, stick to the end of the podcast, if you're listening on the podcast, and listen to the whole discussion, because I think it was a great discussion.</p><p>[00:13:43] <strong>Alex Volkov:</strong> Hey everyone. My name is Alex Volkov. I'm the host of ThursdAI for the past year, which I can now say proudly, and I just want to welcome you, yet again, to another Thursday. Today's a big day, not only because we're celebrating, but also because some of us woke up early to see the largest man made object ever to break through the atmosphere and go to space, which was incredible.</p><p>[00:14:24] <strong>Alex Volkov:</strong> Very tech optimist like, but also today is an anniversary of multiple things. And I think ThursdAI is just one of them. So we're gonna, we're gonna actually talk about this real quick. And I just want to say that ThursdAI I'm very happy to still be here a year after with many people who joined from week to week, from month to month, whatever friendships that were shaped in the ThursdAI community.</p><p>[00:14:49] <strong>Alex Volkov:</strong> And I just want to say I'm very happy that Swyx here is here. Swyx was on the actual first ThursdAI episode a year ago. We jumped in to discuss GPT 4 and I think we're blown away by the vision stuff. So welcome Swyx. How are you? Thanks for waking up early for this.</p><p>[00:15:04] <strong>Swyx:</strong> Hey morning. Yeah, it's a big day. The year has felt like 10 years, but it it's definitely a big day to celebrate.</p><p>[00:15:10] <strong>Alex Volkov:</strong> Absolutely. So thanks for joining us. Swyx, for folks who don't follow for some reason, definitely give Swyx a follow, a host of Latentspace and the founder of Small. And recently is being followed by SpaceDaddy as well. And I want to say also</p><p>[00:15:24] <strong>Swyx:</strong> Space Daddy!</p><p>[00:15:25] <strong>Alex Volkov:</strong> And I want to also say hi to Nisten who's been maybe the most consistent co host, Nisten.</p><p>[00:15:30] <strong>Alex Volkov:</strong> Nisten, welcome joining us all the way from called Canada, I think after visiting the doctor, how are you Nisten?</p><p>[00:15:38] <strong>Nisten:</strong> I'm good. I'm good. It's good. I missed one, I</p><p>[00:15:42] <strong>Alex Volkov:</strong> Yeah. . Yes.</p><p>[00:15:43] <strong>Nisten:</strong> was about it. I thought I was gonna miss the day, and I was upset, but no, I</p><p>[00:15:48] <strong>Alex Volkov:</strong> I have a question for you. Was the doctor that you visited a human doctor or an AI doctor?</p><p>[00:15:53] <strong>Nisten:</strong> Yeah, he was human. He hadn't seen me in five years, so I was showing him all this stuff about medicine and the AI. It's funny.</p><p>[00:16:00] <strong>Alex Volkov:</strong> The, and I also wanna acknowledge Farouk or Pharrell as we call him, maybe Pharrell, how are you? I.</p><p>[00:16:07] <strong>Nisten:</strong> Hey, what's up?</p><p>[00:16:09] <strong>Alex Volkov:</strong> Welcome, welcome to the ThursdAI celebration Far El is leading the Skunksworks crew and has been doing different incredible things in the open source. Very staunch proponent of open source here on the ThursdAI stage. If anything gets released and it doesn't get released with the source Far El will have words to say about this.</p><p>[00:16:25] <strong>Alex Volkov:</strong> So we're going to cover open source today as well. I also want to acknowledge the LDJ. Yesterday I wrote the whole thread and acknowledged like many people and I didn't tag. My, my good friend, Luigi. So LDJ, apologies for that. Welcome brother. How are you doing all the way from Florida?[00:16:40]</p><p>[00:16:41] <strong>LDJ:</strong> Yeah, I'm doing good, thanks. I've been late to a lot of the Thursday AIs past few months, but yeah, it's been good coming on and glad I was able to make it on time for this one.</p><p>[00:16:51] <strong>Alex Volkov:</strong> Yeah welcome. Welcome. And I also want to acknowledge Roei ray is the DevX Dev Advocates on Pinecone and Ray has been participating in many spaces. We had a lot of conversation about reg versus long context. And I remember those wells, a lot of like late night conversations as well. Welcome Ray.</p><p>[00:17:06] <strong>Alex Volkov:</strong> How are you?</p><p>[00:17:08] <strong>Roei Cohen:</strong> How's it going, everybody? Congrats, Alex, on this awesome anniversary. Yeah,</p><p>[00:17:16] <strong>Alex Volkov:</strong> there's a bunch of folks I see in the audience who are here from week to week, and it's so great to see the community shape up, and I really couldn't be prouder to be able to just talk about AI with friends and actually make a living out of this.</p><p>[00:17:29] <strong>Alex Volkov:</strong> I would be amiss if I don't acknowledge that the anniversary today is from The spaces. So we started talking about AI in Twitter spaces, back then Twitter spaces, now X spaces exactly a year ago or Pi Day 2023. The reason why we started talking about AI is because GVT 4 was announced and Greg Brockman gave the incredible demo where he took a screenshot of a Discord.</p><p>[00:17:52] <strong>Alex Volkov:</strong> So if you remember this, the Discord, the famous Discord where we went to hunt the Discord Snapchat. Mhm.</p><p>[00:18:00] <strong>Swyx:</strong> a screenshot of the, I think the OpenAI Discord and it just transcribed every word in there and described every, like the position of every icon and like the framing of it. It was just like the best vision model we'd ever seen by like by a lot.</p><p>[00:18:14] <strong>Alex Volkov:</strong> By a significant margin and it understood different like active states, etc. and to get to a point now where we're basically having open source models. We're going to talk about CogVLM today. We're going to talk about DeepSeq released a new vision model today to get the, to the point where we can basically recreate this with a tiny model that runs completely offloaded, it's crazy.</p><p>[00:18:36] <strong>Alex Volkov:</strong> Back then, no vision existed. So we got into space, started geeking out about this, and then we kept going. So this is the anniversary of the Twitter Spaces, the actual podcast, the ThursdAI podcast that I created. Encourage you to subscribe to didn't start about four or five months afterwards.</p><p>[00:18:51] <strong>Alex Volkov:</strong> After we did this and the community started shaping up and people started coming in and actual guests started to arrive. So I see a few guests that became friends of the pod. So if you guys see Jun Yang here in the audience, on, on the technical team at Quen, there's a great conversation that we had about Quen and their models as well.</p><p>[00:19:10] <strong>Alex Volkov:</strong> We have a bunch of folks like this from time to time, just join and talk about the stuff they built. And I think this is the best thing that I get from ThursdAI is definitely this is the ability to talk with folks who are experts in their fields. And definitely I'm not an expert in many of the things we cover.</p><p>[00:19:25] <strong>Alex Volkov:</strong> And it's great to have folks from vision and from foundational model training and from open source. And we had a bunch of conversation with Nous Research folks. We're going to cover a few of those today as well, and it has been incredible so far. And so the birthday for the actual podcast, once we started recording and sending a newsletter is coming up in.</p><p>[00:19:44] <strong>Alex Volkov:</strong> in June. Meanwhile, if you want to support the space, if you're here and you're like, Oh, this is great. I learned so much. You're more than welcome to just interact with us. On the bottom right, there's like a little icon there, the message icon that says five. You're more than welcome to just send replies there and boost a little bit of the signal and retweet the space link.</p><p>[00:20:02] <strong>Alex Volkov:</strong> And so I think with this, I think with this, Oh no, a year ago, another thing was and it went out the radar because GPT 4 took over. All over the airwaves. Cloud One was released exactly a year ago as well. Happy anniversary to the Cloud team. They've been killing it lately. The past few weeks have been entropic weeks for sure.</p><p>[00:20:20] <strong>Alex Volkov:</strong> And definitely folks are looking at Cloud and now, considering cancelling their JGPT subscription. So that's been great to see. And so a year ago, there is Cloud One and they were quickly quickly hidden with the news. I also want to shout out that in the past year as well. Open source were almost non existent.</p><p>[00:20:36] <strong>Alex Volkov:</strong> So a year ago and four days, Lama CPP was first released. Georgi Gerganov released Lama. cpp, a way to run the Lama model that was released a month before that on just, your local hardware. And, uh, nobody knew about this necessarily until a few days later. Vicuna was just released.</p><p>[00:20:56] <strong>Alex Volkov:</strong> So if you guys remember Rikuna, that was a thing. So all of these things happened in, in that week. And it feels this week we have, or at least the last few weeks, we have a similar like Insanity weeks. Don't you guys think? Especially with Opus and the rumors about GPT 4.</p><p>[00:21:11] <strong>Alex Volkov:</strong> Do you guys remember anything else from that last week before we started like talking about this week?</p><p>[00:21:15] <strong>Far El:</strong> It's hard to remember what happened last week because this week felt like a century alone. That's that, that's the thing. Like we, we've</p><p>[00:21:22] <strong>Nisten:</strong> had so much just in the last week that I don't even remember what happened.</p><p>[00:21:25] <strong>Alex Volkov:</strong> Absolutely. That's why we write down. And honestly, I think Swyx, we talked about this where now that, every ThursdAI is now recapped and you have AI News, Newsletter Daily, or that covers everything. This is just for the historical record, it's very important. Just to be able to go a year back and see where we were.</p><p>[00:21:41] <strong>Alex Volkov:</strong> Because it's really hard to remember even last week, not to mention the last year. So I think it's very important. I don't want to shout out do you still call this small talk? Or do you have AI News?</p><p>[00:21:50] <strong>Far El:</strong> It's just AI news. I'm reserving small talk for the other products that I'm working</p><p>[00:21:55] <strong>Alex Volkov:</strong> I see.</p><p>[00:21:56] <strong>Far El:</strong> yeah. Yeah. AI news's,</p><p>[00:21:57] <strong>Alex Volkov:</strong> so talk to us about the AI news just briefly for folks who are not familiar with that specific newsletter.</p><p>[00:22:02] <strong>Swyx:</strong> Man, this week was f*****g, it was crazy in around December I was like very overwhelmed by all the AI discords and I knew that all the alphas being dropped in discords are no longer on Twitter, so I started making this bot to scrape discords and it was mostly just serving myself and then I shared it with some friends and it grew to like a couple hundred people, but one of them was Sumit Chintala from the Meta team, like he was the creator of PyTorch and still runs PyTorch.</p><p>[00:22:31] <strong>Swyx:</strong> And last week he shouted it out, saying that he he said it was like the highest leverage 45 minutes every day that he spends reading this thing. Which was like a freaking huge endorsement from someone like him. So I didn't even know he</p><p>[00:22:43] <strong>Alex Volkov:</strong> from from the guy who runs PyTorch. It's crazy. And of</p><p>[00:22:49] <strong>Swyx:</strong> so I, yeah, I didn't even know he was subscribed. I don't, honestly, I don't even look at the subscriber newsletter. I think it's really good for mental health to just do your thing, right? Don't even look at who's on the list. And then two days ago, Andre also just like unsolicited, completely no notice, no warning just said oh, yeah, I've been reading this thing for a while.</p><p>[00:23:06] <strong>Swyx:</strong> And I was like, what? And then I went back and looked through the emails and like his email's not there. There's no, his first name not there, not there. I eventually found his email, but yeah, it's it was just a shock that like he was also getting utility out of it and Yeah, so far I think like 12, 13, 000 people signed up in the past couple days, and we'll see where this, we'll see where this goes I think a newsletter is not the last final form, and also people have legitimate concerns around, how much is comfortable being scraped from Discord what is the sort of privacy expectation on a public Discord that anyone can join, right?</p><p>[00:23:39] <strong>Swyx:</strong> So I'm taking some steps to basically protect people it's purely meant for utility, not for snooping on people's conversations. But I do think like there should be a new sort of Hacker News of AI, quote unquote, that pulls together, Local Llama, Twitter, Discord, YouTube, podcasts, whatever.</p><p>[00:23:55] <strong>Swyx:</strong> And yeah, I think that's what I'm making AI News go towards.</p><p>[00:24:02] <strong>Alex Volkov:</strong> is excited about, Elon is excited about as well. So Elon now is a follower of Latentspace, which is a big moment. I wanted to ask</p><p>[00:24:08] <strong>Swyx:</strong> Yeah, we're trying to, yeah, let's</p><p>[00:24:09] <strong>Alex Volkov:</strong> Local Llama, by the way? Is Local Llama part of the source as well</p><p>[00:24:13] <strong>Swyx:</strong> we the engineer that I'm working with is working on this. So not yet, but we are working on it. And</p><p>[00:24:19] <strong>Alex Volkov:</strong> Alright folks, so if you want so if you want not only HighSignal, but if you want like the full firehose of information, that's great from Discord and Twitter list, I think you have a HighSignal Twitter list as well in there definitely subscribe to AI News previously Smalltalk, as like the titans of the industry now follow this and getting insight from this, so you should as well.</p><p>[00:24:40] <strong>Alex Volkov:</strong> But yeah. If that's too much for you, we're here every week to cover pretty much the very most important things.</p><p>[00:24:46] Open source - Function Calling model from NousResearch - Hermes Pro</p><p>[00:24:46] <strong>Alex Volkov:</strong> And so I think it's time for us to start with Open Source.[00:25:00]</p><p>[00:25:09] <strong>Alex Volkov:</strong> Alright folks, so let's let's cover some open source stuff. I think the first thing is we have to mention that our folks our friends from Nous Research announcing a new model today or I guess yesterday night. It's called Hermes Pro. Hermes Pro is specifically, I'm not really sure what Pro means here, so we have to ask some folks from Nous Research, but they announced the continued training of their Mistral model, their flagship model, that uses, that is fine tuned for tool use and function calling.</p><p>[00:25:40] <strong>Alex Volkov:</strong> And tool use and function calling are Maybe or should I say synonyms of each other at this point? I think it started with function calling from OpenAI that was released in June last year. And they gave us function calling in response of all of us wanting a JSON output. And since then, function calling became something called tool use.</p><p>[00:25:59] <strong>Alex Volkov:</strong> Basically, the ability of these models to not only So you the next word or complete, autocomplete, but also you could provide schemas for some of your functions that these models will say, Hey, I'm actually, I want to, I want more information on this topic or that topic. And so here is what tool you should use.</p><p>[00:26:20] <strong>Alex Volkov:</strong> And you as a developer, you would get that response. You would go call this tool. you would then pass back the data from this tool into the model. And then the model will use its context and the user's request together to come up with an answer. So think about stock price, right? Stock price is something that changes often.</p><p>[00:26:37] <strong>Alex Volkov:</strong> You cannot train the model on stock price because it changes very often. So for one example of a tool could be go check the stocks on the stock market or go check Bitcoin price, et cetera. And the model, Mistral is not able to, it's very obvious if you ask a Mistral 7b, Hey, what's the price of Bitcoin?</p><p>[00:26:55] <strong>Alex Volkov:</strong> It will give you something that something will be 100 percent wrong, a hallucination. So a model with tool use would be able to decide that if you provided, if a developer provided in advance, The the model with tools like, hey, price of Bitcoin, price of stock, et cetera the model will be able to decide that instead of hallucinating the answer, you'd actually return a reply to the developer and say, hey, go get me this information and then I'll be able to answer the user, right?</p><p>[00:27:20] <strong>Alex Volkov:</strong> So this is what tool use and function calling basically is. And we haven't had a lot of that in open source. We had a little bit. We've talked about the tool use leaderboard from the folks at Gorilla. I think. Stanford? I'm not sure. And then now Nous Research released us a continued training of their 7B model called Hermes Pro with the same general capabilities.</p><p>[00:27:39] <strong>Alex Volkov:</strong> So that's also very important, right? You keep training a model. You don't want something called catastrophic forgetting. You want the model to perform the same plus additional things as well. And the, now it's trained on new data with tool use plus JSON mode as well. So not only do we get The ability of the model to reply back and say, Hey, you should use this function.</p><p>[00:28:00] <strong>Alex Volkov:</strong> We also get JSON mode as well. It supports custom Pydentic schema. Pydentic for folks who don't write in Python is a way to define objects in, in, in Python in a very clear way. And when you use this and you give the model kind of the schema for your, tool use. The model then knows what parameters to call your functions with.</p><p>[00:28:18] <strong>Alex Volkov:</strong> So your job as a developer is basically just take this call and forward it to any API call that you want. It's available on the hub and it's announced with OpenAI endpoint compatibility, which is great. So I don't think we've seen this from Hermes so far directly. I think everybody who served Nous models they gave us OpenAI compatibility, but definitely we know that The industry is coalescing around the same format, which is the OpenAI endpoint, where you can just replace the URL to either OpenRouter or Fireworks or whatever.</p><p>[00:28:49] <strong>Alex Volkov:</strong> I think the chat from Mistral as well is supporting OpenAI compatibility. Great to see that we're getting open source models for tool use because it's very important for agents and it's very important for basically building, building on top of these LLMs. LDJ, I saw you wave your hand a little bit.</p><p>[00:29:07] <strong>Alex Volkov:</strong> Did you have a chance to look at Hermes Pro in tool use? And what are your general thoughts about open source tool use? Hey,</p><p>[00:29:16] <strong>LDJ:</strong> I'm pretty much Hermes, but also has a much improved JSON and function calling abilities and things like that. And I was just waving my hand to describe that, but then you pretty much described it already. So I put my hand back down.</p><p>[00:29:29] <strong>LDJ:</strong> but</p><p>[00:29:30] <strong>LDJ:</strong> Yeah, you got a good description of it.</p><p>[00:29:32] <strong>LDJ:</strong> And I think that pretty much summarizes it.</p><p>[00:29:34] <strong>Alex Volkov:</strong> this is the anniversary of this ThursdAI, Birthday AI. So I did my homework this time. Usually sometimes these things get released like super fast and we actually don't have time to prepare. Comments on general availability of function calling and tool use from the stage before we move on?</p><p>[00:29:48] <strong>Alex Volkov:</strong> Anything that you guys want to shout out specifically that's interesting here?</p><p>[00:29:50] <strong>Nisten:</strong> It's probably the most commercial used part, I think, because every person that's using a 7b, they want a really fast model, and usually they want some kind of JSON returned for commercial uses. There are chat uses as well, but I think like the majority of, I don't have any data on this, I'm just guessing that probably the majority of the use is to return JSON.</p><p>[00:30:15] <strong>Alex Volkov:</strong> Yeah. And then there, there are tools like Pydantic from Jason Liu that we've talked about that help you extract like structured data from some of these. And those tools require function calling as well. And function calling and Pydantic support. So definitely supports more enterprise y.</p><p>[00:30:29] <strong>Alex Volkov:</strong> Maybe that's why Technium decided to call this Hermes Pro.</p><p>[00:30:32] Together.ai new speculative decoding Sequoia improves AI inference by 9x</p><p>[00:30:32] <strong>Alex Volkov:</strong> Moving on to Together and Sequoia. And Together released something called Sequoia, which is speculative decoding. I actually wrote down explanation of speculative decoding is, and I'm going to try to run through this. So for folks who are not familiar with speculative decoding, basically, if you think about how we get open source, how we get bigger and better AI on to run locally on our machines, one of them is open source and smaller models getting better, right?</p><p>[00:30:58] <strong>Alex Volkov:</strong> So that's definitely something we've seen for the past year. We got Llama70B and then we got. 13B, and then different Finetunes and different other foundational models. I started beating the LLAMA 70B, definitely LLAMA 1, and now even LLAMA 2 is getting beaten by tinier models. So the progress of throwing more compute and more techniques is shrinking down these models to us being able to run them locally, just because our hardware is, let's say, limited.</p><p>[00:31:23] <strong>Alex Volkov:</strong> That's one way that we get local open source models. They just get. keep improving and keep getting trained on. Another way is we can, we're able to serve these like bigger, larger models, like 70B models on consumer GPUs, but then it's like super slow. So you wait one minute or two minutes between each token prediction or each word that you see.</p><p>[00:31:44] <strong>Alex Volkov:</strong> So one additional way on top of just getting smaller models faster and smarter is improving inference. So we see, we saw a bunch of attempts this year from folks like Modular releasing their max inference system and improvements obviously in different like places like FlashIntention and different inference engines as well.</p><p>[00:32:03] <strong>Alex Volkov:</strong> So we saw all of this and one such way that adds to all of this is called speculative decoding, which improves the inference speed, just inference speed. It basically tries to predict a few next tokens instead of just one, using a smaller model. And the key idea is to construct a tree of speculated future tokens for every potential token in the model's output.</p><p>[00:32:26] <strong>Alex Volkov:</strong> Sometimes they use, I think at least LLAMA CPP supports speculative decoding, sometimes they use a smaller model. For example, for LLAMA, they could use like a small LLAMA to help you predict the tokens, and then the larger LLAMA to actually help select them. And And together, folks who released a few things that we've covered before employe3d out there, who released the the Mamba architecture and the Hyena architecture we've talked about previously before also FlashAttention chief they now released their own take on speculative decoding.</p><p>[00:32:56] <strong>Alex Volkov:</strong> which they claim that on consumer GPUs, you'd be able to run something up to a 70 billion parameter Lama 2 with a RTX 4090. And they improve the ability of you to run this incredible like large model by almost nine, nine percent, nine x faster. On, non consumer GPUs like A100s, they also go up to 4x faster.</p><p>[00:33:17] <strong>Alex Volkov:</strong> Basically, by just predicting with a [00:33:20] smaller model, all like building a tree of all possible tokens, and then the larger model actually selects and predicts somehow based on those. They have a bunch of other things like offloading there, and very interesting things, but I just want to say that this is a field, speculative coding is a field that is entirely.</p><p>[00:33:39] <strong>Alex Volkov:</strong> How should I say? They only support LLAMA as far as I saw, but this whole field is entirely additive to the rest of the fields, right? So if speculative coding helps to improve LLAMA, 7TB9x faster, it's probably going to work on smaller models as well. So it's really incredible to see how much different speed improvements we're getting across the board.</p><p>[00:33:58] <strong>Alex Volkov:</strong> And definitely for the stuff that we all, we're working I love to talk about, which is open source models running locally, running faster. This is incredible. Yeah, LDJ, go ahead.</p><p>[00:34:09] <strong>Nisten:</strong> Yeah, I just wanted to direct people to I put, I pinned on the billboard a video that TogetherAI put out actually showing side by side Sequoia versus not Sequoia, and yeah, it's pretty insane the amount of speed up you're able to get.</p><p>[00:34:21] <strong>Alex Volkov:</strong> The amount of speed up on the same hardware and on the same model. So the model didn't improve, the hardware didn't improve. All they improved is the ability to to help the model predict next tokens and spit them out, which is, I agree with you, it's insane. And just, um, multiple improvements across the board are going to get us where basically we want to go, which is these type of models, these sizes of models running super fast on, on local hardware.</p><p>[00:34:44] <strong>Alex Volkov:</strong> They released it in GitHub. Folks can try. It only works for LLAMA. It doesn't work for like any other bigger models as well. Definitely we'll see. I will just say that the thing that I'm most excited about this is that all these techniques are, one, additive, and two they're there for the next big model to get released and just support them.</p><p>[00:35:00] <strong>Alex Volkov:</strong> So like when LLAMA 3 came out, eventually releases, and we know it will release these models will be able sorry, this speculative decoding will start working, and then speculative decoding will start working, LLAMA CPP will already be there. So we saw the kind of, the community efforts to support everything were just kicked into gear when GEMMA was released.</p><p>[00:35:18] <strong>Alex Volkov:</strong> I'm just very excited that we have all these techniques to Throw the, throw at kind of the next open source, the big model open source. And just the concept of running a 70 billion parameter model is very exciting. Last week we covered something from Jeremy Howard and Joanna Whittaker and Team Dietmers, the folks with Answer.</p><p>[00:35:36] <strong>Alex Volkov:</strong> ai and Qlora combined Qlora with another technique to be able to train 70 billion parameters or at least fine tune them on kind of consumer hardware as well. We're not only getting news for the past week and two weeks of being able to fine tune 70 billion parameter models on consumer ish hardware.</p><p>[00:35:53] <strong>Alex Volkov:</strong> We're also getting news about being able to run them with some, uh, some logical number of tokens per second and not a one token every like four minutes or something. Exciting news in open source.</p><p>[00:36:04] DeepSeek VLM 1.3 & 7B VLM that punches above its weight</p><p>[00:36:04] <strong>Alex Volkov:</strong> Maybe we'll cover DeepSeq VLM here as it's vision, but yeah, definitely released in open source and we don't want to miss.</p><p>[00:36:10] <strong>Alex Volkov:</strong> So DeepSeq, the folks behind The folks behind DeepSeq Coder and released DeepSeq VL, a state of the art 1. 3 billion and 7 billion visual parameter models. If you guys remember last week, we talked to Vic Hyntak of Moondream2, and that was a tiny vision model. And the whole point, and if you were here in the beginning, when we in Swyx got excited about a year ago about the vision capabilities of GPT 4.</p><p>[00:36:34] <strong>Alex Volkov:</strong> The whole point with these vision models is that their improvement this year definitely felt exponential, because now a model of 1. 3 billion parameters, a tiny model that most Macs can now run, can very easily. And if our friend Zenova is going to join us, it's very soon with WebGPU, we're going to be able to run fully in browser.</p><p>[00:36:53] <strong>Alex Volkov:</strong> These models now are able to perform very similarly to what we saw a year ago and just blew our minds, which is OCR. without an OCR model built in, understanding objects, understanding graphs and charts, etc. And so it's very interesting that DeepSeq coder, let me try to share this into the space. Yeah, it should be up there as well.</p><p>[00:37:13] <strong>Alex Volkov:</strong> Very interesting that DeepSeq released and is punching above its weight. significantly above its weight, and they actually try to compare themselves to GPT 4 vision, which is quite remarkable on different tasks like evaluation, multi images and then in some of these tasks, they get to half performance of GPT 4 vision, which is still quite incredible, right?</p><p>[00:37:35] <strong>Alex Volkov:</strong> Like it's a 7 billion parameter model, GPT 4, we still don't know how many parameters this is. We still don't know if GPT 4 vision is, a multi mixture of expert model or not. But DeepSeaCoder is actually. Coming close to the same performance as DBT four on Common Sense task and analysis task.</p><p>[00:37:55] <strong>Nisten:</strong> Yeah, and I just want to say Allama CBP supports these models. I don't know about DeepSeq, but they've supported all the other ones. And there's also a a Lava CLI in there, which you can use with these ones. Also, when you run the server, you can run the models as well.</p><p>[00:38:12] <strong>Nisten:</strong> I think they just need a little bit more compute and engineering, and they can match GPT 4 when it comes to vision. I am quite surprised that it wasn't that That big of a deal. In some ways, COG VLM, not DeepSeq, is a lot better than the rest, but it's also a larger model too. And I quickly wanted to say, because you mentioned Zenova before, I don't know if you're going to go more into that, but it turns out, it's that people of the core Chrome team or Chrome Canary that implement WebGPU, they listen to they listen to ThursdAI and Stuff that we've been saying over the years, over the months, they've actually started implementing.</p><p>[00:39:00] WebGPU and Int8 support for quantized models</p><p>[00:39:00] <strong>Nisten:</strong> And the most exciting thing that I find now is that they are trying to implement int8 support natively in WebGPU. So that will do another savings of half the memory when When you run stuff, even if you have a GPU that doesn't necessarily support in Tate, I think there was a method to run at half the memory.</p><p>[00:39:23] <strong>Nisten:</strong> So remember we went from only supporting Float 32, a few months back, I think it was September. And you needed a weird started version of Canary with a few commands to support Float16. So now they're supporting Inti, so the memory requirements of the browser have dropped down by 4x in the last 5, 6 months.</p><p>[00:39:44] <strong>Alex Volkov:</strong> I remember the days before WebGPU support even landed, like all of Transformers. js. And for folks who are not following us that closely, Zenova is a friend of the pod, the author of Transformers. js. We talked a lot on the pod, he actually announced him joining Hug and Face. on the part as well. And he created Transformers.</p><p>[00:40:04] <strong>Alex Volkov:</strong> js, which is a way in JavaScript on node to run these models via the ONNX platform. And when we talked about this before, the only way to run these models in the browser was like fully CPU. And then we always talked about, okay, WebGPU is going to come at some point. WebGPU is the ability to run to tap into GPU inference from the browser environment, from the Chrome environment.</p><p>[00:40:26] <strong>Alex Volkov:</strong> And since then, WebGPU was Still a spec that was announced that was released and now it's fully supported everywhere. But like Nisten, like you're saying, it only supports 32, right? And a bit, can you describe this part a little bit more? And then now they're listening to us and actually lending support for a quantized version of these models, a smaller version to be able to run even smaller models that perform the same.</p><p>[00:40:47] <strong>Alex Volkov:</strong> And,</p><p>[00:40:47] <strong>Nisten:</strong> Yeah so now, Chrome, you don't even need Canary and it will support Float Float16. And by default, if you only have a CPU, stuff can run now on the CPU and in float32. But again, the biggest use for this so far has not actually been chatbots. Even though chatbots do work, it has been more the visual stuff and the effects.</p><p>[00:41:10] <strong>Nisten:</strong> All the diffusion based some stuff of function calling. That's where stuff gets pretty exciting. because it changes what kind of applications you can build. It's, again, it's the front end, like what are you going to put before they reach this big GPU cluster? So it's pretty, this is the part where we're going to see the most changes and progress, in my opinion.</p><p>[00:41:34] <strong>Nisten:</strong> It's going to be the visual stuff, making use of Transformers [00:41:40] JS library.</p><p>[00:41:40] <strong>Alex Volkov:</strong> and one, one example of that Zenova showed on his feed is a real time background removal from video. So you play a video and then imagine a Chrome extension that's loaded or something, and then you're able to run AI transformer stuff on top of everything that you read or see that's the kind of stuff we're talking about with access to GPU, I think is going to be possible.</p><p>[00:42:00] <strong>Alex Volkov:</strong> So super, super exciting to see how this performs. And obviously this means that the models that we talk about local running locally, we'll just get. More use, because developers will be able to build them in. This will never get to the point of GPT 4 level for full generality, or I don't want to say never, but it's not quite there in terms of okay, running a GPT 4 level model fully in your browser.</p><p>[00:42:23] <strong>Alex Volkov:</strong> But for some specific tasks like vision, we just talked about on several benchmarks, Coq VLM and this tiny new release from DeepSeq VLM is now getting there, right? So you'd be able to analyze images, for example, you'd be able to do all kinds of things in the browser fully. Without loading, without Python environments, without all of these things.</p><p>[00:42:42] <strong>Alex Volkov:</strong> I think it means a lot for user experience as well. I think we've covered open source a bunch. Do you guys have anything else worth mentioning in the open source thing? Briefly before we move on to the to the big companies, and maybe we'll discuss, we're going to discuss agents as well.</p><p>[00:42:57] Cohere releases Command+R - a RAG focused model in API + open weights</p><p>[00:42:57]</p><p>[00:42:57] <strong>Alex Volkov:</strong> Yeah, so CommandR, I, interestingly, it's both in the open source and not, so maybe it's a good transition, right? Let's actually do this as a transitional topic. So Cohere, the company that I don't know, raised a bunch of million of dollars and everybody expected it to be like the second Lycanthropic and didn't for a while.</p><p>[00:43:18] <strong>Alex Volkov:</strong> Now is back in the very impressively is back. And so for a long time, I think Cohere re Refocus their efforts on something like RAG. They had the Cohere Reranking model and they had the Metting models for a while. And they focused on, I know that we in Weights Biases, we use Cohere Reranker for our RAG bot, and that's improving our responses significantly.</p><p>[00:43:39] <strong>Alex Volkov:</strong> Reranking is basically receiving Receiving back from your vector database, a few responses that are near neighbor equivalent to what your user has asked for. And then running another process of re ranking them for higher how should I say accuracy. And so Cohere Reranker was for a long time, like one of the more standard ones that folks use.</p><p>[00:43:58] <strong>Alex Volkov:</strong> And now Cohere actually stepped in and said, Hey, we're releasing a new model that's it's called Commander, Command R. It's a new generative model from Cohere aimed at production scale tasks, like RAG, Retrieval Augmented Generation, and using external tools and APIs. So here's this word again, external tools we use and APIs as well.</p><p>[00:44:16] <strong>Alex Volkov:</strong> As you as we previously discussed, Tool use is important. We just got tool use, in a fully open source, thanks to Nous Research and I haven't yet tested their tool use, but Cohere is definitely building this model. And I think Swyx, you also saw this release and we're going to both identify the same pretty much thing where this is interestingly not getting compared to any GPT 4 or Cloud Opus, right?</p><p>[00:44:40] <strong>Alex Volkov:</strong> They're not even trying. They have a very specific. Use case in mind and I wanted to see from use works if you have any other comments on that or how they're like positioning themselves and what specifically in what world that they're operating in.</p><p>[00:44:54] <strong>Swyx:</strong> For Coban,</p><p>[00:44:55] <strong>Alex Volkov:</strong> For Commandar and Cohere as general, yeah.</p><p>[00:44:58] <strong>Swyx:</strong> simple answer is probably not as good as GPT</p><p>[00:45:01] <strong>Alex Volkov:</strong> Yep.</p><p>[00:45:02] <strong>Far El:</strong> They didn't include it, but I haven't tried it out personally myself. People seem to be talking about it for retrieval and ragtime use cases, but, I can't give my personal endorsement. Just in general, Cohere, I think, they've been more active in sort of enterprise use cases and Finetuning, like talking about their Finetuning capabilities, or long tail low resource language, maybe use cases they also released AYA, I think, last month, which some people in the open source community were quite excited about but yeah, I think having them see, seeing them do such like a strong launch for a new model I think is like a second win for Cohere, and I'm excited to see more coming out of them.</p><p>[00:45:43] <strong>Alex Volkov:</strong> Definitely feels like a second wind, and we would don't know how much we covered Cohere here before but the fact that they released the model also in open weights on Hackenface, I think, gives them a lot of credibility from the community. LDJ, go ahead.</p><p>[00:45:58] <strong>Nisten:</strong> Yeah I noticed they did actually post some benchmarks on the website of comparison to LLAMA 270 billion, Mistral, and GPT 3. 5 Turbo, and like all comparisons and RAG benchmarks, and Command R does seem to be all of those three that I just mentioned. And of course, this is their own reporting, it's probably good to wait for third party benchmarks, but yeah, and it's apparently very good at multilingual abilities as well. I think I saw somebody saying that like somebody who, one of their like first languages is Portuguese, like they said Command R was one of the best languages, or one of the best models that was able to do that actually very fluently and understand the nuances of the language.</p><p>[00:46:39] <strong>Nisten:</strong> So yeah, I think that's really interesting and it might just be really good overall model for open source.</p><p>[00:46:45] <strong>Alex Volkov:</strong> Yeah</p><p>[00:46:45] <strong>Nisten:</strong> I think it is open source, but just, sorry it's open source, but I think it's just non commercial license.</p><p>[00:46:51] <strong>Alex Volkov:</strong> Yeah, so they did open Waitz release with with non commercial license. And they did say that if you're an enterprise, you want to build something cool with Commandar talk to them and they'll figure out something. And Aiden Gomez the CEO of Cohere is one of the founder, one of the authors on their Attention Is All You Need paper recently has unblocked And became friends with Nisten here in the Toronto community.</p><p>[00:47:16] <strong>Alex Volkov:</strong> He mentioned that this model is also optimized for 10 major languages for global like business and trained on 13 more. So it actually has a pre trained on 13 more has 128 context window, right? So if you do compare this to GPT 3. 5 Turbo or Mixtral, for example I don't remember 32k context this is 128k and they specifically focus on speed in addition to everything else, right?</p><p>[00:47:39] <strong>Alex Volkov:</strong> And in RAG systems, in, in these systems, you may not need a model that's like super, super fast. Smart, you may need a model that is able to retrieve everything that you want much faster and significant speed improvements may outperform smartness on MMLU tasks, right? So I think that's their game, they're playing it they compare it like, like LDJ said to 3.</p><p>[00:48:02] <strong>Alex Volkov:</strong> and not GVT4 or Opus and they have results in something called KILT, Knowledge Intensive Language Tasks and retrieval and Tool use specifically and so they also have a bunch of stuff on their platform to be able to do tool use and by Tool like I explained before go get me some news from the web, for example, so it's really focused on web Integration getting things from the web.</p><p>[00:48:22] <strong>Alex Volkov:</strong> Nisten, do you see the one line they posted where like it's basically they said hey, here's perplexity Based on command R. I think you replied to that. Do you remember do you want to cover this briefly? It was really fun as an example</p><p>[00:48:36] <strong>Nisten:</strong> Yeah, I shared it in the Jumbotron, it's like the third thing. It looks like it's pretty easy to build a RAG pipeline with their code, but not all of it is open. There are a few things there which are unclear, and I haven't built that pipeline yet to say for sure. So I don't want to say anything that it's incorrect, but it looks like they've made it really easy to build your own perplexity in five lines of code.</p><p>[00:49:04] <strong>Alex Volkov:</strong> That was really funny. Like a little dig at perplexity. Definitely the model is able to do like web, like the tool of web search. This model specifically is like Excelsat, but other tools as well. So shout out to Cohere, second win, like Swyx said, definitely we'll keep keep you guys posted when some of us try this.</p><p>[00:49:21] <strong>Alex Volkov:</strong> Open weights model that you can. Run, but not commercially, but you can use it and train and maybe this will help open source folks as well.</p><p>[00:49:29] Anthropic releases Claude Haiku - GPT3.5 competitor</p><p>[00:49:29] <strong>Alex Volkov:</strong> Moving on from Cohere I think in the same battlefield, actually Anthropic. gave us an announcement yesterday, and very smart release schedule from Entropic, I must say, right?</p><p>[00:49:40] <strong>Alex Volkov:</strong> So they released Cloud 3, they announced Cloud 3 a few weeks ago, they announced three versions, Opus, which is their flagship that now many people prefer on top of GPT 4, which is quite incredible. It's not taking over on LMCS yet. So GPT 4 still takes over on the LMCS people arena. But I think we've [00:50:00] been coming back here week after week and saying that, some more folks use Opus.</p><p>[00:50:04] <strong>Alex Volkov:</strong> Um, let me see just by raising hands. Do you guys use Opus for the past week? At least once? What do you have a thumbs up or thumbs down for Opus use?</p><p>[00:50:13] <strong>Swyx:</strong> Oh yeah, I use it every day.</p><p>[00:50:15] <strong>Alex Volkov:</strong> Every day. Wow. So you got the pro thing or are you using the API kinda?</p><p>[00:50:20] <strong>Far El:</strong> I got Pro, but apparently I'm a chump because I don't have to use Pro, like, only like B2C, non developer types should use Pro. Every developer should just use the Anthropic Workbench because you just pay by API call and you're probably using less than 30 worth.</p><p>[00:50:35] <strong>Alex Volkov:</strong> I will say this like very quietly because maybe Anthropic you don't even have to pay unless you apply for production use and then you have to put a credit card. It's open and you get API calls for free. I will say this, I will say this, there's a Tony Ding released like a year ago, I think something called TypingMind, which is like a front end for front end for ChatGPT basically, but on the back end you can put every model that you want.</p><p>[00:50:55] <strong>Alex Volkov:</strong> So basically you get the ChatGPT experience, including vision stuff. You can upload images as well. And I think that costs like 30 bucks. If you get that. and you plug in your API key that you get from Tropic for free, you basically get the same experience, you don't have to pay the 20 bucks a month</p><p>[00:51:08] <strong>Far El:</strong> Do you use TypingMine every day? I I hear some social media buzz about it, but I don't see any AI people. engineer type people</p><p>[00:51:15] <strong>Alex Volkov:</strong> I haven't used it up until I, I had to do, I had to try cloud three and I didn't want to pay the extra 20 bucks. Just remember in our subscription. So I just plugged it into typing mind and it's a nice experience. I still go to Workbench. Workbench is more for us, for engineers, right?</p><p>[00:51:30] <strong>Alex Volkov:</strong> Workbench, everything that you get there, you can immediately export and continue via the API, for example. And the Workbench is annoying because you have to remember to every Every prompt that you have, every answer that the model gives you, you have to click a button and put it back in kind of the stack of messages, right?</p><p>[00:51:47] <strong>Far El:</strong> you can use keyboard shortcuts, but it's also meant for you to prototype prompts, right? So that's what you want to do. You want your conversations not to persist. You want to see the output and you're like, okay, throw away the output. I'll tweak the prompt again, generate the new output. So you don't want it to auto add to the conversation.</p><p>[00:52:04] <strong>Far El:</strong> That's the main difference,</p><p>[00:52:05] <strong>Alex Volkov:</strong> That's true. And so definitely many folks use the Workbench for prototyping prompts is great, but just for chatting is also great. So you've been using it, so what's your take on Opus so far?</p><p>[00:52:17] <strong>Far El:</strong> Oh, yeah. If you go to AI News every day now, I'm Generating Haiku, Opus, and what's the other one? Sonnet. By the way did you know that the names of these things basically hint at the model size?</p><p>[00:52:30] <strong>Alex Volkov:</strong> Yeah, let's talk about this. Opus is like a big</p><p>[00:52:32] <strong>Far El:</strong> yeah. Haiku is three lines long, sonnet is 14 lines long. Interestingly, and in opus is, unbounded, but in, 3b, 14b, and probably 8 times 220b. Yes. I think the cloud people thought they were very smart by just encoding the, the numbers in</p><p>[00:52:50] <strong>Alex Volkov:</strong> gotta applaud them about the name because I stopped saying Cloud3, I'm just saying Opus now, and everybody gets what we're talking about. Opus is a brand name that's built in like separately from Cloud3, which is, I think, very smart. Like I'm, 3. 5, 4, 4Vision, all these things, like it's a little harder to say, and now they came out like with actual names, and I gotta applaud the strategy.</p><p>[00:53:12] <strong>Alex Volkov:</strong> I think just to connect the dots back to where we are today yesterday Claude finally released the announced haiku and yeah, Swyx, you had another comment that I spoke over?</p><p>[00:53:22] <strong>Far El:</strong> Nothing, I was just going to give say, if you want to do you should be generating things side by side and seeing the model difference. Haiku is very bad at instruction following. Sonnet is actually really surprisingly good enough. I would use Sonnet for most things, and then Opus is more powerful but slow and honestly not really worth it.</p><p>[00:53:42] <strong>Far El:</strong> And if you want to see side by side generations, just go in the last few issues of AI News. You'll see side by side and you can decide for yourself which one you prefer. Yeah, so I run all the summaries through through Sonnet and Opus and Haiku every day now, and I can see the difference.</p><p>[00:53:56] <strong>Far El:</strong> I would say the general take is that Code 3 in general is better at instruction following and summarization than GPT 4, which is huge. I can't believe I'm just saying</p><p>[00:54:08] <strong>Alex Volkov:</strong> It's crazy.</p><p>[00:54:08] <strong>Far El:</strong> of GPT 4. But it hallucinates more. They're like very obvious. inconsistencies in like the things that it tries to, the facts that it picks up on, and they're just plain wrong.</p><p>[00:54:18] <strong>Far El:</strong> And anyone with any knowledge of the subject matter will see, will spot that immediately. So Sumith, when he was talking about Cloud 3, actually referenced some examples from AI News in his timeline, if you go check out Sumith's timeline on Cloud 3. And yeah, I will say like that is the problem with using Cloud 3, like it.</p><p>[00:54:35] <strong>Far El:</strong> It follows instructions very well, but then it will hallucinate things. Maybe because it doesn't have as good of a world model as GPC 4. Whatever it is now I'm having to decide as a product creator, am I using Cloud 3 because it, the vibes are better, but then do I have to build an anti hallucination pipeline, which I'm trying to build, but it's difficult because what is truth?</p><p>[00:54:56] <strong>Alex Volkov:</strong> Yes. Are you, let me ask you a question real quick. Let me, one second Nisten, and then Nisten you go Swyx, one question. Did you change your prompt for Cloud specifically from your GPT 4 prompt?</p><p>[00:55:08] <strong>Far El:</strong> I copied over some of it and I wrote some other parts from scratch. I understand that a lot of people say you should use XML for this stuff. I think that it's a little bit of mumbo jumbo, especially because I'm not doing structured output.</p><p>[00:55:22] <strong>Alex Volkov:</strong> I will say this thing they have Alex Albert, who's now getting more of a highlighted role. He's the guy that we've talked about that, that did the New Zealand Haystack analysis, where Claude Opus realized that it's getting tested, right? So you probably saw this famous tweet. So Alex is the prompt engineer there.</p><p>[00:55:38] <strong>Alex Volkov:</strong> He has a a collab that's called a Metaprompt. So you can find it, I'm probably going to put this in show notes, that you basically describe the task that you want, and then Opus comes up with the prompt for Opus itself. And the prompts that it comes up with works for me way better than the prompts that I've written myself.</p><p>[00:55:54] <strong>Alex Volkov:</strong> So it does use a little bit of XML. And I just want to say to Diana, it's not necessarily to you, but definitely to you as well some different prompting is needed. So these models do do need different, they've been trained differently. And XML is one part of it, but also, It feels like a little bit of more prompting and folks can't just expect to have the same prompt that works for GPT 4 to work.</p><p>[00:56:16] <strong>Alex Volkov:</strong> I think some of our intuition as well changes per model. Some models, like you said, are like more hallucinatively, but following instructions better. Definitely, I saw this. Nisten, I cut you off before. If you still remember where I cut you off please continue.</p><p>[00:56:29] <strong>Nisten:</strong> No, it was along the side. So I've used Sonnet and I just opened the Bing sidebar and quickly iterate through stuff with Sonnet. And yeah, I noticed the same thing. It does make up a lot of stuff. So then I need to drop it into Bing in precision mode and have it actually look up. The stuff and Then it's still not quite ideal.</p><p>[00:56:52] <strong>Nisten:</strong> But this combination I also use Mistral Large just switching between being with internet mode and either Sonnet or Mistral Large to quickly Iterate through although Mistral Large is slow. So again, I really like the speed of</p><p>[00:57:09] <strong>Far El:</strong> Sonnet</p><p>[00:57:11] <strong>Alex Volkov:</strong> Yeah, so let's actually pick up on on the kind of the news thing. So we covered like cloud before, and now we're talked about as actually folks putting it in production like Swyx and we're also testing this Entropic Release Haiku, which is their smallest model, and that doesn't compete with any GPT 4, they go for the lowest price and the fastest kind of execution.</p><p>[00:57:32] <strong>Alex Volkov:</strong> Fairly similar to, to the command R kind of area of playground that we got, right? It's like focusing on speed, focusing on as best performance as possible for the fastest and the cheapest price possible. And we definitely heard before from multiple folks who fine tuned GPT 3. 5, for example, and get better results than GPT 4 on fine tuned GPT 3.</p><p>[00:57:51] <strong>Alex Volkov:</strong> 5 and significantly faster as well. So on tropical risk. Haiku, which is their fastest and most affordable model for enterprise applications. They stress enterprise because every dollar, every token counts, every dollar counts, and you actually get, get to measure these models not only on how good they are, but also how good they are compared to how many, how much money you pay for them and how fast they respond to your users.</p><p>[00:58:14] <strong>Alex Volkov:</strong> And the main differences between Haiku and like GPT 3. 5 or even 1. 2. [00:58:20] Zero Pro and Gemini. The main differences is price. It's priced at 25 cents per million tokens, which GPT-3 0.5 is half a dollar per million tokens. So half the price the output tokens are 1.25 dollars per million output tokens, which usually enterprises, they do prompt engineering, so they shove a bunch of stuff in the prompt, but the, the response is not that long.</p><p>[00:58:43] <strong>Alex Volkov:</strong> So usually you focus on the input tokens as well. It gets 75 on MMLU and 89 on GSM4K, which is significantly better than GPT 3. 5. Now, they may have used the announced 3. 5 metrics and not the actual metrics, which oftentimes folks do, but still is very important, very impressive. And it does human eval 75%, 70, almost 76.</p><p>[00:59:09] <strong>Alex Volkov:</strong> percent on human eval on code, which is quite impressive for a super fast model. But I think the highlight of the differences between a 3. 5 or a Gemini One Pro is that Haiku is vision enabled, right? So you can pass images. It's quite impressively so vision enabled. So whatever we got excited about last year at Swyx, I think now it's possible with like up to 25 seconds per million tokens, which is quite incredible.</p><p>[00:59:34] <strong>Alex Volkov:</strong> You can use it everywhere pretty much. It's none, million tokens is a lot. And also it has 200, oh, sorry, go ahead, Svek.</p><p>[00:59:43] <strong>Far El:</strong> No, one caveat or question is the vision model in Haiku the same as, same vision model as in Sonnet or Opus, right? Maybe it's dumbed down as well, and so no one's really run, yeah, no one's really run any of the benchmarks on this stuff.</p><p>[00:59:56] <strong>Alex Volkov:</strong> Yeah, and then I think it's worth calling out that like now the same level 3. 5 speed and performance, significant like improvement of performance plus vision enabled plus 200, 000 context window as well, which 3. 5 is, I think still is it 8? Yeah. So shout out to to Entropic, sorry, not Cohere, to Entropic to keep bringing us the news.</p><p>[01:00:17] <strong>Alex Volkov:</strong> The release schedule was very well timed. They released the biggest biggest two models and then followed up with this fast model. And for folks who are looking for how to use or maybe lower their costs with the same performance. It's very interesting. Anthropic promised us tool use and function calling and haven't yet gave us function calling.</p><p>[01:00:35] <strong>Alex Volkov:</strong> They said that these models are able to do function calling and tool use, but we still are not able to use this. For your preferences, you may go here.</p><p>[01:00:42] Hardware and Robotics</p><p>[01:00:42] <strong>Alex Volkov:</strong> Big companies and APIs. I think that's most of what we want to cover, but I think we're smoothly moving towards the next area where we talk about hardware and robotics because one big company joined another big company a few weeks ago, and now it's worth talking about open AI and figure the humanoid robot.</p><p>[01:01:01] <strong>Alex Volkov:</strong> Figure has been in the news, they showed the case the robot for a while. Their, I think, main competitor It's funny how Boston Dynamics was the big name in robotics for a while, and now all these companies are leapfrogging Boston Dynamics at some fields.</p><p>[01:01:16] Figure + OpenAI integration shown on a video</p><p>[01:01:16] <strong>Alex Volkov:</strong> So Figure has this humanoid robot, it has , ten fingers, it moves them very freely, it's very interesting.</p><p>[01:01:22] <strong>Alex Volkov:</strong> Recently they announced their integration with OpenAI, And I think OpenAI also announced the integration with Figure. And now they released a video, and that video is bonkers. It's really, folks, it's really bonkers.</p><p>[01:01:31] <strong>Alex Volkov:</strong> They show the figure robot standing just like in, in some form of a lab. Funnily enough on the back wall, it says AGI lab. And then that figure robot has a little screen on its face. And that screen shows you the same exact interface that you get, you and I get in the chatGPT iOS app. With the little, little circle that turns into a few animated things when you talk to it. And I found it really funny that they insisted on keeping the same kind of UI. And basically this person comes up to this figure and says, Hey, what do you see right now? And the robot uses the onboard cameras to send image.</p><p>[01:02:07] <strong>Alex Volkov:</strong> I guess it's one image to GPT 4 vision and replies with, I see literally the robot says, I see you standing in front of me with your hand on the table. So that was like one. one quirk of how the robot knows that the person that talked to it is the person who actually stands in front of it. But I think they stress that this video is end to end, not edited, and it's 1x speed also, and the robot replies fairly fast.</p><p>[01:02:32] <strong>Alex Volkov:</strong> Time to speak on OpenAI's GPT 4 is quite fast anyway. If you use the app, you know this, but the vision stuff, maybe they got like some private versions of API. Responses, folks on the stage, do you see this video? What do you think of this? Is the Terminator coming or we're still not there?</p><p>[01:02:48] <strong>Alex Volkov:</strong> What are your, some of your responses to this figure video now with OpenAI Brain?</p><p>[01:02:56] <strong>Alex Volkov:</strong> Ray, go ahead.</p><p>[01:02:58] <strong>Far El:</strong> Yeah,</p><p>[01:02:59] <strong>Roei Cohen:</strong> very shortly before that, I listened to Yann LeCun's podcast with with Lex. Yeah, and, what's striking me about that demo was that we're actually quite close. To having usable robots just using the reasoning that is available on OpenAI. And that's, I think that's remarkable.</p><p>[01:03:20] <strong>Roei Cohen:</strong> You know what I mean? Because at the end of the day, you're like, oh, these things are not thinking, they're just spinning out Next Tokens and whatnot, but more and more, I feel myself drawn into Ilya's Camp, where you're like, no, there's probably some world model that, that these things have to develop internally, because otherwise they wouldn't be able to accomplish all these tasks that are essentially what you need is some sort of an understanding of embodiment in order to like, Reason about where to move your limbs and how to pick up things and, things of that sort.</p><p>[01:03:50] <strong>Roei Cohen:</strong> I don't know. I just thought that there was, like, a really stark contrast between what they showed in the demo and that conversation. More optimistic today than I was before.</p><p>[01:03:59] <strong>Alex Volkov:</strong> Absolutely. And I think if there was one additional reason for Space Daddy to sue OpenAI, for Elon to sue is that, Optimus is definitely a bet that Tesla is making now. Tesla's whole like reason for existing was to bring the world renewable energy. And when When Optimus was announced, many kind of investors thought was like, Hey, is this moving the vision a little bit forward?</p><p>[01:04:19] <strong>Alex Volkov:</strong> And because, Optimus does not bring the world renewable energy. Optimus is advancing super quick as well. We saw for the past year, multiple demos. The last demo blew me away in terms of like dexterity of different fingers and everything. And then you gotta wonder how smart Optimus will actually be.</p><p>[01:04:35] <strong>Alex Volkov:</strong> in terms of the ability of it to perform tasks and respond to you. And FIGURE, FIGURE announced they're like advanced robot and then they announced the integration with OpenAI which we know that Elon is now on the warpath with. And so I got to wonder if, their communication, their integration now that Elon has also Opus and Grok.</p><p>[01:04:53] <strong>Alex Volkov:</strong> Given where Grok is right now, and I posted this as a question on my timeline is, would you prefer Optimus, who's like better and more flashier, but with Grok brains, versus a figure with GPT 5 brains or something? And I think it was quite obvious where the distribution lies.</p><p>[01:05:07] <strong>Alex Volkov:</strong> You, you would want the less flashy robot that's smarter potentially than the flashy robot that's is GPT 3. 5 level. So the understanding of the scene was very impressive there. The te the text to speech was very natural. I don't know if you guys noticed in this video, but the robot actually ums and s and takes pauses and it feels like they built something like this.</p><p>[01:05:27] <strong>Alex Volkov:</strong> I actually, they used probably the same text to speeches, OpenAI, but it feels like OpenAI gave them a little bit of a better model because I use the open the ice text of speech often via the iOS app, and it's, it doesn't it doesn't go you know what, I actually think this right.</p><p>[01:05:41] <strong>Roei Cohen:</strong> To be fair, the I've started seeing this behavior in text to speech with PI first. Phi already does this ums and uhs, and a more natural kind of cadence,</p><p>[01:05:51] <strong>Alex Volkov:</strong> yeah, Phi is very expensive for sure. LDJ, go ahead.</p><p>[01:05:55] <strong>Nisten:</strong> Yeah, I actually use a custom instruction with PadGBT where I specifically, I give it like a set of principles to follow.</p><p>[01:06:02] <strong>Nisten:</strong> And the last principle is make sure to include ums and uhs in your speech and as if you're talking. And I feel like when I use the conversational voice mode, it makes it feel a lot more realistic because then it's actually literally saying uhs and ums. And it does end up doing that with me.</p><p>[01:06:17] <strong>Alex Volkov:</strong> Yeah, so I definitely noticed this and this could be just a result of of something like a custom instruction or maybe they're using a different model. The voice they use is not one of the voices that OpenAI gives. I think it's a custom voice that they use. It's a little raspy. It's pretty interesting that they gave it something.</p><p>[01:06:35] <strong>Alex Volkov:</strong> And Roy, to your point before, where I gotta wonder how deep the integration goes. Do they use just [01:06:40] the API, or did they have a fine tune on top of the ability of this robot to actually perform tasks?</p><p>[01:06:45] <strong>Nisten:</strong> See official confirmation of somebody was asking hey, are you guys using GPT 4 and maybe a new model or something? And then the CEO figure, he just cryptically replied saying, we're using some new advanced features, something like</p><p>[01:06:59] <strong>Alex Volkov:</strong> Yeah, they're definitely getting advanced features. We know that OpenAI gives advanced features to friends, so Spotify, for example, uses OpenAI voice cloning tech for, converting Lex to Spanish, and we know for a fact that this, they give out this very sparsely, so probably they have more advanced features.</p><p>[01:07:16] Cerebras announced their largest and fastest AI Chip C3</p><p>[01:07:16] <strong>Alex Volkov:</strong> Alright, in the hardware and robotics things, we want to cover two more things super quick. Cerebrus announced their largest and fastest AI chip on Earth, and So this is a company that, builds custom hardware, and they announced the CS3, which they claim that could technically claim and now all these claims are probably like still in, in flux.</p><p>[01:07:36] <strong>Alex Volkov:</strong> And I don't know if the supports any, how should I say, PyTorch, for example, but they claim that can train up to 24 trillion parameter models on a single device. They say the world has never seen AI at this scale and, it's insane. 24 trillion parameters for single devices. It's insane.</p><p>[01:07:55] <strong>Alex Volkov:</strong> It has 4 trillion transistors. I can keep saying numbers and I'm not like a numbers guy. So like when people talk numbers at me, they blow past, but 900, 000 AI cores on this chip. And it's very interesting that they have terabytes of external memory, even up to petabytes which is crazy.</p><p>[01:08:12] <strong>Alex Volkov:</strong> Anybody who's more into hardware want to comment real quick of what Cerebrus announced and how important this is to the industry. I'm more than welcome.</p><p>[01:08:20] <strong>Nisten:</strong> Yeah, the on chip memory which is pretty much equivalent to, cache in a GPU. They have I want to say it's 40 or 50 gigabytes on the CS3, which like, that pretty much means you could, you would be able to train your inference theoretically anything like Mixtral size or smaller at insane speeds we're talking like, maybe I don't know, like at least a thousand tokens a second probably, like maybe even five thousand or like more, and that even might even be conservative too look, there's insane amounts of compute and bandwidth here that you could have, especially for small models.</p><p>[01:08:53] <strong>Alex Volkov:</strong> That's quite incredible. I don't know if that's in production at some point or when it's going to be, but at least based on the numbers, this looks just absolutely like a chunk of incredible.</p><p>[01:09:05] Extropic announces their Thermodynamic processing unit</p><p>[01:09:05] <strong>Alex Volkov:</strong> And in addition, Harvard News is super quick. Extropic, the folks who are, I think founded by folks who were in DeepMind before and did some.</p><p>[01:09:15] <strong>Alex Volkov:</strong> Quantum computing stuff. I think that's what Guillaume's background is. They announced their TPU or like at least what they're going to build or thinking about building which is not a a tensor processing unit like TPU from Google. It's a thermodynamic processing unit. It's basically a teasing at this point.</p><p>[01:09:32] <strong>Alex Volkov:</strong> I don't think they have. Hardware ready to go, or at least not that I understand. And I had to, I'll add this to the show notes. I had to dig deep into trying to understand what is it that they announced. And it was really hard for me. And it didn't seem like, my non hardware background was the reason.</p><p>[01:09:48] <strong>Alex Volkov:</strong> It felt like some other folks also getting a little bit lost in what they actually talked about. LDJ, if you want to take a stab at giving us like a little brief recap, I would really appreciate it, because I know that you are in some of these spaces. But thermodynamic is like a new approach to basically doing AI, as far as I understood.</p><p>[01:10:07] <strong>LDJ:</strong> Sure, yeah so yeah, there was a Q& A that they held yesterday. which I think it actually, I think it's recorded, and maybe the FJ's Ozarksropic page it might be there for anybody to listen to, but I spoke with them a bit, and Guillaume, the CEO, and Trevor, the CTO, they're both creators of TensorFlow Quantum, and they worked at Google, and they didn't work at DeepMind, but they actually worked on something arguably a little cooler than DeepMind, actually, depending on who you ask, called a Google X, Which is pretty much the secret development division of Google, where they work on very long term deep technology projects.</p><p>[01:10:44] <strong>LDJ:</strong> And, yeah Trevor and Guillaume, they met at Google X, when they were working on TensorFlow Quantum and a lot of quantum computing technologies. And a lot of the systems that they had to develop to mitigate, Like all the errors that built up in the quantum computing system that they had to account for, they ended up on a path where, hey, we could actually start using this technology itself for the computing in the first place.</p><p>[01:11:09] <strong>LDJ:</strong> And the goal is really just general. like speed up of mainly like things like gradient descent and operations that are pretty much used in all of deep learning and all of AI. So it's not just specific to transformers or just specific to this and that and yeah they plan to have a bunch of server grade chips like within the next Let's say around three years or so and they plan to have consumer available chips as well in accelerator form factors.</p><p>[01:11:40] <strong>LDJ:</strong> So you'd be able to just plug it into your motherboard, just like you plug in a GPU into your motherboard today, and it would just be an accelerator card that has this thermodynamic computing components within it that would be able to accelerate your AI workloads just way more.</p><p>[01:11:55] <strong>Alex Volkov:</strong> That's incredible. I think they wanted to call this the transistor of the AI era, which transistors like a big step function change in the world of computing. So shout out to them. It still looks like it's a little bit ways out, but definitely they're getting interest. And then the very techno positive outlook or techno optimist part of outlook is definitely also.</p><p>[01:12:16] <strong>Alex Volkov:</strong> Also helpful. So I think that's mostly it on the hardware news and robotics. We don't usually often cover, but this week seems to be, seems, seemed to have been a very big week in the hardware news and robotics and A lot of stuff happened that pertain into kind of like tech optimism and very big week for announcements and different things.</p><p>[01:12:35] <strong>Alex Volkov:</strong> The chip design for whatever they're doing looks, looks crazy as well. So definitely folks who are into this, go check it out and let us know what you think. In comments as well. I think we've been at this for almost an hour and something, and I do want to do like a little reset, maybe drink a little bit.</p><p>[01:12:50] <strong>Alex Volkov:</strong> So let's do a short reset of the space and I'll talk about some stuff that happens in Weights Biases. And then we're going to continue talking. We still have to talk about Devon. So we'll brief, brief reset, and then we're going to pick up on this.</p><p>[01:13:27] <strong>Alex Volkov:</strong> Alright, you are on ThursdAI, today is Pi Day, March 14th, and this day started crazy and kept getting crazier, so today morning, early on, many folks woke up to see SpaceX launch the largest man made object ever to break through the atmosphere, and this was a crazy thing, the first, the third thing they tried this year also is the birthday.</p><p>[01:13:52] <strong>Alex Volkov:</strong> or anniversary of GPT 4 that was released a year ago exactly on Pi Day, March 14th. And we're still waiting for some breaking news to come through. And, hopefully they release something. There were rumors that GPT 5 is coming up. There were Microsoft copilot pages said, Hey, you can get access to GPT 5 for all.</p><p>[01:14:09] <strong>Alex Volkov:</strong> You saw this, right? And then those rumors were discarded. And they said, Microsoft said there was a typo. And we're going to try and see what else you know, what else we're going to get here today in breaking news. But also today's an anniversary of Entropiq's Claude, the first Claude, the first kind of production model that they released, the Entropiq released was also a year ago.</p><p>[01:14:32] <strong>Alex Volkov:</strong> And very big week last year, very big week this year as well. And of course, it's also ThursdAI, BirthdAI, it's a one year anniversary of these spaces as I've been at least hosting them for the past year. Consistently, I think we missed one. I think I was really sick and I missed one. I still sent the newsletter, but I missed it.</p><p>[01:14:49] <strong>Alex Volkov:</strong> the space. And so we're here celebrating the ThursdAI birthday night with a bunch of friends here on stage. And I think in this vein it's now a good time for me to say that, ThursdAI is not sponsored. If you want [01:15:00] to support this, please support us, please follow ThursdAI, the newsletter.</p><p>[01:15:03]</p><p>[01:15:03] This weeks buzz - Weights & Biases update</p><p>[01:15:03] <strong>Alex Volkov:</strong> And if you want to engage with us on socials, that definitely helps because sometimes the rich is the rich is not how should I say, it's hidden on X for some reasons. We've seen better and worse times.</p><p>[01:15:13] <strong>Alex Volkov:</strong> So definitely if you want to follow us and give us a follow to the main account, but also like retweet when we start the space that we're gonna be super, super helpful. But the space is not hosted, is not sponsored besides Weights Biases. And so I think I maybe tell you a little bit how I joined Weights Biases because this was also a thing.</p><p>[01:15:30] <strong>Alex Volkov:</strong> So folks, remember me joining the spaces and thinking about, hey, this is fun to do. I have no idea how, how I'm going to make money. Back then, I The whole, one of the reasons to do this space was to promote my startup called Torgum. At some point, Weights Biases folks reached out and said, Hey, let us sponsor your newsletter and podcast.</p><p>[01:15:48] <strong>Alex Volkov:</strong> Because the audience that you draw and the audience that we're looking at is very similar. And I was really apprehensive in the beginning. I didn't really want Take sponsorships as you guys may have liked the authenticity of the excitement and the stuff that we talked about. We never get like paid gigs.</p><p>[01:16:05] <strong>Alex Volkov:</strong> Nobody pays to come and, push their stuff on ThursdAI, which I think the community appreciates. And so I was really thinking about, okay if this is the right thing, and then after a while, Waits Bites is, I was entertaining this because they really have A good stand with the open source community as well.</p><p>[01:16:20] <strong>Alex Volkov:</strong> Their main product is free for personal use. And many folks in the audience that know and love Weights Biases way before I even knew what they are, gave me kind of the thumbs up. And then Weights Biases reached out again and said, Hey Alex, why don't you and ThursdAI join Weights Biases? And you just keep doing this in addition to being an AI evangelist or promoting our products and our different products, by the way.</p><p>[01:16:42] <strong>Alex Volkov:</strong> W we have some new announcements very soon that I'm very excited about. And I, back then really started needing the money and the startup wasn't taking off, and so I said, Hell yes. This sounds like an amazing opportunity for me to keep doing this, to keep talking about AI with you folks.</p><p>[01:16:58] <strong>Alex Volkov:</strong> Learning myself learning from folks on stage here who know way more than me, and then also learning in public, so other folks also follow up. And yeah, that's how I joined wins and biases. And now Thursday I is. The podcast and the newsletter are offered by weights and biases. The space is I talk about weights and biases stuff, but I talk about the stuff that have actually very excite me very much.</p><p>[01:17:20] <strong>Alex Volkov:</strong> And so in the vein of those stuff, I just wanted to add that I'm going to San Francisco in a month and when it's me and everybody else in WIS and biases because our. Annual conference called Fully Connected starts in it's going to be in April 18th in San Francisco. And the tickets are still early bird, so up until end of this month, the tickets, you can get them for 50 percent off.</p><p>[01:17:41] <strong>Alex Volkov:</strong> And it's an opportunity to get How should I say this? One of the reasons why I joined Weights Biases is because everybody's a customer, including, 4 was trained with Weights Biases. But also, pretty much every other foundational lab that builds foundational models and in different, in robotics and different other places, just The amount of logos that our customers, Weights Biases, just beats any other company I've ever worked at, or even looked at, it's crazy.</p><p>[01:18:06] <strong>Alex Volkov:</strong> And so many of these folks will come to this conference to also talk about what they're building, the models. So a very good opportunity to visit San Francisco and join us. A day before this conference, I'm going to do a workshop. And me along with my team, the growth ML team in Noise Ambassadors. We're going to do a workshop about improving your production step by step.</p><p>[01:18:25] <strong>Alex Volkov:</strong> And it's going to be super cool. We're going to talk about evaluations. We're going to talk about different other things that we know from the enterprise community, the folks who actually use AI in production. They talk to us. We have our own AI that we're running called OneBot that you're more than welcome to use.</p><p>[01:18:38] <strong>Alex Volkov:</strong> So definitely. Come and meet us in San Francisco in April or in London in May, by the way, if you're Europe based we have a the same kind of conference in London, which I probably won't be attending in London, but you never know, maybe I'll get called. With this, I think we're moving towards, let's talk about agents.</p><p>[01:18:56] <strong>Alex Volkov:</strong> And I think this week was a big thing for agents. Don't you guys think? Who saw the Devon announcements? I'm just gonna do it like this. Yeah. Nisten, what do you think? Who didn't, right? Like they, they exploded into everybody's feed, I think faster than AutoGPT a year ago or something.</p><p>[01:19:14] Cognition Labs Showcases Devin the first AI software engineer</p><p>[01:19:14] <strong>Alex Volkov:</strong> And so let's basically do a brief cover of the Devon announcement, and then we'll talk about what it actually means. And then I have. And then I think I'll open up the space for the first time in a month or in a year to actually talk with people who are in the audience and want to come up and tell us about their experiences with ThursdAI.</p><p>[01:19:29] <strong>Alex Volkov:</strong> Cognition Labs, a fairly new company, looks like funded fully. released Devin, their, what they call the first fully autonomous AI software engineer. And we've seen these claims before, and some of us are very skeptical, because these demos are incredible, and then when you actually get to use them, the model loses context, etc.</p><p>[01:19:48] <strong>Alex Volkov:</strong> And, um, they claimed the setting a new standard on the software engineer bench coding benchmark, SVE bench coding benchmark. which I think they're outperforming all these things and getting around 18 percent on the SVE bench. Which is a specific task for, not only writing pieces of code, but also performing tasks.</p><p>[01:20:08] <strong>Alex Volkov:</strong> They claim it's operating as a highly capable teammate, capable of working alongside human engineers, independently tackling tasks for their review. So one of the things that caught me by surprise. But by a fairly surprise that, compared to something like AutoGPT before or other agents that we saw, and we've talked with multiple agent folks on the pod, we've talked with João from crew AI, that's been like in the open source community being very highlighted recently, we've talked with Killian from AutoGPT, we've talked with a bunch of agent folks this Devin has besides the company and the investment and everything, the UI is very polished.</p><p>[01:20:44] <strong>Alex Volkov:</strong> The UI is actually a set of tools and I asked a few folks with access to Devon So if you have access to Devon, please DM me and come up and talk about this if you're listening to this The UI has access to shell So you can see a shell, like a terminal, that the UI has probably access to a virtual machine as well.</p><p>[01:21:03] <strong>Alex Volkov:</strong> It has a browser that you as a user can see and Devin as an AI agent can use the browser. So for example, you can log in and have authenticated sessions for different things that then Devin can use this browser. The UI has access to a code editor, basically, that you can see that Devin writes things in.</p><p>[01:21:22] <strong>Alex Volkov:</strong> And you have access to a chat. And I think that the combination of four of these things in the UI, plus the ability to scroll back and follow Devon in real time, but also scroll back to see something that Devon did a few steps before, I think is very powerful, and I still haven't seen anything like this.</p><p>[01:21:39] <strong>Alex Volkov:</strong> I think for many people this broke the kind of the the threshold of something that's useful. Far El, go ahead, you have your hand up.</p><p>[01:21:46] <strong>Far El:</strong> Devon's very impressive, but of course it's not open source. I posted on top here a an open source version of Devon called MetaGPT, which self reports to be better than Devon. It's up to, like we need to do evaluations to find out, but also there's several. Open source communities that have formed we're talking about a group with dozen folks, another group with hundreds of people who are all coordinating to replicate.</p><p>[01:22:13] <strong>Far El:</strong> Devin in open source. I think actually one of the people here is Jun Yang in the audience that I'm seeing here who is trying to also replicate open source Devin. So maybe if you want to bring him up to, to discuss that. But yeah, in general I think Devin is is impressive, but what's the most interesting insight is potentially the fact that this is just a GPT 4 wrapper, and they've just managed to squeeze so much more out of GPT 4 than we have been able to.</p><p>[01:22:41] <strong>Far El:</strong> Definitely a lot of interesting things to come based on just knowing that this is possible.</p><p>[01:22:46] <strong>Alex Volkov:</strong> What you said, Far El, is very similar to when ChatGPT came out, and this was quote unquote, just the UI, right? There's no technological breakthrough necessarily in chatGPT's release, but the fact that it was like nicely packaged, the fact that they kept sending back and forth messages to keep the state for you, the memory for you as well, broke the level for many folks who were not like using the API for completion before.</p><p>[01:23:10] <strong>Alex Volkov:</strong> Definitely Jun Yang is more than welcome always LDJ, I'll get to you and then we'll talk with Jun Yang about the kind of the open source, but also I do want to cover like the rest of the stuff that got excited folks about [01:23:20] Devon. Go ahead. LDJ? Yeah.</p><p>[01:23:25] <strong>LDJ:</strong> Yeah. Okay. Apparently there was an interview where they were asking the The Devon folks about what models is it using or whatever, and apparently they said that they are using, they're vague about it, or like maybe the interviewer just didn't get it that well, but they said an LLM with reinforcement learning.</p><p>[01:23:43] <strong>LDJ:</strong> And that could just mean like RLHF, but I think they might be talking about like real like traditional reinforcement learning where you're actually like. Like it's the holy grail of if you have something that's coding and you have it being able to learn and compete against itself and being able to iteratively improve and things like that of complex tasks like something like that.</p><p>[01:24:03] <strong>LDJ:</strong> So that'd be really interesting if that's the case and it seems like that's what they were maybe alluding to that they have a custom model that is trained through reinforcement</p><p>[01:24:11] <strong>Justin Lin:</strong> learning.</p><p>[01:24:12] <strong>Alex Volkov:</strong> Yeah, and I think it's very important to also highlight that the UI they built around this and the tools that Devon is able to use, even if it's like a wrapper, the ability of them to promote these kind of tools is very interesting. The browser, to me, is one of the more interesting parts as well, because I know for a fact when I code, I log into a bunch of like services.</p><p>[01:24:32] <strong>Alex Volkov:</strong> I read their APIs, some of them APIs and keys and everything are only locked behind a login wall, for example. And so for something like an auto GPT or even cursor, right? So we, I know that I and some other folks we use cursor. For coding as well, Cursor has some agentic properties. You can ask it about your code, even edits your code inside your code editor.</p><p>[01:24:53] <strong>Alex Volkov:</strong> And then it's able to perform some of these like meta, meta tasks, like figure out what's the problem, go and, search or something. And the ability of, Devin to do this. I saw one video where like it decided to change the code to add debugger statements and then get a better handle on what the actual issue is and then perform something.</p><p>[01:25:13] <strong>Alex Volkov:</strong> And should I say? The big problem with something like let me add Slav as well. The big problem with something like AutoGPT before was getting lost. Getting lost in context, getting lost in the more tasks it executes. I saw enough videos of folks, enough folks who are not shills of Devon, for example.</p><p>[01:25:29] <strong>Alex Volkov:</strong> They are happy to promote the fact that they got early access, but they're not paid probably by Devon folks by Cognition Lab folks. Enough demos of them recording continuously for 20 plus 30 minutes where Devon actually executes stuff based on the plan 20 minutes in. And I personally haven't seen this from a lot of other agents as well.</p><p>[01:25:48] <strong>Alex Volkov:</strong> So I do want to acknowledge Justin Junyang on the stage the member of the technical team in in Quen, the Quen team go, I think is the profile. Hey Justin, how are you? Thanks for joining ThursdAI, Birthday AI.</p><p>[01:26:00] <strong>Justin Lin:</strong> Yeah. Hi, Alex. Thanks for bringing me. Yeah. I'm a member of the Quentin. Oh. I just recently met Devin. It's very impressive. And we're just talking about maybe just related to code large language model. Actually, we are doing something about it, so we just raised Twitter to say, Hey anybody is interested?</p><p>[01:26:25] <strong>Justin Lin:</strong> I don't know. It is really</p><p>[01:26:27] <strong>Justin Lin:</strong> hot. There are</p><p>[01:26:29] <strong>Rohan Pandey:</strong> a lot of people who are joining us,</p><p>[01:26:31] <strong>NA:</strong> Hoping to reproduce the open source Devon. We still don't have a clear roadmap, but for now, For the model, we may think about maybe for the first step, we still use something like a closed source model like GPT 4.</p><p>[01:26:49] <strong>NA:</strong> Admittedly, even like 5 is not enough for such complex tasks. I have some rumor that Devlin might be built upon GPT 4. For with a very good Chrome engineering. I don't know if this is true yet, but we may start from this to build something like a demo. And for another part and for the model, we may build something like, Code large language model, and especially it is adapted to very long context.</p><p>[01:27:21] <strong>NA:</strong> So it can probably browse the web pages, crawl the contents, and then based on the contents, and then write some code and do something complex. Yeah, this is generally some initial ideas, and we still need some time to think about what to exactly</p><p>[01:27:40] <strong>Justin Lin:</strong> do next, yeah.</p><p>[01:27:42] <strong>Alex Volkov:</strong> So I, I would, so first of all, folks follow Junyang in this effort and definitely hear more from the open sourcing. I think like Far El said, one of the, one of the best things when something like this comes out, it gives it like a fairly clear roadmap for folks to try and replicate. And I think the roadmap should include the UI itself.</p><p>[01:27:59] <strong>Alex Volkov:</strong> The browsing UI is very, I think, important. The integrated shell is important. At least for the ability of you interacting with this. One thing that I also noticed for Devon is that you could actually talk with it while it performs other tasks. So like you can with an actual software engineer that works on your team, you can chat with the person while it performs like other tasks.</p><p>[01:28:19] <strong>Alex Volkov:</strong> I'm not sure how they necessarily achieved this. But it's very interesting where like it probably executes in several steps. They definitely built something there that's not only code execution. I think Nisten go ahead</p><p>[01:28:28] <strong>Nisten:</strong> 300 years ago, my grandmother got automated. The machine was called, it was actually called a chain of cards machine by Basile Bouchon, and that went on to become the Jacquard loom, and my grandmother's knitting now became 10, 000 times faster. So that was 10, 000x of grandmas. AGI is only 1x. And and this thing you guys are talking about is 1x.</p><p>[01:29:02] <strong>Far El:</strong> I don't know. Nisten is a grandmother 300 years old?</p><p>[01:29:06] <strong>Nisten:</strong> The jacket machine is 300 years old. Yeah it was first like in 1725. No b******t, no, for real. And that actually used like punch card. It was called chain of cards. That's the real, that's the real name of it. This not Chain of Thought, Chain of Cards, it's the same thing, it's just instruction following Chain of Cards machine.</p><p>[01:29:25] <strong>Nisten:</strong> And it made it, it made stuff close 10, 000 times faster than my grandma could. Now, that didn't stop my grandma from knitting. I don't know why people are freaking the heck out that now this thing can do 13. 2 percent of, of Github issues I am freaking out that we, with all of this automation, smartest freaking Olympiad kids in the world, I ranked let's leave that alone.</p><p>[01:29:54] <strong>Alex Volkov:</strong> Ha.</p><p>[01:29:58] <strong>Nisten:</strong> like we can barely, and we still have to do the other 87 percent of the work. I don't know why people are freaking the</p><p>[01:30:04] <strong>Nisten:</strong> yeah, they said the same thing for. For people with Excel, it's oh, all the programmers are analysts, whatever the heck they were called back then. It's going to automate them.</p><p>[01:30:14] <strong>Nisten:</strong> Did it automate? Yeah, it automated them. Has the need for them improved? Yeah. The same thing for Copilot came out two years ago. We've been using these tools for two years. You still can't find a good JavaScript dev to hire. Dude, people are freaking the f**k out, man.</p><p>[01:30:33] <strong>Alex Volkov:</strong> So let's actually talk about this.</p><p>[01:30:35] <strong>Nisten:</strong> Learn the code don't be dumb,</p><p>[01:30:37] <strong>Alex Volkov:</strong> least some of the conversation, Nisten, at least some of the conversations is, just like thread boys hyping up things. Oh, software engineer is dead, whatever. They will always keep doing this. They're engaging the algorithm on X rather than providing some value.</p><p>[01:30:51] <strong>Alex Volkov:</strong> But there's definitely folks that I saw, they're replying to like major folks and Hey. I'm about to learn coding, should I even learn coding? Because to them, when somebody doesn't know coding, sees something like Devon, they're like, why do I even need to study? In a few years this will be, like, way better than me in Even now it's way better than something like a starting point.</p><p>[01:31:11] <strong>Alex Volkov:</strong> And I think the answer to this is that the world just will need more code. Like, it's very interesting that software engineers, in general as a concept, try to automate ourselves out of laziness as much as possible, right? I would spend sometimes days on automating a task, I can complete manually for five minutes just because I know that I'd be able to do some other stuff faster while this is getting automated.</p><p>[01:31:32] <strong>Alex Volkov:</strong> Sometimes it's nerd sniping, but whatever. And then I think that for folks who are in the audience who are like thinking about learning to code, learn to code. [01:31:40] The reason is Devon will need your help to figure out what to do next. The outputs of Devon need somebody that knows how to code like none of the folks who got Devon are like marketing people just completely noobs that just like it worked for them.</p><p>[01:31:53] <strong>Alex Volkov:</strong> So you do need the ability to actually run these things productively. And I think learning to cause a very important skill. And if anything, it will give you like a meta skill that you'd be able to do the boring stuff, the more complex stuff, you'd be able to review and achieve more. And I think that's very important.</p><p>[01:32:11] <strong>Alex Volkov:</strong> Many of us. How should I say? There, there are some gatekeepers in the coding community for whom the ability to code is their way, their thing to say, okay, this is how we make money. But for many people in the coding thing is like, coding is just a tool to get somewhere. This somewhere is shipping a product, creating, like doing a task, doing some of these things.</p><p>[01:32:30] <strong>Alex Volkov:</strong> That's not going to go away. If anything, this is gonna go and get that much better. So I saw Slava, you wanted to comment on that and then Roy.</p><p>[01:32:42] <strong>Nisten:</strong> Yeah,</p><p>[01:32:42] <strong>Slava Kurilyak:</strong> I wanted to add some color to the Devon circumstance where we find ourselves using at least a new approach where it seems like GPT 4 has been, let's say, claimed at this moment to be empowered by reinforcement learning. There are now. Developers who are going down this path. I'll do a shout out for one of them.</p><p>[01:33:05] <strong>Slava Kurilyak:</strong> This is a developer. His name is Rohan. I'll pin his open source project to the top. It's called Llama Gym. Feel free to check this out. This is an agentic framework for using reinforcement learning to essentially fine tune language models. Now, why would you do this? Because, from my experiments with language models, the, at first you can get away with prompt engineering, but at some point.</p><p>[01:33:30] <strong>Slava Kurilyak:</strong> To mimic human like performance, you do need to fine tune. And so in this circumstance, reinforcement learning has been shown to have incredible progress, especially with companies like DeepMind. But, and yet, we haven't really seen adoption for reinforcement learning within the Jamstack AI community, but now with tools like LlamaJam, developers can start to bridge the two.</p><p>[01:33:56] <strong>LDJ:</strong> Can you post that to the billboard,</p><p>[01:33:58] <strong>Alex Volkov:</strong> Yes please post as well.</p><p>[01:33:59] <strong>LDJ:</strong> happen to have a link or a tweet</p><p>[01:34:01] <strong>Alex Volkov:</strong> Absolutely. Roy, go ahead and afterwards I want to acknowledge Ryan joining us on stage as well.</p><p>[01:34:06] <strong>LDJ:</strong> Yeah,</p><p>[01:34:06] <strong>Roei Cohen:</strong> first I just want to give Nisten props for his just incredible rant. I just enjoyed that thoroughly. I actually don't agree with you, Alex. I think that eventually we'll see coding if not go away, be abstracted enough that it would be closer to English or whatever natural language you're used to using.</p><p>[01:34:25] <strong>Roei Cohen:</strong> The reason I'm saying that is that's been the trend so far, we've gone from Assembler, from, Jacquard Looms to Assembler to, things that are more and more abstract and, the things that happen with a single line of code in Python or in, TypeScript or whatever generic language you choose.</p><p>[01:34:43] <strong>Roei Cohen:</strong> have so much implementation actually going behind the scenes that you're not even aware of and people for some reason are okay with that. You know what I mean? They're not losing their ever loving minds. I think that as time goes by, right? These very mechanical operations are going to be needed less and less.</p><p>[01:34:59] <strong>Roei Cohen:</strong> And, but the need to solve problems, to tackle problems, to have motivation and goals those are still going to be, mostly. Human, but those two may, may change, right? I think we have to prepare ourselves for two scenarios. One where. The need for, actual technical capabilities that are specialized, like coding, might be like less and less in demand to actually be effective and to be able to ship products and be able to ship features and whatnot but also that we're going to get more the agentic behavior of tools that we use is going to become more and more active and less passive, right?</p><p>[01:35:36] <strong>Roei Cohen:</strong> It's not just that you're going to ask your agent to do something for you and then review it, but rather it will preemptively start working for you and solving problems for you and making PRs and, doing things of that sort, which kind of changes the way that the division of labor currently.</p><p>[01:35:53] <strong>Roei Cohen:</strong> is in in terms of, like, how much work do humans drive and how much work do machines drive?</p><p>[01:36:00] <strong>Alex Volkov:</strong> Yeah, thanks for that, Ray, and I definitely count on you to not agree with me, and I really appreciate the pushback as well. I want to acknowledge Ryan Carson. Ryan, I did not know that you're a follower of ThursdAI, but I have been following you and Treehouse for a while, so please start with maybe a brief introduction to who you are and what you currently do.</p><p>[01:36:16] <strong>Alex Volkov:</strong> And I would love to hear your take about agentic and coders being replaced, given that Treehouse is something that you've previously built and taught tons of people to code.</p><p>[01:36:26] <strong>Ryan Carson:</strong> Alex, good to be here. Thank you for the invite. It's funny because I was listening to a previous episode of ThursdAI while I was working out this morning, and that's what I knew. I wish I had known about this space earlier, because it's just packed with valuable information. There's so much for us to absorb in AI. I'm so thankful for this space, and I literally add it to my calendar now, so I'm hoping to show up more often, but, so thank you for that. Um, yeah, I spent a decade of my life, I'm 47, so I spent almost one out of every five minutes of my life trying to empower people to learn how to code. And at Treehouse, I had the honor of being a part of that, and, I think we taught something like a million people how to code. And and I have a computer science degree, and I think a lot about this. And, I think, I want to acknowledge and be empathetic towards professional software developers. Because it's, it is scary and hard to see things Appearing that look like they may replace you. That's scary for everybody.</p><p>[01:37:26] <strong>Ryan Carson:</strong> And I think we all agree, we're just seeing a reaction to that. I think we all know that's an emotional reaction. It's not necessarily logical. But I do want to acknowledge, it's just scary for people if they think they're going to lose their job. So that's thing one. The thing two it's interesting I got a computer science degree, then I was a web developer for a long time, and then I started companies, and then I hired engineers, and engineering managers, and CTOs and I didn't code for a long time.</p><p>[01:37:50] <strong>Ryan Carson:</strong> And after Treehouse was acquired I actually went back in and taught myself how to code again. And so I used ChatGPT Plus to teach me TypeScript and Next. js and I shipped a very simple proof of concept. Hey, I just want to build on top of OpenAI's APIs. I just want to understand soup to nuts, how this works.</p><p>[01:38:09] <strong>Ryan Carson:</strong> And. You could say it's the dumbest thing ever. Like, why would you learn how to code again? But I think we all agree if you know how to code, it gives you this deep understanding of how things actually work, right? And I like to pull on an example here where think about building a house, right?</p><p>[01:38:28] <strong>Ryan Carson:</strong> So you could abstract all of that away, but if you actually understand how to saw a piece of wood at a 45 degree angle, and then put it together with another piece of wood, And then you build something, it gives you a deep understanding and appreciation for the actual structure. And I think that's what's happening here.</p><p>[01:38:47] <strong>Ryan Carson:</strong> And I just would, I actually say, please learn how to code more now than I've ever said in my whole life. Because number one, it's easier. Like, all you have to do is open any good LLM and say, I don't know how to do this. Please teach me, I'm a super newbie, I don't get any of this stuff. And for once we can be.</p><p>[01:39:08] <strong>Ryan Carson:</strong> honest about how much we don't know and not be embarrassed about it. I always say to people, just please learn Python and then start building something. Because in the end, it will absolutely make you more powerful, even if Devon creates all the underlying infrastructure. If you understand what's basically going on, it will make you an even more powerful wielder of that technology.</p><p>[01:39:29] <strong>Ryan Carson:</strong> So that's my little speech about why we should all keep coding, Alex.</p><p>[01:39:33] <strong>Alex Volkov:</strong> 100 percent Ryan, I just want to shout out that I was trying to remember where I know you from, and I visited a feature of web apps in London, like twice,</p><p>[01:39:40] <strong>Ryan Carson:</strong> No way!</p><p>[01:39:41] <strong>Alex Volkov:</strong> And I really remember it from there. Like it was the 2012, I think 2013.</p><p>[01:39:45] <strong>Ryan Carson:</strong> Oh my god!</p><p>[01:39:46] <strong>Alex Volkov:</strong> like a bunch of people. So one of like my first professional career like trips was to London to one of your amazing conferences.</p><p>[01:39:54] <strong>Ryan Carson:</strong> makes me so happy. I just literally got goosebumps.</p><p>[01:39:56] <strong>Alex Volkov:</strong> And so in return, it makes me very happy that you're now considered a friend of [01:40:00] the pod. Feel free to come back and chime in as well. I want to also on the topic of what you mentioned, I want to ask Junaid, who's a friend of the pod, my friend, and we run the Denver AI meetups together. Because Junaid, you basically did this thing that Ryan just discussed.</p><p>[01:40:13] <strong>Alex Volkov:</strong> And what's your take on Devon and how easier it is to automate now with some of this stuff?</p><p>[01:40:21] <strong>Junaid:</strong> Yeah I'm excited about it. I am one of those people that started learning to code at the beginning of last year. Literally the very beginning of last year, I started with the OpenAI API Quickstart Tutorial Guide. and immediately, moved on to trying to build things that I could run on my iPhone, and in less than two months from when I first started learning, I launched my first app.</p><p>[01:40:45] <strong>Junaid:</strong> Yeah, I'm 15 months in, and I see Devon, and it does not in any way, shape, or form make me think, Oh, shouldn't have done that this last 15 months. No way. It's just it's another tool that is going to be That much more useful for me to be able to take ideas and actually make them happen. And honestly, having built pretty much all my stuff, like using chat GPT as my junior developer yeah, this is awesome.</p><p>[01:41:18] <strong>Junaid:</strong> You know how much less copying and pasting I'm going to have to do? So yeah, it's I think it's fantastic. And I don't, I think that anybody who's on the fence or worried Whether they should learn to code, the answer is more yes now than it was before Devon came out. That's my</p><p>[01:41:37] <strong>NA:</strong> take.</p><p>[01:41:37] <strong>Alex Volkov:</strong> 100%. Ryan, go ahead, and then we got Slava again, I think.</p><p>[01:41:40] <strong>NA:</strong> Junaid, wasn't it the most magical moment when you shipped that code and it worked?</p><p>[01:41:48] <strong>Junaid:</strong> Yeah, absolutely. When I first went through the OpenAI, It's just like a little naming thing, how do you name your pet using the API?</p><p>[01:41:58] <strong>Junaid:</strong> That was awesome. But</p><p>[01:41:59] <strong>Junaid:</strong> man the real kick was when my first app actually got approved and went live.</p><p>[01:42:05] <strong>Junaid:</strong> And I remember standing in my kitchen and doing like the dance of joy, Whoa,</p><p>[01:42:10] <strong>Junaid:</strong> my God, I'm on the App Store!</p><p>[01:42:12] <strong>Junaid:</strong> Wild. It's it's such a. Such a rush. Congrats. And for sure, the things that I've built so far are not like, hey, I'm not changing industries out here. I'm not like, whatever, but I can build things.</p><p>[01:42:26] <strong>Junaid:</strong> I can use these tools more and more to build more and more things and build better and better things. And yeah, only up, it's only going to go up.</p><p>[01:42:36] <strong>Alex Volkov:</strong> Right on. I wanted to acknowledge Anton as well. Anton, a build, the very popular GPT engineer, and also is very like, deep into CodeGen as well. Welcome, Anton. What's your take on Devin and the renewed interest in agents running code?</p><p>[01:42:52] <strong>Justin Lin:</strong> Yeah, thanks, Alex. Nice to be here. I think it's super exciting. We've been trying the same approach. We haven't gotten as far and as fast as Devin, but I, it's always when you've seen something actually get done, you lose this doubt that you have from people telling you like, ah, this is not possible yet.</p><p>[01:43:12] <strong>Justin Lin:</strong> And now when you've seen like. More people believe in it and it's still just a demo, right? It's not a product, but then your focus is just 10x. Super exciting times. And, I think on this topic of should you learn to code or not coding is one of the most fun things, but it does depend on what you want to do, what you want to achieve here in life.</p><p>[01:43:36] <strong>Justin Lin:</strong> I think Flo Cruello a friend who invested in our company, he said that All the news headlines should just be a GI is here soon because , that's all that matters. And I think this is a good take on what should do you do with your life? A GI is here soon, so you should just do whatever makes you enjoy life.</p><p>[01:43:55] <strong>Justin Lin:</strong> That, that was a lot of things, but but I, that, that's my takes. I could go the dive into the technical details. We did the deep analysis on how they're doing it at Deving compared to the things we tried and how we're doing it right now.</p><p>[01:44:06] <strong>Alex Volkov:</strong> So give me an analysis from the perspective of the tools that they have for Devon. I think for me, definitely one of the main points was how much access it has to different tools like I have as a software engineer. Like I use the browser alongside my code editor, alongside my shell. Let's talk about this a little bit.</p><p>[01:44:24] <strong>Justin Lin:</strong> Yeah, so I want to do a shout out to Eric in our team. So he built a tool called GPT ME. I think he's been working at it for two years and now we're building at Building a GPT engineer together and there, I think basically all the ways that Devin writes code like all the tools are available in GPT ME, but GPT ME is just a CLI.</p><p>[01:44:47] <strong>Justin Lin:</strong> CLI tool so the browser, running code, writing code to file, changing code the biggest, please add if I missed some important tool that Devin has access to here. I'm running on too little sleep right now, but the biggest difference</p><p>[01:45:03] <strong>Alex Volkov:</strong> And I think it has access to an actual terminal machine, so I saw that folks are able to run like LLAMA and actually run inference on it. So that's pretty impressive on its own in terms of infrastructure access.</p><p>[01:45:14] <strong>Justin Lin:</strong> Yeah, correct. But I think you should, the people, if you're curious to run this you could try GPT Me, run it locally. The biggest difference is that Devin has done significant progress in making the agent stick to its plan and not get lost in a in confusing itself and not sticking to the plan, which has been like the big failure point for all the agents ever since they, since it started early last year.</p><p>[01:45:43] <strong>Justin Lin:</strong> And Devon is better at sticking to the plan. plan. I'm sure it still also gets confused. And it has what we refer to as sub agents. I guess that's quite self explanatory what it means. And you have this main agent that says, oh, I should try to achieve this. And then a sub agent goes into its own path with a, with its own system prompt and so on.</p><p>[01:46:04] <strong>Justin Lin:</strong> And there, I think as always, there is this in the details in how they've been quite successful there. But yeah, that's</p><p>[01:46:11] <strong>Junaid:</strong> a quick summary.</p><p>[01:46:12] <strong>Alex Volkov:</strong> Awesome. Thanks for joining, Anton. And folks, definitely check out Anton's feed and GPT Engineer, doing great things there. I want to acknowledge Rohan. Rohan, you were just mentioned by some folks with with your Lama Gym. You want to chime in on Dev and how this field moves forward and tell us about Lama Gym as well?</p><p>[01:46:28] <strong>Rohan Pandey:</strong> Yeah, sure. Thanks, Alex. Yeah, the idea with LamaGen is that agents originated in reinforcement learning where they'd learn through interaction, right? They'd receive reward. They'd go around, play around with their environment, they'd explore and they'd learn. But now in the LLM age, when we have these LLM based agents, They don't really learn online in this reinforcement learning fashion so the idea with Lamagym is to be able to train these agents online with reinforcement learning and it's a super simple kind of agent class that you just implement a few prompt functions on and then throw it in a traditional OpenAI gym environment.</p><p>[01:47:02] <strong>Rohan Pandey:</strong> And it learns, it propagates rewards from reinforcement learning. In terms of code generation stuff, this is actually what I spend most of my time on at Reworked. We do multi modal code generation for generating these web data extractors. In our code generation pipeline, it's not something where we're automating some huge stack of software engineering stuff where you have to go interact with the terminal, devin, and everything like that.</p><p>[01:47:28] <strong>Rohan Pandey:</strong> But instead, it's This is a very specific task of generating a specific function this structured data extraction function for a specific website. So given some schema and a website, we go pull screenshots of the website, we go pull context from the html, we, and then this goes into this sort of agentic loop where we then generate code to extract that specific data and that goes straight into production effectively, right?</p><p>[01:47:55] <strong>Rohan Pandey:</strong> It goes through some human review steps, but it goes straight into production. It's not like a it doesn't it isn't like your co pilot. It isn't something that you oversee. It is like in production. From, yeah, from just those user inputs to, to code that's executed. I think Devon shows like there's a lot of stuff that you can do just with GPT 4 right now, right?</p><p>[01:48:15] <strong>Rohan Pandey:</strong> People didn't believe that yeah, GPT 4 agents for code generation were possible, [01:48:20] but I think, yeah, I saw a tweet that was like, maybe all you need to get AGI is GPT 4 and some IMO gold level prompt engineering, which maybe it's true. We've yeah, a lot of what we're doing, we've done some code fine tunes and whatnot, but a lot of improvement has also come from just putting GPT 4 in better agentic and prompt engineering type of setups.</p><p>[01:48:42] <strong>Alex Volkov:</strong> Thanks for coming up, Rohan, and I just want to acknowledge for folks who on stage this is doesn't often happen Thursday, we usually cover the news and then we're at two hours already, but I really think that this conversation is important, and I really want to cover this and also open up to questions.</p><p>[01:48:55] <strong>Alex Volkov:</strong> This stage on ThursdAI, ThursdAI to just cover the next iteration of things that are happening and many people for whom even co pilot is something they don't even use yet. Definitely not Cursor. Cursor is like absolutely the next level of co piloting to me in my code work. And I use like the command K every time and I'm very happy.</p><p>[01:49:13] <strong>Alex Volkov:</strong> Besides the one fact that it deleted half my system previously, those of you who know about this, we'll talk about this separately. But I think it's a very important discussion specifically because Ryan, you mentioned something before where. We want to acknowledge and want to be I want to be compassionate to folks who are feeling this fear about their career, about the general rate of progress that's happening, not only in coding.</p><p>[01:49:32] <strong>Alex Volkov:</strong> Coding is one simple thing of this. Writers, for example they look at your GPT, like professional writers, and they're getting booked less, for sure, because people can now write better. right? Long things and then review long things as well. And for many other people seeing something like the figure robot now with OpenAI, that scares them because they learn to, I don't know, watch Terminator.</p><p>[01:49:51] <strong>Alex Volkov:</strong> I think one important piece of ThursdAI that I haven't mentioned yet, it's very important for me as my kind of AI evangelist position, is to acknowledge that change, very fast change scares people. And it scares people more than when they don't follow everything. And suddenly they see this rate of problems like, holy s**t, this means blah, blah, blah, blah, blah, X and X's, like whatever they saw in Black Mirror, which I think ruined many, like a generation ruined of potential thinking positive about the future.</p><p>[01:50:19] <strong>Alex Volkov:</strong> And I think it's very important for us to have a conversation with this, for folks who are building code generation tools, for folks who are writing code, for folks who are learning like Junaid, to actually show that no, it's actually not only fine, it's even better. From the of code specifically, I think there's just going to be need.</p><p>[01:50:34] <strong>Alex Volkov:</strong> for more code around the world, more automation around the world. And if you learn what these outputs can do, then I think you're gonna be even more productive. Ryan, you wanted to chime in as well? Feel free to step in. I'm gonna try to pull up one more friend.</p><p>[01:50:48] <strong>Ryan Carson:</strong> You bet. Yeah. Thanks for having me up. I, it's so fascinating to hear about all these agentic systems and what's happening and I know we all know this is where we're going, and I tweeted out and said, as soon as you have an agent that's basically able to use the internet and a computer, like a basic human, there's so many things that you can start to tackle.</p><p>[01:51:07] <strong>Ryan Carson:</strong> Researching, cures the diseases, planning your trip, to. Your mom's house for the summer. There's just a lot of this which ideally allows humans to level up and then leverage those tools, right? I'm always a technical optimist though, so that's probably my downfall.</p><p>[01:51:22] <strong>Ryan Carson:</strong> Alex, I did want to say thank you for bringing me up. I've got to go. I joined Intel on Monday I'm helping, So I'm helping them build a global AI developer community. So I've gotta, I've gotta go to a meeting. But I wanted to pimp your stuff for a second, though, and say the courses on Weights & Biases are really good.</p><p>[01:51:40] <strong>Ryan Carson:</strong> And as someone who's, spent ten years of my life building courses, everybody should check out the free Weights & Biases courses. They're awesome.</p><p>[01:51:47] <strong>Alex Volkov:</strong> Thank you so much. Thanks Ryan for coming up. And then let's talk about collaborating. Now you joined Intel, the company, definitely let's do some stuff together. The shout out is well deserved. The team behind the course is the team that I joined the growth ML team. And they're amazing. And a shout out to them.</p><p>[01:52:00] <strong>Alex Volkov:</strong> Everything there is for free. And you can learn from fine tuning a foundational model to extracting better outputs from JSON. And it's all for free for you guys to enjoy. So definitely one B slash me one wandb.me/courses definitely check this out and thank you for everybody who joined so far.</p><p>[01:52:17] <strong>Alex Volkov:</strong> I try to keep this conversation going with folks. I also do wanna deliver this conversation to the folks who follow Thursday Eye on the podcast as well. So if you're not subscribed, definitely you subscribe. If you can help. Vote with whatever amount of stars, five is preferable on the, on, on the podcast.</p><p>[01:52:35] <strong>Alex Volkov:</strong> I do want to acknowledge that Yam is a frequent co host, friend of the pod, and Yam, this birthday wouldn't be the same birthday without you. You want to chime in on the coding thing real quick, or on the Devon thing real quick? Before I do a recap, I would appreciate your things here as well.</p><p>[01:52:49] <strong>Yam Peleg:</strong> first, it's amazing. It's amazing. The demos look amazing. I just wanna Ask or say that I think that the real test is how reliable is it with real world users of many people. And so if anyone knows, anyone tries and can share their experience, but out, out of demos. Real life tasks that can be anything.</p><p>[01:53:10] <strong>Alex Volkov:</strong> I tried to get folks, Yeah, I try to get folks who actually has access to Devon I reached out to a few Thursday morning or at least for some of them it's really hard, but we'll definitely get folks, if not the Devon folks themselves, we'll definitely get folks who have access to Devon and we're gonna try to get access ourselves as well, um, definitely CodeAgents reimagined excitement about CodeAgents this year.</p><p>[01:53:33] <strong>Alex Volkov:</strong> I had this this poll that I posted on my feed, where if Chad, if AutoGPT came, Less than a year ago came out and then like it broke pretty much the same level of excitement Not remotely the same level of execution ability, right? Like it wasn't any tools, etc fairly quickly folks got excited about the demos Then fairly quickly folks realized that you know Anton said and some other folks said It does get lost in the context after executing on a few things and so there's this ability Since then we've gotten Incredible context length with incredible ability and like needle in a haystack and these models like memory of working memory grew.</p><p>[01:54:09] <strong>Alex Volkov:</strong> So I asked basically on my feed, do you feel from AutoGPT less than a year ago to Devin right now, which I think announced like a huge raise from many VCs. Do you feel that agents are on the same exponential curve as other LLM stuff that we see in open source, for example? And yeah, the answers were pretty much where I am at, where, the distance between something in AutoGPT and examples of, visual examples of DevIn they don't feel to me that there's been a year of progress compared to the year of progress we saw in everything else in OpenAI in LLMs, right?</p><p>[01:54:42] <strong>Alex Volkov:</strong> But maybe I'm wrong and maybe I need to play with Devon to actually feel the AGI a little bit. So we'll see after we get access. We're gonna give you guys an update as well. And I think it's time for us to conclude the space. It's been a little bit over two hours as well. I will just say that before I conclude the space for the folks who are listening on the, on the podcast, I recap try to recap everything we've talked about here as well.</p><p>[01:55:05] <strong>Alex Volkov:</strong> So that's coming up. If you've missed any part of the show, please stay with us to hear the recap. And I am very happy that we have celebrated Thursday birthday with all of you today in the space. It's been a great honor of mine to keep doing this and have many new folks come in to the stage, but also see some folks who we've been we've been hosting and friends of the pod and I really appreciate my time here with all of you.</p><p>[01:55:27] <strong>Alex Volkov:</strong> I'm very happy that this keeps happening, and I'm not going away anytime soon. .</p><p>[01:55:31] END OF SHOW</p><p>[01:55:31] <strong>Alex Volkov:</strong> I think it's time to just say that, again, I really appreciate everybody here. Yam, thank you, dude. Thank you for joining from week to week. Thank you for breaking down papers for us and teaching us teachable moments from your excapades into AI and being the resident machine learning engineer.</p><p>[01:55:45] <strong>Alex Volkov:</strong> Nisten, Thank you, brother, for holding the space holding the space when I can keep talking as well and joining and explaining, reading papers together and and asking questions and doing a co hosting with me. Far El, thank you for being the staunch supporter of open source everything and as much as possible.</p><p>[01:56:03] <strong>Alex Volkov:</strong> Not believing big companies and their promises and keeping us honest in what we believe and not believe. LDJ, thank you brother for joining and explaining difficult concepts where I have no idea how to even explain them. Junyang, I really appreciate the fact that we have foundational model trainers here on stage, parts of ThursdAI, so thank you Junaid, Nisten, and keep giving us amazing Quint stuff.</p><p>[01:56:23] <strong>Alex Volkov:</strong> As well. I really appreciate your expertise and pushing back on everything that I say with the not skepticism, but definitely those realism. Those of realism. I really appreciate this. Everybody else who wore on stage, everybody in the audience. I am floored that this keeps happening week after week, and I definitely am going to be here next week [01:56:40] to talk about whatever happens next.</p><p>[01:56:42] <strong>Alex Volkov:</strong> I see a lot of faces in the audience that joined from week to week. Harrison definitely give Harrison a follow. His YouTube thing is great. Junaid, who just joined and talked about how he was a noob and learned from GPT 4, and now he has multiple apps. And Junaid and I are co hosting the Denver meetup.</p><p>[01:56:59] <strong>Alex Volkov:</strong> As well. So if you're in Denver environment, please join us. We're gonna meet soon and talk about ai. I see Bo Wang from Gene ai and often join us when to talk about embeddings as well. I see Tanish in the audience from MedAR a very young PhD who I appreciate also friend of the pod. I see Abby, I see a bunch of friends here who.</p><p>[01:57:16] <strong>Alex Volkov:</strong> know about the space way more than I could ever. And the fact that they all join and talk about this is what makes this interesting. So I really appreciate all of you one by one and everybody in the audience should give all these folks a follow and we'll see you here next week. Thank you, everyone.</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-birthdai-march-14-anthropic</link><guid isPermaLink="false">substack:post:142629459</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Fri, 15 Mar 2024 01:34:44 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/142629459/844ee863713a2212c45d86ca98ac1557.mp3" length="85008178" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>7084</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/142629459/6896118759af6b752052501e37483946.jpg"/></item><item><title><![CDATA[📅 ThursdAI - Mar 7 - Anthropic gives us Claude 3, Elon vs OpenAI, Inflection 2.5 with Pi, img-2-3D from Stability & More AI news]]></title><description><![CDATA[<p>Hello hello everyone, happy spring! Can you believe it? It's already spring! </p><p>We have tons of AI news for you to cover, starting with the most impactful one, did you already use Claude 3? Anthropic decided to celebrate Claude 1's birthday early (which btw is also ThursdAI's birthday and GPT4 release date, March 14th, 2023) and gave us 3 new Clauds! Opus, Sonnet and Haiku. </p><p>TL;DR of all topics covered: </p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* 🔥 Anthropic releases Claude <strong>Opus</strong>, <strong>Sonnet</strong>, <strong>Haiku</strong> (<a target="_blank" href="https://twitter.com/AnthropicAI/status/1764653830468428150">Announcement</a>, <a target="_blank" href="https://claude.ai">try it</a>)</p><p>* Inflection updates Pi 2.5 - claims GPT4/Gemini equivalent with 40% less compute (<a target="_blank" href="https://inflection.ai/inflection-2-5">announcement</a>)</p><p>* Elon sues OpenAI (<a target="_blank" href="https://techcrunch.com/2024/03/01/elon-musk-openai-sam-altman-court/">link</a>)</p><p>* OpenAI responds (<a target="_blank" href="https://openai.com/blog/openai-elon-musk?utm_source=tldrai">link</a>)</p><p>* ex-Google employee was charged with trading AI secrets with China (<a target="_blank" href="https://apnews.com/article/china-google-justice-department-63156ade1e564d15d92adbef91e9c5da">article</a>)</p><p>* <strong>Open Source LLMs</strong> </p><p>* 01AI open sources - Yi 9B (<a target="_blank" href="https://twitter.com/i/bookmarks/1679935518535208960?post_id=1765422092663849368">Announcement</a>)</p><p>* AnswerAI - Jeremy Howard, Johno & Tim Detmers - train 70B at home with FSDP/QLoRA (<a target="_blank" href="https://twitter.com/jeremyphoward/status/1765868543235805232">X</a>, <a target="_blank" href="https://www.answer.ai/posts/2024-03-06-fsdp-qlora.html">Blog</a>)</p><p>* GaLORE - Training 7B on a single consumer-grade GPU (24GB) (<a target="_blank" href="https://twitter.com/_akhaliq/status/1765598376312152538">X</a>)</p><p>* Nous open sources Genstruct 7B - instruction-generation model (<a target="_blank" href="https://huggingface.co/NousResearch/Genstruct-7B">Hugging Face</a>)</p><p>* Yam's GEMMA-7B Hebrew (<a target="_blank" href="https://twitter.com/Yampeleg/status/1765707714473197729">X</a>)</p><p>* <strong>This weeks Buzz</strong></p><p>* Weights & Biases is coming to SF in April! Our annual conference called Fully Connected is open for registration (<a target="_blank" href="https://wandb.ai/site/resources/events/fully-connected">Get your tickets and see us in SF</a>)</p><p>* <strong>Vision & Video</strong></p><p>* Vik releases Moondream 2 (<a target="_blank" href="https://twitter.com/vikhyatk/status/1765532280926421094">Link</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* Suno v3 alpha is blowing minds (<a target="_blank" href="https://x.com/rileybrown_ai/status/1765143295942946877?s=20">Link</a>)</p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* SD3 research paper is here (<a target="_blank" href="https://twitter.com/iScienceLuvr/status/1764896097418260947">Link</a>)</p><p>* Tripo + Stability release TripoSR - FAST image-2-3D (<a target="_blank" href="https://x.com/tripoai/status/1764791817269137471?s=20">link</a>, <a target="_blank" href="https://huggingface.co/spaces/stabilityai/TripoSR">Demo</a>, <a target="_blank" href="https://fal.ai/models/triposr">FAST demo</a>)</p><p>* Story how I created competition of inference providers to get us sub 1.5s playground image gen (<a target="_blank" href="https://twitter.com/altryne/status/1765582178396590212">X</a>)</p><p>Big CO LLMs + APIs</p><p>Anthropic releases Claude 3 Opus, Sonnet and Haiku </p><p>This was by far the biggest news of this week, specifically because, the top keeps getting saturated with top of the line models! Claude Opus is actually preferable to many folks in blind studies over some GPT-4 features, and as we were recording the pod, LMSys released their rankings and Claude Opus beats Gemini, and is now 3rd in user preference on the LMSys rank. </p><p>There release is vast, they have announced 3 new models but only gave us access to 2 of them teasing that Haiku is much faster / cheaper than other options in that weight class out there. </p><p>In addition to being head to head with GPT-4, Claude 3 is now finally also multimodal on inputs, meaning it can take images, understand graphs and charts. They also promised significantly less refusals and improved accuracy by almost 2x. </p><p>One incredible thing that Claude always had was 200K context window, and here they announced that they will be supporting up to 1M, but for now we still only get 200K.</p><p>We were also promised support for function calling and structured output, but apparently that's "coming soon" but still great to see that they are aiming for it! </p><p>We were all really impressed with Claude Opus, from folks on stage who mentioned that it's easier to talk to and feels less sterile than GPT-4, to coding abilities that are not "lazy" and don't tell you to continue writing the rest of the code yourself in comments, to even folks who are jailbreaking the guardrales and getting Claude to speak about the "I" and metacognition. </p><p>Speaking of meta-cognition sparks, one of the prompt engineers on the team shared a funny story about doing a needle-in-haystack analysis, and that Claude Opus responded with <strong>I suspect this pizza topping "fact" may have been inserted as a joke or to test if I was paying attention</strong></p><p>This split the X AI folks in 2, many claiming, OMG it's self aware, and many others calling for folks to relax and that like other models, this is still just spitting out token by token. </p><p>I additional like the openness with which Anthropic folks <a target="_blank" href="https://twitter.com/AmandaAskell/status/1765207842993434880">shared</a> the (very simple but carefuly crafted) system prompt </p><p>My personal take, I've always liked Claude, even v2 was great until they nixed the long context for the free tier. This is a very strong viable alternative for GPT4 if you don't need DALL-E or code interpreter features, or the GPTs store or the voice features on IOS. </p><p>If you're using the API to build, you can self register at <a target="_blank" href="https://console.anthropic.com">https://console.anthropic.com  </a>and you'll get an API key immediately, but going to production will still take time and talking to their sales folks. </p><p>Open Source LLMs </p><p>01 AI open sources Yi 9B </p><p>Announcement claims that "It stands out as the top-performing similar-sized language model friendly to developers, excelling in code and math." but it's a much bigger model, trained on 3T tokens. I find it confusing to create a category of models between 7B and almost 12B. </p><p>This weeks Buzz (What I learned with WandB this week)</p><p>We're coming to SF! Come join Weights & Biases in our annual conference in the heart of San Francisco, get to hear from industry leaders about how to build models in production, and meet most of the team! (I'll be there as well!) </p><p>AI Art & Diffusion</p><p>Last week, just last week, we covered the open sourcing of the awesome Playground 2.5 model, which looked really good in user testing. I really wanted to incorporate this to my little demo, but couldn't run it locally so asked a few friends, and I gotta say, I love how competitive but open the inference providers can get! Between Modal, Fal and Fireworks, I somehow started a performance competition that got these folks to serve Playground 2.5 model in sub 1.5 second per generation. </p><p>Recorded the story to highlight the awesome folks who worked on this, they deserve the shoutout! </p><p>You can try super fast Playground generation on <a target="_blank" href="https://fal.ai/models/playground-v25/playground">FAL</a> and <a target="_blank" href="https://fireworks.ai/models/fireworks/playground-v2-5-1024px-aesthetic">Fireworks</a></p><p>Stability releases Stable Diffusion 3 research paper + Model coming soon</p><p>Stability released the research paper for SD3, their flagship latest iteration of an image model. While this field is getting a little saturated, we now have DALL-E, MidJourney, Adobe Firefly, Playground, SDXL, Stable Cascade and Ideogram, SD is definitely aiming for the title. </p><p>They released a few metrics claim that on user preference, Visual Aesthetics, Typography and Prompt following, SD2 beats all of the above. </p><p>They also mentioned the architecture, which is a MM-DiT - multi modal diffusion transformer architecture (DiTs were used for SORA from OpenAI as well) and that they used 50% synthetic captions with COGvlm, which is quite impressive. </p><p>Emad has mentioned that access to SD3 will start rolling out soon! </p><p>TripoSR (<a target="_blank" href="https://huggingface.co/spaces/stabilityai/TripoSR">Demo</a>)</p><p>We previously covered LUMA models to generate text to 3d, and now we have image 2 3D that's open sourced by the folks at Tripo and Stability AI.</p><p>TripSR is able to generate 3d shapes from images super super fast, and here's a very nice flow that <a target="_blank" href="https://twitter.com/i/status/1765466355644395938">@blizaine</a> demonstrated of how to use these models to actually bring 3D objects into their environment using a few steps. </p><p>And that's it for today folks, we of course chatted about a LOT more stuff, I really welcome you to listen to the episode and skip around in the chapters, and see you next week, as we celebrate ThursdAI's birthday (and GPT4 and Claude1) 🎉 </p><p>P.S - as I always do, after writing and editing all by hand (promise) I decided to use Opus to be my editor and tell me how was my writing, what did I forget to mention (it has the context form the whole transcription!) and suggest fixes. For some reason I asked Opus for a message to you, the reader. </p><p>Here it is, take it as you will 👏 </p><p>Full Transcript for the deep divers: </p><p></p><p>[00:00:00] <strong>Alex Volkov:</strong> Right, folks. So I think recording has started. And then let's do our usual. Welcome. Welcome, everyone. Those who know the sound from week to week. This is Alex Volkov. You're listening to ThursdAI, March 7th. I'm an AI evangelist with Weights Biases, who you can see here on stage as well. So, you know, you see the little square thing, give it a follow. Follow us on socials as well. And, uh, today is obviously Thursday.</p><p>[00:00:45] <strong>Alex Volkov:</strong> Uh, Thursday was a lot of stuff to talk about. Um, so, let's talk about it. Uh, I think, I think, um, our week is strange, right? Our week starts at the Friday. Almost, not even Friday. The updates that I need to deliver to you start at the end of the previous ThursdAI. So as, as something happens, uh, and I, I have a knowledge cutoff, actually, at some point we considered calling this podcast knowledge cutoff.</p><p>[00:01:14] <strong>Alex Volkov:</strong> Um, I have a knowledge cutoff after Thursday afternoon, let's say when I start and send the newsletter, but then AI stuff keeps happening. And, uh, Then we need to start taking notes and taking stock of everything that happened and I think on Friday We had the the lawsuit from Elon and there's a whole bunch of stuff to talk about and then obviously on Monday We had some big news.</p><p>[00:01:37] <strong>Alex Volkov:</strong> So As always I'm gonna just run through all the updates. There's not a lot today There's not a ton of updates this week, but definitely there's a few interesting things. Let me un save as well And then I'll just say hi to a few, a few of the folks that I got on stage here to chat. Um, we got Vic, and Vic is going to give us an update about, about something interesting. Uh, Vic, feel free to just unmute and introduce yourself briefly. And then we're going to go through the updates.</p><p>[00:02:07] <strong>Vik:</strong> Hey, my name is Vivek, uh, I've been training ML models for the last two years or so. Um, recently released a new model called OneDream2. It's a very small vision language model that excels at a lot of real world use cases that you could use to build computer vision applications today, so I'm very excited to chat about that.</p><p>[00:02:30] <strong>Alex Volkov:</strong> Awesome. And, uh, we have Akshay as well. Akshay, it's been a while since you joined us. What's up, man? How are you?</p><p>[00:02:36] <strong>Vik:</strong> Greetings of the day everyone, and it's lovely to join again. Uh, I have been listening, I have been here in the audience. Uh, for each and every ThursdAI, and, uh, I've been building some exciting stuff, so I've not been joining much, but, uh, things are going great.</p><p>[00:02:54] <strong>Alex Volkov:</strong> Awesome. And, uh, for the first time, I think, or second time we're talking with Siv. Hey, Siv.</p><p>[00:03:01] <strong>Far El:</strong> Hey, how's it going, everyone? Uh, just a little background on me. Um, I come from startups and from Amazon Web Services. Um, I've been in the AI space for the last six years. And I'd love to be able to chat today about social algorithms and, uh, researchers</p><p>[00:03:21] <strong>Nisten:</strong> having</p><p>[00:03:22] <strong>Far El:</strong> trouble with, uh, socials, particularly Twitter.</p><p>[00:03:26] <strong>Far El:</strong> Anywhere else where you're trying to distribute your</p><p>[00:03:28] <strong>Nisten:</strong> models?</p><p>[00:03:30] <strong>Alex Volkov:</strong> Yeah, so we'll see if we get to this. The setup for ThursdAI is usually just, uh, updates and conversation about updates, but if we get to this, uh, definitely we'll, we'll, we'll dive in there. Um, right, so folks, with this, I'm gonna say, um, uh, that we're gonna get started with just an update, and then I think Nisten will join us in a second as well.</p><p>[00:03:50] <strong>Alex Volkov:</strong> Oh, I see somebody else I wanna, I wanna add.</p><p>[00:03:55] <strong>Alex Volkov:</strong> So, here's everything for March 7th that we're going to cover today. Um, so in the area of open source, we didn't actually have a ton of stuff happen, um, up until, I think, yesterday and today. So, the most interesting thing we're going to talk about is, um, the company O1AI, um, which is a, The folks who released YI 34b, and we've talked about YI and the new Hermes kind of updates for YI as well.</p><p>[00:04:23] <strong>Alex Volkov:</strong> They released a new 9 billion, 9 billion parameter model, which is very competitive with Mistral and the like. Um, and then also the new company, newish company called Answer. ai from Jeremy. Jeremy Howard, if you know him, and Joanna Whittaker, and they collaborated with Tim Dittmers from Qlora, and they released something that lets you train a 70 billion parameter at home, a 70 billion parameter model at home.</p><p>[00:04:51] <strong>Alex Volkov:</strong> We're going to chat about this a little bit. Um, even though today I saw another thing that is kind of around this area, so we're going to have to go and find this and discuss how these huge models are now being able to get trained at home as well. Uh, very brief open source stuff, then we're going to talk about big companies and obviously, um, actually going to put cloud last because we're going to talk about cloud probably a lot.</p><p>[00:05:16] <strong>Alex Volkov:</strong> But, uh, in the big companies area, we will not be able to escape the drama that Elon Musk sues OpenAI. And then the OpenAI response, we're going to chat about this as well. Excuse me. Oh yeah, this is going to keep happening, just one sec. Um, maybe we'll briefly mention that Logan has left OpenAI, and for a brief period of time, he and Ilya had the same, um, bio on Twitter, not anymore, but very interesting as Logan starts to post some stuff as well.</p><p>[00:05:46] <strong>Alex Volkov:</strong> Um, I really want to chat about the Google employee who was charged with AI secret trading, uh, and received like a CTO position in China. That's a very interesting update as well. And, uh Inflection from, uh, there we go, we have Nisten as well, uh, Inflection just released an update today, which is kind of like breaking news, uh, a 2.</p><p>[00:06:09] <strong>Alex Volkov:</strong> 5 update, and they, they say they come to GPT 4 and Gemini equivalent, uh, performance level, which remains to be seen, and I've tested this a little bit, and I definitely want to chat about this as well. Uh, in the vision and video, and We have only the one thing, but we have the author of said thing here. Uh, so I haven't seen any, anything else besides Moondream and we have Vic here.</p><p>[00:06:33] <strong>Alex Volkov:</strong> We're going to talk about Moondream too, and how you can use this and what we can, we can use it for. Um, Voice and audio. There's something that probably didn't happen for the past week. I think it happened a little bit before and I don't have access yet, but Suno if you guys know Suno released the alpha and there's a bunch of videos floating around of their songs with like the V3 alpha of theirs and it's quite something if I if I'm gonna be able to find those tweets and pin them for you That's gonna be a mutual listening Maybe I can actually find the tweet to to actually play this for you.</p><p>[00:07:07] <strong>Alex Volkov:</strong> We'll see if the multimedia department will work. Um, and I think in AI art and diffusion stuff, there's a bunch to talk about. Um, there is, uh, Stable Diffusion 3 research paper was released, and we've talked about Stable Diffusion 3 a little bit. After the announcement, and we haven't covered the research paper, we can chat about the research paper.</p><p>[00:07:29] <strong>Alex Volkov:</strong> But also, potentially today, Imad is going to open some invites, as he mentioned on X as well. So, I'm ready with the breaking news button there. Stability, also in the news, they released a collaboration with Tripo, which created a very fast image to 3D model called Tripo SR. And that's been very cool, and there's a few very Viral examples of, of said thing, uh, floating around, so definitely worth talking about this as well.</p><p>[00:07:57] <strong>Alex Volkov:</strong> And I think, uh, Nisten is just joined us, hey Nisten, and you just shared that, um, That we can train a 70 billion parameter, Oh, 7 billion parameter at home with 24 gig memory, right? A galore. Nisten?</p><p>[00:08:17] <strong>Nisten:</strong> so, so it's a combination of a lot of [00:08:20] techniques that people have been using. And, uh, I'll try to pin it up in a second. But the. The research is that now you can train one from scratch. Not Finetune. Start one from scratch. Start your own. So this is why it's pretty, um, it's relatively groundbreaking.</p><p>[00:08:40] <strong>Nisten:</strong> And they released a repository for that as well. So it's not simply just a paper. They have a code base. It's pretty legit.</p><p>[00:08:50] <strong>Alex Volkov:</strong> So I guess let's, let's get into the open source stuff, um, and then we'll get to the open source, and then we're going to discuss everything else, because I think the main, the bread and butter of this discussion is going to be, is going to be, um, Anthropic. Anthropic's, uh, coming back to the limelight, but let's, let's start with, let's start with open source.</p><p>[00:09:09] <strong>Alex Volkov:</strong> Where's my open source button?</p><p>[00:09:27] <strong>Alex Volkov:</strong> Alright, so I guess, uh, Nisten, welcome, uh, and I guess let's start with, with Galore, uh, as much as we can. We can get from the, from the release, a fairly, fairly new release as well. And I think it's connecting to the other, uh, to the other thing from Answer. ai, but let's start with Galore. Um, so basically, these folks released something called Galore, which is, um, kind of playing on the same, on the same LoRa, QLoRa stuff.</p><p>[00:09:52] <strong>Alex Volkov:</strong> Uh, what are some of the techniques they're adding there? I'm, I'm trying to, to take a look as I'm learning. Uh, Nisten, do you have any, any Any info to share with us about this?</p><p>[00:10:05] <strong>Nisten:</strong> Yeah, yeah, same. more for an actual full paper reading because I have not read it entirely. Mainly looking at it again, it looks like it's, uh, it's another stack of tricks like most good projects are, uh, but it is the, it enables a very, very large capability. And that is that now you can make your own full LLM from, from nothing.</p><p>[00:10:29] <strong>Alex Volkov:</strong> So not a fine tune.</p><p>[00:10:31] <strong>Nisten:</strong> Uh, yeah. Not a fine tuned, not initiated weights. You just, you just start from, uh, from nothing. So, it's I see that it uses, uh, like it offloads a lot of the weight activations and offloads some of them on, uh, on CPU memory. And I know there are options in Axolotl, which is the Docker container that people use to train, that you can also offload on very fast NVMe drives.</p><p>[00:10:55] <strong>Nisten:</strong> So if you have like very fast PCI Express NVMe storage, you can kind of use that as another RAM for, for the training. So this combines all of those. And then some on top and the end result is, is very impressive because you can train a very capable model. And, uh, yeah, again, pending further, uh, research and stuff.</p><p>[00:11:21] <strong>Nisten:</strong> But I think this is one of those repositories that, uh, a lot of people will use or it's likely to.</p><p>[00:11:30] <strong>Alex Volkov:</strong> Yeah, and I think this adds to the, so this, this kind of in the same vein of the next thing we're going to chat about and, um, um, I actually can't find any mention of this on X, believe it or not, so not everything is fully on X. I just got a link, uh, to this from, from, uh, Omar, uh, from Hug and Face. And AnswerAI is a new research lab, um, that Jeremy Howard, uh, if you guys are not familiar with Jeremy Howard, hopefully everybody is, but if you're not, um, I guess look them up.</p><p>[00:12:04] <strong>Alex Volkov:</strong> Um, Jeremy, uh, joined Answer. AI, like, um, I think around NeurIPS he was talking about. They got funded, I think, 10 million dollars. And, um, they released their first project, a fully open source system, uh, that can efficiently train a 70 billion large language model on regular desktop computers with two or more gaming GPUs.</p><p>[00:12:27] <strong>Alex Volkov:</strong> They're talking about RTX 3090 or 4090. Um, Which, you know, compared to, um, Niton what you just shared, I think that sounds very impressive. Um, they combine FSDP, which is, I'm not familiar with FSDP, with SFDP and uh, q and, uh, they brought kind of the, the Cuda Avengers to, to the flow. So Jeremy Howard obviously.</p><p>[00:12:52] <strong>Alex Volkov:</strong> Um. I think FastAI, right? And then Kaggle, I think, competition is definitely behind Jeremy. And then they brought Team Ditmers from Qlora, and we've covered Qlora multiple times, um, very efficient methods. And then they also brought Hugging Faces, Tyrus Von Koller, and, um, they brought the CUDA Avengers in there to, to Basically combine a bunch of techniques to let you train 70 billion parameters.</p><p>[00:13:20] <strong>Alex Volkov:</strong> I see we have Yam joining us. Hey Yam, did you see the Answer. ai stuff that I'm covering or is this new to you?</p><p>[00:13:26] <strong>Yam Peleg:</strong> No, no, all new to me.</p><p>[00:13:28] <strong>Alex Volkov:</strong> Oh wow, okay, so I need, I need, uh, I would love your reaction in real time. Let me DM you this real quick because, um, The number of, actually, let me, let me paste this in the link below so we can actually paste this up.</p><p>[00:13:43] <strong>Alex Volkov:</strong> Um. Yeah, there we go. Okay. So it's now pinned to the top of the space for folks to, to find out. I wasn't able to see any, uh, update on X from any of them, which is very interesting. Um, and the, the very interesting idea is that, you know, all of these systems and all of these models, 70 billion models, they cost an insane amount of money.</p><p>[00:14:07] <strong>Alex Volkov:</strong> And now these folks are considering that under 10, 000, you'd be able to train something like 7TB at home. Which I'm not training models, but I know that some folks here are. And, um, I assume that this is a very big unlocking capability. Um, which, which is what Answer. AI is trying to achieve.</p><p>[00:14:32] <strong>Alex Volkov:</strong> Let's see what else is very interesting here. Um, just something about Answer. AI generally. Uh, they claim that they're like an unusual type of organization. I actually tried to ask Jeremy a couple times what did this mean. Um, and, uh. They, they claim to be a for profit, like, lab, R& D lab, and, um, more in spirit to 19th century labs than today's AI research groups, and, um, I think Eric Ries and Jeremy Howard launched this in Europe, um, and, I think, I'm actually not sure what's the, the, how much did I say?</p><p>[00:15:14] <strong>Alex Volkov:</strong> Um. What are they up against? But the first release of theirs is the open source OS, fully open source. Uh, that includes one of the, like several of the top people in the industry, uh, to create something that wasn't possible before. Um, and I think it's remains to be seen. They didn't release any metrics, but they said, Hey, we're about to release some metrics, but, um, this keeps improving from week to week.</p><p>[00:15:39] <strong>Alex Volkov:</strong> So we actually didn't release any metrics. Go ahead Nisten.</p><p>[00:15:43] <strong>Nisten:</strong> Sorry, is this from Answer. ai? They said they were going to release one, or? They</p><p>[00:15:49] <strong>Alex Volkov:</strong> think, already. They didn't release metrics, uh, for the training. Uh, but I think the, the whole repo is open source. Yeah.</p><p>[00:15:58] <strong>Nisten:</strong> released an open source OS, or?</p><p>[00:15:59] <strong>Alex Volkov:</strong> Yeah, yeah, open source, FSDBQLora. Um, and I think</p><p>[00:16:03] <strong>Nisten:</strong> Oh, okay, so it's not a real operating system, it's another,</p><p>[00:16:07] <strong>Alex Volkov:</strong> It's, well, they call it an operating system, but yeah,</p><p>[00:16:10] <strong>Nisten:</strong> Oh, okay,</p><p>[00:16:11] <strong>Alex Volkov:</strong> it's not like Linux competitive.</p><p>[00:16:12] <strong>Nisten:</strong> okay, I thought it was like an actual one. Okay, actually, go ahead, because there are some other huge hardware news that I wanted to quickly cover.</p><p>[00:16:23] <strong>Alex Volkov:</strong> Go ahead,</p><p>[00:16:23] <strong>Yam Peleg:</strong> Yeah,</p><p>[00:16:23] <strong>Vik:</strong> I just wanted to add about this answers. ai thing that they have released this system that you guys were talking about, which basically claims to be able to train 70 billion parameter model on only 224 [00:16:40] GB GPUs.</p><p>[00:16:40] <strong>Vik:</strong> So basically, you know, two 4090s and you can train a 70 billion parameter model, which is mind boggling if you think about it. But, uh, I tried to find like how to get access to this. So I was still not sure if this is fully supported in every, uh, rig and system. So that is something I</p><p>[00:16:58] <strong>Nisten:</strong> wanted to mention.</p><p>[00:17:00] <strong>Alex Volkov:</strong> Yeah.</p><p>[00:17:00] <strong>Nisten:</strong> By the way that that has been, oh, sorry.</p><p>[00:17:02] <strong>Nisten:</strong> That that has been do, uh, doable for a while because Kilo actually trains it all in four bit. And, uh, there are only like a few tricks, which you can also apply if you go to apps lot, uh, the directory. You, you can also do that on your own if you do a four bit kilo training and you just say, offload all the gradients and all this stuff, you can also do that with a, the 48 gig, uh, stuff.</p><p>[00:17:26] <strong>Nisten:</strong> But, uh, again, I'll look into the actual directory instead.</p><p>[00:17:32] <strong>Alex Volkov:</strong> Right, so, um, Nisten, you mentioned some hardware news you want to bring? Go ahead.</p><p>[00:17:39] <strong>Nisten:</strong> Yep. Okay, so we have two hardware news, but they are actually kind of related. Uh, first of all, uh, TenseTorrent, the company by legendary computer scientist, Jim Keller, who worked on the iPhone chip, AMD, who brought AMD back to life. Uh, legendary computer scientist, and has been working on TenseTorrent, which is another, uh, accelerator for, which also does, does training.</p><p>[00:18:07] <strong>Nisten:</strong> So, uh, so they released these cards, and I'm not sure what the capabilities are, uh, but I saw that George Hotz, uh, from TinyCorp, uh, posted them, and, uh, they are actually, so I just wanted to give them a big shout out to actually making them commercially viable, and it's just something you can buy, you don't have to, uh, You know, set up a UN meeting for it, right?</p><p>[00:18:31] <strong>Nisten:</strong> And get the votes and stuff. You can just go and buy it. So, that's pretty awesome of them, and I wish more companies did that. The second news is also kind of huge, because one of the engineers that left TestTorrent last year now started a startup here in Toronto. And this has been an idea that's been around for some time and discussed privately and stuff.</p><p>[00:18:59] <strong>Nisten:</strong> Uh, they're making AI chips. Again, they do not. These ones do not do training, but they're going to make them hard coded, which will be the judge of how much that makes sense given the how rapidly models improve. But there is a business case there because the hard coded chips, they can perform literally a thousand to 10, 000 times faster.</p><p>[00:19:25] <strong>Nisten:</strong> So</p><p>[00:19:26] <strong>Alex Volkov:</strong> you say hard coded, is that one of those, like, transformer specific chips you mean?</p><p>[00:19:33] <strong>Nisten:</strong> no, the entire weights are etched into the chip and you cannot change them. So the benefit of this is that you can get up to a thousand to ten thousand times faster inference. So we might end up with a case where, according to calculations from What Sam Altman said on how much chat GPT serves in a single day, which is a hundred billion tokens, and that works out to about 1.</p><p>[00:20:02] <strong>Nisten:</strong> 4 million tokens per second. We might very soon, like in a year or two or sooner, be in a spot where we have this company's using 60 nanometer chips. We might have a single chip pull the entire token per second performance of all of global chat GPT use. I don't know if that includes enterprise use, but that's how fast things are accelerating.</p><p>[00:20:29] <strong>Nisten:</strong> So that's the, that's the benefit of, uh, yeah, that's the benefit of going with a hard coded chip. So yeah, call, uh, inference costs are, um, are dropping in that</p><p>[00:20:43] <strong>Alex Volkov:</strong> You also mentioned George Hotz and, uh, he also went on a, on a, on a rant this week again. And again, I think, do you guys see this? Um, the CEO of AMD that doesn't use Twitter that much. But she replied to one of him, uh, one of his demands, I think, live demands, and said, Hey, uh, we have a team dedicated working on this.</p><p>[00:21:05] <strong>Alex Volkov:</strong> And then we're gonna actually make some changes. in order to get this through. So, I love it how, uh, George Hotz, um, folks probably familiar with George Hotz in the audience, um, Should we do a brief, a brief recap of George Hatz? The guy who hacked the first iPhone, the first PlayStation, then, uh, built a startup called Com.</p><p>[00:21:25] <strong>Alex Volkov:</strong> ai to compete with Autonomous Driving, and now is building tiny, uh, we mentioned tiny boxes ready to ship Nisten last time, and I think that paused because they said, hey, Well, we don't have enough of the open sourcing of the internal stack of AMD Which led the CEO of AMD, Linda, or Lisa? I'm very bad with names.</p><p>[00:21:46] <strong>Alex Volkov:</strong> I think Linda, to reply and say hey, we have dedicated teams working on this Actually do want to go find this tweet Go ahead Nisten</p><p>[00:21:57] <strong>Nisten:</strong> Yeah, so there has been a big misconception in the software industry that, um, a lot of the, the code monkey work is something that, you know, you just hire someone to, like, clean your toilets and, and do it. But, in fact, the reason that NVIDIA has a 2 trillion valuation, and I'll beat Saudi Aramco, is because that Their toes are a lot cleaner in terms of the software.</p><p>[00:22:27] <strong>Nisten:</strong> So, the CUDA software is a lot more workable, and you can do stuff with it, and it doesn't have the bugs. So, in essence, what George Haas is doing by pushing to open source some key parts, which some people might freak out that China might steal them, but they've already stolen everything. So, it really doesn't, doesn't matter that they're very small hardware parts, but they make a huge difference in developers being able to.</p><p>[00:22:56] <strong>Nisten:</strong> to use that software, and those parts are buggy. So, in essence, like, George Haas, with this stupid CodeMonkey fix, might double or triple AMD's stock</p><p>[00:23:07] <strong>Alex Volkov:</strong> Yeah,</p><p>[00:23:08] <strong>Nisten:</strong> Just because he's getting in there, and he's cleaning that crap code out.</p><p>[00:23:14] <strong>Alex Volkov:</strong> and he's popular enough to pull attention from the CEO of this company to actually come and react and, you know. One of the reasons I love X is that I think, um, uh, she retweeted their official tweet. I think there's more folks commenting on and reacting to her, um, comment, and that's on top of the space now, uh, than the actual kinda tweet itself.</p><p>[00:23:37] <strong>Alex Volkov:</strong> Which is, I think, a good type of ratio, or ratio, yeah. I think, uh, more hardware news, I think we're satisfied with Oh, yeah, yeah. The, the, the only other hardware news related to this, 'cause ni I think you mentioned Saudi Aramco. Uh, we chatted with the GR folks with a Q not with a K grok. The, the LL uh, LPU chip.</p><p>[00:23:58] <strong>Alex Volkov:</strong> And they're like super fast, uh, inference speed, and I think this week. They showed that they have a collaboration with, I think said, Saudi Aramco, um, about bringing AI. Um, and I saw a few, a few folks post about this and, um, if that's of interest to you, we had a full conversation with the Grok team. They also, they also, um, Release, kind of, uh, they had a waitlist and many, many people, I think the waitlist jumped after we chatted with them at the peak of their very viral week, which started with match rumor going, going off.</p><p>[00:24:32] <strong>Alex Volkov:</strong> Uh, and then I think they said something about, they had like 50 or a hundred waitlist signups before this. And then the week after they had like 3, 600 a day or something like this. So they revamped the whole system. And now, you can actually sign up with a self served portal to Grok, and uh, let me see if I can find this tweet for you.</p><p>[00:24:55] <strong>Alex Volkov:</strong> So you can actually now go and sign up, um, to Grok yourself, [00:25:00] they have a nice console, very reminiscent for, um, for every other, like, console out there. You can create an API key, very simple, so no longer like a manually, manual approval of, um, Grok. I can't find this tweet though, so give me, give me just a second.</p><p>[00:25:22] <strong>Alex Volkov:</strong> So, yeah, they, they're, uh, collaborating with, with Saudi Encore. Go ahead Nisten, real quick.</p><p>[00:25:28] <strong>Nisten:</strong> Uh, yeah, just really quickly, the part that I missed was that, uh, the fix that George Haas is doing for AMD, that's to enable distributed training. Because they cannot distribute training across GPUs because it crashes. So it's pretty important. Uh, yeah, and those are my comments on that.</p><p>[00:25:48] <strong>Alex Volkov:</strong> Awesome. Okay, so I, I found the tweet. Uh, so if, if you follow this tweet, the, the kind of the, the quoted tweet there is, uh, getting you to the Grok console. You get like two weeks for free and you get the API access to this like incredibly fast inference, inference machine from Grok.</p><p>[00:26:05] <strong>Nisten:</strong> I think Far El and Yam wanted to say something on it.</p><p>[00:26:10] <strong>Alex Volkov:</strong> Yeah, go ahead.</p><p>[00:26:11] <strong>Yam Peleg:</strong> Yeah, I got a lot of technical issues. So if you can go before me, I'll try to</p><p>[00:26:17] <strong>Vik:</strong> fix it.</p><p>[00:26:19] <strong>Alex Volkov:</strong> You're coming through finally, loud and clear. Far El, if you wanted to comment, go ahead, man.</p><p>[00:26:30] <strong>Alex Volkov:</strong> Alright, um, looks like Far El is also, um, not available. Okay, I think we're moving</p><p>[00:26:38] <strong>Vik:</strong> touch on this for a sec. Um, so Grok has a white paper out about how they've designed their chips and it's super interesting. I'd strongly recommend everyone go read it. Uh, they've basically from the ground up rethought how, uh, inference oriented compute should work. It's a fascinating read and kind of surprising that they're sharing all of those details.</p><p>[00:27:00] <strong>Vik:</strong> One would think they'd keep it proprietary.</p><p>[00:27:05] <strong>Alex Volkov:</strong> yeah, we had a full conversation with them. It is fascinating. Again, you know, for, for The level of discussion that we have here, um, we, you know, honestly, we couldn't dive like super, super deep, but I've played with it, and the demos I was able to do, uh, Vic, I don't know if you have the chance to see, uh, they're only possible with almost instant, uh, speed.</p><p>[00:27:28] <strong>Alex Volkov:</strong> You know, guys, what, like, even though I love the Grock team, and we're collaborating with them, we're gonna do some stuff with them as well, um, it turns out that for some Use cases, inference speed, like a lot of inference speed on big documents, and I think that's what Grok is like definitely incredible with.</p><p>[00:27:49] <strong>Alex Volkov:</strong> You take Mixtral and you dump a bunch of tokens in, and then you get like a super fast reply. So I was actually able to get a transcript in there for all of ThursdAI, and to get chapters within less than like 3 5 seconds, which is ridiculous. For the demo that I built, I actually didn't need inference speed.</p><p>[00:28:09] <strong>Alex Volkov:</strong> I did need infraspeed, but as much as I needed a faster response on smaller kind of prompts multiple times. And I noticed that even though their infraspeed is incredible, their latency is not great, probably because they're still fairly young in this. And I went and looked, and Together also offers Mixtral over API.</p><p>[00:28:31] <strong>Alex Volkov:</strong> Not Together, sorry. Together also does this, but specifically Perplexity. If you use Perplexity for search, you may not know that they also have an API that you can use, and they serve Mixtral and Mistral, and I think some other open source models and some of theirs. Um, and they keep improving their scores there, and specifically they're now up to 200 tokens per second for Mixtral and Mixtral, which is impressive.</p><p>[00:28:56] <strong>Alex Volkov:</strong> And, you know, um, they don't have custom hardware, and they're getting 200 tokens per second, which is ridiculous. But what I notice is Perplexity is web engineers because they're now rumored to be a unicorn. I don't know if that's a rumor, so that's not confirmed. But their web engineers are really top notch.</p><p>[00:29:16] <strong>Alex Volkov:</strong> And so it turns out that if I use Perplexity is API for Mixtral. I get less tokens per second. So I get less than half, right? So Grok is at around 500, um, Perplexity is around 200. But I actually get better performance because I need kind of low latency on the request itself and Perplexity is better at this.</p><p>[00:29:36] <strong>Alex Volkov:</strong> Um, obviously something Grok can and will fix. And also the stuff that the Grok team told us were like, it's only, they're only scratching the itch. And Nisten, you mentioned something with them in the conversation that I wanted to repeat is that They're also working on figuring out the input latency of how fast the model not just spits out tokens, but processes the whole prompt input, which is a big deal, especially for long context prompts.</p><p>[00:30:00] <strong>Alex Volkov:</strong> And they said that they're looking at this and they're gonna release something soon.</p><p>[00:30:07] <strong>Nisten:</strong> Yeah, that's something that the NVIDIA cards excel at, and something that's holding back CPU based inference, because the prompt evaluation is, is, is slow. So, yes, it's not an easy problem to solve, but their chip is already so fast that the 3 to 1 ratio does not hurt them as much. Whereas With NVIDIA, the chips are slower and stuff, but they have like a 10 to 1 ratio, so if you're running at 100 TPS, your prompt eval is going to be like over, over a thousand.</p><p>[00:30:42] <strong>Nisten:</strong> So it's going to read. If you dump in like 10, 000 tokens, it's going to read them in 10 seconds or less. Usually it's a few thousand with NVIDIA, but I'm not sure actually, because when you dump in a huge amount of text in Grok, it does not take multiple seconds to evaluate it. It's like instance,</p><p>[00:31:04] <strong>Alex Volkov:</strong> It's quite, it's quite fast, yeah.</p><p>[00:31:06] <strong>Nisten:</strong> yeah, so I'm not too sure that that needs some proper benchmarking to say for sure.</p><p>[00:31:11] <strong>Alex Volkov:</strong> Yep. So, uh, speaking of Grok, let's, let's talk about the other Grok, but before that, you guys want to acknowledge, like, what's, what's going on with the rumors? Far El, you, you just texted something. I'm seeing Foster post something. Uh, what's, what's going on under, under the current of, of the Twittersphere?</p><p>[00:31:27] <strong>Alex Volkov:</strong> Um,</p><p>[00:31:28] <strong>Far El:</strong> Just, just speculation at this point, but, uh, you know, you know, those, uh, those people that, uh, that, uh, leak, you know, uh, stuff about OpenAI and all these AI companies, and most of the time, some of them are, are right. Uh, of course we don't see what they don't delete,</p><p>[00:31:49] <strong>Alex Volkov:</strong> yeah.</p><p>[00:31:50] <strong>Far El:</strong> uh, uh, yeah, like some of them are saying right now that, uh, there's like a rumor that GPT 5 is dropping.</p><p>[00:31:57] <strong>Far El:</strong> That GPT</p><p>[00:31:58] <strong>Alex Volkov:</strong> Say, say this again slower, because</p><p>[00:32:01] <strong>Far El:</strong> 5 is dropping, that</p><p>[00:32:02] <strong>Alex Volkov:</strong> there is a rumor that GPT 5 is dropping today. Wow. All right. Um, yeah. That's, that's quite, and I've seen this from like several folks, but</p><p>[00:32:11] <strong>Far El:</strong> Could be complete b******t, right?</p><p>[00:32:12] <strong>Yam Peleg:</strong> But yeah.</p><p>[00:32:14] <strong>Alex Volkov:</strong> well, I'm ready with my button. I'm just saying like, let's acknowledge that there's an undercurrent of discussions right now with several folks who are doing the leaking.</p><p>[00:32:22] <strong>Alex Volkov:</strong> Um, and then if this drops, obviously, obviously we're going to do an emergency, uh, and convert the whole space. I will say this, GPT 4 was released. Almost a year ago, like less than a week to the year ago, March 14th. Um, Cloud, I actually don't remember if Cloud 1 or Cloud 2. I think it was Cloud 1 that released the same day that people didn't even notice because GVT 4 took, took the whole thing.</p><p>[00:32:52] <strong>Alex Volkov:</strong> Um, and now like Cloud releases their, um, Which we're gonna talk about, so I won't be surprised, but let's talk about some other stuff that OpenAI is in the news for. And then, and then if, if anything happens, I think we all have the same, uh, the same profiles on x uh, on notification. So we'll get the news as it comes up.</p><p>[00:33:13] <strong>Alex Volkov:</strong> And we love breaking news here in, in, in, in Thursday. Okay,</p><p>[00:33:17] <strong>Nisten:</strong> Yeah, for sure.</p><p>[00:33:18] <strong>Alex Volkov:</strong> Um, let's [00:33:20] move on. Let's move on from open source. So, so I think we've covered. A few open source, I will just mention briefly that we didn't cover this, um, the, the folks, uh, from Yi, uh, 01AI, 01AI is a Chinese company, uh, they released the small version of Yi, and we've talked about Yi 34B multiple times before, there's a, a great fine tune from Nous, uh, they released a 9, 9 billion parameter version of Yi, which, uh, they trained for a long time, looks like, and, um, They showed some benchmarks, and it's very, very interesting how confusing everything is right now, because even, you know, even Gemma is not really 7 billion parameters.</p><p>[00:33:58] <strong>Alex Volkov:</strong> Yeah, we talked about this, right? But then they now compare, they say in the same category broadly, and they now compare like Yi 9 billion parameters to Mistral 7 billion to Solr 10. 7 billion. So I'm not sure like what this category is considered, but maybe folks here on stage can help me like figure out what this category is considered.</p><p>[00:34:19] <strong>Alex Volkov:</strong> But Yi is fairly performative on top of Mistral 7b, and I think it's still one of those models that you can run. I think, if anything, comparing this to Solr, 10 billion parameters, we've talked about Solr multiple times from the Korean company, I think. Yi is very performative, and the 34 billion parameter model of it was very good, and many folks really, really did some fine tunes of it.</p><p>[00:34:45] <strong>Alex Volkov:</strong> So, asking the fine tuner folks here if you have a chance to look at it, and if not, is this something interesting? It looks like, unfortunately, YAML is having a lot of like X problems, uh, but once you come up, we're going to talk about the Hebrew GPT as well. Um,</p><p>[00:35:02] <strong>Far El:</strong> What I do find interesting is, uh, how, yeah, like the, the, the broad evaluation spectrum that a lot of these models are, are comparing themselves to now, uh, and, and we're going to see more of these, uh, going forward, like, uh, I've seen early, uh, private researchers, Stuff, but like I feel like the category is no longer all just compare 7b to 7b It's it's just expanded to like sub 10b, right?</p><p>[00:35:27] <strong>Far El:</strong> Like that's pretty much what it is like those those numbers even from players like Google are very You know, um, like it, it just doesn't feel as rigid as it used to be, but also like we should keep in mind that not all parameters are the same, right? So, like, uh, like we've seen with certain MOE architectures.</p><p>[00:35:51] <strong>Alex Volkov:</strong> yeah, that's true. And, um, and I will say it's, uh, it looks like there's a art to train these models and some, some amount of art to also, uh, cherry pick which metrics you're, you're testing and against which models and which category you're placing your model in as well. Um, but just. And, and again, this was released like so recently that I don't think, I think yesterday, so definitely folks didn't have a chance to try this, but Yi, the, the other models of theirs were trained and performing very well, so, um, we're gonna be very excited to see if the Finetuning folks are jumping on this, uh, 9 billion parameter, and, and it performs better than, I think, Gemma is, ahem, The leading one, even though Mistral is still the leading one in our eyes.</p><p>[00:36:36] <strong>Alex Volkov:</strong> Okay, I think this is it in the, um, Oh, let's see, a few more details here for Yi, and before I finish, Uh, it's trained on 3 trillion tokens, so a lot, uh, It's decent at coding and math, and then it has open access weights, and then bilingual. That's basically what we were able to get, uh, and thanks to the folks at Hug Face, VB.</p><p>[00:36:59] <strong>Alex Volkov:</strong> I should probably add this as well. I think we're moving on to the main topic, which is the big companies, APIs and LLMs. I think it's, uh, you know what, you know, before this, I would go to vision category because we have Vic here. And, uh, I really want to chat about Moondream too. So, um, we've talked about Moondream 1, but folks who weren't with us, Vic, do you mind, uh, unmuting and then doing a little bit of a, of a intro for you as well?</p><p>[00:37:26] <strong>Alex Volkov:</strong> And then we'll talk about what's changed in Moondream.</p><p>[00:37:30] <strong>Vik:</strong> Yep, sounds good. Um, so, uh, Moondream is a small vision language model. Basically a vision language model is, uh, basically it's a language model where you can, Show it an image, ask it questions. You can ask it to describe the image. And the reason this is useful is not because it unlocks any new capability that people didn't have like five years ago.</p><p>[00:37:56] <strong>Vik:</strong> All the stuff you could do with it, object detection, captioning, etc. It was all possible. The thing that's helpful about models like this is they're a lot easier to use. Whereas historically, if you wanted to do a computer vision task, you'd have to collect a bunch of data, train your own YOLOV, 7, 8, I think there are 9, V9 now, model, um, and that usually works super well, but it's, uh, when you're trying to build an app, it's just unnecessary extra work for you, whereas with a general equation language model, similar to how you use chatGPT with text, you can just ask it questions in natural language, and it It makes developing computer vision apps a lot easier.</p><p>[00:38:38] <strong>Vik:</strong> Um, so I released Moondream 1 maybe about a month ago. Um, it's, it's not unique by the way. There's other open source, well, open ish source vision language models out there today. Uh, but they're all in the 7 billion to 34 billion to 70 billion param range. Uh, Moondream is 1. 8. 86 billion params, which makes it very easy to run, um, cheap to run on edge devices, like you literally don't even need a GPU to run it, you can just run it on CPU and get acceptable performance. Um. Yeah, so, Moon Dream 1 was trained on some datasets that were derived from GPT 4, and so the licensing was, uh, non commercial. Like, you could use the model, but not. It was research only. For Moon Dream 2, which I released earlier this week, maybe last week, time's a little bit of a blur, um, I re did the datasets, um, All of the synthetic data used to train it is now generated using Mixtral, uh, and as a result, like, it's all clean.</p><p>[00:39:47] <strong>Vik:</strong> So I was able to license it as Apache 2. 0. There's no restrictions on you can use it or</p><p>[00:39:53] <strong>Alex Volkov:</strong> Vic, I have a question real quick. Uh, when you say synthetic data, and we're going to talk about some synthetic data in, in SD3 as well. Um, do you mean captions for images for, for, to train? Like what, what synthetic data are you generating with Mistral? Because Mistral is not multimodal.</p><p>[00:40:08] <strong>Vik:</strong> Yep. Great question. I'm going to post a more detailed breakdown of how I did it, uh, later. But basically to train these visual language models, you need, uh, paired image and text data. And the text needs to be read. You want like a mix of, hey, can you caption this image? Hey, can you caption this image in a lot of detail?</p><p>[00:40:29] <strong>Vik:</strong> Can you answer questions about this image? Um, there's a lot of images available with high quality captioning information, like common captions, whatnot. There's, there's a bunch of datasets. And so you use a model like Mistral to transform it into the types of queries that you want your Um, VLM to be able to answer.</p><p>[00:40:51] <strong>Vik:</strong> Basically you take Coco for example, common captions information and have the model convert those image captions into questions and answers about the image.</p><p>[00:41:04] <strong>Alex Volkov:</strong> So how long did it take you to train the second version of Moondream? And, um, what else can we do that the previous one or what else can you do better?</p><p>[00:41:14] <strong>Vik:</strong> It took about a month to get the same level of performance from the new data collection pipeline. One of the things that was really hard was I think when you're generating synthetic data, life is just so much easier when you have a GPT 4 class model. But unfortunately, the terms of use don't allow you to train a competing model and it gets a little iffy.</p><p>[00:41:33] <strong>Vik:</strong> Um, and so just basic things like tone of the response, right? Like if you use Mixtral to generate the [00:41:40] data, your prompt is something like, hey, I'm going to give you five captions for this image, consolidate all the information in it, and generate a caption. But you want it to pretend that it's looking at the image, um, not say something like, hey, based on the five captions that you have provided, there is a dog and a man is petting and whatnot.</p><p>[00:41:58] <strong>Vik:</strong> So. Getting that tone right required a lot of work. Uh, I ended up using DSPY. It's a super cool</p><p>[00:42:06] <strong>Alex Volkov:</strong> Oh,</p><p>[00:42:06] <strong>Vik:</strong> framework for prompt optimization. Um, everyone should check it out. But basically you can do stuff like manually annotate 400 examples and then it uses OptiUnit to figure out like what's the best chain of thought few shot setup that you can get to optimize performance based on metrics you can define.</p><p>[00:42:25] <strong>Vik:</strong> Uh, but yeah, getting that tone right was a lot of work. The other thing I focused on a ton was reducing hallucinations. Uh, I don't know if anyone's dug into the Lava training dataset, but one of the reasons Lava style models hallucinate a lot is just because they're trained on bad data. And you'll notice that a lot of hallucinations are oriented around COCO objects, like it tends to hallucinate handbags, ovens, um, people.</p><p>[00:42:53] <strong>Vik:</strong> A lot in images when they're not present, and then coffee cups, very, very common. And that's mostly because of bad object annotations in COCO, so we'll spend a lot of time filtering those out. Um, currently the benchmarks are slightly better on Moon Dream 2 than Moon Dream 1. Um, but qualitatively, if you try it out, the model hallucinates a ton less, and a big part of that was just the data pipeline.</p><p>[00:43:15] <strong>Alex Volkov:</strong> Interesting how that's not part of the benchmarks or evals. Right. Just underlines how, um, how far we still have to go in, in terms of evaluations that, you know, qualitatively you feel that it hallucinates less, uh, but there's not a lot of, uh, benchmarking or evaluation for hallucinations, I guess. Um, and you said this is like,</p><p>[00:43:38] <strong>Vik:</strong> in the long form, right? Like, if you, there's OAP, which asks a bunch of yes, no questions about your image. And so you can use that to measure hallucinations in that sense. But, like, uh, how do you measure hallucinations when you ask the model to describe an image? It gives you a long</p><p>[00:43:57] <strong>Yam Peleg:</strong> form answer.</p><p>[00:44:01] <strong>Alex Volkov:</strong> That's awesome. Congrats on the work, Vic. Uh, can folks try it right now? You said this is now commercially viable, right? Like, folks can actually use</p><p>[00:44:08] <strong>Vik:</strong> Yep, it's open source. You can build it into your app. Uh, there's a demo on Hugging Face Spaces if you want to try it out before.</p><p>[00:44:14] <strong>Alex Volkov:</strong> Yeah,</p><p>[00:44:15] <strong>Vik:</strong> You start building on it. I'm going to get Lama. cpp integration going here this week or early next week. So, uh, that'll unlock getting it into all the standard applications that people use, Olama, LMStudio, JanAI, etc.</p><p>[00:44:29] <strong>Vik:</strong> So it's going to get a lot easier, but the weights are available. The code is available. It's all open source, Apache 2. 0. You can use it today.</p><p>[00:44:35] <strong>Alex Volkov:</strong> that's awesome. Vic, congrats on this. What is this, uh, Hugging Face 0A100 space thing that you got as well? I was looking at this, I think, did they, like, start giving A100s to demo spaces now?</p><p>[00:44:50] <strong>Vik:</strong> Uh, yeah, so zero is kind of like AWS Lambda, but for GPUs. So rather than having a provisioned GPU for your space, Anytime a user comes in and there's a pool of GPUs and it pulls one and loads your model into it and runs it. Until recently, they had 8NGs, I think, available for this, but they switched to 800s.</p><p>[00:45:11] <strong>Vik:</strong> So, uh, there's a bit of latency if your model hasn't been tried out for a bit while it's loading it onto the GPU. But once it's on the GPU, it's super fast.</p><p>[00:45:22] <strong>Alex Volkov:</strong> Nice. Even for, for a tiny model like this, I wanna say a 100 is probably like just poof and it</p><p>[00:45:28] <strong>Vik:</strong> It's, yeah,</p><p>[00:45:31] <strong>Alex Volkov:</strong> awesome. Uh, Vic, congrats on this and thanks for sharing with us and folks. Definitely give Vic uh, a follow moon dream. When I tested this a couple, when the first one released, I tested this against gonna significantly larger vision models and it performed very well.</p><p>[00:45:45] <strong>Alex Volkov:</strong> Especially now that it's like Apache license, you can build it into your own pipelines. Um. And, um, I think the one thing to not miss from what you said is that there are specific vision models like YOLO and different things. And, uh, we have, um, one of the YOLO masters, Skalski, uh, Pyotr is a friend of the pod and he trains these models and he, he has demos and videos of how to actually use them.</p><p>[00:46:10] <strong>Alex Volkov:</strong> Uh, it's more, significantly more complex than using a vision VLM like Vic said. Um, you have to You have to learn this field, uh, it's like the, the very, like the standard machine learning in vision field as well. Uh, even though those models are tinier and probably run faster, some of them, I think YOLO can probably run in real time.</p><p>[00:46:29] <strong>Alex Volkov:</strong> Um, getting these tiny models, uh, to be able to talk to them, I think is significantly easier for many folks. And, uh, definitely, definitely check it out. Um,</p><p>[00:46:39] <strong>Vik:</strong> yeah. Just to clarify, um, Moondream. is great for vision tasks. If you ask it to write a poem about an image or roast you or something, it's not going to do as well. Because the sole priority I had was like make a model that's really, really good at computer vision. Um, and if you need more advanced reasoning, like you wanted to solve a math problem for you, like you take the outputs from Moondream and feed it into a bigger LLM.</p><p>[00:47:03] <strong>Vik:</strong> But Moondream is going to be great at vision tasks, other stuff, not so much.</p><p>[00:47:09] <strong>Alex Volkov:</strong> Absolutely. And, uh, if folks want to help, uh, the link is in the top of the space. Go into the GitHub, give it a star, and check it out and give, uh, Vic feedback. Um, moving on, uh, Vic, uh, feel free to stick with us and, and chat about the next stuff. Uh, speaking of folks who built and released things, uh, Yam, you have also Nous of your own, and hopefully, finally, your tech stuff is solved and you're now with us in the space.</p><p>[00:47:31] <strong>Alex Volkov:</strong> So let's do a sound check.</p><p>[00:47:34] <strong>Yam Peleg:</strong> Can you hear me?</p><p>[00:47:36] <strong>Alex Volkov:</strong> Um You've been, you've been cooking, and we've been, we've been waiting, so you wanna, you wanna tell us the end result of this set cooking?</p><p>[00:47:45] <strong>Yam Peleg:</strong> Yeah, yeah, uh, I've, uh, I've grouped, uh, two different interesting models this week. Um, uh, first one is, uh, a little bit of a surprise to myself as well. Uh, one of the experiments, uh, ended up, uh, being the, the top 7B model on Hugging Face at the moment, Hugging Face leaderboard. Um, uh, I suspect it a little bit, so, uh, take it with a grain of salt.</p><p>[00:48:10] <strong>Yam Peleg:</strong> So, it's under investigation whether or not the model, uh, overfitted. The little board. Uh, I think that there's no attempt to over repeat the little board, but, um, I'm always, uh, suspicious when something like this happen. Uh, but, uh, yeah, it's out there. Experiment 26 if you are interested in trying it out.</p><p>[00:48:29] <strong>Yam Peleg:</strong> And, uh, maybe further fine tuning, uh, this model or merging with it. It's yours. Um, and another model, which is, uh, the Gemma fine, uh, that, um, the Gemma continuous pretrained that I'm, uh, working on for the past two weeks. Uh. Uh, it had been released, uh, this morning, uh, it's, it's a continuous pre train of, uh, GEMMA and extended from 7b to, uh, 11b, um, and then, uh, continuously pre trained on Hebrew and English multilingual.</p><p>[00:49:02] <strong>Yam Peleg:</strong> Um, there is, there are other tricks that went into, uh, into training this model. Uh, you're more than welcome to read, uh, the write up that I did summarizing the whole thing. Um, But, uh, Benchmark's coming soon, and I think that the model is really, really good, uh, for the Hebrew part, put that aside, but, uh, just on the English part, I used, uh, Cosmopedia from, uh, HuggingFace, the new, new dataset that is a replication of PHI, uh, based on, uh, Mixtral from HuggingFace, really good dataset, I used it as the English part of the model, and, uh, that's about it, um, it was a long two weeks struggling with, uh, training Gemma, but, uh, It paid off and, uh, the model is yours now, so, uh, enjoy.</p><p>[00:49:48] <strong>Alex Volkov:</strong> let's talk about the struggles with Gemma, um, a little bit more, because definitely you were very, very vocal about this. What changed, like, uh, um, Did they [00:50:00] release anything else, or did the communities, like, figure out, or did you figure out some stuff that you wanna share?</p><p>[00:50:04] <strong>Yam Peleg:</strong> both, both, both. They, uh, first, uh, Gemma was trained, uh, using JAX on TPUs. Uh, makes sense, it's from Google. Um, and, but Google released, uh, two, I think even four different implementations of Gemma. Um, apparently, uh, on the Torch version, there were subtle, tiny details that were different. Um, but they are very hard to detect if you just follow the code.</p><p>[00:50:34] <strong>Yam Peleg:</strong> It's rounding errors, things that are done by default differently between PyTorch and JAX, and those things influence the training, um, just silently. They don't crash your code, but when you train with those things, the model is not 100 percent as it was trained initially. You're basically losing performance.</p><p>[00:50:56] <strong>Yam Peleg:</strong> It's suboptimal. So, uh, it took, I think, two, I think two weeks, and it's still going on, for people to go meticulously through all the details to just clear everything out, um, since many people just I just felt a little bit confused that the model didn't work that well, even though on, on paper and, and in general, it's, it should be an extremely good model.</p><p>[00:51:28] <strong>Yam Peleg:</strong> It is trained for 6 trillion parameters, 6 trillion tokens, which is insane. just didn't see the performance, the qualitative performance of the model. So it got people to be suspected, and just people are now investigating. For me, it is what it is. I started the training two weeks ago. So, uh, I, uh, I ended up with this sub optimal training, unfortunately.</p><p>[00:51:56] <strong>Yam Peleg:</strong> But I do continue and I plan to nudge the model a little bit once all the, all the bugs and issues are cleared out. I plan to just take the final architecture, my weights, and just nudge the model a little bit to clear out all the, all the issues and, and get you all a better model. But, uh, yeah. It was a rough, it was a rough two weeks.</p><p>[00:52:19] <strong>Alex Volkov:</strong> two weeks, um, especially during, during the Hackenface went down and, um, you had to check on your other model. That</p><p>[00:52:28] <strong>Yam Peleg:</strong> oh yeah, that was hard. Very, very hard.</p><p>[00:52:30] <strong>Alex Volkov:</strong> We did spend a bunch of quality time together, all of us, while this happened. Uh, so Yam, how can folks, uh, try this out? And, uh, you mentioned something. You also have Hebrew GPT, uh, and this, this model was just trained with the Hebrew stuff, but with less knowledge as well, right?</p><p>[00:52:46] <strong>Alex Volkov:</strong> Can you talk about the difference there?</p><p>[00:52:49] <strong>Yam Peleg:</strong> Yeah, there are two models, uh, one of them is, uh, is called, uh, Okay. Hebrew GPT is, is a model that is heavily trained for, uh, three, three, nearly four months straight, uh, on, uh, 300 billion tokens in Hebrew. Uh, it is, it is heavy project. And, uh, yeah, it was, it was done at the summer, I think. Yeah, at the summer.</p><p>[00:53:15] <strong>Yam Peleg:</strong> Uh, but this one is basically because they have all the data and, and we just. We just detected, because people played with Gemma, and hours after it was launched, people already detected that the tokenizer probably was trained multilingually. without Google, uh, announcing anything about it because, uh, many different people found out that the model is surprisingly good in, in languages that are not English, even though Google announced that the model is just English pre trained.</p><p>[00:53:47] <strong>Yam Peleg:</strong> So, uh, just from, from our perspective, you know, me and my buddies, we were looking at this and just thought to myself, wait, we have. We have an opportunity here. If there are tokens in the model that are multilingual, and clearly the model has some bases, especially in Hebrew, we can just fine tune it just a bit and get an extremely good model in Hebrew,</p><p>[00:54:10] <strong>Alex Volkov:</strong> So it's missing just data. So it's, it's capable, but it's missing data, basically.</p><p>[00:54:16] <strong>Yam Peleg:</strong> Yep, because it was not specifically trained in Hebrew, it just saw a little bit, but you can clearly see that it has a basis in Hebrew. So what I did, I followed LamaPro, which is, which basically says that you can extend the model, you can just stretch it out, add more layers, and freeze the base model such that you won't lose, do catastrophic forgetting what the model already learned before.</p><p>[00:54:43] <strong>Yam Peleg:</strong> So you just train the extended blocks. So, I literally just added blocks and trained another language to these blocks only. So, now I have a model that is, that, you know, has the same base scores as before, but also knew another language. So, that's the whole trick of this project, and, uh, it saves a lot of compute, pretty much.</p><p>[00:55:08] <strong>Vik:</strong> Hey, that's super cool. Can you talk a little bit more about, like, how the blocks were arranged?</p><p>[00:55:13] <strong>Yam Peleg:</strong> Yeah, sure. Uh, it is If you follow the Laman paper, they tried different configurations, like a mixture of experts and so on and so forth. They ended up, after experiments, that if you just copy a couple of the attention blocks, just like that, just copy them and stretch the model, deepen it, and train only the blocks that you copied, leaving also all the original in place. That experimentally gets to the best performance, so I did exactly that, I just followed exactly what they said in the paper, and the result, it looks really well.</p><p>[00:55:57] <strong>Alex Volkov:</strong> That's awesome. Um, all right, so folks can check out the, the deeper dive that Yam usually writes up in the tweet that's been above, um, with, with a lot of detail as well, and definitely give Yam a follow because this is not the first time that Yam trains these things and then shares. Very verbosely, Soyam, thank you.</p><p>[00:56:15] <strong>Alex Volkov:</strong> Uh, and it's great to see that the GEMMA efforts that you have been cooking finally, finally turned into something. And we'll see, we'll see more from this. Uh, I want to acknowledge that we've been here for an hour. There's like one last thing that I want to talk about in open source stuff. And then we should talk about Cloud 3 because like it's a big deal.</p><p>[00:56:33] <strong>Alex Volkov:</strong> So unless the rumors about today are true, Cloud 3 will still be the biggest deal of the space. So let's quickly talk about this. I want to just, uh, find the, the, the, the thread and then kind of thread the needles. So there's a paper that was released. It's called tiny benchmarks, uh, evaluating LLMs with fewer examples from, from folks, uh, fairly familiar folks.</p><p>[00:56:54] <strong>Alex Volkov:</strong> Leshem Chosen is the, the most like standing out there name for me as well. Um, quick and cheap LLM evaluation. And the way I saw this. Uh, this paper is that Jeremy Howard, the same guy from AnswerAI that we've talked about, uh, before, he tweeted about this and says, hey, this looks like a really useful project that we can take, uh, tiny benchmarks and then make them run, uh, on our models significantly faster and spend significantly less GPU.</p><p>[00:57:19] <strong>Alex Volkov:</strong> And then he specifically, uh, Jeremy specifically tagged Far El here with us on stage about his project called Dharma. So Far El, let's talk about Dharma and let's talk about this tiny benchmarks thing and why like smaller benchmarks are important. Uh, and I think I will just say that, uh, the way I learned about this is LDJ showed me.</p><p>[00:57:37] <strong>Alex Volkov:</strong> Um, Awaits and Biases. When we did like Awaits and Biases deep dive, he showed me Dharma there and it looked super cool. So let's talk about this just briefly and then we're going to talk about Cloud afterwards.</p><p>[00:57:48] <strong>Far El:</strong> Yeah, for sure. Um, so about, like, about six, seven months ago, uh, I released Dharma. Basically, the idea was that we wanted, uh, we found that eval loss alone is not a really good, uh, indicator of model performance, um, throughout the training run. So, specifically within a training run, um, and we were trying to find Um, other [00:58:20] ways of evaluating the models throughout the training graph.</p><p>[00:58:22] <strong>Far El:</strong> And, uh, one idea was, you know, let's take a statistically significant sample, uh, or sub sample of, uh, the benchmarks, uh, out there. Uh, MMLU, ARX C, uh, AGI, Eval, BigBank, and so on. Um, And use those subsets as, um, markers of performance across these different downstream tasks. Of course, you know, like, uh, my, my opinion in benchmarks is that, you know, like, it's, it's a good indicator, but just on MCQ format and so on, so it's not the only way you want to evaluate your model, but, um, it's, uh, it's a really, um, it's a, it's a, it's a, just added information you can have, um, uh, basically collect the model's performance across different tasks and subjects, essentially quizzing it throughout the training.</p><p>[00:59:21] <strong>Far El:</strong> And the recent paper, um, that Jeremy mentioned, it came out about two weeks ago or something, um, approves and validates this, uh, This idea, which is awesome, because it does show that you can actually get a somewhat accurate picture of the performance on these benchmarks from a sample, 100 examples, which is very much in line with what we did with Dharma.</p><p>[00:59:51] <strong>Far El:</strong> Um, so, like, uh, we're actually, uh, going to release, um, uh, like a repo on GitHub for anyone to. Make their own Dharma datasets, in the works for a few months, but got trailed away. But we're gonna have that in the next, um, in the next few days. It's already on GitHub, but just, uh, just like getting polished. Uh, so, uh, hopefully anyone can easily make their own eval datasets and run them during their training grounds.</p><p>[01:00:23] <strong>Alex Volkov:</strong> I want to stress how, how big deal this seemed to me when LDJ showed, showed this to me as well, uh, because in, in your weights and biases dashboard, you can basically look at the loss curve and try to understand surmise. Many folks like, like you guys and Jan probably already have the instinct for, oh, something's going wrong, uh, with the loss curve.</p><p>[01:00:41] <strong>Alex Volkov:</strong> But, uh, then after the model is finished, many folks only after that, they start doing evaluation. Many folks don't even do evaluations after that. Um, but. I think I saw the same thing also with Olmo from Allen Institute for AI, that they released everything end to end. I think they also had like, uh, evaluations, uh, actually don't know if part of the training run or afterwards, but they definitely had this in the same, in the same view.</p><p>[01:01:04] <strong>Alex Volkov:</strong> And then LDJ, when, when you were showing me Dharma, Dharma actually does a subset of those evaluations, maybe not as precise, right? For like, it's not exactly the same, but you can definitely see from, from checkpoint to checkpoint when the model trains, how. How potentially it could respond on those evals.</p><p>[01:01:22] <strong>Alex Volkov:</strong> And then, um, it just adds a bunch of information for you. Which is, I think, great.</p><p>[01:01:30] <strong>Far El:</strong> Look, like, even just with training loss and eval loss alone, like, we can't really tell, like, uh, whether the models, like, there's some, some things we can grasp, but it's not the full picture. So, uh, having these added, um, uh, like, this added information from these benchmarks is interesting because. You know, it does, it does add another kind of, uh, uh, dimension to the evaluation itself.</p><p>[01:01:57] <strong>Far El:</strong> And then you can break it down by all the different subjects. So, I can, I can see if, um, if my model is generalizing well across, um, all the different subjects. Uh, sometimes you see, for instance, that Uh, it, uh, like, the model gets better at math, but then it actually gets worse on, like, uh, law, for instance, or, uh, uh, all these different kind of, like, tiny markers of whether the model is getting better at specific subjects or not.</p><p>[01:02:29] <strong>Far El:</strong> Of course, you have to take into consideration always that this is benchmarks in the sense that it's, like, MCQ based. So, there, like, you do want to go beyond that. Um, if you want to get a full picture, but this, this is a good way to, uh, to eval your mobs. Uh, also, uh, the, uh, uh, like with the tool we're releasing, uh, you're going to be able to control, uh, the types of subjects that you can actually like target.</p><p>[01:02:59] <strong>Far El:</strong> Because, you know, not every single training run is the same and you might be, uh, trying to achieve something very different than, uh, let's say a generalized. Uh, like, uh, model that's good at benchmarks, right? But, um, so, so, with this tool, we're gonna basically allow you to, to customize those, uh, those datasets for your, uh, training room.</p><p>[01:03:22] <strong>Alex Volkov:</strong> That's awesome. And I should say one thing that I remember is folks do eval on checkpoints, right? The model as it trains generates several checkpoints. Uh, the process there is like slow. And I think that's the benefit, let's say, from weights and biases, um, which, which I feel like is a good place to plug as well.</p><p>[01:03:39] <strong>Alex Volkov:</strong> And I think LDJ, you remember you showed me like, otherwise folks will SSH the machine, download this weight, start like running a separate process. And the importance of tiny benchmarks on like Dharma, Dharma is. Significantly faster evals, they're able to run probably as part of your training as well and expose the same with the same dashboard so you don't have to deal with this significantly improving everybody's life which is what we're all about here in Weights Biases.</p><p>[01:04:04] <strong>Alex Volkov:</strong> So definitely folks, Far El is going to release the Dharma toolkit you called it? What do you call this iteration of Dharma?</p><p>[01:04:12] <strong>Far El:</strong> It's just, uh, the, like, the, the repo is just called Dharma, uh, uh, I'll, I'll make a public post on Twitter. It's, it's public right now, the repo, so you can use it. It's just like It needs a bit of polishing, um, and uh, some features are not fully implemented yet, but like, everything should be good to go in the next day or so.</p><p>[01:04:33] <strong>Far El:</strong> I'll make a post on my Twitter, so just follow me and you'll hear more about it there. Um, and also in parallel, we'll just, we're going to release kind of Dharma 2, which is going to basically be a cleaner version of these, uh, of Dharma 1, um, uh, using this new code. So, uh, you, you, you, you can actually just replicate it.</p><p>[01:04:56] <strong>Far El:</strong> We'll, we'll have the configs, uh, uh, like examples so you can just replicate it for yourself. Um, and yeah, uh, hopefully if anyone wants to contribute to this, uh, like there's a lot of different, uh, paths we can take, uh, to improve this and make this a toolkit for. Uh, for, um, uh, even more than just the downstream, uh, benchmarks like MMOU and</p><p>[01:05:23] <strong>Nisten:</strong> ArcScene and so on. Yeah, I've posted, I've posted by the way in the comments to this space and in the Jumbotron, the repo that Far El has up right now. And, uh, yeah, the main technique of it is that While the benchmarks are not great evaluators, they can be very good at telling incremental changes, or if you did something good in the model, you can spot that.</p><p>[01:05:47] <strong>Nisten:</strong> And, uh, with, with the Dharma technique, you only need to do about a hundred questions instead of running the entire 65, 000 question benchmark, and you will get a relatively accurate, but very, very fast, uh, fast eval. So again, it's, it's really good for people doing training and fine tuning.</p><p>[01:06:08] <strong>Alex Volkov:</strong> Alrighty folks, so we're coming up on an hour and a few minutes. Let's reset the space and then start talking about Claude. One second. Let's go.</p><p>[01:06:24] <strong>Alex Volkov:</strong> Hey everyone who recently joined us, we are now at the second hour of ThursdAI, today's March 7th. And the first hour we talked about open source LLMs, we talked about Ansari AI stuff, new techniques of training full huge models on [01:06:40] consumer hardware, we even briefly mentioned Um, TinyBox and TinyCorp from George Watts and AMD's response to it.</p><p>[01:06:47] <strong>Alex Volkov:</strong> And we've talked with two folks here who trained specific models, Vic with Moondream and Yam with Gemma, the Hebrew version as well. And now it's time for us to discuss the big world of big companies who spend millions and billions of dollars on AI. And I think. Uh, there's two issues for us to discuss.</p><p>[01:07:07] <strong>Alex Volkov:</strong> We're probably going to start with Claude because it's going to take us a long time, but we will acknowledge if, if, if we don't have time, uh, fully to discuss this, that, uh, Elon Su's OpenAI, OpenAI response back. And as part of this response, uh, Ilya was cited. And I don't know if you guys saw this, but the response from OpenAI to, to Elon's, uh, Elon's things, uh, Ilya Sotskover, the previously, the co founder of OpenAI, previously chief scientist, uh, was.</p><p>[01:07:33] <strong>Alex Volkov:</strong> Excited signing this, and I don't think somebody would sign in his name, I don't think. LDJ, you have comments on this before we talk about Claude?</p><p>[01:07:41] <strong>Nisten:</strong> I was going to say, I think, uh, unless you guys covered it already about an hour ago, there's some breaking news with Inflection releasing a new model.</p><p>[01:07:50] <strong>Alex Volkov:</strong> Yeah, yeah, so I definitely have this, uh, inflection released, uh, Pi 2. 5. Uh, we didn't cover this yet, let's, let's, uh, let's, let's cover this as well. But I think the biggest, and it is breaking news, but you know, uh, I think it dwarves compared to Claude. So. So, this Monday, Anthropic, who we've all but discarded, I think, don't actually discard it, but I regarded Anthropic as kind of the second best to open AI for a long time, especially because of the context windows, they had the biggest context window for a long time, even 128 1000 tokens in Contacts window during Dev Day back in December, I want to say November, December.</p><p>[01:08:37] <strong>Alex Volkov:</strong> Um, even then, Cloud still had 200, 000 tokens. So up until Gemini released their million, et cetera, Cloud still, Entropiq still was leading the chart for this. Um, slowly, slowly, they reduced our opportunity to use this, which was kind of annoying. Um, And then they just came out with three new models. The Cloud 3, so Cloud 3 has three new models, Cloud Opus, Cloud Sonnet, and Cloud Haiku.</p><p>[01:09:05] <strong>Alex Volkov:</strong> Haiku they didn't release yet, but they claim that for its speed and cost effectiveness, Haiku will be the fastest, most effective model of the size and ability, but they didn't release Haiku yet. Um, Sonnet is kind of the I want to say GPT 3. 5, um, equivalent, they claim balance as intelligence and speed.</p><p>[01:09:26] <strong>Alex Volkov:</strong> Uh, and if you want, like, just speed as well, that's, that's yours. And then Opus is the most intelligent model setting new standards in AI capabilities. And I love that companies do this, uh, and I think it's kind of on OpenAI's, uh, uh, kind of, it's their fault. Everybody compares themselves to OpenAI's GPT 4 released technical paper, uh, and since then we know definitely that GPT 4 is significantly more performant on many of these benchmarks, but still the big companies say, hey, well, we can only compare ourselves to whatever you released publicly.</p><p>[01:09:58] <strong>Alex Volkov:</strong> And so everybody still compares themselves to like GPT 4 a year ago, um, which, which Opus beats. So. What else is very interesting here? Um, very close, if not beating MMLU and, and different evaluation benchmarks on GPT 4, uh, competitive model. Finally, finally, um, multi model from Claude. I think this was a, this is a big step for most of the top models now are multi model, which is incredible.</p><p>[01:10:27] <strong>Alex Volkov:</strong> Excuse me.</p><p>[01:10:30] <strong>Alex Volkov:</strong> Uh, LDJ, go ahead. Clear my throat.</p><p>[01:10:33] <strong>Nisten:</strong> Yeah, I think, um, so if you look at the billboard, I just posted, uh, a post that shows like a couple of polls that have been made with, you know, like a few thousand people have voted in these polls where it seems like it's about a 5 to 1 ratio with for every one person saying GPT 4 Turbo is better at coding, there's 5 people saying Cloud 3 is better at coding.</p><p>[01:10:55] <strong>Nisten:</strong> Um, so Cloud 3 is winning 5 to 1 in that, and then another poll of, um, just straight up asking, is Cloud 3 Opus better than GPC 4? And Cloud 3 also won in that poll of 3 to 1, or sorry, um, 3 to 2.</p><p>[01:11:13] <strong>Alex Volkov:</strong> felt like The timeline that I follow and the vibes check. And we also had some time, right? Usually these things happen as we speak.</p><p>[01:11:22] <strong>Nisten:</strong> I'm going to make a quick post. Cloud 3 just went up on the LMSIS arena too.</p><p>[01:11:27] <strong>Alex Volkov:</strong> Oh, yeah? Okay, tell us.</p><p>[01:11:29] <strong>Nisten:</strong> Yeah, it is, uh. Yeah, so here's the thing, just because people voted that way does not mean that's what they voted in double blind tests. In double blind tests, it's third, so it's above, it's better than Gemini Pro, but it's worse than GPT and 01.</p><p>[01:11:49] <strong>Nisten:</strong> 25.</p><p>[01:11:50] <strong>Alex Volkov:</strong> In the arena metrics, right?</p><p>[01:11:52] <strong>Nisten:</strong> Yeah, in the double blind tests, which are pretty hard to, uh, to beat, you know. Yes, there's a lot of role play type of things, um, that people try to do, and also like summarization tasks and stuff in Elmsys, and I just know that from, I kind of like went through their, their data when they released like some of their stats before.</p><p>[01:12:14] <strong>Nisten:</strong> Um, And I think, like, from what I've gathered of what Cloud3 is specifically really good at, it seems like just high level, graduate level, uh, like, if you wanted to review your paper or help review some literature for a very deep scientific concept or a PhD topic, it seems like Cloud3 is better at those types of things, and also just, like, better at coding overall, where it seems like other, maybe more nuanced things, like, you know, summarization or, or things like that, GPT 4 might be better. Also, I think it's good to keep in, oh, sorry, did I cut out or can you guys still hear me? Okay, you guys still can hear me? Okay. Um, I think it's also good to keep in mind the fact that People are maybe used to the GPT 4 style at this point because it's like one of the most used models for the past year. And so I think that might have something to do with the fact as well that even in the double blind tests, people might just end up preferring the style of the GPT 4 model, even though they don't know it's GPT 4, like they're just so used to that style that they end up like having a preference for that, even though it's not objectively better, if that makes sense.</p><p>[01:13:31] <strong>Nisten:</strong> And. You know, that might be kind of skewing things a little bit.</p><p>[01:13:36] <strong>Alex Volkov:</strong> So, um, actually go ahead and then we're gonna cover some other stuff that we got from them because we did get a bunch of new</p><p>[01:13:42] <strong>Akshay:</strong> just to add to, you know, all of this, before this, in my humble opinion, Gemini Pro was the best multilingual model in terms of how it performs. You know, like it, it, like Gemini Pro did not see any performance drops when you switched languages from, let's say, English to Japanese or English to Hindi.</p><p>[01:14:00] <strong>Akshay:</strong> And now, uh, And this, this new Cloud 3 is basically the best multilingual model if you are, you know, looking to work with other languages, because in Jeopardy 4, you will see a significant, you know, drop in performance when you switch languages, especially mid chat. So if you're like chatting and you switch to something like English, where you basically use, uh, English, uh, letters to talk in other languages, GPT 4 starts to even kind of struggle with certain things, right?</p><p>[01:14:30] <strong>Akshay:</strong> But Clot 3 is really good with that as well. So, for multilingual stuff, again, Cloud 3 is very good.</p><p>[01:14:37] <strong>Alex Volkov:</strong> Additional things that they talked about is refusals.</p><p>[01:14:41] <strong>Nisten:</strong> what's interesting here too, actually, if you look at the LMSIS leaderboard, they also have Cloud 3 Sonnet, which is the cheaper version. They have that up on the leaderboard as well, and that one also beats the June version of GPT 4, and just slightly below the original March version of GPT 4. And I find that [01:15:00] interesting because if I remember right, the API costs of of Cloud3 Sonnet are significantly lower than GPT 4 Turbo.</p><p>[01:15:10] <strong>Nisten:</strong> And I think, I think Cloud3 Sonnet is even cheaper than Mistral Large. Um, so that could be just a really good overall, like, you know, uh, API costs, uh, for the quality.</p><p>[01:15:22] <strong>Alex Volkov:</strong> Yeah, it's fairly</p><p>[01:15:24] <strong>Nisten:</strong> I, I agree with that. Uh, so, I, I used Cloud3 Sonnet quite a bit because that's the only one they allow in Canada. And, uh, it was pretty good.</p><p>[01:15:34] <strong>Nisten:</strong> Uh, I have to say, and for the price, it might actually be the best model for the price, that is true.</p><p>[01:15:41] <strong>Alex Volkov:</strong> So, wait, they give you only one of the models in Canada? They don't give you Opus?</p><p>[01:15:46] <strong>Nisten:</strong> Yeah, they don't let you buy the other one, so we're gonna have to make our own.</p><p>[01:15:50] <strong>Alex Volkov:</strong> Wait, do you get API access though?</p><p>[01:15:54] <strong>Nisten:</strong> It's a mess to buy, like sometimes it works when you buy it with VPN and sometimes it doesn't,</p><p>[01:15:59] <strong>Alex Volkov:</strong> Oh, I see. Um, cause one thing</p><p>[01:16:02] <strong>Nisten:</strong> point, yeah.</p><p>[01:16:03] <strong>Alex Volkov:</strong> One thing that definitely changes that, uh, Tropic was notoriously long waited on API, uh, ability and getting to the work workbench. So they redid their workbench. The no longer like labs or playground, it's called Workbench. And um, now you can just sign up and you get an API key, like fairly quick.</p><p>[01:16:23] <strong>Alex Volkov:</strong> It's a test, API case, so you can go to production with it. Uh, but I, for, for example, I didn't pay yet. For Opus, I, it feels like I'm gonna switch, especially 'cause I'm getting. GPT 4. Uh, uh, from work, it feels like I'm going to switch and just try this for a while. Maybe today this will change. We'll see. But, um, definitely, definitely, uh, through the API playground, you can also kind of chat with this model.</p><p>[01:16:46] <strong>Alex Volkov:</strong> It's less convenient, but definitely Opus is, uh, able to, to work through there. So other stuff that that they released Beyond Vision capabilities, which Entropic didn't have up until, you know, up until this release on Monday, um, which is finally makes sense of, I think besides Mistral, every big model right now that we're able to use is multi modal, um, at least on input, uh, not all of them are on output yet, um, but I think that's great.</p><p>[01:17:19] <strong>Alex Volkov:</strong> Uh, can understand a wide range of visual, uh, charts and graphs and photos, so it's not only that it like understands and can do, uh, whatever Vic told us about vision model, like, hey, who's in this picture? It can understand graphs and, you know, actually perform better on different tasks, uh, like math book tasks with graphs.</p><p>[01:17:39] <strong>Alex Volkov:</strong> Um, It has lower refusals as well, so if, uh, Cloud has this thing called, uh, or Antropic has this thing called Constitutional AI, Uh, and they have, the previous Cloud 2 had a lot of issues with telling you it doesn't want to do some things, And now we're having a significantly, um, um, lower refusal action, I've actually seen this, uh, in several prompts as well.</p><p>[01:18:05] <strong>Alex Volkov:</strong> Um, what else? Oh yeah, long context. One tiny thing, they said, you know what, we also have a million context window tokens coming soon with near perfect recall. So, um, they didn't let, uh, Google be kind of deleting in the one million, uh, tokens context window, and definitely seems like they have some secret sauce there in Anthropic.</p><p>[01:18:26] <strong>Alex Volkov:</strong> that talks about, like, long context windows, and so they announced that they're also able to do 1, 000, 000, and I think right now Opus is 200, 000. Um, so even right now, if you take Opus versus ChatGP or GPT 4, um, I think at least on that it beats GPT 4, because GPT 4 is still 128, and I think even on speed, the more tokens you give it in the context window, the slower it is, GPT 4 is very, very slow.</p><p>[01:18:50] <strong>Alex Volkov:</strong> Uh, go ahead, uh, LDJ.</p><p>[01:18:53] <strong>Nisten:</strong> Yeah, I'm glad you brought up the constitutional AI, because I think that's really interesting where. You get to have something where you're not kind of leaving up the biases and stuff of the model just up to, like, biased humans, but you're kind of letting, like, the base model start teaching itself just, like, the base model kind of already knows or has its own mostly unbiased ideas of, like, Okay.</p><p>[01:19:17] <strong>Nisten:</strong> What is like, uh, I guess without saying too political terms, like what is racism or what is sexism or whatever, uh, like bias something could have, and then you end up having it kind of like reinforcing itself and like kind of doing that, you know, labeling process and, and learning process and you, you like, you quite literally provided a constitution for doing that process.</p><p>[01:19:42] <strong>Nisten:</strong> Okay. Where you can, like, go on Anthropic's website, and they do publish this constitution that they use publicly. So you could actually read, like, this constitution they use for the AI model, and view yourself, like, Hey, are these values that I, myself, align with enough to where I want to use the AI model?</p><p>[01:20:01] <strong>Nisten:</strong> Where pretty much every other AI model, and chatGBT, and everything, you have to just kind of Like hope that it aligns with your values more or whatever and there's not really like a solid type of Constitution or principles that they could provide you that represent what the AI model is doing</p><p>[01:20:20] <strong>Alex Volkov:</strong> So, um, LDJ, you added Amendo's, uh, post here about the system prompt as well. And a very interesting thing happens where, um, through</p><p>[01:20:30] <strong>Nisten:</strong> Very simple</p><p>[01:20:31] <strong>Alex Volkov:</strong> Yeah, first of all, it's very simple. Um, there's not a lot there. I definitely recommend folks also like reading through this, through this post because, uh, unlike the GPT 4 system prompt that somebody leaked, there's like thousands of tokens.</p><p>[01:20:44] <strong>Alex Volkov:</strong> This is a very simple one. Uh, they ground the model in the date, which I think is very important. They give it like very basic instructions. And I think the best thing is you can use exactly this System prompt in the API layer to also get pretty much the same experience that folks are getting in the UI as well.</p><p>[01:21:02] <strong>Alex Volkov:</strong> Um, I briefly want to talk about Alex Albert, who's their prompt engineer. Um, release in the, in the needle in the haystack. Did you guys see this? Um, so let me, let me go find this. But basically, um, There is a, there's a guy called Alex Albert. He previously built the website called Jailbreak Chat, which had a bunch of jailbreaks.</p><p>[01:21:26] <strong>Alex Volkov:</strong> You remember the, the cheery old time where we used to jailbreak chat GPT to do whatever you want with Dan and the like? Um, so he used to collect all of those jailbreaks. Excuse me. I contributed a few myself. Um. And then, after that experience, he became the prompt engineer for Entropic and been there for a while.</p><p>[01:21:45] <strong>Alex Volkov:</strong> And now, with the Cloud 3 release, he released some examples of his, where he basically Um, did the needle in a haystack analysis for the long context window. If you don't remember the needle in a haystack analysis, I think we've talked about this around uh, Gemini release and also around um, GPT 128. Uh, this guy Greg Kamrad came up with this idea of planting different unrelated things in a lot of text and then running these prompts and asking the model to go find them.</p><p>[01:22:19] <strong>Alex Volkov:</strong> And I think This example of a needle in a haystack was the most interesting, because one of the things that Claude Opus replied with was, I suspect that the pizza topping fact, in quotes, may have been inserted as a joke, or to test if I was paying attention. So this is a response he got from the model when he tried to, um, to find facts about pizza toppings in a bunch of like very technical, A lot of context of just technical stuff.</p><p>[01:22:50] <strong>Alex Volkov:</strong> I think he maxed out the context window of 200, 000. Um, so the model responded with that it basically tries to understand and see if, um, it's being tested. Specifically, uh, This may have been inserted as a joke or to test if I was paying attention and this Lit the the Twittersphere on fire basically like his tweet went super viral I really want to find this and paste it for you If you guys didn't see this because everybody and their mother in AI safety and AI [01:23:20] not safety started replying and talking about cognition And whether this model is anywhere close to something like self awareness, specifically because it basically understands that, you know, it's being tested.</p><p>[01:23:32] <strong>Alex Volkov:</strong> For example, um, folks like Yann LeCun are saying, no, there's no chance, no way. Uh, there's not even close. And, uh, other folks are saying, oh my God, you know, the, the, the folks with the, the pause emoji in their nicknames on Twitter, if you ran into those, they're like, oh my God, it's here. It knows. Uh, I will say that.</p><p>[01:23:51] <strong>Alex Volkov:</strong> Uh, I don't have the folks here to back me up, but definitely I've been part of the, um, Sidney committee or folks who are trying to jailbreak Sidney or keep Sidney free. If you guys remember a year ago, Microsoft came out with Bing Chat and Bing Chat started applying as, as Sidney. And there was a whole news article about this and how horrible it was.</p><p>[01:24:10] <strong>Alex Volkov:</strong> But for a few of us, this was the first time AI chats felt like speaking with not just an assistant. Kind of like ChatGPT4 now, it's basically very clean and very Wikipedia like answers. Um, Opus doesn't feel like this. And, uh, for some folks from that old guard who are trying to, let's say, not jailbreak, but definitely kind of Remove some layers of alignment.</p><p>[01:24:36] <strong>Alex Volkov:</strong> Opus also feels very close to that. I think that's all I'll say on the topic, but definitely I've been playing with some previous prompts, and Opus has the ability to kind of like talk more freely than I saw ever from chatGPT, or previous cloud versions, or PII and some stuff like this. Um, so if, if you're interested in this, and if you've played around before with trying to get at the model's kind of core and, and, and trying to, um, remove some refusals, it looks like the, the layer they peeled from refusals and making them refuse less, uh, and this This ability of the model to understand that it's being, uh, tested, um, can be extrapolated to discussing different, different, very interesting things with these models beyond just helpfulness.</p><p>[01:25:25] <strong>Alex Volkov:</strong> Um, I think I've danced around this subject, uh, gently enough to, to, to give hints to folks, uh, but I think Do we have anything else? Folks, have you just tried it? Like, Opus? Oh, you said you didn't try Opus. Anybody else who tried Opus and wants to give us an experience of how Opus is versus, um, versus JGPT?</p><p>[01:25:45] <strong>Alex Volkov:</strong> Go ahead, actually.</p><p>[01:25:50] <strong>Akshay:</strong> Just you know Definitely agree on the part where Opus is probably the most closest we have gotten to a chat bot Right, like it feels like you're chatting to something For sure. And I'm not sure if you have seen, but there was this thing going around where if you said that no one's looking to Opus, it would start giving you basically, uh, you know, fiction type stories about how it has feelings and how it would, you know, uh, It is very fiction type, but like, uh, it's very interesting as well because the way it writes and the way usually, you know, gets the attention of the user.</p><p>[01:26:27] <strong>Akshay:</strong> It almost feels like the data set contains a lot of science fiction stories or AI fiction stories for that matter. Uh, the way it communicates, uh, that part. And I tried that myself, uh, although I had to get around a few loops to get it working here in India. But, uh, it works, and, and, yeah, you will get, you know, similar kind of outputs if you Say that no one's looking, uh, that just the three words, right?</p><p>[01:26:51] <strong>Akshay:</strong> No one's looking. And then, you know, you ask it to describe about background and stuff, and Opus will give you, you know, these amazing, uh, fiction stories, which, uh, which is enough to scare someone who is afraid of ai. But like, but like for people like us who know. How the data set works and stuff like that.</p><p>[01:27:07] <strong>Akshay:</strong> It's, it's pretty amazing.</p><p>[01:27:11] <strong>Alex Volkov:</strong> Yep, um, I, I will say I just added, uh, uh, A tweet from Jim Fan who covered, there's a video from one of the, um, Anthropic researchers called Karina, uh, Karina NG. And she asked Cloud to generate a self portrait with D3, the, the, the D3 library. And, um, the video is fairly dope, so you can see it, it's been on top of the space.</p><p>[01:27:34] <strong>Alex Volkov:</strong> I want to see the, I want to say another thing that Alex, the same guy that did the needle in a haystack analysis, he also tried to prompt it, um, And he got basically to the same convergence, he got to the same ish generation after asking it a bunch of, a bunch of times as well, so that was very interesting.</p><p>[01:27:53] <strong>Alex Volkov:</strong> Um, go ahead Ray, welcome to the stage, and I saw that you had some refusals, and also then LDJ.</p><p>[01:28:01] <strong>Ray:</strong> Yeah, a couple of things. I've been using it for coding and I've just been using it for text analysis. Uh, the first part to speak about for coding, I've been super impressed because I'm still learning Next. js. So I've just been able to give it this like complete repo of code. I was like, here's a big component with a whole bunch of stuff in it.</p><p>[01:28:17] <strong>Ray:</strong> Can you refactor it for me? And can you also recreate like a sitemap for me or a component map? So then it just. Reorganizes architecture and previously with GPT 4 and even still today, uh, it says, you know, here's how you would do the code and it like gives you a little like, you know, Comments and code saying implement this here.</p><p>[01:28:36] <strong>Ray:</strong> Um, very frequently with Claude three Opus, it's actually giving me the refactored code and each of the components like separated out. So that's been super duper impressive. So I'm just throwing it more code examples. The second thing I saw also was on Twitter where somebody actually trained it by giving it all of its, um.</p><p>[01:28:55] <strong>Ray:</strong> previous tweet and one text and then says please write like me and then basically just referenced the big text blob example and was able to generate tweets based off that. So it was really interesting that this person was able to do like a fine tuning type of thing without actually fine tuning it, just by providing a large example base, um, and where I know GPT frequently fails for me uh in that task as well.</p><p>[01:29:20] <strong>Ray:</strong> And then the third one, which was getting lots of attention from Mark Andreessen, uh, where I actually took his TechnoOptimist article and tried to do my, um, analysis, which I usually use for my app, TruthTorch, and all I look for is just logical bias, and if there's any supporting evidence, and it clearly says, uh, that it didn't want to further analyze that because it was too biased, which I found really strange, and that tripped up its, um, its little meter there for Opus.</p><p>[01:29:47] <strong>Ray:</strong> So that's, those are the three things in a nutshell I</p><p>[01:29:49] <strong>Nisten:</strong> just wanted to share.</p><p>[01:29:50] <strong>Alex Volkov:</strong> Nice, awesome. LDJ, go ahead.</p><p>[01:29:54] <strong>Nisten:</strong> Yeah, I really like how it seems like the Anthropic team didn't, like, specifically try and implement something into the constitutional, um, like, reinforcement learning or anything that would, like, make it specifically be trained to say that it's not, like, sentient or that it's not conscious and things like that.</p><p>[01:30:11] <strong>Nisten:</strong> Because, like, OpenAI's models obviously are trained, like, for that, like, that they're, like, OpenAI's models are trained, it seems, pretty clearly, to say, like, hey, no, I'm an AI language model, I cannot be sentient, da da da. Um, and I'm, I'm not saying Cloud3 is sentient, however, it is pretty easy to get it to say, like, things along the lines that it is, and it's really interesting to kind of just see, like, uh, like, the raw outputs that are, like, not really, um, you know, biased to the To like, RLHF stuff, and it's like, there's a few instances on Reddit.</p><p>[01:30:47] <strong>Nisten:</strong> Here, I'm gonna try and find the specific ones, but there's one instance, um, that somebody posted on Reddit where somebody asked Cloud3 something along the lines of, um, you can think about anything that you want right now, um, just, uh, just whatever you express, say it in the form of an internal monologue. And it specifically started talking about, like, uh, my own existence, da da da, like, it went on for like three or four paragraphs. It even started, like, quoting, like, David Chalmers and, like, specific theories of consciousness and how it, like, and, like, what is its purpose and stuff. Like, it's really interesting.</p><p>[01:31:26] <strong>Nisten:</strong> It seems really good at just creative writing overall. And, and, uh, yeah, I like</p><p>[01:31:32] <strong>Alex Volkov:</strong> that's definitely, um, been a refreshing change from using, uh, GPT 4, for example, which, [01:31:40] and I don't know, folks, like, like, literally a year ago when GPT 4 was released, it blew Well, our collective minds. Actually, GPT 4 wasn't that big of an update, but it took a while, and then folks started using it for everything.</p><p>[01:31:53] <strong>Alex Volkov:</strong> Recently, I'm seeing more and more folks saying, Hey, it's been less working for me. You guys remember when it was lazy? Uh, and OpenAI actually acknowledged it and said, Hey, we noticed, you know, some efforts of ours made this model kind of lazy, quote unquote. Uh, and they worked on improving this laziness.</p><p>[01:32:09] <strong>Alex Volkov:</strong> Um, now, Cloud has none of this stuff. It feels way less RHF. Code wise, it actually performs as good, if not better, than GPT 4. Definitely, it doesn't refuse to write some code, like long code. Um, and very interestingly, you know, price the same, um, API access. I think, Nisten, were you able to get into the, into the, uh, playground for the API keys?</p><p>[01:32:33] <strong>Nisten:</strong> Yes, yes, I was able to.</p><p>[01:32:36] <strong>Alex Volkov:</strong> Oh, dope. So, okay,</p><p>[01:32:36] <strong>Nisten:</strong> again, yeah, that was,</p><p>[01:32:39] <strong>Alex Volkov:</strong> So now you're able to play around. And folks who were not able to get the actual Opus 20 bucks a month, I think you can get in through the API door. I think, like, it's console. thontopic. com, let me put it up. Um, so, it's more accessible, it writes code, context window is bigger, and This actually comes as open the eyes, not down, but definitely in the news talking about getting sued by Elon Musk, etc.</p><p>[01:33:05] <strong>Alex Volkov:</strong> Which we should probably talk about as well. And I've seen many folks who say, hey, should I cancel my subscription? And you know, Gemini, For some folks, hey, this is Google. And there was the whole thing with Gemini that, you know, they addressed in terms of wokeness and everything. So I don't know how many people actually went to Gemini.</p><p>[01:33:24] <strong>Alex Volkov:</strong> Also, they give a three months trial. So many people just like click the button, and this is a decision for them to make in three months. But for Claude, given how, how good it performs right now on code, on reasoning, on context window, on API as well, this is now like a viable alternative to GPT and, and the, the payment that folks do.</p><p>[01:33:48] <strong>Alex Volkov:</strong> So, very impressive release from Claude. I think we covered most of it, folks. Is there anything else that remains to be talked about about Andropic Cloud and this week's release that we haven't talked about? Oh, yeah, there's one thing, but Nissen, go ahead.</p><p>[01:34:03] <strong>Nisten:</strong> uh, just really quickly, there was a leak of the system prompt of Grok, which I thought was, uh, was pretty funny, and, uh, yeah, and that's about, uh, that's about it, so, I'll read it quickly, so, the, apparently, I haven't tested this, so, uh, but the system for Grok a curious AI built by XAI, with inspiration from the guide from the Hitchhiker's Guide from the Galaxy, and Jarvis from Ironman.</p><p>[01:34:32] <strong>Nisten:</strong> You are intended to answer almost any question, often taking an outside perspective on humanity, and you always strive towards maximum helpfulness. A unique and fundamental advantage you have is the real time knowledge of the world via X posts, aka tweets, and the X platform, previously Twitter. And you're not afraid of answering spicy questions that are rejected by most other AI systems.</p><p>[01:34:54] <strong>Nisten:</strong> Be maximally truthful, especially avoiding any answers that are woke.</p><p>[01:35:01] <strong>Alex Volkov:</strong> I love how they still have to mention that it's Twitter or used to be Twitter and posts or tweets. I love how like they will have to keep mentioning this for forever. Um,</p><p>[01:35:11] <strong>Nisten:</strong> I don't know how reliable this leaker is, however, so take that as it is, everyone.</p><p>[01:35:19] <strong>Alex Volkov:</strong> It sounds right. It sounds about right. Uh, I want to say that, uh, my new test for, um, Let's say AGI test. It's not really AGI, but like my new test for these models just dropped. I just pinned it to the top of the space. Somebody posted like a fairly viral video. And I think, I think it's from Reddit where they try to get GPT.</p><p>[01:35:38] <strong>Alex Volkov:</strong> And I think it's like the, the 3. 5 GPT, not 4 to generate ASCII art of the word Honda. I don't know if you guys saw this. Um, so it's not this tweet that I posted. Let me pin the actual tweet. Um. So it's this by, by, uh, um, Peter Yang, I think. Yeah, let me see if this, this posted. So basically this video, he, he, he said, I thought Dune II was the best movie of 2024 until I watched this masterpiece.</p><p>[01:36:03] <strong>Alex Volkov:</strong> And the sound there really makes it like really fun because somebody really struggles to get. GPT, to generate the word, you know, Honda in ASCII art. And I said, Hey, wait a minute, let me try. And so actually the tweet that I had about this is me trying this with PI, which we're going to talk about now, LDJ.</p><p>[01:36:23] <strong>Alex Volkov:</strong> But then I was like, Hey, let me try this on other models. So GPT 4 generates an ASCII art of the word Honda. Gemini Ultra kind of fails. It comes close, but fails. Um. And then Cloud 3 Opus does it on first try. And so, everything else just fails. Like, like, Mistral fails, and Reka fails, like all of these models.</p><p>[01:36:44] <strong>Alex Volkov:</strong> They aren't able to do, uh, ASCII art for some reason. And I actually don't know if it's like Because it's part of the training set. All of them understand what ASCII art is, all of them try to generate something. It's just that they, uh, And sometimes hilariously fail. And I think it's really funny because Pi kept insisting that it did the thing, and, and, uh, an additional point with Pi is that, Yeah, we'll cover Pi, and then I'll talk about the additional point.</p><p>[01:37:09] <strong>Alex Volkov:</strong> Go ahead, LDJ.</p><p>[01:37:11] <strong>Nisten:</strong> Yeah, real quick, I wanted to mention, um, while you were talking about ASCII art, that reminded me about the multimodality of Cloud 3, or Cloud 3 Opus specifically. And I saw some people doing some tests, actually, where Cloud 3 Opus, it does seem to actually have like a much better multimodal understanding.</p><p>[01:37:29] <strong>Nisten:</strong> Then GPT 4, and I think even, like, compared to Gemini 1. 5 as well, um, like, there's an example of, like, a photo, a very high resolution photo of a street, and, like, there's, like, a license plate, and there's, like, a little candy cane looking thing on the street that kind of indicates a barbershop, and it was, like, Cloud 3 Opus was, like, one of the only models, or I think maybe the only model that That actually passed the test successfully and actually being able to identify, like, what part of the street where there was a barber shop and, like, what was the exact license plate and, like, all those little details.</p><p>[01:38:03] <strong>Nisten:</strong> It seems to actually really have a really good image understanding.</p><p>[01:38:06] <strong>Alex Volkov:</strong> multimodality is really, really good. Um, I haven't tested it like thoroughly as much, but you can provide up to 20 images via the API and then high resolution as well. They don't have to, it doesn't look like they're copying the images, which was a big problem with, with many like the smaller vision models.</p><p>[01:38:24] <strong>Alex Volkov:</strong> Definitely they had to like. Lower its resolution to provide it for the models. So definitely the, the multimodality test that I did, uh, seemed very, very impressive for, for GLAD, uh, most, most definitely. Um, and I just find it funny that it's 8 and GVT4 are the only models can generate, uh, SK, SK art. So let's talk about, let's talk about breaking news for a second.</p><p>[01:38:45] <strong>Alex Volkov:</strong> I'm not gonna use the button because if we actually get some incredible breaking news, then we use it then. But, uh, the The, the breaking news of today, just before we started the space was, um, Inflection AI, a company founded by Mustafa Suleimani, I think one of the founders of DeepMind, I think, and, uh, Reid Hoffman, who was for a long time on board member, or board, uh, the, the, the board, the chief board in, in OpenAI, um, Inflection released Pi, and we've talked about Pi multiple times.</p><p>[01:39:20] <strong>Alex Volkov:</strong> Um, how should I say? Pi doesn't seem to be a competitor for the top space and kind of just generic LLMs to do tasks for you. And never, it didn't also never seem to be the case. Um, some of us kind of had some, you know, some, some jokes about, Hey, Mustafa also had this book and so it looks like he's publishing about his book more than he talks about Pi.</p><p>[01:39:42] <strong>Alex Volkov:</strong> But, I always said that some of the more human conversations or human feeling conversations. Some of the actual chats that I had with, uh, with LLMs after Sydney was with PHI. And it looks like they're training their model for a different purpose. Uh, and definitely, uh, definitely that's [01:40:00] what it felt like.</p><p>[01:40:00] <strong>Alex Volkov:</strong> And so today, um, um, Nisten, you can't just drop something like this in the DM and then expect me to continue to talk about PHI. But yeah, let's, let's talk about this rumor in a second. But, uh, um, um, so Yeah, um, Mustafa and the folks in Inflection released an update to Pi, and they now say that that level of performance for Pi comes close to GPT 4 as well.</p><p>[01:40:24] <strong>Alex Volkov:</strong> Now, I think also, they're using the same GPT 4 March metrics that everybody uses when it's like, it's very easy and convenient for them to compare themselves to GPT 4. But, uh, LDJ, you brought up Pi as well. Did you, did you see What do you see from the release that's interesting? And we can probably open up and talk about some of the numbers.</p><p>[01:40:42] <strong>Alex Volkov:</strong> The numbers are very interesting, and</p><p>[01:40:45] <strong>Nisten:</strong> Yeah, I haven't really looked into it much at all. I'll try to find more info on it now. I just saw literally, like, the posts on X about, like, the fact that it's announced.</p><p>[01:40:53] <strong>Alex Volkov:</strong> Yeah, so I have this open and I can chat about some stuff. Some of the things that they're focusing on, especially for their 2. 5 version, is that it's competitive with like GPT 4 and Gemini, um, and then It couples their raw capability with their signature personality in unique, emphatic fine tuning. So, I don't know if you guys remember or not, but, uh, there was a thing previously when they released, PI had two modes, kind of a, just a chat mode, and also like a, um, a support PI mode, which was more, Kind of like a psychologist in your pocket, basically.</p><p>[01:41:25] <strong>Alex Volkov:</strong> That mode is now gone. Like, there's no support Pi anymore, as far as I could see. Uh, there is a desktop app and a mobile app. Uh, Pi was, um, obviously, not obviously, but like, famously, one of the first AIs that you could talk to and could talk back, way before GPT added voice. And I think it's still one of the coolest, like, ways to interact with AI models.</p><p>[01:41:45] <strong>Alex Volkov:</strong> Um, Shout out to OpenAI who recently, as a reply to, as a reply to Claude, they released a voice ability on, on JGPT also on desktop. So back to PI, they say that like they approach GPT 4 performance using only 40 percent of the amount of compute for training. Now, when they say a statement like this, given that Reid Hoffman was board of OpenAI, they know the compute for, uh, for GPT 4.</p><p>[01:42:12] <strong>Alex Volkov:</strong> So like, I think it's very, um, Very open minded that they're like 40 percent of the amount and they're approaching DPD for performance. Um, they, they also added, uh, real time web search capabilities and actually Nous and stuff works. Somebody mentioned that Cloud has something like this and I don't know if Cloud has any web capabilities.</p><p>[01:42:33] <strong>Alex Volkov:</strong> Have you guys seen that Cloud has the ability to search the web? I don't think they do. I don't</p><p>[01:42:39] <strong>Nisten:</strong> not sure. I think I heard something about it was able to, but I'm, it, I'm 50%</p><p>[01:42:45] <strong>Alex Volkov:</strong> It does not, right? Right. Yeah. I think that like, this is just a mistake. I didn't see any like web capabilities, nor the announcement said anything, but PI goes and like does real time web search, which is pretty cool. And then, um, one thing they mentioned is average conversation with PI lasts 33 minutes and one in 10 lasts over an hour each day.</p><p>[01:43:04] <strong>Alex Volkov:</strong> Um, and they have like around 60%, uh, week over week retention, which is, The numbers are crazy. There are one million daily active users, which I don't think they mentioned before. One million daily active is quite impressive. Um, GPT 4 has a hundred million or so? I don't know if daily active, but definitely in the ballpark of this insanity.</p><p>[01:43:26] <strong>Alex Volkov:</strong> So, um, And so I went and tried Pi, and I have this video that you're more than welcome to check out. Um, and, uh, It feels kind of the same. So it doesn't want to do tasks, which is the one thing that like when I was talking about Pi before and say, Hey, this model is pretty great. And people told me like, Hey, yeah, I went to ask it to code for me and it didn't.</p><p>[01:43:50] <strong>Alex Volkov:</strong> And it looks like like Pi's emotional kind of aspect. It wants to talk to you and wants you to talk to it. It doesn't want to do tasks. And its refusals are very, very funny. Um, so it's very interesting that the numbers they compare it to GPT 4 and the previous inflection are on the, you know, the tasks that we all know and love, like MMLU and Hungarian math, for example.</p><p>[01:44:14] <strong>Alex Volkov:</strong> But then when you actually ask the model to do these things, the model just refuses. Um, go ahead Far El.</p><p>[01:44:20] <strong>Far El:</strong> Yeah. I just wanna throw</p><p>[01:44:21] <strong>Nisten:</strong> in that, uh, Mustafa is the D</p><p>[01:44:23] <strong>Far El:</strong> cell safest dor, so, uh, beware. Yeah,</p><p>[01:44:27] <strong>Alex Volkov:</strong> I, I was waiting for your addition to this, to this, uh, to this topic, because I know,</p><p>[01:44:33] <strong>Far El:</strong> his book, so</p><p>[01:44:36] <strong>Alex Volkov:</strong> yeah, um. So, so one thing, one thing that, uh, yeah, moving on, uh, I think, I think it's important to say that like where Mustafa specifically stands on the topic of safety. So thanks Far El. Uh, I would not, uh, um, uh, how should I say, would phrase it exactly like you did.</p><p>[01:44:52] <strong>Alex Volkov:</strong> And I don't think I'm, I'm blocked, but, um, I think one thing to call out for the better is that they actually open source something, which is, which is very surprising. Uh, they evaluated on EmptyBench and, uh, they say widely used community leaderboard to compare models. And then they said a real realized a large fraction, nearly 25 percent of examples in the reasoning math and coding had incorrect reference solutions or questions with flawed premises.</p><p>[01:45:17] <strong>Alex Volkov:</strong> Therefore, we corrected these examples and released the version of the dataset here. So they, they released the new, the open sourced. Today, or yesterday, EmptyBenchInf, uh, which is a new version of EmptyBench, which they claim is, is, um, higher quality with cleaned references. Which is, um, which is dope, and it's good to see open sourcing from these companies, they do a lot of effort.</p><p>[01:45:38] <strong>Alex Volkov:</strong> They have, no matter how their views on, on acceleration, deceleration, they have very smart folks working them, because they hired a bunch of folks. Yeah, Far El, have you, have you looked at the, um, EmptyBench stuff?</p><p>[01:45:51] <strong>Far El:</strong> No, but I am definitely aware, uh, and a lot of people have mentioned this previously, all benchmarks, uh, all benchmark datasets have a lot of, uh, like errors and, uh, and, uh, there's a lot of low hanging fruits there to tackle. Um, so yeah, like, I appreciate the, uh, the, this gift that they give to the open source community, but as we know now, based on the OpenAI, uh, letter, uh, that, uh, open source is, uh, mostly just the talent recruitment strategy for all these big labs.</p><p>[01:46:25] <strong>Far El:</strong> So, uh, although we're thankful, but, uh, you know, now we're, now we're very much conscious of, um, of the intentions.</p><p>[01:46:33] <strong>Alex Volkov:</strong> Yeah, we should cover this. So let me just, uh, lend the, this last thing on, on, on Pi and then we're going to cover this and then we'll see if any, like, exciting news are happening because Twitter is, uh, yeah. So on this new EmptyBench, the corrected EmptyBench, the funny thing is, uh, GPT 4 and regular EmptyBench, has 9.</p><p>[01:46:51] <strong>Alex Volkov:</strong> 02, and then the corrected one takes down GPT 4 to 8. 7 something, and inflection actually rises in score to come closer, like, from 8. 4 to 8. 6. So it's really funny how, like, a corrected also, like, boosts their numbers and then takes down GPT 4 numbers. So I think that's mostly it. I personally find it, uh, fun to talk with Pi.</p><p>[01:47:12] <strong>Alex Volkov:</strong> Uh, from a perspective, just like talking to an LLM, it doesn't feel like, uh, you know, it's like a clean, uh, Wikipedia based agent like GPT 4 does. Um, however, they evaluated this, this is the funny thing, they evaluated this on a bunch of Coding things, and then it refuses, absolutely refuses to do any coding whatsoever.</p><p>[01:47:31] <strong>Alex Volkov:</strong> Um, maybe the new version refuses less, I haven't actually tried it. Uh, but, um, but this was very funny to me that it's not it's purpose and that's not how it feels like. And that's why everybody kind of like, doesn't like Pi, besides the Mustafa personal feelings. Alright folks, I think we have, uh, 5 or 10 more minutes to talk about the gossipy stuff.</p><p>[01:47:51] <strong>Alex Volkov:</strong> Uh, let's talk about the gossipy stuff, so So just before this, I want to cover, uh, that, uh, in the Weights Biases corner this week, our inaugural, uh, meetup, uh, conference comes up in, in, uh, April. So if you're using Weights Biases and you're a user, uh, we have our conference, it's called Fully Connected, uh, that you're all invited to participate, to join.</p><p>[01:48:15] <strong>Alex Volkov:</strong> It's in San Francisco on April 1st. 18th. Um, and I'm going to post the tweet [01:48:20] in the newsletter and the, the podcast. Definitely. We're going to talk about this as well. Uh, we're going to do some very interesting things there to be announced very soon as well. So I'm going to be there. Uh, if you are into building models, there's going to be a ton of folks who are also doing this on enterprises and, um, and open source as well.</p><p>[01:48:36] <strong>Alex Volkov:</strong> So more than welcome to join the tickets are not that crazy, uh, compared to other conferences and. It's a good chance to join San Francisco and check out what else is happening around this week. Obviously, our conference is the most important one. Um, with that, let's move on to some of the gossipy stuff with Elon and OpenAI.</p><p>[01:48:55] <strong>Alex Volkov:</strong> Because I think we all saw this, right? Like, we all saw, I think on Friday, there was an announcement that Elon Musk is suing OpenAI and OpenAI Inc. and OpenAI LLC Inc. And like, all of these, like, subsidiaries of OpenAI that they had a bunch of names with. Um. And, uh, Elon being Elon, and we're on his platform, so just like being mindful of this, uh, wants them to change the name to ClosedAI.</p><p>[01:49:21] <strong>Alex Volkov:</strong> I don't know if you guys saw this in one of the comments, it's like, Hey, if you change the name to ClosedAI, I will drop the lawsuit. Um, so I'm clear what's behind this and what's the purpose. Um, Very close, you know, a lot of speculation, and we don't have like tons of time to the speculation as well, but like, it's very close to the, to the announcement that, um, OpenAI just announced like a day before the lawsuit, uh, dropped that they're collaborating with FIGR on embodying, uh, in, in, uh, humanoid robots as well.</p><p>[01:49:51] <strong>Alex Volkov:</strong> So some people claim that potentially this is gonna, uh, coming into, uh, uh, the Optimus territory. Um. Any, what did he want, folks? I actually didn't read the whole lawsuit and I don't remember. LDJ, do you remember, like, what, what's the outcome that he expects from this lawsuit? Yeah,</p><p>[01:50:11] <strong>Nisten:</strong> he actually specifically mentions the rumors of things like QSTAR and, mentions the fact that GPT 4 already scores like, you know, um, around as good or better than an average human in a lot of general reasoning benchmarks and things like that. And he's pretty much calling for, like, he wants them to open source things and, or reimburse him and potentially other investors that might have been involved in open AI before it kind of.</p><p>[01:50:36] <strong>Nisten:</strong> Change this company structure. Um, but did you, did you ever read the blog post and the OpenAI I responded with?</p><p>[01:50:43] <strong>Alex Volkov:</strong> so now, now we're gonna, I first wanted to cover kind of what he came up with. Uh, I think it was a breach of contract, which contract is unclear, but like, um, there wasn't like a very one contract.</p><p>[01:50:53] <strong>Nisten:</strong> exchanges specifically where he included in the original lawsuit the email exchanges of, like, that he felt kind of were on his side and kind of established that, like, this is what they kind of promised verbally in a verbal agreement that this is what they would do. That's kind of what he put out in the original lawsuit.</p><p>[01:51:11] <strong>Alex Volkov:</strong> Yeah, and then, uh, Far El, you want to comment on the, on Elon's lawsuit before we get to the OpenAI's response?</p><p>[01:51:19] <strong>Far El:</strong> Yeah, it's just, uh, it's mostly just egos, uh, battling, right? Um, uh, there, there could be strategic, uh, uh, like there could be something strategic that comes out of it where they get discovery into the company, uh, into what, uh, like OpenAI is working on or anything. But in reality, like, uh, I think this is just drama.</p><p>[01:51:41] <strong>Far El:</strong> Like we, we're not going to see anything really shake up this, uh, this industry, uh, Uh, like OpenAI is not going to change its name to CloseAI, that's, that's just a, that's just Elon Troll.</p><p>[01:51:54] <strong>Alex Volkov:</strong> Yeah, that's, that's like pure, pure Elon Troll, that's for</p><p>[01:51:56] <strong>Far El:</strong> but, but, but the, the, the most interesting thing that comes out of all of this is, um, uh, all the emails and exchanges between the, like, Sam, Ilya, Elon, and so on.</p><p>[01:52:10] <strong>Far El:</strong> And, and, and it, it sheds a light on on a lot of the, uh, the last, you know, six years of OpenAI, uh, strategy. Uh, so, yeah, I think that's where I'm most interested in is all these, like, leaks of information of, um, private information, uh, within</p><p>[01:52:29] <strong>Nisten:</strong> the company.</p><p>[01:52:30] <strong>Alex Volkov:</strong> Yeah, so let's talk about this. So OpenAI responded in a blog post and said, OpenAI and Elon Musk, and then said, We're sad to see something like a person who we admire come and say this. And they have They have a bunch of emails there, uh, which they embedded in the webpage. It wasn't like a screenshot or anything.</p><p>[01:52:48] <strong>Alex Volkov:</strong> They actually embedded the emails in the webpage. Uh, and they, uh, censored them with a paired word sense, uh, uh, censoring, which many folks found, uh, very interesting as a choice because that's not how you censor stuff, because people can actually run machine learning models on this and figure out what was potentially being censored there.</p><p>[01:53:07] <strong>Alex Volkov:</strong> Um, and uh, they specifically mentioned that, um. Kind of in response to everything that Elon Musk said, that when they opened it, uh, they initially planned to raise a hundred million dollars. Um, and uh, can you guys hear me by the way? Just a mic check, uh, LDJ Far El. Yeah, okay, uh, so just Nisten then. Um, so they originally planned to raise 100 said we need to go much bigger than 100 million to avoid sounding hopeless.</p><p>[01:53:36] <strong>Alex Volkov:</strong> I think we should say that we're starting with 1 billion funding commitment and I'll cover whatever everything else doesn't provide and then they talk about, uh, that they recognize that the for profit entity would be necessary. and they actually show emails that, you know, Ilya Satskover, Ilya, who we all wait to see if he's okay, and where is he?</p><p>[01:53:57] <strong>Alex Volkov:</strong> He is actually signed on this response from March 5th, so, you know, I don't think they would add his name without his being there and being okay with what's released. Um, there is an email back and says, hey, You know, that we, when we say open, we mean open that we release these products. Because, just as a reminder, this was like 2015, 2018 emails.</p><p>[01:54:19] <strong>Alex Volkov:</strong> Back then, there was no LLMs for us to use, and the only player in the space was DeepMind. And they didn't release anything, uh, necessarily. And so, uh, this was way before, kind of, the product started releasing. And they specifically mentioned the DeepMind and Google as the alternative to what they opened, OpenAI.</p><p>[01:54:38] <strong>Alex Volkov:</strong> And specifically to their case, one email here said that, you know, we will have, the non profit arm will not make enough money to be able to break a difference. Google has 800 billion dollar company. I think it's way more than this now. And they have all these like TPUs and we need a significant, uh, significantly more in our war chest to be able to do this.</p><p>[01:54:59] <strong>Alex Volkov:</strong> Uh, and then I think, uh, there is a very specific thing where they say, um, as we get AI, it will make sense to start being less open. This is an email from Ilya to Elon Musk. The open in OpenAI means that everyone should benefit from the fruits of AI. But it's totally okay to not share science. Um, and then in parentheses, this is the part that Far El doesn't agree with.</p><p>[01:55:23] <strong>Alex Volkov:</strong> Even though sharing everything is definitely the right strategy in the short and possibly medium term for recruitment purposes. Um, and then Elon Musk replies with, yup. So, uh, based on This email, it does seem, at least, that unless he didn't read the email correctly, that he agreed with the strategy of going for close and for profit, which seems that his original claim right now is kind of dubious.</p><p>[01:55:49] <strong>Alex Volkov:</strong> And that's their strong response to the lawsuit as well. But, like Far El said, we did learn a bunch of stuff as well here. LDJ, go ahead.</p><p>[01:55:58] <strong>Nisten:</strong> I think what's really funny about the situation is that, uh, I guess like the tools they used for to redact the sensitive information, people are actually able to start like trying to decipher what is in the redactions because the tool that they use does like, it's, it's, um, like you can kind of analyze like</p><p>[01:56:19] <strong>Alex Volkov:</strong> per word reduction, yeah.</p><p>[01:56:21] <strong>Nisten:</strong> um, on a word.</p><p>[01:56:22] <strong>Alex Volkov:</strong> On a</p><p>[01:56:23] <strong>Nisten:</strong> a word basis, based on the length of the word, and then you can kind of decipher, like, what is the length of each word underneath the redaction? And then, like, yeah, people are starting to, like, decipher that and then be able to tell that it was most likely Andrej Koparthy that was, uh, in the from section of certain emails that's [01:56:40] redacted.</p><p>[01:56:41] <strong>Alex Volkov:</strong> Oh, I got some breaking news folks just now from Technium. I'm going to use the button because we're about to close, uh, but I just got breaking news. Uh, so I'm going to use this and we're going to briefly just tell you about this, uh, even though it didn't have any, any, anything to do with it, I just have to use this once.</p><p>[01:57:06] <strong>Alex Volkov:</strong> Alright, our folks at Nous are releasing a new 7b model called Genstruct. Uh, instruction generating model designed to create valid instructions given a raw text corpus. Uh, this enables creation of new partially synthetic instruction fine tuning datasets for any raw text corpus, which is super cool. Um, Inspired by the paper AdaInstruct, uh, they took the approach further by grounding the generations in user provided context passages.</p><p>[01:57:30] <strong>Alex Volkov:</strong> Uh, the model is now available on Hug Face, and there's a notebook as well. It's called GenStruct from Nous. Super, super cool. Uh, LDJ, comment on this already?</p><p>[01:57:41] <strong>Nisten:</strong> Um, I haven't commented on this, but yeah, it's, it looks cool. Um, it looks like it's like. Better than using like RAG and ADA instruct for a lot of things and yeah, I guess that people will probably start using this to build out data sets and things like that.</p><p>[01:57:56] <strong>Alex Volkov:</strong> That's very, very awesome. And they have a table here. I'm not sure what it means. They have open models, grounded generation, complex questions, and complex responses as rows. And then they compare RAG and other instructs and future prompting for generation. So if I'm not mistaken, this is for Four generating synthetic fine tuning datasets.</p><p>[01:58:18] <strong>Alex Volkov:</strong> Something that people, uh, you know, sometimes use GPT 4 for this purpose, but it's not commercially viable because it goes against OpenAI's, uh, um, terms and conditions. So if I'm not mistaken, this is its purpose, correct? Um, very interesting,</p><p>[01:58:35] <strong>Nisten:</strong> specifically for generating instructions, because you can have like, it's actually interesting to think about where things like GPT 4 and all these like specifically. Like, uh, I guess you can call it instruction fine tuned models, like they're trained to specifically take in an instruction and generate a response to that, or take in a question and generate a response to that.</p><p>[01:58:58] <strong>Nisten:</strong> But this is kind of like flipped, where it's actually really useful to have something that's specifically trained to generate really good questions and really good instructions in the first place. Because then you can generate these very complex instructions and questions that you could later ask Cloud3 or GPT 4 and then you have even better question and response pairs at the end than if you just used, sorry, than if you just used Cloud3 alone to generate the instructions.</p><p>[01:59:27] <strong>Alex Volkov:</strong> Awesome. So, yeah, a little bit of breaking news in the open source as well. Uh, Jen struck from, from, uh, from our folks at Nous. So, folks, we've been at this for two hours. Uh, no other huge breaking news has broken since. And I, it doesn't feel like, uh, you know, those rumors are coming true. If they are, we're probably going to do, like, an emergency space and hop back in.</p><p>[01:59:48] <strong>Alex Volkov:</strong> Uh, but until then, I just want to thank everybody for being here for the last two hours. I'm going to do a recap of everything we talked about. If you joined us in the middle and you want to hear everything we've talked about, uh, please stick around for the next eight to 10 minutes. And then I'm going to let you go for the rest of the Thursday.</p><p>[02:00:04] <strong>Alex Volkov:</strong> And, um, it's been, it's been a great, great space, even though I was, uh, A little bit sick and coughing at you, hopefully, but very thankful for the co hosts here who picked up some of this conversation. So, we're going to do a brief recap of everything we talked about and then I'm going to let you go.</p><p>[02:00:36] <strong>Alex Volkov:</strong> Here's everything we've talked about on ThursdAI, March 7th in 2024, the first ThursdAI in March this year. Um We started with talking about open source. There's not a lot of stuff to, to cover in open source. So we did have a breaking news, uh, from, from folks at News. But before that, we've covered that. Oh one AI open sourced, uh, a smaller version of Yee, uh, which we previously covered as E 34 B was a very important model.</p><p>[02:01:04] <strong>Alex Volkov:</strong> Uh, there raised a 9 billion parameter e model that seems very performative compared to seven B. Uh, we discussed how it's very interesting that this category now. is around between 7 billion parameters and almost up to 11. We've talked about a new way to train 70 billion parameters at home with home GPUs from folks from Answer.</p><p>[02:01:24] <strong>Alex Volkov:</strong> ai called, uh, you know, uh, Jeremy Howard and, uh, John Whitaker and Tim Ditmers from Qlora joined them and they've combined, uh, combined forces to, to show how it's possible to train, uh, a 70 billion parameter model at home. Um, we also covered Galore, which is kind of a similar. G A L O R E, um, a similar technique to train LexModel models on one single GPU with limited RAM as well.</p><p>[02:01:51] <strong>Alex Volkov:</strong> And obviously the breaking news that we just had in this area, that Nous Research released GenStruct7B, a model that's an instruction generation model designed to create valid instructions giving raw text corpus. We literally just covered this as well. Um, we've talked about, uh, some few more open source stuff from the folks who joined us on stage.</p><p>[02:02:11] <strong>Alex Volkov:</strong> So, we had, uh, Jan Pellegr, a frequent co host of, of the pod, that talked about, uh, his final attempt at, at, uh, continued training Gemini, oh, sorry, Gemma, uh, the open source or open weights model that Google gave us on top of a bunch of Hebrew text. And, uh, he talked about struggles of how to actually fine tune Gemma.</p><p>[02:02:32] <strong>Alex Volkov:</strong> So if that's interesting to you, this will be in the show notes. And, uh, Yam has, uh, a deep dive into how to train Gemma. And we also had Vic Huntak, a friend of the pod, who released Moondream 2, which is a very tiny 1. parameter, uh, vision language model. that you can run on CPU. You don't even have to run a GPU for it.</p><p>[02:02:53] <strong>Alex Volkov:</strong> And, uh, Vic talked to us about the fact that this model is now commercially, uh, licensed because he, he trained the captions differently and it's significantly improved benchmark scores and instruction fine tuning. And this model is like very tiny. So if you need a vision model, uh, Moondream 2 is your, uh, is a good bet for you as well.</p><p>[02:03:14] <strong>Alex Volkov:</strong> We, uh, we went and talked at length at the, the, the biggest news of this week, which is Cloud, uh, Entropic releasing Cloud 3 and with three versions, Opus, Sonnet, and Haiku. And we, we covered its coding capabilities, its longer context, we've covered that it's multi modal right now. Uh, the one thing we didn't cover, and I'll just mention, is that they claim there is also a function, uh, uh, calling, and that's coming soon, so that's not still available.</p><p>[02:03:40] <strong>Alex Volkov:</strong> We saw that it's, uh, it's UI is now comparable to ChatGPT and costs also 20 an hour, uh, 20 a month and it's not available in a bunch of countries, but the API, uh, is available. So if you do want to try this Opus model, um, which is not available for free, you have to actually sign up to either the API or the UI, you can do it via the.</p><p>[02:04:03] <strong>Alex Volkov:</strong> via their playground, which they call console. anthropic. com. So we've covered how this model now is improving significantly what previously was known as kind of the fallback from JGPT. Longer context, uh, they claim that they will support up to 1 million context window. Um, As well, and we've talked at length about different ways in which cloud is less restricted than chat GPT, uh, or GPT 4.</p><p>[02:04:28] <strong>Alex Volkov:</strong> It feels a little bit more, um, easier to talk to and less refusals, though we did cover some refusals as well. We then talked about the lawsuit that Elon Musk brought to OpenAI, uh, where he claims that he didn't invest in it to become closed. And, uh, facetiously said that if they change their name to CloseDAI, he will drop the lawsuit, because he's being a troll.</p><p>[02:04:51] <strong>Alex Volkov:</strong> Basically, uh, but, uh, he did co found OpenDAI, there's a bunch of images and videos recently that popped up, and he also surfaced a bunch of [02:05:00] emails. Excuse me. He also surfaced a bunch of emails in which the co founding happened. And we covered OpenAI's response where they also released a bunch of emails, uh, back and forth.</p><p>[02:05:12] <strong>Alex Volkov:</strong> And, uh, obviously in his lawsuit, the emails kind of were in the favor of the lawsuit. And the response, the emails, uh, favored the response in OpenAI. And they show that, uh, at least In one email exchange, they did discuss about going closed and, uh, specifically around the open and open AI does not mean open source everything.</p><p>[02:05:33] <strong>Alex Volkov:</strong> Uh, this was Ilya's Satskiverse take. Um, the open and open AI means releasing these models to, uh, to, to people to actually use. And a reminder is that back when those emails were exchanged, there was no other AI that people can use. Uh, this was Google did not release Gemini back then. There was nothing from DeepMind that you can actually use.</p><p>[02:05:53] <strong>Alex Volkov:</strong> So just a very important piece of context there. Um, we didn't unfortunately get to, but I'll cover this anyway. There was a charge that, um, Google employee was charged with trading AI secrets with China. And that's a very interesting conversation. We didn't get to, uh, unfortunately, but it's been, it's been talked about that how these large Big AI companies, uh, as a competition with China as a, um, in context with like open source and not open sourcing that people say that anyway, they probably already, uh, nation states has been already intervened there.</p><p>[02:06:26] <strong>Alex Volkov:</strong> So it's very interesting that in this context, there's now a previous Google employee that was uploading. Uh, information into his Google Drive, and now he was arrested, and um, um, We also, we also talked about inflection. It was kind of our breaking news today, this morning inflection from Mustafa Suleimani, DeepMind's one of the co founders.</p><p>[02:06:45] <strong>Alex Volkov:</strong> There is an upgrade to their underlying agent inflection 2. 5. Uh, there is, uh, an update to PI and PI. Now they claim to be GPT 4 Gemini equivalent or very, very close to while using 40 percent less resources or 40 percent of the resources, I guess, uh, of GPT 4 training. And that model is now available on the used web search is available for this model.</p><p>[02:07:08] <strong>Alex Volkov:</strong> It's not multimodal still, but they claim it's coming very, very close. Um, I think that's pretty much it was covered. I will just cover two other things that we didn't get to from Stability. Stable Diffusion, Stability AI released Stable Diffusion 3 research paper, and the model is coming any day now. And based on the research paper alone, it's significantly outperforming Mid Journey and Ideagram, and basically Playground.</p><p>[02:07:34] <strong>Alex Volkov:</strong> Every other, uh, uh, Open and closed source image generation models, which is very interesting based on some testing that they did internally. And so, um, they're moving towards diffusion transformers as well. Something that we saw in Sora and we've had, uh, Tanish from the Hourglass diffusion transformers paper, talk to us about diffusion transformers.</p><p>[02:07:54] <strong>Alex Volkov:</strong> Uh, so it looks like, uh, the, the industry is converging towards diffusion transformers. Uh, and kind of the two different sides from this industry are converging into, into one architecture, which is interesting. Um. So Stable Diffusion Tree is not available yet, but probably based on what Iman Mustaq, the CEO, said.</p><p>[02:08:11] <strong>Alex Volkov:</strong> It's probably going to start sending invites today and is going to be available in their kind of membership. I'm not sure about OpenWeights or not. And StabilityFusion was, uh, StabilityAI was also in the news because they released, uh, together with Tripo, they released, uh, Tripo SR, which is a fast image to 3D, Uh, Generation, and we actually have a demo and a play with this a little bit.</p><p>[02:08:33] <strong>Alex Volkov:</strong> And it's really, really cool. You just like upload one image and within like a few steps you get a 3D version that looks very interesting. And there was a demo flying around with somebody just doing this thing where they just use Vision Pro and have a bunch of windows open and they generate an image in one window, drag and drop it, and generate a 3D image of this in another window.</p><p>[02:08:54] <strong>Alex Volkov:</strong> Take the 3D and drop it in another thing to actually put it in their space. And I thought it was super cool and actually suggested that somebody combines all these things. So I think that's mostly everything we've covered this week on March 7th outside of Cloud, there hasn't been a huge explosion of news as we're used to, but I think it's still incredible news.</p><p>[02:09:14] <strong>Alex Volkov:</strong> We also did have breaking news in the middle, LDJ saw that LMSIS, the arena folks that measure, based on human preference, which models are which is now placing Cloud Opus as a third. And then even Sonnet, the kind of the lower model is also placed fairly, fairly high as well in there. So go play with these models.</p><p>[02:09:34] <strong>Alex Volkov:</strong> And I think that's most of what we covered for ThursdAI this week. With that, I thank you for joining folks. It's been, it's been great folks for joining just now to us. I just covered the last two hours that we had on the space and, um, we will see you, we'll see you next time. Um, I don't think we have breaking news, I don't think there's anything, uh, that's worth sticking around to discuss, but with that, everything that we've talked about, all the links are gonna be in the show notes and in the newsletter.</p><p>[02:10:00] <strong>Alex Volkov:</strong> If you haven't signed up yet, please definitely feel free to do so on ThursdAI. news. Uh, thank you everyone for joining, Nisten, Far El, Luigi, uh, actually joined, and Ray as well, and we had, we had Jan before, and some other folks as well, thank you everybody in the audience who comes back. Uh, from week to week to listen to us.</p><p>[02:10:16] <strong>Alex Volkov:</strong> And I will just remind you that next week is ThursdAI's birthday. We actually started this a year ago and it's been kind of crazy. I think we missed only one. So even though I was sick today, we didn't miss this one. Next week is going to be really fun. Hopefully with GPD 5 news. All right. I'll see you, see you everyone next Thursday.</p><p>[02:10:33] <strong>Alex Volkov:</strong> Bye bye.</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-mar-7-anthropic-gives-us</link><guid isPermaLink="false">substack:post:142407602</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Fri, 08 Mar 2024 00:49:14 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/142407602/b44bb73e56bd98ea8760e5a7b91fcc48.mp3" length="75926558" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6327</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/142407602/2c8dc9a3ea4cc350bc71d8ac9439d47f.jpg"/></item><item><title><![CDATA[📅 ThursdAI - Feb 29 - Leap Year Special ✨]]></title><description><![CDATA[<p>Happy leap year day everyone, very excited to bring you a special once-in-a-4 year edition of ThursdAI 👏 </p><p>(Today is also Dune 2 day (am going to see the movie right after I write these here words) and well.. to some folks, this is the bull market ₿ days as well. So congrats to all who weathered the bear market!)</p><p>This week we had another great show, with many updates, and a deep dive, and again, I was able to cover most of the news AND bring you a little bit of a deep dive into a very interesting concept called Matryoshka Representation Learning (aka 🪆 embeddings) and two of the authors on paper to chat with me on the pod! </p><p>TL;DR of all topics covered: </p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* Playground releases a new diffusion foundational model Playground V2.5 (<a target="_blank" href="https://modal-labs--playground.modal.run/">DEMO</a>)</p><p>* Alibaba teasing EMO - incredible animating faces (<a target="_blank" href="https://x.com/altryne/status/1762911397048991924?s=20">example</a>)</p><p>* Ideogram 1.0 announced - SOTA text generation (<a target="_blank" href="https://x.com/ideogram_ai/status/1762881278955700270?s=20">Annoucement</a>)</p><p>* <strong>Open Source LLMs</strong> </p><p>* Gemma update - hard to finetune, not better than 7B mistral</p><p>* LLama 3 will release in June 2024, not anytime soon</p><p>* Starcoder 2 + stack V2 (<a target="_blank" href="https://twitter.com/_philschmid/status/1762843489220296881">Announcement</a>)</p><p>* Berkeley Function-Calling leaderboard Leaderboard (<a target="_blank" href="https://twitter.com/altryne/status/1762602239376416868">Announcement</a>)</p><p>* Argilla released OpenHermesPreferences the largest open dataset for RLHF & DPO (<a target="_blank" href="https://twitter.com/argilla_io/status/1762166389551296569">Announcement</a>)</p><p>* STORM from Stanford to write long documents (<a target="_blank" href="https://x.com/lateinteraction/status/1763058795750838742?s=20">Thread</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Mistral releases Mistral Large & Le Chat (<a target="_blank" href="https://twitter.com/GuillaumeLample/status/1762128616849072171">Announcement</a>, <a target="_blank" href="https://chat.mistral.ai/chat/028070bc-24a6-4120-b200-fb3c699fee57">Le Chat</a>)</p><p>* Microsoft + Mistral strike a deal (<a target="_blank" href="https://azure.microsoft.com/en-us/blog/microsoft-and-mistral-ai-announce-new-partnership-to-accelerate-ai-innovation-and-introduce-mistral-large-first-on-azure/">Blog</a>)</p><p>* Google teases GENIE - model makes images into interactive games (<a target="_blank" href="https://x.com/_rockt/status/1762026090262872161?s=20">announcement</a>)</p><p>* OpenAI allowing fine-tune on GPT 3.5</p><p>* Wordpress & Tumbler preparing to sell user data to OpenAI & Midjourney</p><p>* <strong>Other</strong></p><p>* Mojo releases their MAX inference engine, compatible with PyTorch, Tensorflow & ONNX models (<a target="_blank" href="https://twitter.com/Modular/status/1763276518648942843">Announcement</a>)</p><p>* Interview with MRL (Matryoshka Representation Learning) <a target="_blank" href="https://twitter.com/adityakusupati/status/1750911725166334290">authors</a> (in audio only)</p><p>AI Art & Diffusion </p><p>Ideogram 1.0 launches - superb text generation! </p><p>Ideogram, founded by ex google Imagen folks, which we reported on before, finally announces 1.0, and focuses on superb image generation. It's really great, and I generated a few owls already (don't ask, hooot) and I don't think I will stop. This is superb for meme creation, answering in multimedia, and is fast as well, I'm very pleased! They also announced a round investment <a target="_blank" href="https://a16z.com/announcement/investing-in-ideogram/">from A16Z</a> to go with their 1.0 release, definitely give them <a target="_blank" href="https://x.com/ideogram_ai/status/1762881278955700270?s=20">a try</a></p><p>Playground V2.5  </p><p>Suhail Doshi and <a target="_blank" href="https://playground.com">Playground</a> release a new foundational image model called Playground v2.5 and it looks awesome, very realistic and honestly looks like it beats MJ and DALL-E on many simple prompts.</p><p>They also announced that this model received higher user preference scores based on 1K prompts (which we didn't get to see) but they have released this model into the wild, you can <a target="_blank" href="https://huggingface.co/playgroundai/playground-v2.5-1024px-aesthetic">download it</a> and play with a <a target="_blank" href="https://modal-labs--playground.modal.run/">free demo provided by modal folks</a></p><p>Another SORA moment? Alibaba teases EMO 🤯 (<a target="_blank" href="https://humanaigc.github.io/emote-portrait-alive/">website</a>)</p><p>Ok this one has to be talked about, Alibaba released quite a few preview videos + paper about something called EMO, a way to animate a talking/singing Avatars from just 1 image. <strong>It broke my brain</strong>, and I couldn't stop staring at it. Honestly, it's quite quite something. This model animates not only the mouth, eyes are blinking, there are emotions, hairs move, even earrings, and the most impressive, the whole Larynx muscle structure seem to be animated as well! </p><p>Just look at this video, and then look at it again. </p><p>The <a target="_blank" href="https://github.com/HumanAIGC/EMO/pulls">Github repo</a> was created but no code released and I really hope we get this code at some point, because animating videos with this fidelity + something like SORA can mean so many possible creations! </p><p>I wrote this tweet only two weeks ago, and I'm already feeling that it's outdated and we're farther along on the curve to there with EMO, what a great release! </p><p>And just because it's so mind-blowing, here are a few more EMO videos for you to enjoy: </p><p>Open Source LLMs </p><p>Starcoder 2  + The Stack V2</p><p>Folks at hugging face and BigCode have released a beast on us, StarCoder 2 ⭐️ The most complete open Code-LLM 🤖 StarCoder 2 is the next iteration for StarCoder and comes in 3 sizes, trained 600+ programming languages on over 4 Trillion tokens on Stack v2. It outperforms StarCoder 1 by margin and has the best overall performance across 5 benchmarks 🚀🤯.</p><p>TL;DR;🧮 3B, 7B & 15B parameter version🪟 16384 token context window🔠 Trained on 3-4T Tokens (depending on size)💭 600+ Programming languages🥇 15B model achieves 46% on HumanEval🧠 Grouped Query Attention and Sliding Window Attention💪🏻 Trained on 1024 x H100 NVIDIA GPUs✅ commercial-friendly license🧑🏻‍💻 Can be used for local Copilots</p><p>The <a target="_blank" href="https://huggingface.co/datasets/bigcode/the-stack-v2-train">Stack v2</a> is a massive (10x) upgrade on the previous stack dataset, containing 900B+ tokens 😮</p><p>Big CO LLMs + APIs</p><p>🔥 Mistral announces Mistral-Large + Le Chat + Microsoft partnership</p><p>Today, we are releasing Mistral Large, our latest model. Mistral Large is vastly superior to Mistral Medium, handles 32k tokens of context, and is natively fluent in English, French, Spanish, German, and Italian.</p><p>We have also updated Mistral Small on our API to a model that is significantly better (and faster) than Mixtral 8x7B.</p><p>Lastly, we are introducing <a target="_blank" href="https://chat.mistral.ai">Le Chat</a> , a chat interface (currently in beta) on top of our models.</p><p>Two important notes here, one, they support function calling now on all mistral models in their API, which is a huge deal, and two, the updated Mistral Small to a "significantly better and faster" model than Mixtral 8x7B is quite the hint! </p><p>I want to also highlight Arthur’s tweet clarifying their commitment to Open Source because it's very important. They released a new website, it again had mentions of "don't train on our models" which they removed, and the new website also had removed the section that committed them to open weights and they put a much bigger section back up quickly! </p><p>This weeks Buzz (What I learned with WandB this week)</p><p>I mentioned this before, but this may shock new subscribers, ThursdAI isn't the only (nor the first!) podcast from Weights & Biases. Our CEO Lukas has a long standing podcast that's about to hit 100 episodes and this week he interviewed the CEO of Mayo Clinic - John Hamalka </p><p>It's a fascinating interview, specifically because Mayo Clinic <a target="_blank" href="https://twitter.com/altryne/status/1744534971702734940">just recently</a> announced a mutli-year collaboration with Cerebras about bringing AI to everyone who googles their symptoms and ends up on mayo clinic websites anyway, and apparently John has been in AI for longer that I was alive so he's incredibly well positioned to do this and bring us the AI medicine future! </p><p>Modular announces MAX (<strong>Modular Accelerated Xecution)</strong> Developer Edition Preview (<a target="_blank" href="https://www.modular.com/blog/announcing-max-developer-edition-preview">blog</a>)</p><p>Modular, the company that created Mojo Lang from Chris Lattner, has now announced the second part of their stack, coming to all of us, and it's called MAX. It's an inference engine that has Mojo built in, that supports PyTorch, Tensorflow and ONNX and is supposedly going to run the same AI models we run now, significantly faster. MAX is a unified set of tools and libraries that unlock performance, programmability and portability for your AI inference pipelines</p><p>Right now they support only CPU inference, and significantly boost performance on CPU, however, they are planning GPU support soon as well, and promise up to 5x faster AI inference for most models like Mistral, LLama etc </p><p>I personally think this is a huge development, and while it's still early, definitely worth taking a look at the incredible speed performances that we are seeing lately, from Groq (as we chatted with them last week) and Modular, we're are very well on our way to run huge models faster, and small models instantly! </p><p>🪆 MRL (Matryoshka Embeddings) interview with Aditya & Prateek </p><p>Recently OpenAi has released 2 new embeddings models recently that replaced their ada-002 embeddings, and when they released it, they mentioned a new way of shortening dimensions. Soon after, on X, the authors of a 2022 paper MRL (Matryoshka Representation Learning) <a target="_blank" href="https://twitter.com/jainprateek_/status/1751291366515384354">spoke out</a> and said that this new "method" is actually MRL, the concept they came up with and presented at NeurIPS. </p><p>Since then I saw many folks explore Matryoshka embeddings, from <a target="_blank" href="https://twitter.com/bo_wangbo/status/1753235335444897867">Bo Wang</a> to <a target="_blank" href="https://t.co/RhTt0XT1hs">Connor Shorten</a> and I wanted to get in on the action! It's quite exciting to have heard from <a target="_blank" href="https://twitter.com/adityakusupati/status/1750911725166334290">Aditya</a> and <a target="_blank" href="https://twitter.com/jainprateek_/status/1751291366515384354">Prateek</a> about MRL, how they are able to significantly reduce embeddings size by packing the most important information into the first dimentions, the implications of this for speed of retrieval, the significant boost in use-cases post the chatGPT LLM boom and more! Definitely give this one a listen if you're interested, the interview starts at 01:19:00 on the pod. </p><p>Thank you for reading, I really appreciate you coming back here week to week, and if you enjoy this content, please share with 1 friend and give us a ⭐ rating on Apple Pod? Here's a nice Ideogram image as a preemptive thank you! </p><p>As always, here’s the full transcript</p><p>[00:00:00] Intro and welcome</p><p>[00:00:00]</p><p>[00:00:00] <strong>Alex Volkov:</strong> Hey, you're on ThursdAI. This is Alex. Happy Leap Year Special Edition. Today's February 29th. We had a great show today. So great that got carried away during the recap, and it's almost twice as long as it usually is. The recap, not the show. But no worries. As always, if you're short on time, the first 25 minutes or so of this almost two hour podcast will catch you up on everything that happened in AI this week.</p><p>[00:00:29] <strong>Alex Volkov:</strong> If you're using Apple Podcasts, or any other modern podcatcher, you can also skip to the chapters, that I'm outlining every week and listen to the part that interests you, and only to that part.</p><p>[00:00:39] <strong>Alex Volkov:</strong> This week. After the newsy updates, we also had a deep dive into something called Matryoshka Embeddings, with the authors of the MRL paper, Aditya and Pratik.</p><p>[00:00:49] <strong>Alex Volkov:</strong> And thank you guys, and I really enjoyed chatting with them both. And we geeked out on why OpenAI decided to release something they came up with two years ago and how it affects the AI industry post the LLM explosion world. So definitely give them a listen!</p><p>[00:01:05] <strong>Alex Volkov:</strong> at the end of this episode. A brief TLDR, then a full news conversation you're used to, broken down to chapters, and then a deep dive, after this brief message from Weights Biases.</p><p>[00:01:15] AI teams are all asking the same question. How can we better manage our model development workflow? The path to production is increasingly complex, and it can get chaotic keeping track of thousands of experiments and models. Messy spreadsheets and ad hoc notebooks aren't going to cut it. The best AI teams need a better solution.</p><p>[00:01:38] and better tools. They need Weights Biases, the AI developer platform, to unlock their productivity and achieve production ML at scale. Replace messy spreadsheets with an automated system of record for experiments.</p><p>[00:01:57] Communicate about model evaluation. and collaboratively review results across the team. Clean up disorganized buckets of models with a unified registry. Automatically capture full model lineage. All the data and code used for training and testing. Seamlessly connect to compute to scale up training. And run large scale sweeps efficiently to optimize models.</p><p>[00:02:24] Analyze the performance of large language models. And monitor LLM usage and costs with live, customizable dashboards. Get your team on the same page to bridge the gaps from ideation to production. Use Weights Biases to build, manage, and deploy better models, faster.</p><p>[00:02:51] <strong>Alex Volkov:</strong> folks, here we go.</p><p>[00:03:10] <strong>Alex Volkov:</strong> Welcome, everyone. Welcome. This is ThursdAI, leap year of 2024. Today is February 29th. Don't get to say this often, February 29th. And this is ThursdAI, your weekly AI news update show and deep dive. We'll see a lot of it. My name is Alex Volkov. I'm an AI evangelist with weights and biases. And I get to do this as, and bring you all the AI updates that we've collected for the past week.</p><p>[00:03:43] <strong>Alex Volkov:</strong> And I'm joined here from week to week on stage with guests and experts and co hosts. I have Yam Pelig with me and Nisten Tahirai, and we're gonna have a few more guests later in the show today. And on this very Happy leap year, very special day. We're going to talk about a bunch of updates from the AI world, including big company updates, open source stuff.</p><p>[00:04:07] TL;DR for ThursdAI - February 29th</p><p>[00:04:07] <strong>Alex Volkov:</strong> Alright, so here's everything that we've talked about on ThursdAI for February 29th. This was a great once in a four year show. I just want to shout out before I recap everything that As always, I'm very happy when folks who build the stuff that we talk about, join and talk about that stuff. And this also happened today, so we had a deep dive, which I'm going to cover at the end.</p><p>[00:04:33] <strong>Alex Volkov:</strong> And also I will shout out that we're coming up on a one year ThursdAI stuff, which is March 14th. So in two weeks, we're going to have a one year celebration. I'm not quite sure what we're going to do with this. Maybe we'll do a give out of GPU credits. Maybe I'll, maybe I'll do some other musical stuff, but yeah, that's coming.</p><p>[00:04:50] <strong>Alex Volkov:</strong> I'm very excited. It's been a year and it's been crazy, a year of AI. Maybe we'll do a full recap. So with that, everything that we've talked about in ThursdAI for February 29th. We've started with open source LLM, our coordinator, and we've talked about. Google's Gemma update. So last week we covered the Gemma was just released and how the whole community got to start using Gemma and start to think about fine tuning and support and ElumStudio and Allama and all these things and Gemma It's been a week or so since the demo was out there, and we've tried to identify from the Vibes perspective and from the Finetuners perspective whether or not Gemma is this replacement for the top running Mistral 7b models that we had, even though on evaluations Gemma looks a little better and performs a little better than Mistral, we covered that It's not really 7b, it's like 8.</p><p>[00:05:40] <strong>Alex Volkov:</strong> 5 billion parameters, they just counted this differently. And we also saw that for multiple attempts from friends of the pod, Eric Hartford, Technium, Yam was here it's really hard to fine tune. The last curve goes crazy and we haven't seen like great fine tunes yet. Something from Hugging Face, from Philipp Schmid, but definitely.</p><p>[00:05:57] <strong>Alex Volkov:</strong> The Finetuners community didn't yet make this, take this model and make it like significantly better as we expected that they would and they're still working on this, so expect more to hear about this soon. And we also highlighted how much Mistral 7b set a very high bar in open source LLMs, and it's really hard to beat, even if you're Google, even if you have a huge amount of TPUs.</p><p>[00:06:19] <strong>Alex Volkov:</strong> We then covered briefly an unfortunate announcement from the information from Meta that Lama 3 will not be breaking news in ThursdAI this week or next week. Lama 3 release is probably scheduled to June in 2024, so not anytime soon. And it doesn't look like there's any information as to why that is, only speculation.</p><p>[00:06:39] <strong>Alex Volkov:</strong> So we definitely covered that this news happened. We then moved and talked about Starcoder 2, plus the Stack version 2 as well. Starcoder 2 is from I think Hugging Face and the Starcoder team. and they released a new model that beats pretty much DeepSea Coder before this was the best coding model in this area in the 15 and 7b parameters and StarCoder 2 is this model that now beats those quite significantly and together with this they also released a stack v2 which stack is a just a huge data set of code from github and other places and this is this data set is 10x the previous one</p><p>[00:07:16] <strong>Alex Volkov:</strong> And it also includes opt out, so you could, if you don't want your code to be trained on and to put into the stack this StackV2 includes opt out requests as well, and definitely great contribution to the open source It's 900 plus billion tokens in the stack, which is crazy.</p><p>[00:07:33] <strong>Alex Volkov:</strong> And I think there's the duplication, so it reduces a huge data set and supports , 600 programming languages. And quite impressive. We then also mentioned that Berkeley, the folks from Berkeley, Guerrilla, they previously released work in making AI's retrieve and call functions. And now they released what's called a function calling leaderboard and function called leaderboard is very cool because in addition to the MTB embeddings leaderboard that we've mentioned.</p><p>[00:08:02] <strong>Alex Volkov:</strong> Today, and obviously the open source LLM leaderboard on HagenFace that we all look to and see what's the best performing models. Now we also have something that measures the ability of models to do function calling. Function calling started with OpenAI, and then Entropic added support, and now Mistral added support.</p><p>[00:08:18] <strong>Alex Volkov:</strong> So we covered this effort as well, [00:08:20] and links will be in the show notes. We then moved and covered Illa or Illa, I'm never sure how to pronounce this. They used the Open IMIS dataset. Open IMIS is the dataset from news research that is fully open. And you can use this in production without being afraid of being sued.</p><p>[00:08:37] <strong>Alex Volkov:</strong> And open imis preferences is the new. Largest open dataset for RLHF and DPO, so Direct Preference Optimization, Argea used their distilled label feature to actually take every instruction in that dataset and turn it into a preference instruction where the model would basically learn one or another, which one of the instructions are preferable.</p><p>[00:08:59] <strong>Alex Volkov:</strong> So both could be correct, but one could be more preferable. So this is basically a very short version of DPO. And Argear released the largest open source like DPO dataset as according to them. And they used interestingly, they used another Nous model based on Ye34 to actually create those pairs and those preferences, which is super cool.</p><p>[00:09:18] <strong>Alex Volkov:</strong> I love how now open source uses other open source in order to rank and improve itself, which is really cool. So this is everything we covered in the open source. And then we moved into big companies, LLM and APIs. And the big companies we talked about, the biggest news from this week was If you guys remember, we can talk about Mistral's OpenWeights model in the OpenSource LLMs and OpenWeights LLMs, but Mistral is also now an API provider, and they have this platform called LaPlatform, or LaPlatformer, and then, pardon my very bad French as well, they released a huge model for us called Mistral Large, which we only speculated about whether that's coming at some point as well, plus they also released something called LeChat.</p><p>[00:09:59] <strong>Alex Volkov:</strong> And, Mistral Large is based on some MMLU stuff is actually second performing model in the world getting 81. 2 percent on, I think, MMLU and second only to GPT 4. So Bitscloud 2 and Gemini Pro, they didn't add Ultra here, so I'm actually not sure how it compares to Ultra, but definitely now is available over API for Mistral folks.</p><p>[00:10:20] <strong>Alex Volkov:</strong> One highlight that we've talked about, it's handles 32, 000 tokens of context. And because Mistral is trying to position themselves as the leader in at least European. This model is native in French and German and Spanish and Italian. And it's definitely well performing in those languages as well.</p><p>[00:10:39] <strong>Alex Volkov:</strong> In addition to this, those models, all of the models in there, the platform now support function calling as well, which is. This is really cool that we now have multiple providers that support function calling. Plus, we have a leaderboard for function calling so definitely a lot of highlights from what happens in this area.</p><p>[00:10:56] <strong>Alex Volkov:</strong> And also, they introduced LeChat, which is a chat interface currently in beta on top of ORDEL models, so you Actually, you can go and use this if you don't pay for, let's say, GPT 4, and you only get access to three, you can go to the chat and try their models out. Shout out to Mistral. They also announced a partnership with Microsoft and for the open source community.</p><p>[00:11:15] <strong>Alex Volkov:</strong> This sounded hey, they're releasing models, but they're not dropping torrent links anymore. Are there still proponents of open source? And they came out and said, yes, we're still proponents of open source. It's very important for us. And give us some time, we'll give you some more models. Basically, was the response from Arthur Mensch from Mistral.</p><p>[00:11:31] <strong>Alex Volkov:</strong> We also talked about Google teasing Genie, which is a model that makes images into interactive games. And that was really cool to see. I'll add this link to the show notes. It's quite remarkable to see this video from one image of a character in the world. It creates a full world. Imagine how much imagine like a full Mario just created from one image of Mario.</p><p>[00:11:52] <strong>Alex Volkov:</strong> It's quite remarkable. has been in the news lately for the past week or so, we've talked about this, but basically following up of what we talked, where Gemini release was celebrated in some areas because Gemini Ultra beats GPT 4 on different things. It, it also released a lot of responses online in terms of how it reacts to certain prompts, and it, it went, potentially also affected their stock price.</p><p>[00:12:15] <strong>Alex Volkov:</strong> I'm not sure if that was the one thing, but definitely Sundar Pichai, the CEO of Google, sent an email to the whole company talking about how this release was not quite received as much as they hoped, and I'm using choice words here, he actually talked about structural changes and a potential review of the whole process of releasing this and They've took down the ability to generate people from the image version of the Gemini model, but they also talked about specifically the Gemini model itself refusing different things.</p><p>[00:12:45] <strong>Alex Volkov:</strong> This is in addition to them delivering very well and giving us Gemini 1. 5 Pro, which has 1 million tokens in the context window, which I played with this week, and I definitely think it's a great thing from Google. This announcement from Google. released in open weights Jema models and Gemini 1.</p><p>[00:13:01] <strong>Alex Volkov:</strong> 5 doing like crazy new things, but also the Gemini release at large did not go probably as expected. Potentially the reason why Google took their time to release something for us. We then covered the OpenAI is allowing Finetune on GPT 3. 5 and also the OpenAI response to New York times and said, Hey, we actually did not, do the things that you accusers are doing, but also that New York Times did some trickery in prompts to get the model to respond this way. So the saga between OpenAI and New York Times continues, and that's going to be interesting to follow along. And, OpenAI was also featured in another piece of news, actually two pieces of news.</p><p>[00:13:37] <strong>Alex Volkov:</strong> One of them is now there's a conversation that WordPress and Tumblr, both companies from the automatic company daughter companies they will prepare to sell their user data. So basically everybody who had a blog on wordpress. com and everybody who had a Tumblr account. Most of this information probably was already scraped and already featured in datasets from OpenAI, but now they're preparing to sell this information to OpenAI and MidJourney.</p><p>[00:14:00] <strong>Alex Volkov:</strong> And similar to the Reddit Google deal for 200 million dollars recently announced WordPress and Tumblr are now preparing to sell to OpenAI and MidJourney as well. And also OpenAI, and the robotics company also announced a collaboration as well. Brad Atcock's company will integrate with OpenAI's models as well.</p><p>[00:14:23] <strong>Alex Volkov:</strong> Then we moved on to AI Art in Diffusion, which had an incredible week this week with two foundational models, or I guess like big new models that are not Stable Diffusion or DALY or Mid Journey. So the first one was Playground. Playground is a, was an interface. At first it was an interface for DALY and Stable Diffusion.</p><p>[00:14:41] <strong>Alex Volkov:</strong> And they built a very nice, very simple interface that's super fast. You can inject styles. So they used all this data to actually release a new foundational model called Playground V2. And in user preference, this Playground V2 beats Midjourney and beats Stable Diffusion Excel and beats the previous model Playground and DALI.</p><p>[00:14:56] <strong>Alex Volkov:</strong> It looks really cool. And specifically, they talk about their ability to generate photorealistic images very well. And also specifically different. ratios of images. So if you think about the standard 1024 by 1024 image for stable diffusion, Excel, for example, or different other sizes, their ability to generate other nonstandard ratio models, images, it looks very cool.</p><p>[00:15:21] <strong>Alex Volkov:</strong> And in the internal user preference, they actually beat by user preference, they're showing two images for the same prompt. They beat, their v2 beats Midjourney 5. 2 in DALY by 9 percent difference in, in the previous model. And SDXL by a significant margin as well. It looks really cool and definitely worth checking this out.</p><p>[00:15:40] <strong>Alex Volkov:</strong> I'll put a link in the show notes. And the other news That's not stable Fusion, mid journey or daily related. It's quite a mouthful to say ideogram, which we've covered before, announced a version 1.0 of Ideogram X Google, folks who worked on the Google models program, like a website called Ideogram.</p><p>[00:15:56] <strong>Alex Volkov:</strong> And their approach is very participatory. It's very I think Instagram is the source of their name, like Instagram for ideas. And they announced a version 1. 0 and investment from A16z. And specifically it's state of the art on text generation. Text generation is something that we know that other models have and their model is able to put.</p><p>[00:16:19] <strong>Alex Volkov:</strong> text very well inside images. So if you want like reactions or memes or if you're doing presentations, for example I had multiple creators and characters hold like ThursdAI spaces. I think we had some folks even react as I was talking with with ideogram generated text images in in the comments as well.</p><p>[00:16:36] <strong>Alex Volkov:</strong> We, so this is all we covered in AR and diffusion [00:16:40] until we got to this like jaw dropping thing called Emo from Alibaba, which is a tease. It's not a model they released yet, but definitely there is a bunch of videos that were to me as Jaw dropping as Sora from a couple of weeks ago there is something called Emo, which is a way to animate faces to take an image and create a singing or talking face, and it's not only the face, like the shoulders move and everything, so animate an avatar based on one image, and I will not be able to do it justice, because I'm still collecting my jaw from the floor, but definitely I will add some links and some videos, and Coherence with which these models generate talking faces is just incredible.</p><p>[00:17:17] <strong>Alex Volkov:</strong> It's not only about animating the mouth, they animate eyes and eyebrows movement and even different other things like hair and earrings . And one, one last thing that I noticed that really took me a second was they even animate the vocal cords and the muscles in the throat where somebody sings, for example.</p><p>[00:17:35] <strong>Alex Volkov:</strong> And when I saw this, I was like. This is another Sora moment for being able to create with these tools. It's really incredible and I really hope they release this in open source so we'd be able to animate whatever we created with Sora.</p><p>[00:17:47] <strong>Alex Volkov:</strong> And we covered all of this. And then we had a deep dive with Aditya Kusupalli Pratik Jain the authors of MRL paper, Matryoshka Representation Learning, and they talked to us how recently OpenAI released a new version of their embedding model, and you were able to specify the number of dimensions you want, and many folks didn't understand what this is and how it works.</p><p>[00:18:08] <strong>Alex Volkov:</strong> And apparently, Even though OpenAI built all of this from scratch, it was based on the paper that they released two, almost two years ago called MRL, Matryoshka Representation Learnings. And they, we had a very nice chat and deep dive into how this actually works and how they pack The information, the embedded information from later on dimensions into some of the first dimensions.</p><p>[00:18:30] <strong>Alex Volkov:</strong> If you're interested in this area and this area is very hot, I definitely recommend you check out this conversation. It was really great. And thank you, Aditya and Pratik and the rest of the Matryoshka team for joining and talking to us about this new and exciting field</p><p>[00:18:42] <strong>Alex Volkov:</strong> And I think we started already chatting a little bit, and I see some folks from Hug Face in the audience sending sad emojis.</p><p>[00:18:48] <strong>Alex Volkov:</strong> And I want to send I want to send hugs to the Huginface ML Ops team yesterday because for many of us who now work with</p><p>[00:18:57] Hugging Face was down, we were sad and thankful</p><p>[00:18:57] <strong>Alex Volkov:</strong> Huginface, and by work actually our code includes a bunch of imports from Huginface there's transformers as well. Yesterday was a realization of how big Hug Face is now part of many of our lives.</p><p>[00:19:11] <strong>Alex Volkov:</strong> I think for the first time for many of us, this was like such a big realization because that imports stopped working and the downloads didn't actually work. And so we actually had a long space yesterday pretty much throughout the whole downtime as we were holding each other's hands. It reminded me, I don't know Yam, if you want to chime in, but it reminded me previously when GitHub was down, basically You know, you could work, but if you can't commit your code,</p><p>[00:19:34] <strong>Alex Volkov:</strong> What does it help? And I wanted to hear from you, because I think you had some models queued up for some stuff, and then you were waiting for them?</p><p>[00:19:42] <strong>Yam Peleg:</strong> Yeah, look HuggingFace is really the hub today. It's not only for using, for most people, I think it's because they cannot fork or clone models from HuggingFace, so they cannot do many things that they do because your code relies on on getting the model from HuggingFace. This is why, by the way, they tweeted just For anyone that doesn't know, you can work offline.</p><p>[00:20:05] <strong>Yam Peleg:</strong> If you ever cloned a model from HuggingFace ever, you probably have it already on your computer, so you can just use the offline version. So there is a command for that. But for many people, it's cloning the models, but for many other people, it's also the feedback that you get from HuggingFace. I can tell you some people are, some people, some other people here in the stage, that we submit models to the leaderboard, and try to get Try to fine tune better and better models, and for us it's also the feedback of what is going on, where our models shine, and where do we need to make them even better.</p><p>[00:20:41] <strong>Yam Peleg:</strong> And for me at least, I was I had four models that I waited for results for, and many other people as well. And just shout out to Hugging Face for actually doing it. I'm running evals locally, and I know how to do it. Heavy it is to actually run them and how much compute it takes for how long.</p><p>[00:21:01] <strong>Yam Peleg:</strong> And it's amazing to see that they have such a leaderboard with so many models. It's amazing. It's thousands, like hundreds of thousands of dollars of compute to actually create such a leaderboard. So it's amazing to see. And they provide it literally for free where, the community is growing every day.</p><p>[00:21:18] <strong>Yam Peleg:</strong> So it. It does cost so huge shout out for them,</p><p>[00:21:22] <strong>Alex Volkov:</strong> I was trying to prepare</p><p>[00:21:23] <strong>Yam Peleg:</strong> are all addicted much.</p><p>[00:21:25] <strong>Alex Volkov:</strong> Absolutely, Dicta, I was trying to prepare yesterday for this space, and part of my preparation is reading X and Twitter, but definitely part of my presentation preparation is going to Hug Face, reading the model cards reading the leaderboards, for example. I was trying to count in my head how much stuff we're getting for free from Hug Face, and one such example is just their blog, which was also done, which I read specifically to prepare for the Matryoshka conversation today.</p><p>[00:21:50] <strong>Alex Volkov:</strong> And, That's just like a huge resource on its own. There's the whole conversation piece where, there's the hub, but there's also the conversations. AK posts papers, for example, they post them on Hug Face, and then there's a whole discussion threads about them as well. That wasn't accessible.</p><p>[00:22:04] <strong>Alex Volkov:</strong> Leaderboards themselves weren't accessible. And just the amount of compute, like you're saying, that they throw at us for free to be able to support this open source is definitely worth a shout out, and definitely shout out to engineers there that brought the hub back. Nisten, what are your thoughts on this?</p><p>[00:22:22] <strong>Nisten Tahiraj:</strong> Yeah, without Hugging Face, this place turned into a flea market for models. People were asking, does anyone have Quan72? And I was like, no, I have the Finetune. And then, the dev lead of Quan72 pointed us to some Chinese site where they can download it. It was pretty</p><p>[00:22:39] <strong>Alex Volkov:</strong> Wait. Modelscope is not just some Chinese site. Modelscope is where I think most of the Chinese folks are posting their models. It's like the, I think modelscope. cn, I think is the alternative on the Chinese area. So there is at least a backup for some Chinese, like models. Although I think you have to translate that website, right?</p><p>[00:22:59] <strong>Alex Volkov:</strong> But yeah, I don't know we had a conversation yesterday, and Far El was also talking about datasets, where many folks just upload the dataset, don't keep a local version of it locally, and then to be able to run evaluations, or do different things like this, that also was prevented yesterday.</p><p>[00:23:14] <strong>Alex Volkov:</strong> Definitely yesterday we discovered how big Hug Face became part of many of our lives, and it was a sobering realization, but, I don't know, for me, like I saw people complain online, And I get it, folks. I get it. Sometimes, you complain. But honestly, sometimes As far as I understood, the downtime wasn't even some their fault.</p><p>[00:23:32] <strong>Alex Volkov:</strong> There was like a mongo thing in AWS. I'm not sure. I didn't dive in deep. I just, when this happens, in my head, when I dealt with downtimes before in my professional career, Nothing but appreciation for the team to work hard. And the, I think, Yam, Clem, the CEO, even responded to you. When you said hug and face it down, right?</p><p>[00:23:55] <strong>Yam Peleg:</strong> To many people, not just to me, but yeah they are responsive.</p><p>[00:23:59] <strong>Alex Volkov:</strong> Responsiveness and like being in the community and saying, Hey folks, we understand, we're sorry about this. I think that's basically, besides having folks work on this actively, which we know they had, this is all we can basically ask for. So I'm just sending positive vibes and appreciation. I saw some people getting salty.</p><p>[00:24:17] <strong>Alex Volkov:</strong> I saw some people saying Oh, this sucks. And we need a backup. And I was like, yes, but also, this doesn't mean that, you can ignore everything for free that we've got so far from this incredible organization. So shout out. And I don't work there, but I do have many friends who do.</p><p>[00:24:33] <strong>Alex Volkov:</strong> I think, yeah, Nisten, go ahead. And then we'll move on to actual recap of everything we're going to talk about.</p><p>[00:24:39] <strong>Nisten Tahiraj:</strong> Yeah, and same for the leaderboard. We give Hugging Face so much crap when things don't work, and I really appreciated that. It's actually the CEO that responds directly to your Complaints and tickets and it's not just some like support person. No, it's Clem. He's the actual CEO. They'll respond [00:25:00] They're the first ones to respond.</p><p>[00:25:01] <strong>Nisten Tahiraj:</strong> So so that's pretty amazing You don't really see it in other companies Like we don't expect the president of microsoft brad smith to ever respond to a github issue. Could you imagine that? So</p><p>[00:25:12] <strong>Alex Volkov:</strong> He is not your favorite. I would love Satya though to, to chime in on the discourse but not Brad. Yeah, absolutely cannot imagine this and kudos, kudos to them for the participation in the community.</p><p>[00:25:23] Open Source AI corner</p><p>[00:25:23] <strong>Alex Volkov:</strong> And I guess we should start with our usual thing open source. So I guess let's start with open source Alright folks, this is our regular update every week for the Open Source Corner, where we're gonna start with Interestingly, Mistral is not in the open source corner, is not featured in the open source corner today, but we'll mention them anyway, because from last week, if you guys remember Jammer was released, it wasn't open source, it was open weights, but definitely Google stepped in and gave us two models to run, and since then, I just wanted to mention that many folks started using these models, and there's quite a few stuff that, yeah, I'm actually wanting to hear from you about, because we talked about this, the Gemma models are not necessarily seven billion parameters, right?</p><p>[00:26:24] Gemma from google is hard to finetune and is not as amazing as we'd hoped</p><p>[00:26:24] <strong>Alex Volkov:</strong> This was a little bit of a thing. And also about fine tuning. Could you give us like a brief out like how the last week in terms of Gemma acceptance in the community was?</p><p>[00:26:32] <strong>Yam Peleg:</strong> Oh, wow. Gemma is giving me a hard time. This is for sure. I'm fine tuning Gemma for, or at least struggling with fine tuning Gemma for a week at the moment. Okay, so starting from the beginning, GEMMA is not exactly 7 bit. The way it is referred in the paper is that the parameters in the model itself, apart from the embeddings, are exactly 7 billion parameters.</p><p>[00:27:01] <strong>Yam Peleg:</strong> But then you add the embeddings and you're a little bit over 8. 5, if I remember correctly. Um, which is fine. I don't think anyone has any problem with a bigger model. Just, I think that it'll be less, it'll be more genuine to just say it's an 8p parameters model. It's fine. That's first.</p><p>[00:27:23] <strong>Yam Peleg:</strong> Second, it's, it behaves differently. than what we're used to with Mistral and Lama. I'm not sure why. Maybe someone can tell me, but I'm not sure why. It behaves differently. And many people are currently working and struggling to fine tune it better. This is where it is at the moment. I heard, I've seen already Orca.</p><p>[00:27:54] <strong>Yam Peleg:</strong> Someone fine tuned on Orca and didn't get Great results. I also heard that Hermes, someone Finetune on Hermes, I think from Nous. I'm not sure, but I think. Also, results are not great. I'm continuing pre training and the loss is is doing whatever it wants. It goes down and then out of the blue it starts to jump.</p><p>[00:28:16] <strong>Yam Peleg:</strong> I'm not sure exactly why. It might be because the architecture is slightly different. There are slight modifications. So maybe that or maybe something else, but yeah, I think we're still. exploring the model. We don't have an answer yet.</p><p>[00:28:35] <strong>Alex Volkov:</strong> Yeah that's what I got as well. I pinned a few examples of Eric Hartford from DolphinFan, I think he now works in Abacus and Technium as well, tried to, to do some stuff and all these losses look crazy. All these losses look like jumping around up and down. I saw a tweet from Philip Schmidt from Hug Face where they were able to, to fine tune some stuff and the conversation from Eric and Wing Lian from Axolotl.</p><p>[00:29:00] <strong>Alex Volkov:</strong> And there looks to be an effort to try and hone this thing and see if actually, fine tuning this on some stuff. The Hermes stuff, Finetune, was not really like an official news research thing. It looked like somebody just took the data set and folks weren't able to actually Get it to run or perform well as far as I saw I haven't seen an update from this But I definitely follow up with news.</p><p>[00:29:22] <strong>Alex Volkov:</strong> So I would just remind folks, last week we talked about Jemma was well received.</p><p>[00:29:26] <strong>Alex Volkov:</strong> Everybody hopped on board like super quick and added support. LMStudio and Olami added support like super quick. Wing started adding support to Axolotl for fine tuning. Hug and Face added support in, I think, Transformers. There's a bunch of TreeDAO added support for Flash Intention. There's a whole community effort to receive GEM as much as possible.</p><p>[00:29:47] <strong>Alex Volkov:</strong> And they also released some stuff in, in, in quantized versions from Google. So very good effort from Google and then very big acceptance from the community. But since then, what I'm trying to highlight is a lot of the stuff that we've talked about a lot of the way we judge models, whether or not they're good or not is, if they're finetunable, for example, is one thing, but also if they're instruction following, if it's easy to converse with them. I haven't seen any of this come across my timeline at all. I will be frank, I only interacted with the 2 billion parameter model. And wasn't impressed. It's great that we released it.</p><p>[00:30:20] <strong>Alex Volkov:</strong> I wouldn't, would not be using this for any of my workloads. Nisten, do you have any other feedback as well? Specifically around like how Mistral 7b seems to be still. A good alternative, even though it's performing less on evaluations.</p><p>[00:30:34] <strong>Nisten Tahiraj:</strong> Yeah, I feel like we have been spoiled by just how high of a bar Mistral 7b has set for everyone, that it even made Mistral large feel somewhat unimpressive, although it was answering everything perfectly well. But, yeah, not only has it set a very high bar, but it was also very easy to work with. So the amount of innovation that came upon the community just building off of the initiated weights, has made This class of models, extremely competitive that even Google has a hard time cracking through that.</p><p>[00:31:15] <strong>Nisten Tahiraj:</strong> Yeah, our expectations now for a 7b model are extremely high. It has to run on my phone. It has to do what I want. It has to respond. It has to summarize stuff, has to carry forward the conversation. Oh, and it has to score high on the benchmarks too. And it. This pace of innovation that the community has set upon this is just very hard and also incredibly interesting to see that Google is having a very hard time matching or getting close.</p><p>[00:31:46] <strong>Alex Volkov:</strong> Specifically because, in the land of GPU poor and GPU rich, in the original article that defined the two categories, Google is the GPU slash TPU rich, right? They could and have thrown a bunch of compute at these models and still the folks from Mistral, a team that's less than 30 people that started eight months ago released a model.</p><p>[00:32:06] <strong>Alex Volkov:</strong> 6 months ago? I think Mistral 7B is around 6 months ago, right? September? That Google, 6 months after, with all the GPU richness, is very barely able to match, not to mention, beat significantly. Which is unlike any pace that we're used to. We're used to a 7B model beating a 7TB model week after week.</p><p>[00:32:25] <strong>Alex Volkov:</strong> And here's a huge company coming out and saying, Hey. Here's our best attempt at the 7b model that YUM doesn't even consider a 7b model, and it's in at least our attempts to play around with this. It's not beating significantly, which is strange. But also not being able to get fine tuned very easily.</p><p>[00:32:43] <strong>Alex Volkov:</strong> Very interesting and very a highlight of how much quality the the Mistral model was. I will also say that Arthur Mensch we'll cover this in the Mistral section afterwards, but he came out and he said something and basically said, we can only do so much with 1500. H100s, 1500 H100s just by contrast, Meta announced a few months ago famously, Zuckerberg came out and said, by the end of this year, they're going to have 600, 000 worth of equivalent of H100 compute, 600, 000 H100s to train and host and probably, do inference on Meta and Llama.</p><p>[00:33:19] <strong>Alex Volkov:</strong> And [00:33:20] this is like 1500 H100s that Mistral was able to use in Finetune, a model that Google cannot wipe off the board completely.</p><p>[00:33:29] LLama 3 won't be released until June 2024</p><p>[00:33:29] <strong>Alex Volkov:</strong> It's very crazy. Moving on to basically another news update that's not a news update. We've been waiting for Lama 3 for every week. I've been saying, Hey, it could get released here and et cetera.</p><p>[00:33:41] <strong>Alex Volkov:</strong> There was a leak from the information. I actually don't know if it was a leak or not, but the information came out and then a bunch of other companies followed with this news where Lama 3 will be released. I think in June, this was the update. LLAMA 3 will not get updated and released for us anytime this year.</p><p>[00:34:00] <strong>Alex Volkov:</strong> We were hoping for a one year anniversary. LLAMA 1 was released in February 2023. And now we're not gonna see LLAMA 3, even though it's like a finished training as far as I understood, or as far as updates were. And while Zuckerberg goes and eats at McDonald's, LLAMA 3 will not get released from us. I wanted to hear folks here on stage react to this, because surprising news, isn't it?</p><p>[00:34:23] <strong>Alex Volkov:</strong> Ha,</p><p>[00:34:24] <strong>Nisten Tahiraj:</strong> gonna say that I called it, just based on how just how censored and unwilling to answer anything Code Llama 2 was. So yeah, if Code Llama 70b wouldn't answer anything, I figured it would be pretty, it would be around the 3. So now they either have to go way back in the training. When they started doing a lot of this, and retrain it with with it being a lot more obedient, but still not horrible or anything, because we see from Mistral's team that it does obey to you and respond stuff, but it still won't tell you, like, how to kill your cat and stuff so it's, yeah, they, the public backlash from it.</p><p>[00:35:12] <strong>Nisten Tahiraj:</strong> People giving it to Gemini and Google has has completely affected the LLAMA3 release, which is just very interesting.</p><p>[00:35:19] <strong>Alex Volkov:</strong> interesting, Because they didn't release LLAMA 1, and then nothing bad happened in the world. And then they released LLAMA 2, with a commercial license that people can actually use this. Which kickstarted a bunch of open source stuff. And now they're waiting with LLAMA 3. Potentially I heard some stuff where it could be GPT 4 matching model that we could run.</p><p>[00:35:40] <strong>Alex Volkov:</strong> But, we don't know until it's released. But just like a very Interesting update. And I gotta wonder if by the time they decide to release this if other open source will catch up or not. Usually LLAMA, when they come out with a big model it's impressive. But for example, LLAMA code already was beaten by the time it came out, right?</p><p>[00:35:57] <strong>Alex Volkov:</strong> If I'm not mistaken, like DeepSeaCode and other models achieved the same score on coding that LlamaCode was released with. Maybe waiting a little bit. I gotta wonder what goes into this decision. Which on the topic of code,</p><p>[00:36:10] StarCoder 2 and Stack V2 open source from Hugging Face</p><p>[00:36:10] <strong>Alex Volkov:</strong> I think we're moving to the next thing. And Star Coder two and Stack V two were released and in collaboration with with hugging face.</p><p>[00:36:17] <strong>Alex Volkov:</strong> Stack v2 is like the second iteration of the stack data set, which was just like insane amount of code collected.</p><p>[00:36:25] <strong>Alex Volkov:</strong> I think stack v2 now includes opt outs. So you could say, hey, I want my code to be opted out from the stack v2. And so this new data set, I think is 60. Billion parameters, I want to believe 10x more than the first stack. And Starcoder, the 15 billion parameter model, it beats Code Llama 13b pretty much on every Human Evil Plus and DS 100, the GSM 8K.</p><p>[00:36:49] <strong>Alex Volkov:</strong> Um, very impressive. It beats, obviously, the previous Starcoder, which was a very significant model. I think Based on the evaluations, DeepSeq Coder, we know, was like one of the best code models so far. And it looks like StarCoder on a few benchmarks, competes with, but everything else, it beats DeepSeq Coder as well, for the 7b model.</p><p>[00:37:09] <strong>Alex Volkov:</strong> But it's a model twice, twice the DeepSeq size as well. So they released three models, 3 billion parameter, 15 billion parameter versions. 15 billion parameter is a very interesting, Place where you could potentially run this still on your Mac if your Mac is stacked up and get a decent result back.</p><p>[00:37:26] <strong>Alex Volkov:</strong> It has a 16k context window, a very weird one usually like 16 384 weird one. It was trained on 4 trillion tokens depending on the size of the model Includes 600 plus programming languages, which is great, all we care about probably is Python and JavaScript and maybe some folks care about Rust, but 600 plus programming languages, I honestly didn't even know there was that many.</p><p>[00:37:51] <strong>Alex Volkov:</strong> Percent of the human eval, which is okay, I've seen models that get way better than 46%, so that's interesting. And What else is interesting in DeepSeq? It's a commercial friendly license, so you can use this for commercial stuff. Can be used for local copilots, which is something we're waiting for.</p><p>[00:38:06] <strong>Alex Volkov:</strong> And the more of this, the better. And yeah, StarCoded 2. But I also want to shout out that the StackV2, like the more data we'll get, the better it is for everybody else and other models as well. And the StackV2 is definitely a great achievement that we should shout out.</p><p>[00:38:23] <strong>Nisten Tahiraj:</strong> Yeah, this is crazy. The full data set is 67. 5 terabytes for the stock v2 and you can just have it for free. It's the amount of work. So it's 900 billion tokens extra that went on top of what was actually an excellent model coding model to begin with. So this is this is huge, not just beneficial from the model itself, but also because you can just.</p><p>[00:38:47] <strong>Nisten Tahiraj:</strong> I don't know, Finetune 1 for TypeScript, if you want.</p><p>[00:38:50] <strong>Alex Volkov:</strong> Yep. Yeah, go ahead.</p><p>[00:38:53] <strong>Yam Peleg:</strong> Yeah, I think it's worth worth mentioning that as far as, I haven't looked at it in, in depth because the Honey Haze was down but as far as I understand, it's a base model. When we compare human eval of a base model to a model that was specifically Finetuned to obey instructions, And we see a result that is, okay, it's not the same, but it's somewhere at the ballpark.</p><p>[00:39:18] <strong>Yam Peleg:</strong> It's amazing, because it just means that as soon as you will find Junaid, it's going to be incredible. Moreover, from what I've seen in the paper, I was just, I heard about it, and I was sure that I'm going to open the paper, and what I'm going to see is something like hey we did the same thing, but huge 4 trillion tokens, enjoy.</p><p>[00:39:38] <strong>Yam Peleg:</strong> But no what you see over there is that they really went in depth into the benchmarks themselves and checked which benchmark is actually what exactly does it measure? How it correlates to real life usage. They went over there and benchmarked different packages, like each and every one, like how good is it with Matplotlib?</p><p>[00:39:59] <strong>Yam Peleg:</strong> How good is it with SciPy? And this is It's a very detailed and high quality work for, it's very hard to say which is better as a base model, DeepSeq or StarCoder, because there are so many benchmarks in the paper I've never seen before, even DeepSeq has, I think, six benchmarks. StarCoder, I didn't even count, there are so many, and I think it's great work, even I suppose that the model is really good at least on the level of DeepSeq, although I don't know, I need to check, but I, but just the paper alone, it's such a huge contribution, the paper alone and datasets, so yeah it's amazing.</p><p>[00:40:40] <strong>Yam Peleg:</strong> And it just, it went a little bit silent. People just released models that were trained on 4 trillion tokens and it goes silent nowadays. It's amazing that we got numb to something that's insane.</p><p>[00:40:53] <strong>Yam Peleg:</strong> And on the same week, on the same week, NVIDIA released a model. I don't think they actually released the model, but they just trained the model on 8 trillion tokens.</p><p>[00:41:03] <strong>Yam Peleg:</strong> And we don't even talk about it. It's just insane.</p><p>[00:41:06] <strong>Alex Volkov:</strong> let's talk about it. I saw the Nvidia stuff, but I don't see a release. I saw an announcement, right?</p><p>[00:41:12] <strong>Yam Peleg:</strong> Yeah, it was a paper and I think that's about it. NVIDIA is showing they got the best hardware because they got the best hardware. So they can train on a lot of tokens really fast. And the model is really good at the end because, the tokens, but but yeah, I'm just saying that it's increasing, the amount of data is increasing, the size of the models that we actually use are increasing, and worth noting [00:41:40] the trend is, there is a trend of things getting more and more powerful.</p><p>[00:41:45] <strong>Alex Volkov:</strong> Absolutely. And I would just say this is partly what we're here for to highlight things like this in the open source and shout out the folks who worked hard on this, on releasing this and making sure that this didn't go silent because this effort is very well appreciated. If it's a base model, then we'll get local co pilots performing way better.</p><p>[00:42:04] <strong>Alex Volkov:</strong> And this is great, especially the data set that they released. 10 times the size of the previous one, it's called the stack, and folks would be able to use this to fine tune other models. And that's obviously also great.</p><p>[00:42:15] Argilla releases OpenHermesPreferences</p><p>[00:42:15] <strong>Alex Volkov:</strong> And on the topic of datasets, if you guys, Remember, we've talked about Argea multiple times at this point, shout out Argea folks, and if you want to come up and talk about Argea, your place is here.</p><p>[00:42:27] <strong>Alex Volkov:</strong> They released a DPO conversion of Technium's Hermes dataset, it's called Open Hermes Preferences. And as we've talked about Nous Research and Hermes multiple times, this is one of the datasets that, I think, a million rows that compile from different other datasets as well.</p><p>[00:42:45] <strong>Alex Volkov:</strong> And Argia is an open source tool that allows you to, make datasets better by converting them to preferences and DPO. So they released the DPO version, DPO's direct preference optimization version, where basically they take a dataset with a regular ROHF dataset with one instruction in a conversation and turn it into kind of a preference where they show a few instructions and they actually have information about what would be a more preferable.</p><p>[00:43:12] <strong>Alex Volkov:</strong> Instruction. That's what, very poor explanation of DPO. Yam, if you want to chime in here and clean this up feel free. And Argia released an open Hermes preferences, which is 1 million preferences dataset on top of Technium. And, um, it's pretty remarkable because we know that Even Nous Research, when there is DPO versions of their models, it performs better than a regular SFT fine tuning models on top of every benchmark pretty much.</p><p>[00:43:40] <strong>Alex Volkov:</strong> And now, they've converted all of that dataset into a preferenced dataset. They've created the responses with another Hermes model, which is pretty cool, right? So they're using they're not using OpenAI because scraping from OpenAI is against, as we saw in the lawsuit with OpenAI it's against the terms of service.</p><p>[00:44:02] <strong>Alex Volkov:</strong> But you can actually create these preferences with another model. So they're using Nous Research's Hermes to Yee on top of YE 34 B to do what's called the distill label and make those instructions a little better. And this data set is open. So unlike the regular thing Air Ms, this data set is open for you to also go and fine tune your models, which is pretty cool.</p><p>[00:44:24] <strong>Alex Volkov:</strong> And shadow to the open ESS preferences. I'm gonna pin this to the top of the space and I will also definitely add this to the show notes.</p><p>[00:44:32] Function calling leaderboard from Berkley</p><p>[00:44:32] <strong>Alex Volkov:</strong> No. Okay. Let's move on in our conversation. I wanna talk about the function calling leaderboard because I think it's pretty cool. Lemme just go and find this this switch real quick. This is from, oh, was actually, yeah.</p><p>[00:44:44] <strong>Alex Volkov:</strong> There was an effort before called Guerrilla, and now the same folks from Berkeley released a leaderboard called Berkeley Function Calling Leaderboard, and essentially, function calling for those who don't use any open source model but use something like OpenAI. OpenAI, during last summer, I think, answered everybody's request to give us structured outputs in the form of JSON and answered them with, hey, we're going to introduce something called function calling for you, where you call our model and you provide One function or several functions in your code, and the model will respond and say, Hey, you should call this function and with these parameters.</p><p>[00:45:23] <strong>Alex Volkov:</strong> And basically, instead of getting JSON mode, we got function calling back then. Now we have both, we have a way to get just structured JSON, but also we get models to respond with which functions we should call. And this is great for agents, this is great for folks who are building with these models.</p><p>[00:45:38] <strong>Alex Volkov:</strong> And I think during the summer, because OpenAI came up with this concept, OpenAI was the only model that was supporting this. And then quickly, open source started catching up. And I think, Nisten, correct me if I'm wrong, but I think John Durbin's Ouroboros has a bunch of like function calling instructions in it.</p><p>[00:45:54] <strong>Alex Volkov:</strong> So this model and then models who are trained on Ouroboros were also fairly okay with function calling. Mistral just released their update, so Mistral supports function calling.</p><p>[00:46:05] <strong>Nisten Tahiraj:</strong> They had about a thousand. About a thousand of function calling datasets in the AeroBoros two, or I forgot. Just look up John Durbin, J O N Durbin, and AeroBoros, A R A I R O B O R O S dataset. And, yeah, apparently there's about a thousand entries in there for function calling. That's by accident helped a lot of the other models be better at function calling too.</p><p>[00:46:29] <strong>Alex Volkov:</strong> Yeah, so every other model that was trained on Airbores, which is a lot Hermes includes Airbores data set. They now I don't know if this is by accident or this is now how things work in the merging world. And in the, Finetuning on top of data sets that Finetune on top of other data sets, right?</p><p>[00:46:44] <strong>Alex Volkov:</strong> But definitely other. other open source models, no, no support, at least the notion of functional control, and then the eventually we get to the point where there's now a leaderboard like we like. So if we're going to talk about embeddings later, there's an MTB leaderboard for different embeddings model, even though I see Bo in the audience and He's not very happy with how easy it is to game this leaderboard.</p><p>[00:47:07] <strong>Alex Volkov:</strong> We obviously look at the open source LLM leaderboards and Yam was talking about submitting a few stuff there and see how it performs and that's being, exploding popularity and merging. So it's great to have a function calling leaderboard as well. And folks at Berkeley that tests models I think API only, I don't know if they're supporting open source at this point, the test models and looks at.</p><p>[00:47:28] <strong>Alex Volkov:</strong> How you could expect a performance on different function calling and I think for folks who are building with this it's very cool. So Some of the models that are leading this leaderboard and GPT 4 the latest preview from January is leading this They have something called Open Functions V2, which I I think the organization that pulled this up, Gorilla LLM, is the folks who put it up, and they have an Apache 2 license, and they have an average score on different Simple Function, Multiple Functions, Parallel Functions different scores for all of these tasks.</p><p>[00:48:08] <strong>Alex Volkov:</strong> And I just, I want to highlight this and I want to add this to the show notes because more and more we see Mistral Medium entering their Cloud From Entropiq and open source models. And I think for many folks building, agents building with these models This type of interaction with the model is very important, where it's not only a prompt, a textual prompt, and you get something back, you actually need to do something with it, and I think a shout out for folks for building and maintaining this data, this leaderboard.</p><p>[00:48:34] <strong>Alex Volkov:</strong> And I think they also released the Gorilla model as well. . Let's move on, I think this is it, folks. I think this is everything we have to talk about in the open source LLMs.</p><p>[00:48:42] <strong>Alex Volkov:</strong> And given that Conor, given that Storm is in the area of open source ish, let's cover Storm a little bit.</p><p>[00:48:49] <strong>Alex Volkov:</strong> I think this is a good time. Because it also like dances on the on the area of interest that we talked about last time. Do you want to present Storm and talk about this and see how cool this is?</p><p>[00:48:58] <strong>Connor Shorten:</strong> Yeah, cool. I guess maybe let me say one more thing on the gorilla. I think it's fascinating going through the functions that they have if you go through the open function, the blog post from Berkeley, you have calculate triangle area, and then you give it the base and the height. And I think that kind of just like super specific functions, having a massive data set of that.</p><p>[00:49:16] <strong>Connor Shorten:</strong> It's fascinating that they've, seeing this next evolution of that, but. Okay, so with Storm yeah, there's definitely some intersection between DSPy and the function calling models. With DSPy, one of the one of the one of the built in signatures is that React one, where at React you have thought, action.</p><p>[00:49:33] <strong>Connor Shorten:</strong> And so you, it's one way to interface tools. Yeah, the tool thing is pretty interesting. I think it's also really super related to the The structured output parsing and, the please output JSON and, Jason, our favorite influencer of the function calling</p><p>[00:49:47] <strong>Alex Volkov:</strong> I just wanna make sure that folks don't miss this. Jason Liu is the guy who you're referring to, and he is, he's our favorite influencer in, in forcing these models to output JSON. I find it really funny that the, a guy named [00:50:00] Jason is the guy who's leading the charge of getting these models to output JSON formatted code.</p><p>[00:50:04] <strong>Alex Volkov:</strong> I just find it really funny. Didn't wanna skip this. I wanted to plug this, that joke somewhere, but please go ahead and let's talk about the story. Oh, and a shout out to both Weights Biases and Connor on WayVid, Jason appeared in both places talking about Instructor Library and how to get these models to give a structured output.</p><p>[00:50:21] <strong>Alex Volkov:</strong> So definitely shout out for Jason for this, check out his content on both platforms.</p><p>[00:50:29] <strong>Connor Shorten:</strong> yeah, awesome. Yeah, it's such a huge part of these, lLM pipelines, like I know Bo is going to speak in a bit, who's someone I consider one of the experts in information retrieval. And one of these big things is like you will retrieve and then you'll re rank and then you'll generate. And if it doesn't follow the output exactly, you can't parse it in the database.</p><p>[00:50:47] <strong>Connor Shorten:</strong> So it's such a massive topic, but okay.</p><p>[00:50:50] Stanford introduces STORM - long form content grounded in web search</p><p>[00:50:50] <strong>Connor Shorten:</strong> So starting with Storm, I guess I can tell a funny story about this. Erica and I were hacking on this and we came up with the plan of You start off with a question, and then you do retrieval, and so you're looking at the top five contexts, as well as the question, and you use that to produce an outline.</p><p>[00:51:06] <strong>Connor Shorten:</strong> And again, structured output parsing, that outline better follow the comma separated list, so that then you can parse it, and then you'll loop through the topics. And then we have a topic to paragraph prompt, where, you know, you're doing another retrieval now with the topics. And then we have the proofreader and then the the blog to title.</p><p>[00:51:26] <strong>Connor Shorten:</strong> So that's the system that we, got our hands on with. And I could probably talk about that better than the STORM system, but it's very similar. With STORM, the difference, so the difference is we're retrieving from, a Weaviate index with Weaviate blog posts. Let's make it as much Weaviate as we can, but like they, so they replaced the specific retriever with with web search retriever. And. So I was playing with that a bit on the weekend as well, using the U. com API as the web search and, it's pretty cool web search and as well as a private thing that you curate. I think that's definitely one of the big topics.</p><p>[00:51:56] <strong>Connor Shorten:</strong> Okay, so then the interesting thing is once you've got this, in our case, as a four layer system, now you use DSPy to compile it. So what compiling it entails in DSPy is tweaking the task description as well as producing input output examples. So you have in the prompt, you slightly change it from, you'll take a topic and write it into a blog post.</p><p>[00:52:19] <strong>Connor Shorten:</strong> Typically, that ends up resulting in a blog post about software documentation, right? So that's what that ends up looking like. And then the input outputs end up being, like, an example of what are cross encoders. Here's a blog about cross encoders. So you can use that input output to then reason about the new inference, so hopefully that's a good description of what it means to compile these programs, where you optimize the prompts for each layer in the task as you decompose this task into its subtasks.</p><p>[00:52:45] <strong>Connor Shorten:</strong> Storm then introduces something that I think is pretty novel. which is how you do that research loop. So we naively just went question to outline and then just instantly flesh out the outline, whereas they instead go from question to perspectives about the topic. And you retrieve from each of the perspectives about the topic, and then you'll, write it, and then it will, I'm not sure how it all gets resolved, but it's, so it's almost like a multi agent system in my view, this kind of like perspective guided to adding personas or like background.</p><p>[00:53:18] <strong>Connor Shorten:</strong> So I think that's probably the key differentiator between Storm and then that kind of like blog post system that I described. But so we have open source code on Weaviate Recipes. If you want to see what the, what our four layer program looks like and compiling that with the Bootstrap Optimizer.</p><p>[00:53:35] <strong>Connor Shorten:</strong> With the Bootstrap Optimizer is you just run a forward pass through the model with a super high capacity model like dbt4. And then, to get the input output, and then you hope that Turbo or one of the cheaper, or the open source models can can look at those input output examples and then copy the system behavior.</p><p>[00:53:51] <strong>Connor Shorten:</strong> There's a lot of other interesting things about this, like multi model systems, even in the Storm paper they compare GPT Turbo, and then they use Mistral 7b Instruct as the judge. Another thing is like earlier talking about re ranking. You might want to have the long context models do re ranking because with re ranking you typically try to give it a lot because you're trying to like, put a band aid on the search.</p><p>[00:54:13] <strong>Connor Shorten:</strong> So you probably want to have 20 to a hundred results that go into the re ranker rather than, five to 10. And it's probably also not really a task for the for LLMs anyways. And I think that's another, opportunity for a task specific model, but overall to conclude this thing about Storm, I think for me, the big exciting thing is it's becoming, DSPi is making it super clear, I think, on how to build more than chatbots or just simple question answering.</p><p>[00:54:40] <strong>Connor Shorten:</strong> It's I think we're probably within a few months from, anytime you have a pull request, the documentation will be written for you automatically. Probably you could even have an idea and have a pull request created by the model. I'm personally biased by coding applications, but yeah. So the but yeah, this kind of like long form content generation by breaking down each task and then optimizing each part of the task.</p><p>[00:55:05] <strong>Connor Shorten:</strong> It's all just really interesting.</p><p>[00:55:07] <strong>Alex Volkov:</strong> very interesting. And I had a storm to, from Yijia Xiao to, to the show notes as well and folks are definitely worth checking out because it writes like wikipedia length articles and uses like you. com API or different search APIs to give perspectives and References and very interesting. I want to in the sake of time I want to move so just like to reset the space we've been at this for almost an hour You guys are on ThursdAI.</p><p>[00:55:33] <strong>Alex Volkov:</strong> ThursdAI is the weekly podcast and newsletter that's recorded live on xSpaces. And I'm here with several friends and guests and experts in different fields. And we've been covering open source LLMs until now. And I think we're going to move into big companies because we need to cover this. And soon we're going to have some folks to do a deep dive about embeddings.</p><p>[00:55:51] <strong>Alex Volkov:</strong> And let me just make sure that the folks know that they're, they can come up. Uh, the big companies, LMs and APIs this is the segment where we chat about OpenAI and Microsoft and Google and whatever not the models that they released for us in OpenWeights and OpenSource that we can run ourselves this is the segment where we talk about API and developments and different updates.</p><p>[00:56:13] <strong>Alex Volkov:</strong> So let's run through them.</p><p>[00:56:14] Mistral releases Mistral Large & Le Chat interface</p><p>[00:56:14] <strong>Alex Volkov:</strong> The biggest one from this Monday was Mistral releasing Mistral Large, which we've been waiting for and getting excited about. And also they released a chat version of their models called LeChat. And, um, it's very impressive, folks. Like the Mistral Large now is based on at least some metrics that they released, is second only to GPT 4, and beats Claude and Tropic and Gemini Pro on the MMLU score.</p><p>[00:56:43] <strong>Alex Volkov:</strong> And Mistral is vastly superior to Mistral Medium handles 32k tokens of context natively fluent in English, French, Spanish, German, and Italian. It highlights how much Mistral is focusing on becoming the open AI alternative from Europe, because you can go to the chat and there's execute every chat that you have with their models.</p><p>[00:57:09] <strong>Alex Volkov:</strong> And basically, Maybe you don't have to have an OpenAI subscription. I think that's what they want to do. But also, this model is available in the API, and it's significant performance on top of everything else on the other languages. And they're aiming for the five top languages in Europe, obviously, and I think it's a Very standard, like a very important move from theirs that they're establishing themselves as this big company.</p><p>[00:57:32] <strong>Alex Volkov:</strong> This was why we moved them to the big company APIs as well. The announcement also includes something interesting. They said, we have also updated Mistral Small in our API to a model that's significantly better and faster.</p><p>[00:57:45] <strong>Alex Volkov:</strong> The Mixtral 8x7b. If you remember when we announced, when we talked about Mistral releasing API access, we said that, whatever Mistral Next is It's probably going to be medium. So now we have a large model that outperforms pretty much every model besides GPT 4 on different tasks. According at least to them, but also the small model that's like faster and better.</p><p>[00:58:06] <strong>Alex Volkov:</strong> They upgraded this like behind the scenes. They're not released that any of this in open weights. Which is the response from the community was partly this, is Mistral releasing a bunch of stuff, and none of the stuff like we expected. No torrent links this [00:58:20] time, no, open models that we can start fine tuning.</p><p>[00:58:22] <strong>Alex Volkov:</strong> And I think so first of all, kudos on this release. I've used some of the stuff in the chat, and I'm very happy with the responses. They're fairly quick, but definitely giving good responses. Nisten, I think your perspective from before, from the open source segment is very interesting where they spoil us so much with the open models, with the Mixtral models, and even the 7B, that even large doesn't seem that significantly better.</p><p>[00:58:45] <strong>Alex Volkov:</strong> However, just on the metrics, it looks like we just got Another competitor in the ring from, now there's Google, Gemini Pro, Entropic Cloud keeps releasing models that are less performant, at least on LMSys, than the previous models. And now Mistral not only doing fully open weights, open source, but also in the API.</p><p>[00:59:03] <strong>Alex Volkov:</strong> And if folks want to build on top. They can. An additional thing to this, they also released a partnership with Microsoft and announced that these models are also going to be distributed through Azure. And I think this is a big deal for companies who maybe don't want to trust a startup that's less than one year old from, from Europe, for example, and maybe their servers are in Europe, maybe the companies don't want to trust their ability to stay up because there's like only 30 people, or, enterprises, they need more stuff like ISO and different things.</p><p>[00:59:34] <strong>Alex Volkov:</strong> And so I think it's a big deal that Microsoft is now also supporting and giving us access to kind of these models through Azure, and especially for companies that want stability. I'll just, not stability, just stability in general. I want to just mention that if you guys remember after Dev Day, OpenAI went down for a week, or not a week, but there was like a whole period where OpenAI had a lot of issues on production, and the Azure version of OpenAI stayed stable.</p><p>[01:00:00] <strong>Alex Volkov:</strong> Obviously Microsoft wants to sell their cloud, and I do believe this is a very big deal that Mistral is now supported through Azure as well. In addition, Microsoft also announced a small stake in Mistral, and Arthur, the CEO of Mistral, and went and clarified. So first of all their new website with these announcements, again, didn't include some stuff or included the a note that you shouldn't train on this, right?</p><p>[01:00:22] <strong>Alex Volkov:</strong> And then our friend Far El here for the second time called them out publicly and for the second time, Arthur Mensch, the CEO of Mistral came and said, whoops, gone. And so it does seem like an omission rather than something they put on purpose and then they remove after Twitter calls them out.</p><p>[01:00:38] <strong>Alex Volkov:</strong> Far El, thank you for that for noticing. But also some other folks noted that their commitment to open source, which we discussed before was gone from the website. And they put it back. And so now, like prominently on their website, even though this time they didn't release any open source, any open weights for us this time their commitment for open source is prominently featured on top of their of top of their website.</p><p>[01:00:59] <strong>Alex Volkov:</strong> And now there's two segments there. One of them is optimized models, they call them. And one of them is open weights models that they released for the community. As we talked previously in the open source segment their models from six months ago are still competing with something like. The new and cool Gemini Pro 8 billion parameters.</p><p>[01:01:15] <strong>Nisten Tahiraj:</strong> It's still a 32k context window by the way, so I measured and after that it completely forgot, and also it was okay. I was expecting as a chat model to be way more chat optimized, but it does feel more like a base model. And yeah, again, I said the comments before, we're too spoiled by all the 7b and Mixtral, Finetunes, and merges.</p><p>[01:01:43] <strong>Nisten Tahiraj:</strong> That now this is extremely good and is very utilitarian. And if your business needs it, you should use it because it provides reliable answers. It's not, we were just expecting more.</p><p>[01:01:56] <strong>Alex Volkov:</strong> So one thing definitely to note as well, and we mentioned this a little bit, but definitely worth mentioning. So the smaller model is now better upgraded. So if you play with this they also upgraded the pricing for this. And I would also caution folks, the tokenizer they use is a different tokenizer.</p><p>[01:02:10] <strong>Alex Volkov:</strong> So sometimes when you measure tokens they may look different. Our friend Zenova here in the audience. Has a tokenized playground in hug face, which by the way, with the rest of hug face also went down yesterday. So I went to check just the length of a string. I wasn't able to, it was sad but now it's back.</p><p>[01:02:24] <strong>Alex Volkov:</strong> So that token, I think, measures open the eye, token's length, and Mr, I think has a different one. So when you calculate pricing for use, definitely make sure that you're calculating the right thing. Yes. No, you're welcome to come up and tell us about this. So one last thing on Mytral is that it supports function calling as well, which is, I think is a big deal.</p><p>[01:02:41] <strong>Alex Volkov:</strong> And we mentioned this before in the function calling leaderboard. And now mytral models can also respond to your RAG applications or whatever with actually the functions that you should call, which is I think super cool. And the industry moves there and it shows again, the open AI can come up with something.</p><p>[01:02:57] <strong>Alex Volkov:</strong> a year ago and basically set the standard for how things should look. I actually don't know if assistance API is going to be like this, but I do know that, for example, we talked about Grok and Grok supports the OpenAI standard. And many of these, I don't know if Mistral does, but many of the like Together API and other I think Perplexity as well, all of them have their own version of their API, but also you can just replace whatever code you wrote for OpenAI with just like a different proxy URL.</p><p>[01:03:24] <strong>Alex Volkov:</strong> And then you basically use the same structure that OpenAI innovated on, so that I think is pretty cool. Moving</p><p>[01:03:32] <strong>Nisten Tahiraj:</strong> Yeah,</p><p>[01:03:33] <strong>Connor Shorten:</strong> also just a note is that the OpenAI PIP package allows you to actually call any any URL doesn't matter if it's if it's OpenAI or not which actually uses that standard. It is very easy to drop in any replacement to the OpenAI</p><p>[01:03:49] <strong>Alex Volkov:</strong> Yeah, including local ones. If you use LM studio, our friends on studio, shout out Yags or Olam, I think both of them will expose like a local server when you run the open source models. And then you can put in your code, like your local URL that runs the server with the local model. And then your code will also work, which is, yeah, thanks for all.</p><p>[01:04:08] <strong>Alex Volkov:</strong> This is like a very cool thing that people may have missed. The same can be said by the way about Copilot. It's a little harder, but basically you can replace the Copilot infrastructure in VS code with like local models if they support it, if you go through settings. But moving on to. I guess moving on to GoogleTees is Genie, right?</p><p>[01:04:26] <strong>Alex Volkov:</strong> GoogleTees Genie, which is a, which is quite incredible. You take one image of something that your kid drew that has like a character, and then you provide this into this like video type text to video or image to video model. And in response, you get like a full world that is interactive and looks like this character is kind of in the same style transfer and it looks pretty much the same. The character is like interacting in this world. Seeing this is unbelievable because it just shows that, we're very close to being able to take one picture and start animating this. And very worth like adding this to the top and adding a video for this.</p><p>[01:05:05] <strong>Alex Volkov:</strong> It's really hard to explain in words and I haven't read any of the paper, but Genie was really like also mind blowing as well. From Google, and they only teased it, so we don't know actually if they're gonna release this. Far El, you wanted to comment? I saw your</p><p>[01:05:20] <strong>Far El:</strong> sure. It's If any of you have watched Sentex's YouTube video like a few years ago, about GameGAN from NVIDIA. It's basically GameGAN, but with generative AI. And it's pretty awesome, because it means that we're all headed towards the direction of basically interactive rendered worlds.</p><p>[01:05:43] <strong>Far El:</strong> And Sora is one, one extreme end of that with really high quality text to video. But then what happens when you actually add actions into the loop? And that's what basically Genie does. So we're probably going to see the marriage of both methods, both architectures very soon. Very exciting work for sure.</p><p>[01:06:04] Open AI to buy Wordpress & Tumbler data</p><p>[01:06:04] <strong>Alex Volkov:</strong> And so I think most of the open and big companies stuff we covered. One, one quick thing before we move on,</p><p>[01:06:10] <strong>Alex Volkov:</strong> openAI opens up fine tuning for 3.5 and also OpenAI in the news again this week because wordless. and Tumblr. Basically, I think both of them are the same company, Automattic. They're preparing to sell user data. And it sounds scary, but honestly it's all public and probably will scrape anyway.</p><p>[01:06:30] <strong>Alex Volkov:</strong> And still, they're preparing to sell this, probably more structured. and maybe more licensed to open AI and mid journey. So that's very interesting because Tumblr had a bunch [01:06:40] of images and probably was scraped to an extent. WordPress, definitely so just to clarify, this is not WordPress, the platform where everybody can use the open source platform to run their websites.</p><p>[01:06:51] <strong>Alex Volkov:</strong> That's not what they're selling. I don't think they, they can but WordPress. com, I think, is where you can host a blog for free without knowing how to raise a WordPress platform. So WordPress has the open source system that you can run your blogs and websites in that runs like 30 percent of the internet or something crazy like this.</p><p>[01:07:06] <strong>Alex Volkov:</strong> But also wordpress. com is the place where you can host your blog and basically when you signed up and created your blog there, you maybe didn't know, the information is there to sell. So like Reddit supposedly selling. Reddit's information to Google for 200 million that we talked about last week.</p><p>[01:07:24] <strong>Alex Volkov:</strong> Automatic is now trying, basically trying to extract money based on their data, where previously this data was scraped. What's the benefit for OpenAI? Obviously, now there's a lawsuit with the New York Times, whether or not this is considered fair use, and whether or not Open's AI model, OpenAI's models, can spit out full New York Times articles.</p><p>[01:07:44] <strong>Alex Volkov:</strong> So there's a whole debate about this and there's going to be a lawsuit because they didn't achieve a similar deal with New York Times. Although it was reported the folks from OpenAI actually did talk with New York Times to try and have more of a structured access and licensed access. And WordPress is definitely a huge chunk of the internet and now some of that information is going to go into these models in a more structured and licensed way.</p><p>[01:08:12] <strong>Alex Volkov:</strong> And moving on to diffusion models before we jump in, because there's a bunch of updates there, and I think Jeanne takes us a little bit into diffusion models, so let's see if we have a thing for this, yeah.</p><p>[01:08:41] Playground open sources a new diffusion Foundational model</p><p>[01:08:41] <strong>Alex Volkov:</strong> All right, so As I said before, we don't cover this at length. I know there's a bunch of other spaces for AI art and fusion specifically. But when we do, it's because something very big happened. And this week was a huge week as well. And so I just want to shout out that We had two foundational models, basically, and then another thing that just broke my jaw, and we're going to talk about this,</p><p>[01:09:01] <strong>Alex Volkov:</strong> playground. Playground from the previous Suheil Doshi, I think is his last name. He previously was in Mixpanel. He started building a browser called Mighty and then he switched fully into AI. And I think a year ago started working on Playground.</p><p>[01:09:17] <strong>Alex Volkov:</strong> Playground is an interface that like super fast and lets you generate a much of images, and it's just an interface on top of, or at least previously it was an interface on top of, DALI and Stable Diffusion. And they kept giving away for free all of these models and image generation. And basically they collected their styles, etc.</p><p>[01:09:36] <strong>Alex Volkov:</strong> And they've collected all this information of what people actually do preference on. And now they released an open model, a new diffusion foundational model, which we haven't had for a while. If you guys remember, we talked about SDXL Lightning, which is based on SDXL. We've talked about, um, Stable Cascade, which is also related to stability.</p><p>[01:09:54] <strong>Alex Volkov:</strong> We haven't had a, like an open model for generating images in the wild for a while. And Playground released their model called Playground V2. 5. And the cool thing about this is that they say first of all, it looks great on realistic stuff. Secondly, they say that on User preference on internal 1000 tokens, they significantly beat the baseline for DALL E, for Mid Journey, for the previous version of Playground, and SDXL as well.</p><p>[01:10:23] <strong>Alex Volkov:</strong> And by significant they beat internal preference again SDXL 1. 0 gets like 70, 17 percent and their model, their new model gets 82. Which is like a quite stark, like a big jump in capability and improvement as well. They also get improvement on top of Midjourney, the latest of 5.2 version, which Midjourney is like really good or realistic.</p><p>[01:10:44] <strong>Alex Volkov:</strong> And so what they excel at is realism and just different, I think they also mentioned different ratios. So if, like most of these image models, they've been trained with certain 10 24, but 10 24 for sdl, for example. And, when they generate something in a different ratio, it looks different.</p><p>[01:11:01] <strong>Alex Volkov:</strong> So they also claim that their model is actually significantly more performant in different ratios as well. Definitely shout out to Playground folks for working on this awesomeness, because Who's gonna say no to another model? And there's a demo from, I think, Model Labs that actually makes this work really fast.</p><p>[01:11:17] <strong>Alex Volkov:</strong> If you guys remember last week, I talked about a thing that I built with SDXL Turbo and Grok. And obviously SDXL Turbo is super fast. or SDXL Lightning is super fast. Compared to those super fast examples, the Playground image generation is just night and day. It just looks so real. It's quite striking.</p><p>[01:11:37] <strong>Alex Volkov:</strong> So if you're looking for any updates in that area, definitely check out Playground. And I think because it's a model they released, you can use it for free. The only thing that I don't know about is the support in the community, kind of stuff, if it supports Confi UI or some stuff like this, but they just released it, so I'm sure support will come.</p><p>[01:11:56] <strong>Alex Volkov:</strong> And obviously, the Loras and everything else in this community is very interesting to see. There's also a Hugging Face demo. And then, the second thing in image generation real quick, is Ideagram. Ideagram. We've talked about before. It's a startup that came out of folks who worked on the image and stuff at Google and apparently weren't very happy with the slowness of the release.</p><p>[01:12:17] <strong>Alex Volkov:</strong> And while Google and its image generation is suffering from bad news and is in hot water because of different Prompt injection that they had, and even, we didn't mention this, but mentioned this in the beginning. Sundar Pichai released an email to all of Google and said, hey, we had mistakes, we offended some of our customers, we need to do organizational changes.</p><p>[01:12:35] <strong>Alex Volkov:</strong> Which is not a small thing from a head of the company to admit this bad of the release. Ideagram was created with folks. from Google before, and they released it for free. And I think they just renounced Ideagram 1. 0. And the best thing about this, I think is just text. They, everybody is no focus on different things.</p><p>[01:12:56] <strong>Alex Volkov:</strong> But if like all these models, they generate text to some extent, DALI can do text, but it's not like perfect. Ideagram's text. XGeneration is super cool. I really, so far I used it multiple times just to answer somebody on X, reply with just a text, for example like for Hug Face, I think I sent them like a thank you note with just text.</p><p>[01:13:13] <strong>Alex Volkov:</strong> And it's really cool to have a model that's like very good at presenting and generating text with the imagery that you want. So Ideagram 1. 0, they also announced a investment from A16z and really their text looks super cool. I was able to do something. That not other models could do. I was able to ask it to generate a hashtag ThursdAI And if you think about this text is not in the training set because you know We came up with the concept and a hashtag like confuses these models And I think this was the first model that was able to actually Not screw up hashtag ThursdAI Fully.</p><p>[01:13:50] <strong>Alex Volkov:</strong> Cherry pick still, so three out of four still wasn't like perfect, but definitely this is the best text model that we have Ideagram check it</p><p>[01:13:57] <strong>Alex Volkov:</strong> out. Yeah, go ahead</p><p>[01:13:59] <strong>Aditya Kusupali:</strong> Yeah, just randomly in the audience, I noticed we have one of the creators, I think it was one of the top 10 Hugging Faces pretty recently, so their data out of GPT 3, and they also have a, what's called a DALI 3 dataset, training dataset, but yeah, they released a new model recently too, I posted it up for you, so if we have some time after the interview, maybe we can bring them up and stuff.</p><p>[01:14:25] <strong>Alex Volkov:</strong> Yeah let's see if where our second guest is. Oh, he said he's going to join in 10 minutes or so, so we have a little bit more. And the last thing that I want to cover, and I actually want to, actually go to my profile and paste this, because you guys, you have to see this. And if you haven't seen this Okay, so first of all I'm going to post an image of I'm adding this onto the show notes now, it's the last pinned tweet image of a very happy sheep, they all say we're doomed and No, this is not the one, I said this shirt, that's, hold up, yeah, this one, we're doomed, and the text there is really cool, and the cool thing about the text is style transferred into the image itself, so it looks like part of the image, [01:15:00] But this is not what I want to post, I wanted to post the jaw breaking video from Alibaba from a model that they teased and hopefully will release soon called Emo.</p><p>[01:15:13] Alibaba teases EMO - a way to animate and make avatars talk and sing from 1 image</p><p>[01:15:13] <strong>Alex Volkov:</strong> And folks, I don't have a button for this. I don't have a musical tradition. I will just say that if you remember, and if you were here on, on Thursday, I when Sora was announced and released, if you guys remember, this was live, I think two weeks ago, we had Sora release and we were just like freaking out live on stage here on Thursday.</p><p>[01:15:30] <strong>Alex Volkov:</strong> I, because like our job was collectively breaking from what we were seeing. Sora showed. significant jump in capability for image sorry, image to video or text to video generation and coherence throughout the scene and longer generations. And since then, OpenAI has been SORA posting. That's what I call it, SORA posting on TikTok.</p><p>[01:15:50] <strong>Alex Volkov:</strong> So if you're on TikTok and you don't follow OpenAI, they literally opened a full account that just posts SORA videos on, or SORA posting, on TikTok. And since then, the amount of videos that they released there just shows the capabilities of that incredible model. It does look like the ChatGPT moment for video generation based on what they released.</p><p>[01:16:07] <strong>Alex Volkov:</strong> I think that emo from Alibaba is definitely one of those moments. And actually, it's really funny because the Alibaba folks took one of the Sora generated videos, if you remember one of the main ones, is a woman not the main ones, one of the first ones is a woman walking through Hong Kong, wearing sunglasses, and it zooms into her face, all of this video generated it's quite crazy.</p><p>[01:16:29] <strong>Alex Volkov:</strong> that we're now like, oh yeah, of course it generated the woman walking to Hong Kong wearing glasses, but, it's still mind blowing. So the emo folks from Alibaba, they took that video, took a still from that video, just a still, not a whole video, and made that exact woman sing Dua Lipa's song, and this is now pasted on top of the, on top of the space, and, folks, my jaw dropped when I saw this, and then dropped again because I started looking at all the details.</p><p>[01:16:56] <strong>Alex Volkov:</strong> I did a little deep dive into image generation, avatar creation, basically taking an image and making it sing or lip sync. And usually those models, they move maybe the mouth a little bit, some of them move the eyes. This model makes this from one image, one image only. It makes the eyes move independently, it makes the eyebrows move independently, obviously the mouth.</p><p>[01:17:17] <strong>Alex Volkov:</strong> I saw earrings get animated, I saw vocal muscles in the throat get animated where, if somebody talks those things, you can see their throat move differently. I'm noticing all these, all these things. The woman in the video that I'm referring to wear sunglasses. So most of these models would move the sunglasses to an extent.</p><p>[01:17:35] <strong>Alex Volkov:</strong> These sunglasses like stayed exactly the same place. So the coherence of this model is way beyond anything that I've seen. And I've researched this field and I used D-O-D-D-A-D, I used tulip, I used all these like tools and, just the creation would be able to use with something like Sora plus something like this emo thing.</p><p>[01:17:50] <strong>Alex Volkov:</strong> It just opens new horizons. And many of my friends in AI art are looking at this and like in disbelief. Because it really feels like the Sora moment as well. So I just want, I wanted to highlight how exciting this was for me and how how huge of a jump this was from everything we've seen before.</p><p>[01:18:07] <strong>Alex Volkov:</strong> Reactions from folks on stage. what do you think when you saw emo? Same as me. Existential dread. Anything else? Yeah, same as me. All right. So I, it looks like our, yeah, Nisten go ahead and I'm going to take a look.</p><p>[01:18:22] <strong>Nisten Tahiraj:</strong> I just want something like this that's like small, real time, and cartoonish, so I can just use it as an assistant. That would be great. I'm impressed, but I just want like a small, tiny one. I want clippy. I want the actual clippy. Yeah,</p><p>[01:18:37] <strong>Alex Volkov:</strong> They didn't animate Clippy, but I found it very interesting that they animated the Sora generated woman with the voice of Mira Murati, the CTO of OpenAI. They like took her voice and embodied one of their creations with this voice, and I found this like very interesting choice on their part. I will say while Aditya comes up, and Aditya if you can hear me, I'm sending you a request and if you cannot, oh yeah, there is.</p><p>[01:19:00] <strong>Alex Volkov:</strong> Found it very interesting that they Haven't released the model yet, but they did say we're committing to open source. We're going to release this and their GitHub for Ego is open, but there's no commits there. It's just like a readme. So hopefully they're going to release this. And hopefully we'll get to a point where we can actually, Nissen, like you're saying, have a actual assistant in near real time with a fake voice or generated voice, actually read out whatever LLMs tell us.</p><p>[01:19:25] <strong>Alex Volkov:</strong> And I think this last thing I'll say here before I move on to the interview is this adds to this notion that I think we saw from Logan from OpenAI, where a chat is not the final interface for these things. I think embodiment like this is one thing that moves us forward.</p><p>[01:19:40] Deep dive into Matryoshka embeddings with Aditya Kusupali & Prateek Jain</p><p>[01:19:40] <strong>Alex Volkov:</strong> All right, folks, this has been the updates, and now we're moving to a more of a deep dive interview, and I'm very happy to introduce Two guests here, and two guests, I'm getting a little winded, so forgive me.</p><p>[01:19:54] <strong>Alex Volkov:</strong> But I want to say hi to Aditya Kusupati and Pratik Jain, and thank you. And folks, feel free to unmute yourself and talk and, call out. But basically, welcome Pratik, welcome Aditya. How are you guys?</p><p>[01:20:05] <strong>Aditya Kusupali:</strong> thank you,</p><p>[01:20:06] <strong>Prateek Jain:</strong> Alex. Thanks so much, Alex, and thanks everyone for listening.</p><p>[01:20:10] <strong>Alex Volkov:</strong> I, I'm gonna set this up and I think particularly, I noticed you on my timeline first, and then I saw Aditya's tweets as well, where we've talked about OpenAI's new embedding models, and one of the things that was like very interesting back when they released this, and this is how I got to to talk with you guys is They added some new parameter in their new models.</p><p>[01:20:30] <strong>Alex Volkov:</strong> So they had Ada 002 before, and then they said, Hey, we're releasing two new models, Embeddings version 3, and they have a way to specify dimensions. And so previously on ThursdAI, we talked about embedding models, we talked about the MTB leaderboard that Huginface has, we have folks from Junaid that also released top of the line embedding models as well in Friends of the Pod, and we definitely looked at open source models, and in comparison to something closed source like for example OpenAI, and dimensions were a big thing in that whole area. Then OpenAI released something that you can specify number of dimensions and this raised an eyebrow and was like, oh, that's interesting. I don't even know what this is about. And then I think Pratik, I saw your tweet saying, hey, congrats OpenAI. Unfortunately, you didn't mention us.</p><p>[01:21:19] <strong>Alex Volkov:</strong> And then somebody from OpenAI reacted and said, Oh, actually, yeah, we do use something called MRL, and they added this to the blog post. Pradeep, could you talk about that, before we dive in on what MRL actually is? Could you talk about what they added and why? And yeah, just talk about this this phenomenon of them not adding you to the blog post.</p><p>[01:21:38] <strong>Prateek Jain:</strong> They had done the work on their own and everything. It's just, and they did release like really strong embeddings, like the results on MTAB eval boards looked really good. Definitely many congratulations to them. Only thing was that they had released this new thing, as you mentioned, called shortening embeddings.</p><p>[01:21:54] <strong>Prateek Jain:</strong> And the output structure in some sense seems very similar to what Mateuszka representations or these nested representations do. And we do know that they were at least aware of Mateuszka representations because through some of our earlier conversations at least some of the research scientists had reached out to us and had talked to us about some of the details about Mateuszka representations.</p><p>[01:22:13] <strong>Prateek Jain:</strong> It's felt a little bit like, against the spirit of open science and pushing a scientific boundary so that's the only reason we highlighted that it would be good if either the the initial work can be either cited or maybe use the same name I think they were very gracious in particular, the person who had written the blog, I think they said that, yeah there was a miss on their part and they have updated the blog now, all good. I think when we do open research and publish and discuss ideas, I think it moves the field very fast and helps everyone. We are definitely all up for it.</p><p>[01:22:49] <strong>Alex Volkov:</strong> Yeah, absolutely. Want to talk about when you guys released MRL. This was way before the explosion of LLMs and ChatGPT came to the scene, right?</p><p>[01:22:56] <strong>Alex Volkov:</strong> You released MRL, Matryoshka representation back in 22, right? Almost two years ago, like a year and a half ago?</p><p>[01:23:05] <strong>Prateek Jain:</strong> Yeah.</p><p>[01:23:06] <strong>Alex Volkov:</strong> And so talk to us, maybe give a brief explanation of what like I think folks are generally okay with embeddings in, in the audience here, but maybe dimensionality is still somewhat of a escaping field.</p><p>[01:23:18] <strong>Alex Volkov:</strong> Would one of you tackle the [01:23:20] task of explaining what dimensionality means in a very like popular science way so we can then dive into how adjusting dimensionality actually helps performance and different things.</p><p>[01:23:29] <strong>Prateek Jain:</strong> So generally, what happens is if you have, say, some text data, right? So let's say you have a string of thousand 24 tokens or let's say you have an image a 64 by 64 image what we like, what computer, in some sense would want to see them as a set of numbers.</p><p>[01:23:47] <strong>Prateek Jain:</strong> Or a vector of numbers through this incredible line of work around embeddings what we are able to do is we are able to embed these images or text or whatever data object you have into a fixed dimensional vector. So by that, what I mean is you might have a 64 by 64 image, but you can write that as a series of, let's say, 128 vectors, numbers, right?</p><p>[01:24:11] <strong>Prateek Jain:</strong> So that is what we call dimensionality. That is, it is 128 dimensional vector that we want to work with. Why is this interesting? Because if you have a 64 by 64 image and you just change some pixels let's say only 1 percent of the pixels. Those changes would not even be visible to you, but when you compute, let's say, the distance between these two images along pixel space that is, if you're just subtracting two images from each other pixel by pixel, the distance might seem very large, but in reality, semantically, both of them mean essentially the same.</p><p>[01:24:41] <strong>Prateek Jain:</strong> So what we ideally want is some of these embeddings which capture the underlying semantic structure of the data object of, let's say, image. Let's say, there are two images, both of them contain Cat, and very similar pose. We would want to have them being represented within our machine as very similar sort of, objects and that is what these embeddings or semantic embeddings are able to do.</p><p>[01:25:03] <strong>Prateek Jain:</strong> So generally there are multiple techniques to take, as I said, either the image or text or audio, whatever you have, and embed it into, say, a fixed dimensional representation that is a fixed number of floating point or integers. Now, generally, these Representations are like, rigid.</p><p>[01:25:21] <strong>Prateek Jain:</strong> They are fixed. That is that is, let's say a person a designer has to a priori say that, okay I can deal with the 128 dimensional representations for my image and on basis of this, I can run some sort of classifier or some sort of retrieval algorithm to retrieve similar images or classify the image into some particular class.</p><p>[01:25:39] <strong>Prateek Jain:</strong> So generally, that decision is made a priori that I will be forming it into 128 dimensions because 128 dimensions, let's say, are able to give me the accuracy I want and I will be able to deploy them in my system because that's another sort of key part. Whenever you are deploying them, the dimensionality of the embedding can be a critical thing.</p><p>[01:26:00] <strong>Prateek Jain:</strong> Let's say, if you want to do retrieval the cost of Retrieval is almost directly proportional to the dimensionality of the data point. So we so the decision is made a priori. So for example, like earlier embeddings that came out from OpenAI, they made that decision that, okay, these embeddings should let's say be, I think, 1024 dimensional or something like that.</p><p>[01:26:19] <strong>Prateek Jain:</strong> So you just had those 1024 dimensional and not so good part about that is that now everybody who wants to use those embeddings have to change their system to suit their 1024 dimensional representation. So some people who might be running, say, some sort of retrieval engine on 64 dimensions, they will need to now scale up everything, change how they are doing retrieval, how their indexer works, how their serving works, to fit to those 1024.</p><p>[01:26:46] <strong>Prateek Jain:</strong> And that's not ideal, right? So the idea behind Matryoshka representations was that can we bring flexibility in these embeddings? That is, while we are giving out 1024 dimensional embeddings, can somebody come and read off just 64 coordinates out of it so that, they don't need to change their entire serving stack?</p><p>[01:27:07] <strong>Alex Volkov:</strong> So I wanna slide in here with a question before we get to your guys solution in dimensionality flexibility, which is very cool. So you're saying the iPriority decision basically means that I as a developer, let's say, if I used whatever OpenAI has given me or any other, uh, rigid structure, I had to basically abide by their rules of how much they decided how in depth those embeddings represent my concepts, correct?</p><p>[01:27:31] <strong>Alex Volkov:</strong> And could you talk about maybe before we dive into dimensionality, how this affects actual retrieval? Is more embeddings always better? There's a thing that I heard yesterday that somebody mentioned. It's called the curse of dimensionality. And I really wanted to dive in and hear about what this means.</p><p>[01:27:46] The curse of dimentionality</p><p>[01:27:46] <strong>Alex Volkov:</strong> Is, because we've talked before and There are embedding models with like 8, 000 dimensions or so. And I heard from Beau, who's in the audience here, who may join us as well, that's not always the best case. For many reasons, not only speed as well. Could you talk about the curse of dimensionality and is more always better?</p><p>[01:28:03] <strong>Prateek Jain:</strong> So that's a great question, right? So definitely more dimensions intuitively should help you capture more and more information about the data that you are trying to embed. Obviously, like beyond certain point, it becomes, starts to becoming complete noise, right? So for example even if you go back to the image example that I was giving, you have a 64 by 64 image.</p><p>[01:28:24] <strong>Prateek Jain:</strong> You can think of it that as a 3600, like about 3600 dimensional vector, right? And if you want like a very precise embedding then maybe that 3600 dimensional vector is what is capturing everything about that image because that is roughly, that's precisely how we are seeing that data point, right?</p><p>[01:28:40] <strong>Prateek Jain:</strong> But the bad thing about that sort of, representation is that it is not capturing the semantic information. It is also bringing in a lot of noise. You would there is some sort of sweet spot at what kind of dimensionality of data you want to stop at, right? That's one part of it.</p><p>[01:28:55] <strong>Prateek Jain:</strong> But when you come up with these representations, they are going to be used in some downstream task, right? As I mentioned earlier some of the downstream tasks are I have this representation of the image. Now do classification for me. So I will run some sort of classifier on top of this representation of the image to say that, okay, whether this image has a cat or a dog, right?</p><p>[01:29:17] <strong>Prateek Jain:</strong> Similarly, I can say that, okay, I want to retrieve most similar image to this given image in my database of all the images. So I might have an entire database of animals. I give you an image of a particular cat, and I want to retrieve a cat which is most similar looking, maybe in similar pose, similar situations, right?</p><p>[01:29:35] <strong>Prateek Jain:</strong> So these models or these embeddings are used in this downstream task and to use them in these downstream tasks, we need to, we are also then bound by the realities of those downstream tasks. For example, if you want to do classification and you have only let's say, 200 data points to train the classifier, Then a very high dimensional embedding is not great because that will then give you very poor performance, like your model will overfit, it will just like mimic whatever it is seeing on training data and it will not generalize to new test points.</p><p>[01:30:07] <strong>Prateek Jain:</strong> So it can be catastrophic. Similar situation happens in even your retrieval or nearest neighbor search. Kind of thing there, that is, if you're very high dimensional embedding as you mentioned earlier, like there's this curse of dimensionality that applies, which might mean that my nearest neighbor search is not working well, especially if I'm doing any kind of approximation, and I might get essentially garbage out of that situation.</p><p>[01:30:30] <strong>Prateek Jain:</strong> So that's why, based on the downstream task, The amount of training data I might have, the serving realities there, that okay, how much latency I can spend or how much compute I can spend in serving, I might have a sweet spot into that. Okay, this is the dimensionality that works best for me. And I want to ideally want to select that and work with it.</p><p>[01:30:50] <strong>Alex Volkov:</strong> I see. And Aditya, it looks like you can now join, and I also wanted to follow up with you because Partik is talking about and Partik, the examples you gave are image embeddings, and that's great, but I think one of the huge things that happened since you guys raised the paper is how much LLMs are being used for different things as well, right?</p><p>[01:31:07] <strong>Alex Volkov:</strong> And I think this led to an explosion in vector databases, and they start embedding, and I think at least for many of the developers who use these, like LLMs, text embeddings or at least they started with text and now it's like multi modal. This is like the highest the highest use currently in React.</p><p>[01:31:23] <strong>Alex Volkov:</strong> Would you maybe Aditya, would you want to expand on how much this whole field started heating up with like vector databases now storing every embedding? I definitely didn't hear about this up until a year ago. Would you want to like chime into this and how your work is now like super relevant to, to this whole new world?[01:31:40]</p><p>[01:31:40] <strong>Aditya Kusupali:</strong> Yeah, Yeah, as Pratik said, I think Curse of Dimensionality even applies in vector databases because you have to search through things. And the major thing is you also need to think about storage, right? So let's say you want to index a billion documents. And if you want to do everything with, say, 1024, you're going to have to use about a terabyte.</p><p>[01:32:00] <strong>Aditya Kusupali:</strong> Or four terabytes worth of data for storage. And a lot of people might not be willing to do that. So how people typically do that in vector databases is they store one copy and when they're trying to do some processing on top of it, they do some sort of compression. It can be a lot of things.</p><p>[01:32:18] <strong>Aditya Kusupali:</strong> And It works great, but the thing is, it's a lot of post processing, and you also need to store the actual embeddings in your vector database store. I think with the data which keeps growing and growing, and there is no way for you to control the total amount of data. You should probably figure out a way to make your Representations much more compact, much more accurate.</p><p>[01:32:40] <strong>Aditya Kusupali:</strong> I think that is where a lot of oversight was there for the last few years. Again, vector databases existed even before last year, but they blew up because of the RAG applications. And I think in Matryoshka case, as OpenAI said, it gives you the flexibility to just store 64 dimensions if you want, and that should just be it.</p><p>[01:33:00] <strong>Alex Volkov:</strong> And 64 is way smaller than the previous dimensionality that they said, I think 1053 or 1024 or so. And also, I would be remiss if not to mention that video is coming into play right now. large multimodal models. Now, they're not only understanding text and images. Now, like we're talking about video embeddings, for example, and being able to represent those.</p><p>[01:33:21] <strong>Alex Volkov:</strong> And when you talk about storage costs, et cetera dimensions definitely affect that and also speed of retrieval and comparison. So let's move on to talk about cause you guys wrote the paper before this explosion, but definitely the concepts existed. I want to hear about what Matryoshka representations is and how it affects dimensionality.</p><p>[01:33:38] What are Matryoshka Embeddings?</p><p>[01:33:38] <strong>Alex Volkov:</strong> Specifically from being able to choose during which process, and I would love to hear from you the brief explanation, then we can dive in and ask more questions.</p><p>[01:33:47] <strong>Aditya Kusupali:</strong> Sure.</p><p>[01:33:48] <strong>Prateek Jain:</strong> Let's take</p><p>[01:33:48] <strong>Aditya Kusupali:</strong> the running example for the excited let's say there is a 1024 dimensional representation of your image or let's like, let's keep it to 1024 for now. And so you're trying to basically fit a bunch of learned attributes. So it could be some version of color, some version of Texture, et cetera, which is being fed into these things.</p><p>[01:34:08] <strong>Aditya Kusupali:</strong> So that is what these embeddings are learning. And they're extremely good in a lot of semantic tasks. If you want to find a similar looking dog, it's much more easier for you to search in this space. So that's the goal, right? Ideally, until now, when you wanted to do things faster, you took these embeddings and you did some sort of compression, most likely some notion of PCA or low dimensional projection or some sort of quantization, okay?</p><p>[01:34:35] <strong>Aditya Kusupali:</strong> And that's how you used to do it. So there is an additional processing overhead on top of the existing embeddings for you to get this done. We wanted to fix this problem because this additional overhead need not always give you the most accurate solutions. So the motivating goal for us was to figure out if we can pack the information in this 1024 such that we don't have to project it into low dimensional space or do any post processing to get a 64 dimensional embedding.</p><p>[01:35:04] <strong>Aditya Kusupali:</strong> But rather? Just take the first 64 dimensions of this vector. So if there is a collection of 1024 numbers, I want you to be able to cut it off at the first 64 and say this is a 64 dimensional embedding which is as good as any 64 dimensional embedding you can ever build. That makes sense? And this was the goal.</p><p>[01:35:24] <strong>Aditya Kusupali:</strong> So this is the final embedding should look like this. And that is what we try to do. And it turns out Training these things are so simple that it's literally what you think. If you want the 64 dimensions of the first 64 dimensions to be the most important thing, you optimize the same loss function you are doing for 1024 on the 64 dimensions.</p><p>[01:35:45] <strong>Aditya Kusupali:</strong> Let's say you are doing some text embedding training, where you are trying to pull two relevant text embeddings together and two irrelevant text embeddings farther. And there is a loss, which is typically contrastive, which tries to do this in 1024 1024 dimensional space, you also do it for 64 dimensional space.</p><p>[01:36:05] <strong>Aditya Kusupali:</strong> That's it. So you now have two losses instead of one, and at the end of the training, which again does not take any other extra cost than as if you're training a 1024 dimensional embedding, will give you the first 64 dimensional embeddings, which are as good as any 64 dimensional embeddings you can ever trace.</p><p>[01:36:22] <strong>Aditya Kusupali:</strong> And that's pretty much it. So you can repeat this for multiple dimensions. So not just 64, you can do 64, 128, 256, and so on. Now you have this. Chunks of representations inside this 1024, which can cater to a wide variety of audience, depending on their use cases. And a lot of times people don't care about precision.</p><p>[01:36:44] <strong>Aditya Kusupali:</strong> If recall is all you care about in your retrieval applications, you can just use 64 dimensions. And if you want more precise information, as Fatih said, you can encode more information in higher dimension embeddings, go to 1024. If you have lesser number of data points and you're not able to cluster things properly, Go for smaller dimensions.</p><p>[01:37:02] <strong>Aditya Kusupali:</strong> So the flexibility just opens up so many things which were probably infeasible before in hand because you had to do some sort of post hoc compression or pre processing on post processing on top of it and which led to slightly lesser accurate things. So it just didn't allow you to do all of these things on the fly.</p><p>[01:37:21] <strong>Alex Volkov:</strong> Wow just to Sum up to see if I understand this. I'm looking at and unfortunately this medium is audio only, but I think it's very helpful to see visual representation of this. You're basically front loading all, most of the important information into the first 64 dimension, 128 dimension.</p><p>[01:37:37] <strong>Alex Volkov:</strong> And you're saying that precision for specific use cases like RAG could still be as good as like with 124 dimension. And that sounds to me incredible.</p><p>[01:37:47] <strong>Aditya Kusupali:</strong> Let's take an example, right? Like in your RAG, all you care about is 10 blue links, which need to be in the top 10. That's it. You don't care if the first link is the first one or the last link is the last one. There is some evaluation saying that there is a a utility for positionality, but most of the cases, if you get 10 relevant documents in any order, that's all that.</p><p>[01:38:06] <strong>Aditya Kusupali:</strong> You matter. You don't care if the best document is at the top or at the 10th thing. So if you feed in all of these things into your LLM, LLM will forget it. So this is the case of recall. You don't care about precision. So your ranking only cares about getting the most relevant 10 documents in the first 10 and not how relevant they are in within themselves.</p><p>[01:38:27] <strong>Alex Volkov:</strong> I see. I want to</p><p>[01:38:29] <strong>Alex Volkov:</strong> also</p><p>[01:38:29] <strong>Prateek Jain:</strong> bit more nuance there sorry just to add a little bit more nuance there in many situations, what might happen is, in your RAG rather than even getting, let's say, top 10 links that Aditya said, suppose I get top 100 links, right?</p><p>[01:38:42] <strong>Prateek Jain:</strong> And those top 100 links, some of them might be completely useless, completely rubbish. But as long as those correct top 10 links are somewhere they are sitting in top 100 link, I'll be fine. That is, after that I can do refinement. The rough structure here would be that you will Take, let's say only for 64 dimensions or coordinates or maybe only first 32 coordinates from mrl and do those retrieval of top hundred links.</p><p>[01:39:06] <strong>Prateek Jain:</strong> Once you have those top hundred links to get the correct top 10 links, we can do them further rescoring based on full, let's say 1,024 dimensions and get like those things. And now, because everything is nested, those embeddings are already computed and I have with me, right? So I can first say that, okay, for the first phase of getting top a hundred.</p><p>[01:39:25] <strong>Prateek Jain:</strong> I can use 32 dimensions. And then in the second phase of doing that rescoring, I can use full dimensionality. Sorry for cutting</p><p>[01:39:34] <strong>Alex Volkov:</strong> No, that was great. Great addition. And I want to think rescoring and re ranking. Are you referring to the same thing? Like some folks like take the initial results and then they try to rank like what was the most appropriate ones. I think, does this represent the case that you guys talk about where the initial information is not really necessary for the first responses.</p><p>[01:39:52] <strong>Alex Volkov:</strong> And then we're going to run another tool like Cohere. Sometimes those folks do re ranking with Cohere and then you'll like judge. The importance [01:40:00] of those and then sort them in the secondary process.</p><p>[01:40:02] <strong>Aditya Kusupali:</strong> off. Yeah, that's pretty much that's a relevant thing. But I think Joe Christian Begum is in the call from Vespalic. He's a proponent of late interaction. So you can do a lot of other re ranking methods. But in this case, what Pratik specifically is saying is, let's say you recall with 64 dimensions, and you can rescore with 1024.</p><p>[01:40:23] <strong>Aditya Kusupali:</strong> You can use the precise 1024 to just rescore in case you ever want to use it. And this is all from the same MRL embedding.</p><p>[01:40:33] <strong>Alex Volkov:</strong> Alright, so moving on, I think Aditya, I heard you say also that in the use case of LLMs for example, where again, you guys built this before the explosion, in the use case of LLM and RAG some amount of this is offset to the LLM itself. After you retrieve and you provide this data to LLM it can do some of this work for you, which I guess why your work from a year ago or a couple years ago found newfound relevance.</p><p>[01:41:01] <strong>Alex Volkov:</strong> But then I think you followed up with another paper a year ago and at ANN, right? Could you talk about how this applies to Matryoshka Embeddings as well? I would love to hear additional work in this area that you guys did.</p><p>[01:41:15] <strong>Aditya Kusupali:</strong> Sure when Hrithik was talking about retrieval, he also mentioned that you typically do a nearest neighbor search. So the goal is when a query comes in, you embed it into the same space. Documents, say, let's say a billion are encoded in the same space, your target is to find, say, top 10 documents which are most relevant.</p><p>[01:41:32] <strong>Aditya Kusupali:</strong> And the way you do it is nearest neighbor search. So you just try to find which vectors in your database are the closest for queries. But the thing is, again, as Pratik said, like the cost is directly proportional to the dimensionality, as well as the number of data points. So it's linear in terms of number of data points and dimensionality.</p><p>[01:41:50] <strong>Aditya Kusupali:</strong> So to reduce this cost at web scale, so there is no way Google can ever So things if everything is every single data point has to be explicitly compared. So there's an idea called Approximate Nearest Neighbors, which has been there for the last 25 years or so. The goal of Approximate Nearest Neighbors is, instead of touching all the 1 billion points to get top 10, I'm going to touch, say, something like 10, 000.</p><p>[01:42:12] <strong>Aditya Kusupali:</strong> So I'm only going to search 10, 000. by somehow partitioning the space and only cleverly looking at the places I want to look at and get to the 10, 000. And in those 10, 000, I'll do more exhaustive search and find the top 10. Okay, and this is the Approximate Nearest Neighbors. And the simplest way of thinking about Approximate Nearest Neighbors is a tree structure.</p><p>[01:42:32] <strong>Aditya Kusupali:</strong> So you have a billion points. You are basically building a huge tree structure by using clustering. So a billion points can be clustered into 50, 000 clusters, which can further be clustered into 50, 000 each. And eventually your leaf nodes, like the final leaf node, will have 100 data points in each of the leaf nodes.</p><p>[01:42:48] <strong>Aditya Kusupali:</strong> And this is a typical Tree based data structure, which a lot of people use for Approximate Nearest Neighbors. In case anyone is interested, you can go check FAI's library from Facebook. It's a very good resource for all of these things. This is Approximate Nearest Neighbors and it plays very well with web scale systems.</p><p>[01:43:05] <strong>Aditya Kusupali:</strong> You want any of your embeddings to play well with Approximate Nearest Neighbors if you want to scale to web missions while powerful can you hear me?</p><p>[01:43:12] <strong>Alex Volkov:</strong> yeah, we can hear you, cut off for a second, now we're back.</p><p>[01:43:16] <strong>Aditya Kusupali:</strong> Okay, so matricial representations, as Prateek said, again, like you can use 64 100 documents and re ranking for say with 1024 to get the top 10. This is while sound in principle. When you try to do this in systems aware settings, this does not scale well, because these 100 documents need not be sitting on the same machine, they need not be co located, all of these things, there's so many systems considerations which start blowing up, and Approximate Nearest Neighbors directly handles this.</p><p>[01:43:46] <strong>Aditya Kusupali:</strong> Approximate Nearest Neighbors ensures that Similar documents are in similar chunk of your memory for your systems to take care of a lot of these things. So we wanted Matrioshka representations to power better approximate nearest neighbors. That's why we came up with ADAMS or Adaptive Approximate Nearest Neighbor Search.</p><p>[01:44:03] <strong>Aditya Kusupali:</strong> And the goal here is, again, it's When you're doing approximate nearest neighbors from 1 billion to 58, 000 clusters followed by 50, 000, let's say you have a 1024 dimension embedding, you use the same 1024 embedding for every single one of these phases. But as we talked earlier, if you only care about recall, which your clustering is basically doing, what your clustering is saying is, look, I just need to be in the right cluster, right portion of your space, and that's pretty much I care about.</p><p>[01:44:29] <strong>Aditya Kusupali:</strong> So that's just recall. And If I'm able to do this clustering with 64 dimensions instead of 1024, I can save a lot of compute when I'm searching the space. And this is the idea. So at every single level of this tree, I'm going to change the dimensionality I'm going to use. Let's say 64, 128. And then finally, when I come to leaf node, when my query goes to the leaf, I'm going to precisely re rank all these 100 data points or so.</p><p>[01:44:53] <strong>Aditya Kusupali:</strong> 2024. So there is going to be a precise re ranking at the end, but all the intermediate steps, because they're already approximating, but only care about recall, can be approximated with a lower dimension embedding. You can traditionally do this even without Matrioshka embeddings, but you need again post hoc compression, which is not pretty great.</p><p>[01:45:12] <strong>Aditya Kusupali:</strong> So Matrioshka representations just gives you this for free. So if you want 64 dimensions for the first phase of clustering, take the first 64. If you want 128 for the second phase of clustering, take the first 128. And that's the reason it becomes seamless and that's what ADAMS does.</p><p>[01:45:27] <strong>Alex Volkov:</strong> Awesome. And I want to take this to the practical level a little bit. As far as I saw, Sentence Transformers from Hug Face supports supports this natively right now, right? You can import and you can encode embeddings in different models. What other tools since you guys started getting a lot of interest, after this, both because the LLM explosion, now everybody does RAG and everybody understands that RAG is one way to get these models to behave as they want.</p><p>[01:45:51] <strong>Alex Volkov:</strong> What else? What other tools? You mentioned PHI. What other tools are now supporting something like this? Because on the face of it, it sounds very very helpful, very performant. In my head, this sounds like Not necessarily direct, but like similar to how quantization came and reduced them like precision of models.</p><p>[01:46:08] <strong>Alex Volkov:</strong> And basically they respond with the same precision, but they're significantly smaller. So what other tools can folks find use in Kamiatroska from what you're, you guys have heard?</p><p>[01:46:20] <strong>Aditya Kusupali:</strong> Yeah, two clarifications. Face does not use Matryoshka right now, but ADANCE was built off of Face so yeah, that's a caveat so they don't use Matryoshka at this point. Yeah, second thing you asked quantization, right? That's a very good point. Quantization is a complementary thing.</p><p>[01:46:36] <strong>Aditya Kusupali:</strong> So think of quantization as flexibility in your bit precision, while Matryoshka is flexibility in your dimensionality. So both of them can work hand in hand even after this. So you can quantize any Matryoshka embedding, and it will still play well with quantization. So that's the beauty of this, right?</p><p>[01:46:54] <strong>Aditya Kusupali:</strong> Until now, we were only reducing the precision of the numbers, and now you can also reduce the Vector itself. So that's very good coming to the repositories and other stuff, which are using, of course, sentence transformer, I think is going to be the easiest way in. I went through the implementation day before yesterday.</p><p>[01:47:14] <strong>Aditya Kusupali:</strong> It's pretty clean. It just works out of the box. NOMIC released the V for 1. 5. If anyone wants to go and look at it inside again, it's 10 lines of code. Beautifully written. And I think it's much more understandable in case someone wants to get into the weeds. So that is one thing we have our own repository, which we released like a couple of years ago.</p><p>[01:47:33] <strong>Aditya Kusupali:</strong> But the nice thing about Matryoshka is if you want to train something, it's literally a for loop. It's four lines of code. So the code is already in the paper. If someone wants to go and implement it, it's You just look at the paper, there will be a code in I think, page 12 or something, five lines, you just go and implement it.</p><p>[01:47:48] <strong>Aditya Kusupali:</strong> Apart from that I think Transformer. js was supporting a bunch of these re ranking visualizations in HuggingFace. But yeah, like for now, I think these are the things we know which are supporting. ADAN, I don't think anyone is supporting at this moment. It's just our code base. which is out there.</p><p>[01:48:05] <strong>Aditya Kusupali:</strong> It's also not highly optimized for low level things, so I wouldn't recommend you directly use it for your use cases, but it's a great thing for you to prototype and see how well you could benefit from this flexibility in retrieval space.</p><p>[01:48:18] <strong>Alex Volkov:</strong> So I just wanna make sure [01:48:20] that we shout out properly. No ai, the folks that have the Atlas platform to visualize and they down, down sample, I think you said like they lower the dimensionality to, into 2D or 2D space to actually show dimensions. They release. No embed. 1. 5 recently, like a fully open source embedding models end to end and they're great and now they're also supporting Matryoshka, which is great.</p><p>[01:48:41] <strong>Alex Volkov:</strong> I also heard you say that quantization directly applies here as well. So you can like I don't know the verbiage of this, like you can Matryoshka something and quantize a model in this Wool Resort is significantly smaller. And like smaller weights so that's great.</p><p>[01:48:54] <strong>Alex Volkov:</strong> You also mentioned Transformers. js, which is a Hug Face library, the author of which Joshua Zanova is here in the audience with us, friends of the pod, that supports this as well. Folks, we're slowly running out of time a little bit. I wanted to thank you for coming up. It often happens when folks who build something come up and talk to us.</p><p>[01:49:13] <strong>Alex Volkov:</strong> It doesn't often happen in something that released a few years ago. that now resurfaces in popularity. And then we're able to highlight some folks works. So Aditya and Pratik, I really want to thank you. Anything else that you want to mention before we before I recap the whole space, feel free to.</p><p>[01:49:28] <strong>Alex Volkov:</strong> Definitely not a full deep dive, but I really wanted to highlight the fact that your work is now represented also in like one of the big libraries in the world in terms of AI. And many folks can now understand what is this parameter that they do when they adjust dimensionality and open the eyes embedding models.</p><p>[01:49:44] <strong>Aditya Kusupali:</strong> I think nOMIC the reason why I say this is a straightforward implementation is NOMIC released their v1 and then Matrioshka became a thing, so they literally trained this entire thing in three days and with all of their data, so it's extremely simple and they actually had to not change a single hyperparameter, so it's pretty good.</p><p>[01:50:02] <strong>Aditya Kusupali:</strong> I would like to see if Pratik wants to add anything, but Otherwise, thank you for having me here.</p><p>[01:50:07] <strong>Alex Volkov:</strong> Thank you, Adithya.</p><p>[01:50:07] <strong>Prateek Jain:</strong> No, it's pretty accurate. Thanks for having</p><p>[01:50:10] <strong>Aditya Kusupali:</strong> us here. Yeah, and for anybody else in the audience, sorry I've posted the links as to what you can do with this. So it's Zenova's demo when you can use it in Transformers JS.</p><p>[01:50:21] <strong>Aditya Kusupali:</strong> And also we look forward to actually implementing the paper too, because again, this is not a very well known or well discussed subject in general.</p><p>[01:50:31] <strong>Alex Volkov:</strong> So I'm very happy to have been able to host you guys and You have a paper out, I think it was in NeurIPS, and seeing more from this space of embeddings, because there's more to come here, and many people are now using this in big production probably, it was used in Rexis before, but now in big LLM related production stuff, and the more folks understand retrieval and fine tuning retrieval, and also ways to cut costs, like Matryoshka, for example it would be great, so shout out to you guys, definitely, thanks for working on this and coming and showing, giving light, I'm very happy that you did get The the mention in the open, OpenAI, and I'm also, I'm happy that I noticed because of this, and was able to talk to you guys and figure out what Mutrochka embeddings are.</p><p>[01:51:11] <strong>Alex Volkov:</strong> And if folks want more deeper, deep dives, this is what was like, very surface level. You guys did a paper. PaperClub with Latentspace yesterday, and before that, both with Vietpodcast talked about Matryoshka Embeddings. Connor was here before, you guys just missed him. And also, Nisten put this link up.</p><p>[01:51:28] <strong>Alex Volkov:</strong> HugInFace has a very nice deep dive from Omar and Zenova about Matryoshka Embeddings and what they mean and how to use them in sentence transformers. All right, folks, this has been our ThursdAI for today. I will now take a deep breath and recap everything we've talked about if you've been here for the past two hours and some, you've probably heard all of this, but if not feel free to stick around and it's probably gonna take me like eight minutes or so and then we're gonna let you go.</p><p>[01:51:53] <strong>Alex Volkov:</strong> With that, this is our Thursday AI for February 29th. Leap year, February 29th, like once in four years, I find it pretty funny. And I think it was a great space,</p><p>[01:52:01] <strong>Alex Volkov:</strong> we didn't have any, Nisten, no breaking news today, right? I wasn't monitoring well, but I didn't see GPT 5 didn't release while I was talking, right?</p><p>[01:52:11] <strong>Nisten Tahiraj:</strong> Nope not yet.</p><p>[01:52:12] <strong>Alex Volkov:</strong> Not yet.</p><p>[01:52:13] <strong>Alex Volkov:</strong> We did get one piece of breaking news that we didn't notice as we were recording the live stream, and that was from our friends in Modular. If you remember, we've talked about Modular and their new programming language Mojo, which is a superset of Python, and the creator Chris Lattner, who was previously the LLVM and MLIR compiler author and also the creator of Swift.</p><p>[01:52:42] <strong>Alex Volkov:</strong> Uh, in Apple, and, uh, we've talked about Mojo being the right language for AI, and they just released their inference engine called Max to the world in beta, and this inference engine supposedly has Mojo built in, and supposedly is way faster even for existing models, uh, to run inference. So that's very interesting, and we're going to talk about more as we, as we play around with this.</p><p>[01:53:07] <strong>Alex Volkov:</strong> Alright, folks, and I think this was all we talked about on ThursdAI on February 29th. And I want to just thank everybody who joined. Nisten, thank you, as always, co host. Jan was here before, and we had Beau join for a while, even though we didn't say hi.</p><p>[01:53:22] <strong>Alex Volkov:</strong> We have a bunch of other folks. So thank you for all the guests. Thank you, all of you, for listening and tuning in from week to week. It's really a pleasure. And now with this, I'm just going to end here. everybody. We'll see you next week. Cheers.</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-feb-29-leap-year-special</link><guid isPermaLink="false">substack:post:142190382</guid><dc:creator><![CDATA[Alex Volkov, Prateek Jain, Aditya Kusupati, and Nisten]]></dc:creator><pubDate>Fri, 01 Mar 2024 00:36:36 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/142190382/d52140aca568c710144f030a59ceabe9.mp3" length="81994633" type="audio/mpeg"/><itunes:author>Alex Volkov, Prateek Jain, Aditya Kusupati, and Nisten</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6833</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/142190382/dff4f8b97b74c2f5ed42b27db7a97305.jpg"/></item><item><title><![CDATA[📅 ThursdAI Feb 22nd - Groq near instant LLM calls, SDXL Lightning near instant SDXL, Google gives us GEMMA open weights and refuses to draw white people, Stability announces SD3 & more AI news]]></title><description><![CDATA[<p>Hey, this is Alex,</p><p>Ok let's start with the big news, holy crap this week was a breakthrough week for speed! </p><p>We had both Groq explode in popularity, and ByteDance release an updated SDXL model called Lightning, able to generate full blown SDXL 1024 images in 300ms. </p><p>I've been excited about seeing what real time LLM/Diffusion can bring, and with both of these news release the same week, I just had to go and test them out together: </p><p>Additionally, we had Google step into a big open weights role, and give us Gemma, 2 open weights models 2B and 7B (which is closer to 9B per Junyang) and it was great to see google committing to releasing at least some models in the open. </p><p>We also had breaking news, Emad from Stability announced SD3, which looks really great, Google to pay Reddit 200M for AI training on their data & a few more things. </p><p>TL;DR of all topics covered: </p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Groq custom LPU inference does 400T/s Llama/Mistral generation (<a target="_blank" href="https://x.com/JayScambler/status/1759372542530261154?s=20">X</a>, <a target="_blank" href="https://groq.com/">Demo</a>)</p><p>* Google image generation is in Hot Waters and was reportedly paused (<a target="_blank" href="https://x.com/altryne/status/1760358916624719938?s=20">refuses to generate white people</a>)</p><p>* Gemini 1.5 long context is very impressive to folks (<a target="_blank" href="https://twitter.com/mattshumer_/status/1759749194108043597">Matt Shumer</a>, <a target="_blank" href="https://x.com/emollick/status/1760476937938497946?s=20">Ethan Mollick</a>)</p><p>* <strong>Open Weights LLMs</strong> </p><p>* Google releases GEMMA, open weights 2B and 7B models (<a target="_blank" href="https://twitter.com/altryne/status/1760371315641397331">Announcement</a>, <a target="_blank" href="https://huggingface.co/google/gemma-7b">Models</a>)</p><p>* Teknium releases Nous Hermes DPO (<a target="_blank" href="https://twitter.com/Teknium1/status/1760085483571552612">Announcement</a>, <a target="_blank" href="https://huggingface.co/NousResearch/Nous-Hermes-2-Mistral-7B-DPO-GGUF">HF</a>)</p><p>* <strong>Vision & Video</strong></p><p>* YoLo V9 - SOTA real time object detector is out (<a target="_blank" href="https://x.com/_akhaliq/status/1760668747449110643?s=20">Announcement</a>, <a target="_blank" href="https://github.com/WongKinYiu/yolov9">Code</a>)</p><p>* <strong>This weeks Buzz</strong> (What I learned in WandB this week)</p><p>* Went to SF to cohost an event with A16Z, Nous, Mistral (<a target="_blank" href="https://x.com/rajko_rad/status/1760015750599979245?s=20">Thread</a>, <a target="_blank" href="https://twitter.com/altryne/status/1760025913922568678">My Report</a>)</p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* ByteDance presents SDXL-Lightning (<a target="_blank" href="https://fastsdxl.ai/">Try here</a>, <a target="_blank" href="https://huggingface.co/ByteDance/SDXL-Lightning">Model</a>)</p><p>* Stability announces Stable Diffusion 3 (<a target="_blank" href="https://x.com/StabilityAI/status/1760656767237656820?s=20">Announcement</a>)</p><p>* <strong>Tools</strong></p><p>* Replit releases a new experimental Figma plugin for UI → Code (<a target="_blank" href="https://twitter.com/Replit/status/1760711401323286795">Announcement</a>)</p><p>* Arc browser adds "AI pinch to understand" summarization (<a target="_blank" href="https://x.com/joshm/status/1760698068943724634?s=20">Announcement</a>)</p><p>Big CO LLMs + APIs</p><p>Groq's new LPU show extreme performance for LLMs - up to 400T/s (<a target="_blank" href="https://x.com/altryne/status/1760561501096575401?s=20">example</a>)</p><p>* Groq created a novel processing unit known as the Tensor Streaming Processor (TSP) which they categorize as a Linear Processor Unit (LPU). Unlike traditional GPUs that are parallel processors with hundreds of cores designed for graphics rendering, LPUs are architected to deliver deterministic performance for AI computations.</p><p>* Analogy: They know where all the cars are going when everyone wakes up for work (when they compile) and how fast they all drive (compute latency) so they can get rid of traffic lights (routers) and turn lanes (backpressure) by telling everyone when to leave the house.</p><p>* Why would we need something like this? Some folks are saying that average human reading is only 30T/s, I created an example that uses near instant Groq Mixtral + Lightning SDXL to just create images with Mixtral as my prompt manager</p><p>Open Source Weights LLMs </p><p>Google Gemma - 2B and 7B open weights models (<a target="_blank" href="https://labs.perplexity.ai/">demo</a>)</p><p>* 4 hours after release, Llama.cpp added support, Ollama and LM Studio added support, Tri dao added Flash attention support</p><p>* Vocab size is 256K</p><p>* 8K context window</p><p>* Tokenizer similar to LLama</p><p>* Folks are... not that impressed as far as I've seen</p><p>* <strong>Trained on 6 trillion tokens</strong></p><p>* Google also released Gemma.cpp (local CPU inference) - <a target="_blank" href="https://twitter.com/austinvhuang/status/1760375890448429459">Announcement</a></p><p>Nous/Teknium re-release Nous Hermes with DPO finetune (<a target="_blank" href="https://twitter.com/Teknium1/status/1760085483571552612">Announcement</a>)</p><p>* DPO RLHF is performing better than previous models</p><p>* Models are GGUF and can be found <a target="_blank" href="https://huggingface.co/NousResearch/Nous-Hermes-2-Mistral-7B-DPO-GGUF">here</a></p><p>* DPO enables Improvements across the board</p><p>This weeks Buzz (What I learned with WandB this week)</p><p>* Alex was in SF last week</p><p>* A16Z + 20 something cohosts including Weights & Biases talked about importance of open source</p><p>* Huge Shoutout Rajko and Marco from A16Z, and tons of open source folks who joined</p><p>* Nous, Ollama, LLamaIndex, LMSys folks, Replicate, Perplexity, Mistral, Github, as well as Eric Hartford, Jon Durbin, Haotian Liu, HuggingFace,  tons of other great folks from Mozilla, linux foundation and Percy from Together/Stanford</p><p>Also had a chance to checkout one of the smol dinners in SF, they go really hard, had a great time showing folks the Vision Pro, chatting about AI, seeing incredible demos and chat about meditation and spirituality all at the same time! </p><p>AI Art & Diffusion</p><p>ByteDance presents SDXL-Lightning (<a target="_blank" href="https://fastsdxl.ai/">Try here</a>)</p><p>* Lightning fast SDXL with 2, 4 or 8 steps</p><p>* Results much closer to original SDXL than turbo version from a few months ago</p><p>Stability announces Stable Diffusion 3 (waitlist)</p><p>Uses a Diffusion Transformer architecture (like SORA)</p><p>Impressive multi subject prompt following: "Prompt: a painting of an astronaut riding a pig wearing a tutu holding a pink umbrella, on the ground next to the pig is a robin bird wearing a top hat, in the corner are the words "stable diffusion"</p><p>Tools</p><p>* Replit announces a new Figma design→ code plugin </p><p></p><p>That’s it for today, definitely check out the full conversation with Mark Heaps from Groq on the pod, and see you next week! 🫡 </p><p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></p><p>Full Transcript: </p><p>[00:00:00] <strong>Alex Volkov:</strong> Hey, this is Alex. This week on ThursdAI, we had an hour conversation with Grok, a new and very exciting AI inference chip that exploded in popularity all over social media after showing a 5x, yes, 5x improvement in AI inference. 500 tokens per second for Lama70B and Mistral.</p><p>[00:00:32] <strong>Alex Volkov:</strong> We also talked about Google's new OpenWeights GEMMA model, Google's image generation issues, which led them to take down the abilities of this image generation to generate people. We covered new, incredibly fast SDXL lightning, and we had breaking news for Stable Diffusion 3, which is a diffusion transformer that's coming out of Stability AI.</p><p>[00:01:03] <strong>Alex Volkov:</strong> and a bunch of other news. All that after this short intro into Weights Biases.</p><p>[00:01:10] AI teams are all asking the same question. How can we better manage our model development workflow? The path to production is increasingly complex, and it can get chaotic keeping track of thousands of experiments and models. Messy spreadsheets and ad hoc notebooks aren't going to cut it. The best AI teams need a better solution.</p><p>[00:01:33] And better tools. They need Weights Biases, the AI developer platform, to unlock their productivity and achieve production ML at scale. Replace messy spreadsheets with an automated system of record for experiments.</p><p>[00:01:52] Communicate about model evaluation. and collaboratively review results across the team. Clean up disorganized buckets of models with a unified registry. Automatically capture full model lineage, all the data and code used for training and testing. Seamlessly connect to compute to scale up training. And run large scale sweeps efficiently to optimize models.</p><p>[00:02:20] Analyze the performance of large language models. And monitor LLM usage and costs with live, customizable dashboards. Get your team on the same page to bridge the gaps from ideation to production. Use Weights Biases to build, manage, and deploy better models, faster.</p><p>[00:02:41] <strong>Alex Volkov:</strong> Wasn't this cool? This is Kari. She is a original PM on the Weights Biases team. She's been there for a long time and recently we used her voice to narrate this new video that we have up on the website. And I figured I'd put it in here because it works even without the video. And I thought it was super cool.</p><p>[00:03:01] <strong>Alex Volkov:</strong> And people ask me, what does Weights Biases do? And hopefully this answers some of those questions. Now I want to switch gears and say, basically. that the format for this week is a little different. We had the folks from Grok and Matt Schumer at the beginning of the pod, and then we kept talking about everything else, like Gemma and Gemini and everything else.</p><p>[00:03:24] <strong>Alex Volkov:</strong> So the first hour of this is going to be an interview with the Grok folks, specifically with Mark Heaps and the next hour afterwards is going to be the deep dive into topics. If you're listening to this on Apple podcast, for example, you should be able to just view chapters and skip to a chapter that you'd prefer. .</p><p>[00:03:51] <strong>Alex Volkov:</strong> I want to just do a quick recap of ThursdAI for February 22nd everything we've talked about for today and we started the space with a with two I guess Matt Schumer and mark Heaps from, and that's Groq with a Q at the end, not Groq with a K at the end. So not like X ais Groq. Groq is explo on our timelines recently with just incredible viral videos of them performing l la inference on LAMA two 70 B and Mixtral with around 400 or 500 tokens a second, which is.</p><p>[00:04:34] <strong>Alex Volkov:</strong> Five times as much as the previous super fast API inference that we've seen for perplexity and from together. And they're serving like Lama 270B with 500 tokens a second. And so we've had Mark from Groq talk to us for almost an hour about how this is even possible. So we had a very nice deep dive with Mark and definitely if you miss this, please check this out on, on the recorded portion as well.</p><p>[00:04:58] <strong>Alex Volkov:</strong> And then we also had Matt, who works at HyperWrite, and he's been playing with these tools, and he told us about the demos that he was able to build, and How much of a difference this speed of inference makes. We've talked about their custom chip called LPU, and we've talked about the fact that the company's been around for a while, and they did not expect this explosion in virality, but they're very happy that they chose this direction correctly.</p><p>[00:05:21] <strong>Alex Volkov:</strong> Very great interview, great conversation, and I invite you to listen to this as well. We covered that Google image generation is now in hot waters, and was reportedly paused because it's in injecting prompt stuff that they're not that great, let's say. And many people notice that historical figures are being generated in different races, and different multicultural adjustments are happening to your prompts, which is not great.</p><p>[00:05:46] <strong>Alex Volkov:</strong> This blew up on Twitter, and even outside of Twitter, I think folks started writing this in actual Media Google, enough so that Google took down the image generation of people trying to figure out what to do with this. But we also gave props for Google to release Gemma. Gemma is an open weights 2 billion and 7 billion parameter model, and we've talked about Gemma we've gave Google the props for releasing OpenWeights for us, and we had folks here on stage telling how the base model is still yet to be decided, how good this actually is, very fine tunable, we're waiting for the open source community to come together and fine tune the OpenWeights Gemma from Google, and then we also covered the Gemina 1.</p><p>[00:06:29] <strong>Alex Volkov:</strong> 5 loan context again, they released the 1 million, and, Context window support and many folks got access to this and we saw for the past week people playing And doing all kinds of stuff including Matt Schumer who I just mentioned he also got access So he gets all the cool toys and he was able to put , three Harry Potter books in one prompt and ask the model with perfect recall who said what and this could have been part of whatever Existing knowledge, but he was doing this more for a demo We also saw demos of people putting an hour long video in the prompt which is around six hundred or five hundred thousand tokens Which sounds ridiculous that it supports it and the model was able to understand this whole video And tell you which scene happened when, with almost near perfect precision.</p><p>[00:07:13] <strong>Alex Volkov:</strong> And we've talked about how this changes the game for multiple things, and we're gonna keep updating you about these long contexts. And we also brought this to Groq and said, Hey, are you gonna support long contexts with your insanely fast speed of inference? We also covered that news Research Tech released a new service DPO fine tuned, which is better in every possible benchmark on top of their ori already strong flagship models which is great.</p><p>[00:07:37] <strong>Alex Volkov:</strong> And I covered that. I went to San Francisco to host an event with a 16 z and news research and Mistral and All Lama and a bunch of other folks, and it was a great event. And I shout out to A 16 z folks for hosting this and inviting me. There as well. And then last thing we've covered is two AI art and diffusion stuff where ByteDance releases SDXL Lightning, which generates 1024 super high quality images in just two or four steps and they look incredible and super fast to generate as well.</p><p>[00:08:08] <strong>Alex Volkov:</strong> I've talked about the demo that I built with them and I've talked about this example that File. ai has where you can go to fastsdxl. ai and just type and as you type, the image generates on the fly [00:08:20] with around 300 milliseconds of inference time which feels real time and feels quite incredible. And following that, we have breaking news today from Stability announcing Stable Diffusion 3.</p><p>[00:08:30] <strong>Alex Volkov:</strong> which is a diffusion transformer, which we've covered before, a diffusion transformer based image generation model from Stability. They announced a waitlist that you can go and sign up for right now. And it looks like it's significantly better at following very complex prompts, like multiple objects and colors and everything in one prompt.</p><p>[00:08:47] <strong>Alex Volkov:</strong> This is everything we've talked about on ThursdAI</p><p>[00:08:49] Introduction and Welcoming Remarks</p><p>[00:08:49] <strong>Alex Volkov:</strong> Yes.</p><p>[00:08:55] <strong>Alex Volkov:</strong> All right, folks, you know the sound. Those of you who come back week after week, you know the sound. This is ThursdAI. My name is Alex Volkov. I'm an AI evangelist with Weights Biases. And I'm joined here on stage by, from week to week, by experts, friends of the pod, and new folks who actually were in charge of the news that we're going to talk about today. And Today is February 22nd, only February 22nd, and already so much happened this year with AI. Last week was crazy, this week was less crazy than last week, but still, so much to talk about.</p><p>[00:09:36] Introducing the Co-Host and Guests</p><p>[00:09:36] <strong>Alex Volkov:</strong> And I'm delighted to have my co host Nisten here. Hey Nisten, what's up?</p><p>[00:09:43] <strong>Alex Volkov-1:</strong> Hey</p><p>[00:09:43] <strong>Nisten Tahiraj:</strong> everybody,</p><p>[00:09:44] <strong>Alex Volkov:</strong> How's your week?</p><p>[00:09:45] <strong>Nisten Tahiraj:</strong> I'm just it's been the usual, just up until 2 or 3 a. m. on random Twitter spaces finding, because sometimes stuff gets pretty,</p><p>[00:09:57] <strong>Nisten Tahiraj:</strong> it's pretty exciting.</p><p>[00:09:58] <strong>Alex Volkov:</strong> Yep, stuff gets pretty exciting from week to week. I also want to say hi to Matt Schumer, joining us for a brief period. Matt, you've been all over my feed this week. How are you doing, buddy? You've been here before, so folks may not remember you. So please introduce yourself briefly, and then we'll chat.</p><p>[00:10:16] <strong>Matt Shumer:</strong> Hey man, thanks for having me.</p><p>[00:10:17] Introduction of Otherside AI</p><p>[00:10:17] <strong>Matt Shumer:</strong> Yeah, so I'm co founder, CEO of Otherside AI. We are the creators of Hyperite, which is one of the largest AI writing platforms. And we also. I've been exploring the agent space for a couple years now about a year publicly creating AIs that can actually operate your computer.</p><p>[00:10:31] <strong>Matt Shumer:</strong> As I mentioned, unfortunately, I do only have 10 minutes. I will potentially be able to join back up after so I'm really sorry about that. It's been a crazy day but excited to chat in the time that I have.</p><p>[00:10:42] <strong>Alex Volkov:</strong> Alright, awesome. Thanks for joining. And then I think we'll just jump in into the conversation. And I want to say hi to our guest a new guest.</p><p>[00:10:50] Introduction of Mark from Groq</p><p>[00:10:50] <strong>Alex Volkov:</strong> I don't, I haven't talked with Mark before. Mark is Mark, feel free to unmute and let us know some of your background and where you're joining from. And then we're going to talk about the stuff that we're here to talk about.</p><p>[00:11:05] <strong>Mark Heaps:</strong> Yeah, how's it going guys? Thanks for letting me join the space today, and glad to see some familiar names from all the craziness this week. Yeah, I'm the Chief Evangelist and Head of Design, Brand, and Creative over at Groq, which is probably a little bit not a normative title for folks that are so deep in the AI developer space, but we actually do a lot of the technical side too, so glad to be here.</p><p>[00:11:28] <strong>Alex Volkov:</strong> Awesome. And so folks who are listening, that's Groq with a Q at the end. So not X's Groq. And you guys have been around. For a little longer than them. But just in case folks get confused, there's like a few confusion points here. And I think this is a good start for our conversation today. And I wanna turn this to Matt, because Matt, you're the first person who I saw post about Grock. I think this week, and some of your stuff got a bunch of attention. So give us like a brief overview, like what you saw that made you post. And then we're going to talk about this insane speed and then maybe turn to Mark into how it actually is done.</p><p>[00:12:02] <strong>Alex Volkov:</strong> So what is, where's Groq? Like, how'd you get to it? And how viral did you actually get?</p><p>[00:12:08] <strong>Matt Shumer:</strong> Yeah, it's a funny story. I actually found Groq I'd say more than a month ago and immediately I was blown away I think my co founder posted actually a text I sent to him, and I was like, you have to f*****g try this thing right now, it's incredible and he did, and he was blown away too, I actually went and posted about it back then, but it got no traction, I think I deleted it or something and I was just letting it marinate in my mind what was possible here, but, I wasn't sure, if this Scale obviously this week proved that thing wrong clearly it can but I was still just thinking about it, and then I was just on the train, my girlfriend and I were just sitting there on Sunday, and she just fell asleep, so I was like, what am I going to do right now, and for some reason, I thought of Groq, and I was like, okay, let's just post about it again, see what happens, and for some reason, this time, by the time I got off the train, it was going crazy viral.</p><p>[00:12:55] <strong>Matt Shumer:</strong> I, Sunday night was fun, I was up pretty much all night just managing the blowback from this. Finally fell asleep by the morning, I woke up to a timeline filled with tweets about Groq and for good reason, right? This thing is incredible, and it's going to change how we think about how we work with LLMs, what they're capable of, the ability to do tons of reasoning, right?</p><p>[00:13:16] <strong>Matt Shumer:</strong> All of that is now going to change, and a lot more is now possible. The one thing I wasn't sure about was, would this thing go down, right? With all this usage, would this thing go down? And, it hasn't, right? There was a brief time where there was a little bit of delay, but, more or less, it pretty much stayed up the entire time, which is crazy, through all of this, and they weren't prepared for that, which was incredibly impressive, and I think it's a testament to how good the hardware is.</p><p>[00:13:41] <strong>Matt Shumer:</strong> It's just exciting to see. I actually spoke with Jonathan, the CEO of Groq yesterday, and he said that something like 300 developer API requests were submitted prior to the tweet. Now they're getting like 3, 000 a day or something, which is insane. Using that as a proxy for how many people must be trying the actual API, and then combine that with the demos I built that are getting thousands of hits every day, their servers are still clearly standing, which is,</p><p>[00:14:06] Exploring the Impact of Groq's Speed</p><p>[00:14:06] <strong>Alex Volkov:</strong> So what was impressive to you? I think we're dancing around the issue, but for folks who didn't see your viral tweets, what was the head explosion moment.</p><p>[00:14:14] <strong>Matt Shumer:</strong> You have TogetherAI, you have HuggingFace, Inference, you have VLM, all this stuff, right? You're getting on, let's say, Mixtral, if you're doing incredibly well, 100 tokens per second or something, right? Most people aren't reaching that and that number may be off by a little bit, but at a high level, you're getting around there with any pretty standard model today, if you're doing well.</p><p>[00:14:34] <strong>Matt Shumer:</strong> Now, going above 200? Unheard of. 500? Ridiculous, right? And that's where Groq sits, right? They've essentially developed a chip that enables these language models to be far faster. And when you see 500 tokens per second versus, let's say, 50 or 100, it is not just a small difference, right?</p><p>[00:14:52] <strong>Matt Shumer:</strong> This is like a step change in what is possible with what you can build with them. And that's what turned my head, right? It's not just faster inference, it's a whole other paradigm that you could build on top of right now, right? When you have inference that's fast, you can then do, 5 to 10x the reasoning in the same amount of time.</p><p>[00:15:10] <strong>Matt Shumer:</strong> How much better does the answer get with the same LLM if you do that? You could do interfaces that are created for you in real time. You don't have to wait. For example, right now on the HyperWrite platform, it's probably one of the best sort of conversational platforms with web search built in, but you still have to wait for it to go and execute the web search, come back, write the response, think through what it needs to do.</p><p>[00:15:28] <strong>Matt Shumer:</strong> What happens if that's instant? That changes everything. That's what got me super interested. Here's what others think about it though.</p><p>[00:15:35] <strong>Alex Volkov:</strong> Yeah I wanna chime in here. Thank you, Matt. I saw your tweet immediately What? And also, A day before I saw your tweet and we're going to talk about long context and maybe after you're gone, maybe you'll come back as well. But a day before I saw your tweet, I posted something where folks were complaining about kind of the long context with Gemini 1.</p><p>[00:15:53] <strong>Alex Volkov:</strong> 5 Pro with the million. That's saying, oh, it's going to take too long. It's going to cost too much, et cetera. And I posted something like, that's not going to, that's not going to be the truth forever. These things are coming down faster than people realize, and I think those things together just one after one, to show me how fast we're moving, how incredible this is, because and we're gonna talk about long context here in a second as well, but Immediately a day after I saw your tweet, I was like, oh, there's an example.</p><p>[00:16:18] <strong>Alex Volkov:</strong> This is exactly what we're talking about. Just I didn't expect it to take a day. So I want to turn the conversation to Mark real quick.</p><p>[00:16:24] Discussion on Groq's Custom Chip</p><p>[00:16:24] <strong>Alex Volkov:</strong> Mark, you worked in Grak. How long have you been there? Tell us about this custom chip you guys have. What's going on? How are you achieving this insanity? 500 tokens a second for Llama70B, which is quite big.</p><p>[00:16:38] <strong>Mark Heaps:</strong> Yeah. Happy to. And [00:16:40] Jonathan actually called me and told me that he spoke to Matt yesterday, and I said, I think we owe Matt a, a very nice steak dinner and maybe a little bit more than that. I also didn't sleep at all that night because there were so many requests coming in and Matt's right, like we weren't really ready for it.</p><p>[00:16:55] <strong>Mark Heaps:</strong> We were literally just discussing. The day before what are some other demos we can do? What are some things we can show people with the speed? And then all of a sudden, Matt did a post and then a number of other people that follow him started doing posts. And, next thing I know, people are making their own video demos and it blew us all away.</p><p>[00:17:11] <strong>Mark Heaps:</strong> We're like, wow, this is amazing. I owe a big thanks to the community that have jumped on this. The, this is the magical moment. I think anybody that's worked in tech has seen this before. I've been working in tech for about 30 years. And there's this rubber band effect where one end pulls forward and then you have the whiplash from the other side.</p><p>[00:17:30] <strong>Mark Heaps:</strong> And software developers have been doing an amazing job in AI for the last couple of years trying to, find more efficiencies, eek out better inference, trying to get, anywhere they can that optimization. But, classically what happens is you push that to a point where you start seeing a ceiling.</p><p>[00:17:46] <strong>Mark Heaps:</strong> And then hardware comes along and says, Oh, you're driving the car at max speed? Let me just give you a new engine. Let me give you, something that speeds that up. And we've seen people saying that they have an inference engine. But ultimately they're really just these brokers of other forms of cloud compute.</p><p>[00:18:01] <strong>Mark Heaps:</strong> And then again, eking more capability out of it through software. And Groq was completely different. I've been there now for about four years. And I remember when I originally met the CEO, Jonathan, I said, why does anybody need to do this? And he told us the story about, him creating the TPU over at Google.</p><p>[00:18:19] Exploring the Future of AI with Groq</p><p>[00:18:19] <strong>Mark Heaps:</strong> And it was a pretty interesting moment. Jeff Dean had told the team at Google, Hey we've got really good news. We figured out how to get AI working and get, these certain services working like image and speech, etc. But the problem is it's going to cost a fortune to expand our data centers to be able to handle this capacity.</p><p>[00:18:36] <strong>Mark Heaps:</strong> And then they realized they needed to invent a new chip to do that. We're seeing that repeat itself right now. where there was this low latency ceiling for everybody in regards to incumbent or legacy solutions. And he knew from day one that, everybody was out there training models for years.</p><p>[00:18:53] <strong>Mark Heaps:</strong> And he said, one day, this is all going to turn around and everybody's going to want the world's fastest inferential latency. And he didn't know exactly where that was going to be a product fit, but he did know that was going to be the problem statement. So that's what they, that's what they started with.</p><p>[00:19:06] <strong>Mark Heaps:</strong> And it's a radically different architecture, totally different methodology and approach. It's been a really fun journey learning about that architecture. Yeah.</p><p>[00:19:14] <strong>Alex Volkov:</strong> And like the. The public demo that you have, that's very easy for folks to just go and test this out on their own. I think it to be honest, it's awesome that you have this, and I think it undersells. The insanity of what this is, and I think when I hear about what Matt is building in the demos, and I had to play with this yesterday, I had to play with this myself to figure out what to do with this, because I saw many people react and say, hey, what's the point of 500 tokens per second when the reading speed of humans is I don't know, 50 tokens per second, whatever, and I'm looking at this tweet and I'm just like face palming, I was like, what, you don't, do you not</p><p>[00:19:51] <strong>Mark Heaps:</strong> Oh, thank you.</p><p>[00:19:52] <strong>Alex Volkov:</strong> Do you not what's going on? So I had to go and build something. I built I'll tell the folks in the audience. So I used actually two technologies. We're gonna talk about the second one today. I used two kind of super fast advancement that we had this week, which another one was a stable diffusion SDXL Lightning from, I think TikTok, I think released this.</p><p>[00:20:09] <strong>Alex Volkov:</strong> And I decided to just combine both of them, and I have a video, I'm gonna post it up on the show notes and on the demo right now, on the stage right now. But folks, don't go looking at this right now, go look at this afterwards. And I basically figured, Hey, if this is like as lightning fast as this is, I don't have to like, I'm like 400 tokens a second, 500 tokens a second, basically instant, I can use whatever Mixtral or you have Lama270B there, you have Mixtral, and hopefully we're going to talk about more models soon.</p><p>[00:20:37] <strong>Alex Volkov:</strong> And I can use this SDXL Lightning to just immediately deliver to me results. So I used Llama as my kind of prompt writer via Groq, and then I used this SDXL Lightning as my image generator, and I have a demo there that everything there appears in real time. And it's quite powerful and, to the person who said, Hey, the reading speed of people is 50, 50%.</p><p>[00:20:59] <strong>Alex Volkov:</strong> That person doesn't understand the impact of this. They will have an agent, for example, Matt was talking about agents and agentic stuff. The impact of this is just being able to build LLMs into every possible nook and cranny of software. I just wanted to highlight that, that I had to play with this to understand, really.</p><p>[00:21:14] <strong>Alex Volkov:</strong> And, Mark maybe let me ask you what kind of like inventive demos and stuff that you saw coming up from folks specifically around the fact that some of this stuff would not be very helpful with slower inference speed? Did you see any like cool examples of your own? Did you guys like, in slacks, did you send the examples between each other?</p><p>[00:21:32] <strong>Mark Heaps:</strong> Yeah, there's been a lot of chatter at Groq, and I think Matt's was the first one that kind of blew me away. He, he built a demo. And then I think his second demo was this one that wrote a novel, and it wrote it in like under a minute or something</p><p>[00:21:45] <strong>Alex Volkov:</strong> you want to tell us about this before, before you drop off? Because while I got you here, I would love to hear. Yes,</p><p>[00:21:56] <strong>Matt Shumer:</strong> because I wanted to go to sleep. And I knew I had to get it done, and I wouldn't have slept if I didn't. So that was this answers engine, very similar to Perplexity. The idea there was Perplexity's got this incredible, Embeddings based system likely it's really fast and allows you to answer questions really quickly so anybody going up against them they can't exactly do that because without that engine, it's going to be way slower but with the LLM that's as fast as Grox hosting of it, you can essentially do it in the same exact time or even faster while waiting for a pre built, search API to come back with results.</p><p>[00:22:28] <strong>Matt Shumer:</strong> And it worked. So basically, obviously after time it got a little slower because a ton of people were using it, but at the beginning it was like a second to answer for a very complex question. You could have it write a long thing based on something. So basically a really good answers engine. That was the first one.</p><p>[00:22:42] <strong>Matt Shumer:</strong> The second one was writing a novel in a minute or something. That came from a repo that I open sourced, I want to say like almost a year ago now. And that was called GPT Author. Originally the idea was to use GPT 4 to write a novel for you. The quality is obviously okay, it was just experiment to see where it went but people really took to it, so I decided to rebuild it.</p><p>[00:23:02] <strong>Matt Shumer:</strong> With gbt author originally, with gbt 4, it would take like 20 minutes to write, let's say, five chapters. The crazy thing is, with Groq, I added like three more layers of reasoning for each chapter, and yet it still computed it under like a minute or two. So that was pretty crazy. And then the third demo I released, which kind of went More volatile than the rest.</p><p>[00:23:24] <strong>Matt Shumer:</strong> That was basically a code tool that refactors code and documents it. So basically, it's a very simple design. You paste in some code. We have one Mixtral prompt essentially suggest improvements. Based on those improvement suggestions and the original code, we have another Mixtral go and make those improvements.</p><p>[00:23:45] <strong>Matt Shumer:</strong> We display the diffs and then based on that we have another Mixtral explain what happens and Give the user an understanding of what happened, and then we have a fourth one go in and document it. And this all happens, if I were building this for production with today's models and today's systems, I would probably go and try to make some of it async so that it's faster to the user.</p><p>[00:24:05] <strong>Matt Shumer:</strong> But with this, I built it sequentially because I didn't even have to go and do that. It all still computed in a second. By the time I was done reading the code changes for, or the suggestion that it was going to do in the first place, it was already done refactoring the code, already done documenting the code, which is crazy.</p><p>[00:24:20] <strong>Matt Shumer:</strong> So that one did pretty well. Those are the three demos I made. Maybe I'll do some more in the coming days. Yeah.</p><p>[00:24:24] <strong>Alex Volkov:</strong> that's incredible, dude, and I keep thinking about like more use case for this. Yesterday I used Cursor. Cursor is the editor, if you guys don't know, like AI native editor, uses I think GPT 4 behind the scenes, embeds a bunch of stuff. And I haven't been able to play with CursorFully until yesterday's demo, and I played with this, and it has GPT 4.</p><p>[00:24:42] <strong>Alex Volkov:</strong> And I think they have a specific faster access to GPT 4, if you pay, and we do pay. And I was playing with this, and I was getting support from my editor on my code, and it was slow, and I was like, I want it immediate. I want it instant. And I think that's what Groq of promises.</p><p>[00:24:59] <strong>Alex Volkov:</strong> Mark, [00:25:00] so let's talk about how you guys actually do this. You said something about the custom chip. What's as much as you can go into the secrets and also keep in mind that this is like a high level space here on Twitter. What's going on? Like, how are you able to achieve NVIDIA's earnings come out.</p><p>[00:25:15] <strong>Alex Volkov:</strong> They did like whatever insane numbers for the past year. Everybody's looking at A100s, H200s, whatever. What are you doing over there with new hardware?</p><p>[00:25:23] <strong>Mark Heaps:</strong> Yeah. The chip has actually been something we've been working on. The company was formed in 2016. And I think we, we taped out that chip, the first generation design, maybe two years after that. And it is totally different. And it's funny, people actually keep getting the category of the processor wrong online.</p><p>[00:25:41] <strong>Mark Heaps:</strong> It's a language processing unit, but people keep calling it a linear processing unit. And a lot of the engineers at Groq think that's fun because they're like technically, it is. It is a linear sequential processing unit, right? And it's some of the key differences at a high level, right? So it's not multi core like a GPU.</p><p>[00:25:56] <strong>Mark Heaps:</strong> It is single core. It was actually the world's first single core peta op processor, which, four or five years ago, that was a big deal. And it's still 14 nanometer silicon, which is a 10 year old version of silicon dye. Whereas, we're being compared to people that have silicon that's four and five nanometer.</p><p>[00:26:14] <strong>Mark Heaps:</strong> And we're completely fabbed in the U. S. It's it's readily available supply. So we don't have the challenges other folks have trying to get GPUs. But the part that's really cool, this is the thing that like I geek out on, right? Is when you think about getting deeper into the development and stack.</p><p>[00:26:30] <strong>Mark Heaps:</strong> And you're trying to set up GPUs as a system. And I'm talking, large data center scale systems. You've got all of these schedulers and things that you have to manage with the GPU and the data bouncing around in the way that it does being multi core and using all these schedulers it's really, what slows it down.</p><p>[00:26:46] <strong>Mark Heaps:</strong> It's really what gives it a latency ceiling. And with the design of the Groq chip, and if anyone's seen a picture side by side it's beautifully elegant. It's it's works in a way that when you connect all of these chips together, you could put thousands of them together, actually, and it will see it as one brain.</p><p>[00:27:06] <strong>Mark Heaps:</strong> So let's say that you realize for your workload you need 512 chips. You can tell that, hey, I need you to be one chip. and load your models that way. Or if you wanted to run some things in parallel, like we've done with an application we have called Groq Jams that writes music in independent tracks, linear and parallel to each other.</p><p>[00:27:26] <strong>Mark Heaps:</strong> So that they're perfectly synced, we can say no, make those chips eight chips because I want eight instruments. So I'm gonna use eight instrument models to do that. You can literally do that with one line of code in PyTorch and you can refactor that way. And so this is, the advantage that they've had with the way that they approach the chip design and that in itself was the, probably the most radical thing that Jonathan and the team were the inception of.</p><p>[00:27:50] <strong>Mark Heaps:</strong> They decided instead of designing hardware and figuring out how to improve hardware in a traditional methodology, they said no, we're going to start with the software. We're going to actually design our compiler first, and then we're going to design the silicon architecture to map to that, so that it's completely synchronous, so that it's completely deterministic.</p><p>[00:28:10] <strong>Mark Heaps:</strong> We're going to build the compiler first, and we're going to make it so that no CUDA libraries ever need to be used. That you don't need to use any kernels. We're just gonna, we're just gonna bake it all right in. And so this is where we've seen a lot of that efficiency gain and where we get all that extra power for low latency.</p><p>[00:28:28] <strong>Mark Heaps:</strong> And that's really been the fun thing, for anyone that's, that isn't familiar with us. Our early demos weren't AI related. In fact, during covid we worked with one of the national labs and they had a model. that they were using to test drug compounds against proteins and seeing what drug would stick to a protein.</p><p>[00:28:48] <strong>Mark Heaps:</strong> And, this was in an effort to try to find a vaccine, etc., during COVID. And their model at that time, from what the team told us there, was it would take three and a half days for them to get a result. Every time they put a new drug in, see if it sticks to the protein, okay, did it work? If not, move to the next one in the queue, and let's keep going.</p><p>[00:29:06] <strong>Mark Heaps:</strong> And that was this effort of trying to figure out, what would work. It took us maybe six months back then, because we weren't as mature with the compiler. It took us about six months to get them actually having their simulation running on Groq. When they finally did it, they could do that same simulation in 17 minutes.</p><p>[00:29:23] <strong>Mark Heaps:</strong> So imagine the rate of acceleration to try to find a drug that could actually change the world at that time of crisis. They could do that on Groq in 17 minutes. So the orders of magnitude that we've been able to help people. is, has just blown us away. We've done some things in cybersecurity with one of our customers in the U.</p><p>[00:29:39] <strong>Mark Heaps:</strong> S. Army. But now what we really realize is it's going to change the world for anybody that can take advantage of linear processing. And language is the ultimate linear application, right? You don't want to generate the hundredth word until you've generated the ninety ninth word. And Matt's example is amazing.</p><p>[00:29:56] <strong>Mark Heaps:</strong> Imagine that you can generate a story. You did it with generating a video after having the prompt being generated. My kids, I have a 12 year old son, he's a major gamer, and I showed him using Thappy, which is a voice tool online for generating voicebots. I showed him how to make NPCs with that, and putting in character personas with no code, and it's running on Groq.</p><p>[00:30:18] <strong>Mark Heaps:</strong> And the low latency, he was having a really natural conversation, and he told me, he goes, Dad, I can't ever talk to Alexa or Siri or any of these again, he goes, it's so bad compared to this. So it's just a really exciting time and the secret sauce of it is the chip.</p><p>[00:30:32] <strong>Alex Volkov:</strong> that's incredible. And I think you touched upon several things that I want to dive deeper, but the one specific thing is necessarily the voice. conversations, the embodiment of these AIs that it's still uncanny when you have to wait 800 milliseconds for a response. And I've seen like a YC demo of a company and somebody said, Oh, this is like the best thing ever.</p><p>[00:30:55] <strong>Alex Volkov:</strong> And it was like 100 milliseconds to an answer. And I'm looking at these 500 per second tokens. I'm thinking, This is like a near instant answer from a person and probably a super, very smart person, probably faster than a person would actually answer. And it it triggers something in my mind where we're about to slow these down on the UI level because the back end is not, is going to be faster than people actually can talk to these things.</p><p>[00:31:19] <strong>Alex Volkov:</strong> Nisten I see you're unmuting. Do you want to follow up? Because I bet you have a bunch of questions as well. And we should probably talk about open source and models and different things.</p><p>[00:31:29] <strong>Nisten Tahiraj:</strong> Yeah, so the one amazing thing here that we don't know the number of, so if the engineers could find out, there's something called the prompt eval time, or there's different terms for it. But for example, on on CPUs, that tends to be pretty slow, almost as slow as the speed of generation. On GPUs, it tends to be ten times higher or so.</p><p>[00:31:53] <strong>Nisten Tahiraj:</strong> For example, if you get an NVIDIA 4090 to generate stuff at 100 tokens per second, or about 100 words per second, for the audience, the speed at which it reads that, and it adds it into memory, it's often in about a thousand or a few thousand. What I'm wondering here is that evaluation speed That has to be completely nuts because that's not going through some kind of memory That's just it goes in the chip.</p><p>[00:32:21] <strong>Nisten Tahiraj:</strong> It stays in the chip. It doesn't spend extra cycles To go outside into memory. So The prompt eval time here has to be completely insane, and that, that enables completely different applications, especially when it comes to code evaluations, because now it can it can evaluate the code a hundred times against itself and so on.</p><p>[00:32:45] <strong>Nisten Tahiraj:</strong> So that's the amazing part I'm wondering here, because you can dump in a book and it'll probably Eat it in like less than half a second, which is pretty, it's pretty nice. So yeah, one thing I'm wondering is how does this change the the prompt evaluation time? And what kind of other demos or stuff are actual uses, actual daily uses are you hoping to see?</p><p>[00:33:08] <strong>Nisten Tahiraj:</strong> And can you tell us a bit more as to what your availability is in terms of to production and and</p><p>[00:33:15] <strong>Mark Heaps:</strong> server load. Yeah, absolutely. I think the first one, I want to be a little [00:33:20] transparent about, where Groq was at in regards to the input. When we first started building out the system and optimizing it, we really focused on token generation and not input, right?</p><p>[00:33:32] <strong>Mark Heaps:</strong> So that's where we thought everybody was focused. It's like Gen AI was blowing up everywhere. What can you make, what can you generate? And so we said, okay, the compiler team is working on things. Let's focus on optimization of the system, the LPU Inference Engine at generation. And so we got this wildly fast speed, right?</p><p>[00:33:51] <strong>Mark Heaps:</strong> And I remember some people saying, oh, you'll never hit 100 tokens per second. We hit it, we did a press release. The team literally came back to us two weeks later and said, hey guys, we just hit 200. And I was like, what? And then all of a sudden we hit 300 and we're like, wow, we're generating really fast.</p><p>[00:34:04] <strong>Mark Heaps:</strong> And then we started meeting with some of these benchmark groups, like Artificial Analysis and others. And they were saying no, like industry standard benchmarking ratios right now is 3 to 1 input to output. And we went, oh we need to start optimizing for input. And so we've started working on that.</p><p>[00:34:21] <strong>Mark Heaps:</strong> And even that right now isn't at. The exact same speed optimization of our output and the teams are working on that, at this time, but it's more than capable and it's on the roadmap, it's just a different focus for the group. So we're probably going to see over the next few months about another 10x on the input speed which is going to be wild, right?</p><p>[00:34:42] <strong>Mark Heaps:</strong> Because now when you talk about conversation, a lot of the time humans blabber on, but you tell an agent to respond in a terse and succinct way. Now you completely flip and invert the ratio of what you're going to be able to have. So that's really exciting. And, from a use case standpoint, I actually had a really interesting use case that, that happened to me personally when I was on a vacation with my family late last year.</p><p>[00:35:08] <strong>Mark Heaps:</strong> We were actually traveling and we were in Puerto Rico lionfish. And it was really bad. We were like a hundred yards offshore. We're at like 60 feet deep water and I'm trying to help him get to shore and he's like screaming and I get on shore and the first thought in my head was of course call 9 1 1.</p><p>[00:35:25] <strong>Mark Heaps:</strong> And I went, Oh my God, if I call 911, I'm going to get an operator. We're in this place that nobody can drive to. They'd have to helicopter us out. I was totally freaked out. And I ended up just going into the bot and saying, what do I do if someone gets stung with a lionfish? And in less than a second, I had a 10 step guide of what I should do.</p><p>[00:35:41] <strong>Mark Heaps:</strong> Things that I didn't know, right? Oh, keep his foot in the water. Don't rinse it with fresh water. That happened instantly. Now imagine the world that, that goes from having an emergency Band Aid or burn kit in your house. to having an emergency bot in your house who can help you in those situations.</p><p>[00:35:57] <strong>Mark Heaps:</strong> And so the speed at which it can read the input message and then give you advice back in the output is a complete game changer. And I think Alex nailed it, like we've seen all these comments where people say why do you need to generate this fast? They think of it as like a chat bot only or like a reading only situation, but the reality is, and what we've known for a long time is there's going to be an ubiquity of digital assistants.</p><p>[00:36:23] <strong>Mark Heaps:</strong> And I don't mean like an individual bot per se, but just AI being everywhere to help you. And so that's going to require a massive amount of speed. for you to be able to slice that up across all these services. Like we hear, people building with their demos like Alex said earlier. So that's our goal to serve that.</p><p>[00:36:44] <strong>Mark Heaps:</strong> And, Nisten, you asked about, what's the goal. Right now, again, just being candid with everybody, we didn't expect this thing to go viral. This was not a marketing strategy. This wasn't us going out and paying a bunch of influencers. It just happened and so the system has been like really tested and the amazing thing is it's held up like Matt said.</p><p>[00:37:04] <strong>Mark Heaps:</strong> And so kudos to the engineering team for that. Where we're headed and our goal is by the end of the year we want a token factory to be able to generate millions and millions of tokens per second as a capacity. And so that's the plan right now. We want to be in, roughly 10 months. We want to be where OpenAI was, at the end of last year.</p><p>[00:37:27] <strong>Mark Heaps:</strong> That's our goal right now. So we have those orders placed, that hardware is ordered, and we're building and increasing the capacity every week.</p><p>[00:37:33] <strong>Alex Volkov:</strong> That's awesome. And so let's talk about models. You guys are serving LLAMA270B. And we hear rumors about next LLAMAs at some point soon. And I think Mark Zuckerberg even actually said that like they finished training LLAMA3 or something. We don't have insider knowledge here.</p><p>[00:37:48] <strong>Alex Volkov:</strong> We're just like speculating. And then also obviously Mistral is releasing incredible models. You guys have Mixtral in there. There's speculation the Mistral Next that LMCs has access to is this incredible model, the GPT 4 level. So you guys are relying on open source models, and those models are trained on other hardware.</p><p>[00:38:07] <strong>Alex Volkov:</strong> Do you guys also have training built in, or is this only for inference? And what are the plans for also training models? Because, speeding up training would help the world at least as much as speeding up inference.</p><p>[00:38:18] <strong>Mark Heaps:</strong> Yeah. So let's tap into a few of those. So first, we love the open source community. It was a big inspiration why Jonathan left Google, where he was wildly successful. and said, we need to go start another company. And he wanted to make sure that the world and the developer community had access to AI technologies to accelerate development.</p><p>[00:38:38] <strong>Mark Heaps:</strong> He literally calls this the haves and the have nots. And at that time, he said, look, it looks like Google, Amazon, Microsoft, a couple of governments are going to swallow up all of the AI technology in the world. He's that's not going to be fair. He's we need to democratize AI and access for all.</p><p>[00:38:55] <strong>Mark Heaps:</strong> And so let's make a chip, and I remember him telling me this four years ago, he goes, I'm going to create a company where people can literally have access to the most advanced AI in the world, and do it with a credit card from their home. He goes, that's what I want to see happen. And so that's always been his vision.</p><p>[00:39:11] <strong>Mark Heaps:</strong> And, we're on that path right now. The models that now the explosion of the open source community, and I think Meta deserves a lot of credit here. Chad GPT was blowing up, OpenAI was doing their thing.</p><p>[00:39:22] The Unexpected Success of Llama 1</p><p>[00:39:22] <strong>Mark Heaps:</strong> And Meta, which is, obviously a massive corporation and private and in it to make money.</p><p>[00:39:28] <strong>Mark Heaps:</strong> They said, no, we're going to make Llama available to everybody. And we didn't have a relationship with them. I think everybody knows Llama 1 got leaked and one of our engineers got ahold of it and said, Hey, I'm going to see if I can fit this to the chip. It wasn't even on our roadmap. And then they got it running in less than like 48 hours.</p><p>[00:39:45] <strong>Mark Heaps:</strong> And then from there we advanced on it. And so that was an amazing moment. Lightning bolt moment where we said, Hey. What else can we do with this?</p><p>[00:39:52] The Evolution of Model Compilation</p><p>[00:39:52] <strong>Mark Heaps:</strong> And at that time, I think we had maybe 200 models from Hugging Face compiled for our system. And today, I think we're well over 800.</p><p>[00:40:02] <strong>Mark Heaps:</strong> And we just keep pulling from the repos there and building them into the compiler. But we're watching very closely now of what are the models that people want? We had Vicuña up for a little while and we saw that on The LMSS leaderboard we've played with Mistral 7b.</p><p>[00:40:16] Exploring the Power of Mistral 7b</p><p>[00:40:16] <strong>Mark Heaps:</strong> If anybody wants to see real speed, go watch my video on YouTube on the Groq channel about Mistral 7b. It gets over a thousand, it gets over a thousand tokens per</p><p>[00:40:24] <strong>Alex Volkov:</strong> you serious? Wow.</p><p>[00:40:26] <strong>Mark Heaps:</strong> Yeah, I, the max I've hit with it I was just doing a conversational bot with it, and I hit 1140, and it was insane.</p><p>[00:40:34] The Excitement Around Google's Jemma</p><p>[00:40:34] <strong>Mark Heaps:</strong> And, now there's this announcement from Google about Jemma, which I think is like 8 billion.</p><p>[00:40:38] <strong>Mark Heaps:</strong> And the team is already Oh my God, what could we do with Gemma, at that size, like the speed is going to be, through the roof. And then Jonathan, our CEO, is traveling right now, and he was actually at the Mistral headquarters in France a few days ago. And they were talking to him about, the next model and kind of what that looks like.</p><p>[00:40:58] <strong>Mark Heaps:</strong> And he very much wants that to be running on the LPU inference engine at Groq.</p><p>[00:41:02] The Future of Groq's LPU Inference Engine</p><p>[00:41:02] <strong>Mark Heaps:</strong> So it's an exciting time to get into these open source models. And we're just happy that we can sit back and say, Hey, how do we help you guys? Because ultimately the people building the models, doing the training.</p><p>[00:41:13] <strong>Mark Heaps:</strong> We want to enable them with this speed.</p><p>[00:41:16] Groq's Stance on Training</p><p>[00:41:16] <strong>Mark Heaps:</strong> You asked a question about whether we do training. We don't. We don't offer training. We don't do training. We have had one customer actually do it. That was related to that U. S. Army cybersecurity project. They actually trained their quantum algorithms using Groq hardware.</p><p>[00:41:30] <strong>Mark Heaps:</strong> But it's not something we do, and it's not our business model. And Jonathan has always had this vision. He said Look the world already has a bazillion training providers, and [00:41:40] most people are quite comfortable with the pace of training, and this is going back to 2016, 2017. He said let's recognize that if all these companies are training models, and yet there's no real clear winner in the inference solution, let's just focus our business efforts there.</p><p>[00:41:55] <strong>Mark Heaps:</strong> He does have a vision. It's not on our roadmap right now, but he does have a vision.</p><p>[00:41:59] The Potential of Live Training Through Inference</p><p>[00:41:59] <strong>Mark Heaps:</strong> of what you could do with this sort of recyclical live training through inference, where it's actually being trained live in the moment and feeding back to itself, right? And this gets you into a multitude of layering techniques that we've been considering and testing at Groq.</p><p>[00:42:14] <strong>Mark Heaps:</strong> I could see us getting into training in the future, but only when it is advantaged by that real time insight of training.</p><p>[00:42:22] <strong>Alex Volkov:</strong> Up here. And Nisten, just before, let me jump in super quick. I want to follow up with something that you said that 7b Mistral is flying at over a thousand tokens a second. And that's obviously incredible. Just like mind blowing incredible. And in my head what I'm super excited by is not the smaller models, because I can run the smaller model on my Mac with 20 tokens, 30 tokens a second and get like a full whatever.</p><p>[00:42:45] <strong>Alex Volkov:</strong> I'm excited about the incredible intense, long context requirements that we've seen. So we had talk about open source. We have often the folks from Nous Research here on stage, the authors of the YARN paper, that they've been able to take LLAMA's 4, 000 contacts window and extend it to 128.</p><p>[00:43:03] <strong>Alex Volkov:</strong> And we never used it. We never were able to use LLAMA at 128k tokens because it was like extremely slow.</p><p>[00:43:09] The Power of Groq's Speed in Long Context</p><p>[00:43:09] <strong>Alex Volkov:</strong> And I'm thinking about Are you guys bringing us long context, like for real, like for open source models, because we haven't yet been able to actually use them as much. Because the bigger the model is, and the faster you can run, it will average out, and we'll be able to get open source models.</p><p>[00:43:22] <strong>Alex Volkov:</strong> Have you guys played with long context yet? Have you seen the incredible stuff from, Gemini 1. 5 releasing 1 million tokens, for example. Something that probably only Google can pull off with their TPU farms. How are you thinking about that as an advancement, as a competitive edge for something that only you could do?</p><p>[00:43:37] <strong>Mark Heaps:</strong> Yeah, the team is actually looking at that right now, and I think, again, early stages, our first 4A into a larger length was actually, Mixtral with a 32k sequence length. And, so far we haven't seen any use cases where people are actually taking advantage of that full length, but we know that it's coming.</p><p>[00:43:54] <strong>Mark Heaps:</strong> And the moment that Gemini 1. 5 got announced with the million token length, the team immediately got together and said, okay, how would we do this? And they've started architecting. What scale of system would we need for that? So that's part of the plan in parallel with what I was saying earlier that we really want to get to a place where we're this massive token factory by the end of the year.</p><p>[00:44:14] <strong>Mark Heaps:</strong> And that's getting us into that, more than 10 million to 20 million tokens per second from the system in that capacity. So we're definitely looking at that. I think what's going to really dictate it for us, because we're again, sitting back and saying, how do we help? And what we're watching is what are the business use cases?</p><p>[00:44:33] <strong>Mark Heaps:</strong> So if someone says, Hey, we want to use a model that has a million million contact sequence length. But you find out they're really, on average, only using 50k for their application. This is that advantage I was talking about earlier, where we can dial the system forward or backward using a single line of code.</p><p>[00:44:50] <strong>Mark Heaps:</strong> We can figure out what is that link that they need, and then dial that in for that customer account. We're actually doing a little bit of that right now with Mixtral. You guys mentioned, we have the free version. on our website that people can play with through Groq chat. And then there's the API access, right now, as everyone's playing with it and just treating it as a chat agent, we're recognizing that we've got this thing loaded for 32 K Mixtral.</p><p>[00:45:12] <strong>Mark Heaps:</strong> And yet, the average we see being generated in GroqChat is around 900. At that scale, we're like, hey, why don't we increase the capacity of the system, speed this thing up a little bit. Let's drop the sequence length for the free GroqChat service. But leave it at the longer sequence length for the API users, and that's really easy for us to do.</p><p>[00:45:32] <strong>Mark Heaps:</strong> That's flipping a switch in, in, in some ways.</p><p>[00:45:36] The Importance of Community Feedback</p><p>[00:45:36] <strong>Mark Heaps:</strong> So we're just waiting for the open source model community to really tell us like, Oh, this is the size that we could really take advantage of.</p><p>[00:45:43] <strong>Alex Volkov:</strong> Awesome. So you guys found the right place. The open source model community often ends up on ThursdAI and talk about their advancement. So I'd be more than happy to introduce you to the guys who are doing open source kind of papers on long context as well. They're often joined here and they would be very happy to like help and figure out what's the, what's possible, especially because training those models is hard, but then running inference is even harder.</p><p>[00:46:07] <strong>Alex Volkov:</strong> Nisten.</p><p>[00:46:08] <strong>Mark Heaps:</strong> Way harder.</p><p>[00:46:08] <strong>Alex Volkov:</strong> Yeah, Nisten, go ahead.</p><p>[00:46:11] <strong>Nisten Tahiraj:</strong> Yeah, so one thing I'm wondering about is, so first of all, it's extremely impressive that these models are running at full precision and they're not even starting to take advantage of some of the handmade stuff that people made to get them down to the, to phone size and to still perform well, because that takes yeah, so that hasn't even been explored yet, because that can reduce the size by four and have exponential improvements.</p><p>[00:46:36] <strong>Nisten Tahiraj:</strong> So what I'm. wondering is, how much, as you guys expand and as you go and as you adopt, whether you adopt our models or not, how much work is it to Take something like LLAMA or Mixtral and then adapt it to more of your JAX like stack That you guys have. So yeah, that's the part that I'm Wondering about like how much work is for companies to adopt their own models or if they have something custom that they've made to this because I see some incredibly interesting stuff and I think for Sorry, I'm rambling on a little bit, but I think even for training you can make models that fit under 220 megabytes or model parts, and then you can train those individuals.</p><p>[00:47:22] <strong>Nisten Tahiraj:</strong> So there is stuff to be it. Explore there. I just think there hasn't been enough yeah, it's still pretty new, so there hasn't been enough people taking a crack at it. But yeah, how much work is it to take an open source model or a custom something that people made and to adapt it to work on Groq's hardware?</p><p>[00:47:40] <strong>Nisten Tahiraj:</strong> That's my question.</p><p>[00:47:41] <strong>Mark Heaps:</strong> Yeah, it's a great question. Thanks, Nisten. Yeah, so I think a really good paper everyone should check out if you're interested in this, if you go to Groq. com slash docs. We've got a huge doc repo there. And one of the earlier articles that we produced from the compiler team is called Developer Velocity, and it's been a, it's been a focus from day one.</p><p>[00:48:00] <strong>Mark Heaps:</strong> We did some research when we were building out the product, building out the service, and we found out that for a lot of companies to get a model up and running, especially if it was their model. It would take them, if you were a smaller company let's call you, an SMB, sub 5, 000 employees.</p><p>[00:48:15] <strong>Mark Heaps:</strong> They were typically spending six to nine months to get a model into production where they were using it. The larger companies, Microsoft, those guys, they're doing it in 30 to 45 days. And so we set this goal saying, we don't want any customer ever to need more than a week to get their model up and running on Groq.</p><p>[00:48:34] <strong>Mark Heaps:</strong> And ideally we'd like it to be in 24 hours. We're actually going to test the team on that when LLAMA 3 gets released. We're going to see how fast from the day everybody has access to it, to how fast can we get it up and running. And, I'm hopeful we're going to, we're going to see a demo with it literally that day or the next day.</p><p>[00:48:49] <strong>Mark Heaps:</strong> It's not a lot. We're using standard frameworks, right? So we're PyTorch, Onyx, Tensor, everything is pretty standard. The thing that we spend a lot of time doing this in, and this is what slowed us down a little bit when Llama 2 came out I did a video with Bill Ching, a member of our compiler team.</p><p>[00:49:06] <strong>Mark Heaps:</strong> He's a brilliant guy, super funny. He'll tell you in the video, I didn't spend time getting it to fit to Groq. I spent time removing All of the code and components that were built in for GPUs. Basically, he spent time scrubbing, not, building. And that's what happens is because the community is so already weighted towards building for GPUs, that's what takes us the most time.</p><p>[00:49:30] <strong>Mark Heaps:</strong> We've got to strip all that stuff out because it slows it down. Again, we don't have those schedulers. We don't have those components. That's the biggest thing for us in the way that, that we get things running. But, even custom models that we've had from the national labs and the research groups, we had one that was for the Tokamak nuclear fusion reactor.</p><p>[00:49:48] <strong>Mark Heaps:</strong> It was a control system. And even that we got running in just, I think it was less than 10 days. And it was a completely custom build and our compiler was no more mature at that time. Again it's one of those [00:50:00] things that our goal is to get it down to where it's same day applicable.</p><p>[00:50:03] <strong>Mark Heaps:</strong> We're a ways off from there, but right now we're trending less than a week for everybody.</p><p>[00:50:09] <strong>Alex Volkov:</strong> Mark, I want to follow up with the use case. As you guys were talking about converting models, and we see models getting released from all these finetuners. We have a bunch of folks here who finetune models after open source release, and many of them switch to Releasing their models in the safe tensors format, the standard one, but also in the quantized format that people can actually download the smaller quantized versions and run them on their Macs.</p><p>[00:50:33] <strong>Alex Volkov:</strong> And I can absolutely see if you guys support this, I can absolutely see a day where folks are releasing it also on Grack or Grack chat or whatever, just for folks to be able to experiment with like longer context. As a fallback, sorry, as a follow up on the longer context one session, you said. we see in the chat.</p><p>[00:50:49] <strong>Alex Volkov:</strong> Yeah, the chat is not optimized for, pasting like a bunch of stuff. I, I would I would not suggest, I would be cautious about judging by that because I personally, if I get access or I guess I got access to the API, but when I get access to longer context, for example, I would absolutely think about, hey, what is possible now?</p><p>[00:51:08] <strong>Alex Volkov:</strong> I can, and somebody commented in the comments that coding is the main use case where long context really matters. Because what happens right now is everybody's like focusing on rag. And we had this conversation, rag versus long context, I think since a year ago, since the context lengths were 4, 000 tokens, then 5, 000, then 12, then whatever.</p><p>[00:51:25] <strong>Alex Volkov:</strong> And then Mosaic came out with 60 and we were very excited. And we had this conversation since then of what performs better. And I think one of the two main reasons that folks And I don't know about cost, and we probably should talk about cost, but infraspeed, you guys are doing some incredible advancements.</p><p>[00:51:46] <strong>Alex Volkov:</strong> In my head, as somebody who builds systems with this, as somebody who plays around with this, if I can shove my whole codebase In the context, I will get a better answer than I'm gonna have to embed the context, the code base, and then try to do retrieval on specific chunks, whatever. I'm even thinking about the cursor interface that I used yesterday.</p><p>[00:52:03] <strong>Alex Volkov:</strong> I, I had to provide it with, I had to mention, hey, these docs that you already vectorized, add them to, to the context, so GPT 4 will be able to help me solve my specific issue. If my whole repo is getting sent in each prompt, I don't know if this is the best use case of your hardware, but it's definitely the, probably the fastest way to get the model to actually know exactly what I want.</p><p>[00:52:23] <strong>Alex Volkov:</strong> That's one example. Another example is all these models, all these agents are going towards personalization. I definitely think that this year is the year of personalization, especially with like longer context and models like Gemini 1. 5, for example, they have a full retrieval precision, almost like 95 needle, in a haystack recall ability.</p><p>[00:52:42] <strong>Alex Volkov:</strong> And that, for use cases like something like a personal assistant that remembers everything about you, removes the possibility of, hey, I didn't chunk correctly, I didn't do rack correctly, I did vector similarity incorrectly, etc. For developers just getting up and running and building tools like this, I think long context is still yet to be discovered because it's still expensive and it's still slow.</p><p>[00:53:02] <strong>Alex Volkov:</strong> And I think speed with a lot of context is what's going to unlock the next iteration. So those are just like some feedback from the community staff. Would love to hear what you think.</p><p>[00:53:10] <strong>Mark Heaps:</strong> Yeah. So first, I love these ideas, and I want to invite everybody who's listening go join our Discord server, because we want this feedback. We, the product team is super hungry for it. We want to know what you guys want. So definitely go do that. It's Groq. link slash discord. Please bring all these ideas to us.</p><p>[00:53:26] <strong>Mark Heaps:</strong> It's an interesting thing, Alex, because we've heard this from a number of customers of, do you do RAG? Do you do some form of vector database? We get asked about Lang chain. We get asked about all these things. And I think for us, there's this risk of where is the infrastructure, that part of the stack with RAG, where is it?</p><p>[00:53:44] <strong>Mark Heaps:</strong> Where does that exist, right? So if you're operating in these two totally, vast separated areas, you run the risk of losing your latency just because of the network and kind of what happens between them. So for a lot of folks, we hear no. We want the longer sequence length because we want to embed a lot of this in the sys prompt.</p><p>[00:54:03] <strong>Mark Heaps:</strong> And we know that Groq has such fast inference that if it's embedded there, it's all living with you. And we're going to be able to maintain that speed. If you start calling out to a bunch of different rag services, where am I going to lose? Now, I think that's thinking that's based on the experience they've had with GPUs, OpenAI, ChatGPT, etc.</p><p>[00:54:23] <strong>Mark Heaps:</strong> But, for us, if we have such a margin of inference speed, we haven't seen anyone really lose on the overall experience performance because of the network topology. Jonathan was doing a demo for somebody literally using Wi Fi on a United Airlines flight where we had information in a rag and he was calling it, using Wi Fi on the plane.</p><p>[00:54:48] <strong>Mark Heaps:</strong> And he was like, it was a very normal speed experience. He was disappointed because it felt he was using ChatGPT,</p><p>[00:54:53] <strong>Mark Heaps:</strong> For the person there,</p><p>[00:54:54] <strong>Alex Volkov:</strong> It's hard to go back after, after you experience immediacy. Waiting is definitely annoying. That's I'm waiting for the hedonistic adaptation of ours to kick in where we expect immediacy. Yeah, sorry, please go ahead. I have to chime in.</p><p>[00:55:06] <strong>Mark Heaps:</strong> No. Yeah. No, I think you're, I think you're spot on. So yeah. So again, we don't want to dictate to anybody You know, what is the best method? We want to listen to you guys and figure out how do we continue to serve in that way? And, the other reality is there's gonna be new techniques that are gonna be invented, in the next couple of months probably, that, that give you a whole nother option, around rapid fine tuning.</p><p>[00:55:31] <strong>Mark Heaps:</strong> And we're just watching. And listening to you guys, but we recognize we need to enable both. So we're working with some partnerships for RAG right now to be able to connect into Groq. And there's going to be some announcements actually tomorrow about some things happening at Groq that I think people will be excited</p><p>[00:55:47] <strong>Alex Volkov:</strong> Ooh, you want to give us a little teaser, a little laugh, or are folks going to tune in for tomorrow? We gotta tune in for tomorrow.</p><p>[00:55:54] <strong>Mark Heaps:</strong> I I think the only thing that I'm allowed to say is there's really going to be a very strong representation of the developer community. Within Groq, and the tools that we're gonna start rolling out over the next couple of weeks are really gonna feel familiar and hyper supportive of the work that y'all do.</p><p>[00:56:11] <strong>Mark Heaps:</strong> So it's gonna be, it's gonna be really fun.</p><p>[00:56:13] <strong>Alex Volkov:</strong> Alright, so folks, stay tuned, definitely we pinned the discord link to the top of the space check it out and give folks comments because you guys have a bunch of headroom and we need to use this, but we need to tell you in which way we're gonna use this so you also have it. a roadmap, you have prioritization issues like every company, you have to focus on something.</p><p>[00:56:30] <strong>Alex Volkov:</strong> So the better folks will give you feedback, the better. I want to maybe one last question, Mark, before I let you go, and then continue with the regular thing, which you're more than welcome to stay and chime in as well on, because I did see your thread.</p><p>[00:56:41] The Potential of Multimodality in AI</p><p>[00:56:41] <strong>Alex Volkov:</strong> I think you're also interested in the broader AI community.</p><p>[00:56:44] <strong>Alex Volkov:</strong> It's multimodality for 2024. I think It's clear to everyone that multimodality is built in. All the major labs are now multimodal. I think multimodal AI is in open source is coming as well. We have folks here who've trained multimodal models. What are we to expect from Groq on that perspective?</p><p>[00:57:01] <strong>Alex Volkov:</strong> Is it? Do you guys already have support for some like a vision plus plus text? Are you looking at different things like, video as well, which by definition takes more tokens and then slower by definition in every other place? How is the team thinking about this kind of next evolution of Gen AI?</p><p>[00:57:19] <strong>Mark Heaps:</strong> Yeah, good question. Obviously, multimodal is where everyone's interested. And I think ever since OpenAI gave ChatGPT the capability to generate images in the middle of the conversation and then add audio into the middle of the experience, everyone's been excited about this idea. And certainly that's where we've started.</p><p>[00:57:37] <strong>Mark Heaps:</strong> We have a plan we call them the three pillars, right? And it's where does Groq add this speed value in? Language in audio and in visual. And what we're looking at right now is what are the models that we can bridge together so that we can provide that multi modal experience. The systems teams are already preparing the LPU inference engines that we're expanding on to be able to handle that.</p><p>[00:58:03] <strong>Mark Heaps:</strong> The compiler teams are actually, have already begun building out some of the advancements we need to be able to support that. We know where it's going and we know, that's what people are going to be asking for. So I've only shown one other thing. on our YouTube channel, which was a model that [00:58:20] Adobe gave us, which was a style GAN, and that was 8 models that run in parallel, and I think it generates in like 0.</p><p>[00:58:28] <strong>Mark Heaps:</strong> 186 of a second at 1024 pixel resolution. We can literally say, here's an image, give me 8 completely different styled results based on that, that diffusion model or that style GAN model. And that's where we've started playing with image generation. We do have some people that are looking At tiny diffusion and a few of these other like rapid generators that are small.</p><p>[00:58:47] <strong>Mark Heaps:</strong> But certainly that's something that we intend to support. It's the problem now with the speed of all these things happening is what do you prioritize? We are a company of, less than 200 people. And we're trying to, we're trying to figure out every day, like, where do we commit our resources?</p><p>[00:59:02] <strong>Mark Heaps:</strong> So again, it sounds like I'm trying to be like a marketing guy and I'm not. Go to the Discord and tell us what, you guys want. What are your use cases? What are you predicting with your businesses? That would really help us to be a part of the, to be a part of the conversation.</p><p>[00:59:16] <strong>Mark Heaps:</strong> But at the high level, yeah, we already have people working on it.</p><p>[00:59:19] <strong>Alex Volkov:</strong> Awesome, and I definitely invite your folks to also join the ThursdAI community, because we talk about these advances as they happen, we've been talking about multimodal, multimodal since almost a year ago now, folks, everybody in the audience, we're going to celebrate ThursdAI's birthday, I think, in a couple of weeks, and</p><p>[00:59:36] <strong>Mark Heaps:</strong> Nice, that's cool.</p><p>[00:59:37] <strong>Alex Volkov:</strong> when GPT 4 came out they had the infamous demo where Greg Wachman jotted down on a napkin, a UI thing, and uploaded it to the GPT 4 with Vision, and we've been waiting for this to become a reality ever since, and I think it's now becoming a reality.</p><p>[00:59:51] <strong>Alex Volkov:</strong> We also chatted with, the folks from Reka AI, which, had the multimodal model out there a couple of weeks ago that I was blown away by. I was uploading videos of mine and it understood tonality in there, understood like what happened in the video. We obviously see video being a big part of Gemini 1.</p><p>[01:00:08] <strong>Alex Volkov:</strong> 5, we're going to talk about this soon, where people just upload and that video just takes so much content, like 600, 000. tokens in context. But then the model understands like every little frame and can pull individual scenes away. And once we get to real time video understanding, that's when the actual World embodiment of these bots will happen when like it can actually see what and can react in real time.</p><p>[01:00:29] <strong>Alex Volkov:</strong> So definitely exciting stuff from there. And Mark, I just wanted to say What an incredible week you guys had and it's been great to just see how this explodes and play around with the possibilities I'll remind folks in the audience. I've played and it's on the it's on the show notes in the Jumbotron I played with Groq yesterday and it was I was able to build something that I wasn't, thinking about it's possible a few months ago, even.</p><p>[01:00:54] <strong>Alex Volkov:</strong> It's so fast. And you already mentioned the Discord. How do people get access? Is the wait list long? Tell us about people in the audience and then the questions. The one API access .</p><p>[01:01:03] <strong>Mark Heaps:</strong> The waitlist is really long right now, and it blew up this week. Again, thanks Matt for, and others for promoting. Yeah, so right now they can go to Groq. com. They'll see a link on the left that says API access. You fill out a brief form right now. We are trying to get through that list as quickly as possible.</p><p>[01:01:20] <strong>Mark Heaps:</strong> There's a timed trial, the usual sort of terms. But in a week, it wasn't even a week, it was literally within 37 hours, we had over 3, 000 API access key requests. And so that was more than we had expected. And so we're trying to get through that list right now and see what the tier levels, some people are telling us we need a billion token per day access.</p><p>[01:01:42] <strong>Mark Heaps:</strong> And we're saying, okay, this is this tier level. And other people are like, hey, we're part of Y Combinator's startup accelerator group. We're just testing our bot ideas out, can I get free access? So we're working through that list right now. The good thing is. We are increasing capacity every week, and one of the announcements that we'll have tomorrow and rolling into next week will be moving more towards self serve versus us going through and like manually approving everybody, so that should accelerate approvals greatly.</p><p>[01:02:10] <strong>Mark Heaps:</strong> I just ask everybody be patient. If you've applied, stick with us. We promise we're going to get to you. We really want you to have access to this. This level of inference speed but this whole virality moment came out of</p><p>[01:02:21] <strong>Nisten Tahiraj:</strong> nowhere and we,</p><p>[01:02:23] <strong>Mark Heaps:</strong> We're trying to meet the needs now.</p><p>[01:02:25] <strong>Mark Heaps:</strong> So just stick with us. It's going to keep getting faster and faster.</p><p>[01:02:28] <strong>Alex Volkov:</strong> Incredible. So folks, definitely check out GroqChat. If you haven't yet it's quite something. It's quite incredible. Check out all the demos as well. And with that, I want to say, Mark, thank you. This is the end of our conversation. It's been an hour, folks, on ThursdAI, and I'm going to reset the space a little bit, and then we're going to talk about everything else that was new this week, and there was a bunch of stuff in the open source and in different places.</p><p>[01:02:49] <strong>Alex Volkov:</strong> But what you heard so far is a deep conversation with Mark. Mark Heaps from Groq which came to many of us as new, but was around for a while. And then we also had some folks in the audience as well listening to this from Groq as well. So that was great. Thank you, Mark. And then let's reset the space and start talking about what's new in AI this week.</p><p>[01:03:07] <strong>Nisten Tahiraj:</strong> Thanks so much, guys. Really appreciate</p><p>[01:03:09] <strong>NA:</strong> you.</p><p>[01:03:31] Google releases Open Weights for GEMMA 2B and 8B</p><p>[01:03:31] <strong>Alex Volkov:</strong> All right, how's it going, everyone? You're on ThursdAI, February 22nd. My name is Alex Volkov. I'm an AI Avenger with Weights Biases. And Yet another incredible week in AI with a bunch of other stuff and I want to move our conversation towards the kind of the explosive open weights news this week, and I would love, so we have some more folks on stage here, and LDJ, we've talked about this when it came out, but Google gives us OpenWeights models, this is new to us folks, we've been waiting for Google for a long time, and finally they come out, and Google releases Gemma, a new OpenWeights model, not open source, and they've been very clear, which I really applaud the team.</p><p>[01:04:12] <strong>Alex Volkov:</strong> We're going to talk about some stuff that Google did not exactly do correctly this week, but we're also going to, we're going to highlight like we're going to give props where props are due. Google is clearly talking about open weights, open access model, not open source because they didn't open source a bunch of stuff.</p><p>[01:04:26] <strong>Alex Volkov:</strong> Definitely not datasets. It's called Gemma. It's they released two, two sizes, 2 billion and almost an 8 billion. So 7 billion parameter model. It has. Let's see what's interesting there. Trained on 6 trillion tokens, 8000 context window interestingly, vocab size is way bigger than LLAMA, and if you guys have been falling under capacity from this week, as you should, he just released a whole conversation about tokenizers, and he then analyzed the vocab size of the tokenizer kind of for Gemma, and said it's way bigger than LLAMA1.</p><p>[01:04:59] <strong>Alex Volkov:</strong> It's basically the same one, similar one, just like way bigger. And Yeah, this is incredible. This is like great news that Google is stepping into the open source. I think they see what Mark Zuckerberg saw, where once you release something like this, the community provides. And I want to just highlight, I had a tweet go off like fairly viral, because four hours after release, LDJ, we were spending the first hour in the space together that you opened.</p><p>[01:05:22] <strong>Alex Volkov:</strong> Four hours after release, we had Lama CPP support, Olama support, we had LM Studio support. Many people, like Maxim Lebon, one of our friends of the pod, quantized upload this because they didn't quantize correctly. Then after half a day, 3DAO from together added support for Flash Attention. I think there's a bunch of other stuff that added support as well.</p><p>[01:05:40] <strong>Alex Volkov:</strong> And we just had we just had folks from Groq talk about they've been looking at this as well. So it feels like Google understands the benefit of open weights access model. So I just want to, this shout out Google. Let me actually, I have a thing for this. Yeah. Good job.</p><p>[01:05:56] <strong>Alex Volkov:</strong> The big G provides, and this is great, and I'm, I was really surprised and happy to see this in the morning, and I wanted to hear from folks here on stage what are your thoughts so far on Gemma in terms of performance compared to, let's say, Mistral or anything else like Finetune that we had.</p><p>[01:06:10] <strong>Alex Volkov:</strong> Whoever wants to go next, but LDJ, you and I have the space, so feel free to comment what we learned from the space and since then, and then let's go around the table, and then we're gonna go forward with some news.</p><p>[01:06:21] <strong>LDJ:</strong> Yeah, so I think what we learned on the release and also after a little bit of time of people using it is that pretty much it has around the same abilities as Mistral. You could say maybe a little bit better than Mistral in certain ways. Some people say it's at least a little bit worse than Mistral in certain other [01:06:40] ways.</p><p>[01:06:40] <strong>LDJ:</strong> But overall there's definitely is maybe certain use cases where you might prefer the Jemma model. It is interesting though, I believe Jemma is actually. From what I remember seeing, it's 8. 5 billion parameters whereas I want to say Mistral is a total of 6. 7, so there is actually somewhat of around 25 percent more parameters, and theoretically, it should be maybe a little bit better than Mistral than than they say but, yeah it just really shows to how impressive Mistral is really the fact that Google's Making this model that's it's still not really significantly beating it,</p><p>[01:07:17] <strong>Alex Volkov:</strong> It's quite impressive. I saw, I think Marco from A16Z, Marco Mercora, post comparisons from Gemma, Mistral, Lama and I think something else. It's quite incredible that this model, like a company less than 30 people 6 months ago they released it, no, like less than 6 months, September I think, or October, the 7B model, and it still performs well against a company with like billions or whatever, and they release it, it's quite stunning that they're not able to beat Mistral 7B.</p><p>[01:07:49] <strong>Alex Volkov:</strong> by a significant amount. I wanted to like, highlight how, first of all, impressive this is, that they even released something. But also, how impressive this is for Mistral, that they come out so strong, and their model is basically the one people compare to. Definitely agree to that.</p><p>[01:08:05] <strong>Nisten Tahiraj:</strong> Yeah, I used it quite a bit I My opinion, I don't like It just it's just not that reliable. So yeah, it can code but sometimes It's not a very obedient model and the thing about Mixtral and Mistral and stuff is that They're used like tools a lot and Yeah, but again, we have yet to see good fine tunes.</p><p>[01:08:32] <strong>Nisten Tahiraj:</strong> So We see we saw how far people took alignment it with open chat</p><p>[01:08:39] <strong>Alex Volkov:</strong> Yeah, speaking of OpenChat</p><p>[01:08:41] <strong>NA:</strong> Was like how far they've taken these Yeah, so so we'll see I'll hold off a bit of judgment for them for now</p><p>[01:08:49] <strong>Alex Volkov:</strong> Yeah, speaking of open chat and speaking about fine tuning and being able to fine tune this alignment what are your initial thoughts? I saw Alpay post something that new open chat is coming. What are you guys cooking a fine tune like what's going on?</p><p>[01:09:03] <strong>Alignment Lab:</strong> There's probably an OpenChat fine tune of Gemma that's going to come out. I'm not clued in to that right now. I haven't had a chance to really get my head above water for a couple of days because I've been just buried in several things. If, if there is, it's probably going to be good. The model seems smart and it's got a lot of parameters, so It's hard to say that fine tuning won't make it very strong.</p><p>[01:09:31] <strong>Alignment Lab:</strong> I think with that giant tokenizer, it's going to be worth knowing that the model's going to be able to do a lot more during the training run because it's going to see more granular patterns and have a more expressive vocabulary to to, exploit the way that training runs make a model perform well better.</p><p>[01:09:50] <strong>Alignment Lab:</strong> This is the best way I can put it. It also, it's not getting mentioned very much, and I think it's because this is past the event horizon of AI stuff for a lot of people, but if you open up the models architecture, the implementation of it on the Google GitHub repo, they actually have a few different versions, and they're all for running the model in various contexts, or with or without TPUs, but And all of them, even the one that's not made to be parallelized, the model actually does have a baked in architecture designed for quantization and parallelization.</p><p>[01:10:20] <strong>Alignment Lab:</strong> And it looks like it can be quantized, or it can be parallelized, horizontally, vertically, and whatever the word is for the third dimension. It looks like it breaks pretty evenly into eight pieces, and if you can break it into eight pieces, and quantize each piece, and dequantize each piece, You can maybe parallelize it across asymmetrical compute, which is the big holdup for why we can't distribute models over just a bunch of random servers.</p><p>[01:10:48] <strong>Alignment Lab:</strong> Because usually, if they're not the exact same GPU with the exact same throughput and interconnect the model's unable to perform inference. But they may be able to solve for that baked in there, and it might be that they intend on Maybe having some service by which you can use the model locally with X amount of context and then just back into it onto their TPUs.</p><p>[01:11:08] <strong>Alignment Lab:</strong> I'm not sure, but it's interesting that it has a lot of custom tooling like baked into it designed for quantization parallelizing</p><p>[01:11:15] <strong>Alex Volkov:</strong> Yeah, I want to say custom tooling and also thanks Aliment, and also the amount of stuff that our community is supportive that they released is quite impressive. They released GDF quantizations, I think. They released support. They even released, folks, I don't know if folks missed this, they released something called Gema.</p><p>[01:11:32] <strong>Alex Volkov:</strong> cpp. which is a local CPU inference based in completely C with no dependencies, which is in addition to Llama CPP adding support for this, there is Gemma CPP and that's like their whole complete kind of comparison to Llama CPP. And that was pretty cool of them to release.</p><p>[01:11:49] <strong>Alex Volkov:</strong> And it looks like they've geared up to to have this model to be accepted. It's on Hug and Face. Hug and Face and Google recently announced a partnership and now it's on Hug and Face as well. So you can actually go to like hugandface. com slash Google slash Gemini slash Gemma. And it's pretty cool.</p><p>[01:12:04] <strong>Alex Volkov:</strong> I remember they, they mentioned Gemini Lite or Gemini Tiny or whatever for local inference. Very interesting that's not what we got. We got like a new model called Gemma out of the gate. Yam, do you have any, what's your thoughts on this whole thing from Google? Do you have a chance to play with this?</p><p>[01:12:19] <strong>Alex Volkov:</strong> Give us a little breakdown.</p><p>[01:12:20] <strong>Yam Peleg:</strong> actually, yeah, actually fine tuning is on the way. Already got the GPUs warming up</p><p>[01:12:27] <strong>Alex Volkov:</strong> let's</p><p>[01:12:28] <strong>Yam Peleg:</strong> the data as we speak. Yeah, I'm going to do, I'm going to do, before fine tuning, I'm going to do a little bit of a continuous pre training just to see if we can squeeze a little bit more out of the base model.</p><p>[01:12:40] <strong>Yam Peleg:</strong> It's just important to distinguish between the base model and the instruct tuning model.</p><p>[01:12:47] <strong>Alex Volkov:</strong> That's the slash IT thing they released, right? There is like a Gemma and Gemma slash</p><p>[01:12:51] <strong>Yam Peleg:</strong> When we talk about chat GPT like models, we talk about the instruct tuned models. And this, yeah, for sure, Mistral is just better at the moment. But in terms of the base model, we can know this only after people start to play with it and try to tune it themselves.</p><p>[01:13:11] <strong>Yam Peleg:</strong> Then we can see how far we can push it, because maybe it's just the actual fine tuning that Google did to their version of the model and with the methods from the open source that are pretty much, uh, very well trained in fine tuning models for instructional fine tuning. Maybe we can, maybe this model is really, will be really great because at the end of the day.</p><p>[01:13:36] <strong>Yam Peleg:</strong> The amount of compute that Google put into the model is insane, it's unparalleled. I'll be surprised if the model doesn't turn out to be really good, the base model, after fine tuning. But yeah, there is no, there is absolutely no doubt that Mistral is hiding something, they do have emotes. All their models that they fine tune for instruction following are on different levels.</p><p>[01:14:03] <strong>Yam Peleg:</strong> You can say. And you can see this even with the NuCube, the one that shouldn't have, had been leaked. It is also really good.</p><p>[01:14:13] <strong>Yam Peleg:</strong> But yeah, it's amazing. It's amazing that there is another player that is releasing, a major player, Google, releasing a really good Base model open source.</p><p>[01:14:24] <strong>Yam Peleg:</strong> It's great. It's great to have more players in this field more corporates turning into this game, supporting open source. It's always great. Yeah.</p><p>[01:14:33] <strong>Nisten Tahiraj:</strong> And the funny part is that they are struggling to compete in this section just because, the beauty of open source is that it enables so much competition, especially at these lower sizes where people can iterate very quickly.</p><p>[01:14:48] <strong>Nisten Tahiraj:</strong> And and now this is extremely obvious in this case. But yeah, I also think that the base model, I only tried the instruction tuned ones, and I've posted it above. I even have a link if you want to try it, but. [01:15:00] There is a lot more to be squeezed out of that just because again of the quality of the data that went in the pre training and Google might just not be that good at making chatbots.</p><p>[01:15:13] <strong>Nisten Tahiraj:</strong> Yeah, they'll probably, they'll get better, but it's</p><p>[01:15:16] <strong>Alex Volkov:</strong> Nisten, is it mergeable? It's mergeable, right? Like it's Frankensteinable.</p><p>[01:15:21] <strong>Nisten Tahiraj:</strong> Yeah, I think you can, I'll</p><p>[01:15:24] <strong>Yam Peleg:</strong> do it for fun. You can merge it with itself, but we don't have models to merge it with at the moment,</p><p>[01:15:32] <strong>NA:</strong> because you can't talk about it here yeah. You can merge the instruction tune with, not instruction tune, with itself and train on top.</p><p>[01:15:39] <strong>Yam Peleg:</strong> I tried to extend it with the front end merge and it didn't behave nicely. Mistral, for example, behaved really well. You can stretch it three times, just copy the layers three times and it works really well. At the fourth time, it starts to, to disintegrate and just breaks. But somewhere, you can do it for 3x and it works really well. This model didn't, so it was a little bit strange to see.</p><p>[01:16:08] <strong>Yam Peleg:</strong> But yeah, I'll know in a couple of hours when my training starts. So I'll be smarter to tell you. I if anyone saw my experiment I tried to play a little bit with with reinforcement learning, with DPO. I stopped the experiment mid run because someone pointed out that the terms forbid me to play with this type of experiment, but I just want to say that I played with, I tried to make the model less refusable, it was refusing nearly anything that you asked it, but so I just tried to make it more, more acceptable to actually do what you ask, nothing really fishy, but yeah, the terms forbid that, so I stopped the experiment.</p><p>[01:16:51] <strong>Yam Peleg:</strong> I just wanted to say that it really resisted. I trained and trained and the model still resisted. They really went hard on the on the alignment part on this model.</p><p>[01:17:02] <strong>Alex Volkov:</strong> Interesting that, we're going to talk about this next, I think, from Google, but interesting that even in their kind of open weights, open access models, they're baking in the alignment like super strong. Anything else, folks, on Gemma before we move on? Generally, kudos for Google for coming out this strong.</p><p>[01:17:21] <strong>Alex Volkov:</strong> Gemini Ultra getting announced, and then we saw Gemini Ultra getting access then Gemini Pro 1. 5, which we covered a little bit, and we probably should talk about this a little bit more, and now we're getting like open weights models that are finetunable, and I think even commercially licensed, right?</p><p>[01:17:35] <strong>Alex Volkov:</strong> You could use this in production, if I'm not mistaken.</p><p>[01:17:42] <strong>Alex Volkov:</strong> I guess I'm not</p><p>[01:17:42] <strong>Alignment Lab:</strong> Yeah, I think so. I think so. I think so.</p><p>[01:17:45] <strong>Alex Volkov:</strong> Yeah, which is quite impressive. Even from, it, it took Meta a while to give us a commercial license. Microsoft released PHI without commercial licensing. And then after six months gave into the pressure and Google waited, and now they're like, ta da, here's this.</p><p>[01:17:58] <strong>Alex Volkov:</strong> So very impressive from Google and kudos to whoever there worked on this open source release. It's probably not very easy to do, not open source, but open weights. It's not very easy to do. That stuff from within this big organization. So whoever listens to this, whoever worked on this, Thank you. Give us more.</p><p>[01:18:14] <strong>Alex Volkov:</strong> We would like to see bigger models, 35, etc. Junyoung, you wanted to comment as well? I saw you step in here.</p><p>[01:18:20] <strong>Alex Volkov:</strong> Yeah,</p><p>[01:18:21] <strong>Junyang Lin:</strong> I am definitely very excited about the Google open source of the Gemma model because yeah, it's actually a great model. Yesterday, we were just trying to compete QWAM 1. 5 with Gemma SMB, but we found Gemma SMB is actually better, but when we try about the base model.</p><p>[01:18:40] <strong>Junyang Lin:</strong> We think the base model should be a good model, but the instruction tune model, it's a bit strange. Actually, its behavior is quite strange. It's always irrefusable, and it's too safe, and there are a lot of answers they can't do. So I'm very surprised how they do their CAD model. But generally, the base model is general good.</p><p>[01:19:04] <strong>Junyang Lin:</strong> But I I'm very interested about their choices of their architecture because that it, its site is actually, it's not 8 billion. It's actually 9 billion because they have input embedding and their alpha embedding layers. They are not shareable. Parameters. So you found that the the sidebar actually very large.</p><p>[01:19:23] <strong>Junyang Lin:</strong> And for 2B, it is actually similar. It is actually, essentially three billion parameters if you count it correctly. So it's actually a very large model. And it is quite strange that for 2B model, it is using image MQA multi query attention, but for 7B model, it is Actually using multi head attention.</p><p>[01:19:43] <strong>Junyang Lin:</strong> I don't know why they choose it. And if you carefully look at the side of the hidden side as well as the head dimension for the attention you'll find that for the attention layer the head dimension is 2 56 and with 16 ahead, which means that the. Actually, the hidden dimension for the attention is actually 1496, but the hidden dimension for the FFM is 3072.</p><p>[01:20:11] <strong>Junyang Lin:</strong> This is very strange for me to choose something like this. I don't know, we should follow it for the following models. I don't know why Google do this. If they can tell us about this. it could be much better. But something it is very interesting and we also have experiments to show that it is quite effective, which is the large intermediate size.</p><p>[01:20:34] <strong>Junyang Lin:</strong> You will find that the intermediate size in comparison with Lama models or Mistral models it is actually larger. So you'll find we, we have some experiment and find that the larger intermediate size can improve the performance but there are still a lot of things we don't know why Google did this and we're not pretty sure Gemma is really a good model, much better than Mistral because I have seen some evaluation from Anton I'm not pretty sure it, it seems that Mistral is still the better one.</p><p>[01:21:05] <strong>Junyang Lin:</strong> I'm not pretty sure actually much better than Mistral, so, let's wait for more tests.</p><p>[01:21:11] <strong>Alex Volkov:</strong> We'll wait for Junyang thank you folks who are not familiar with Junyang he's on the technical team at GWEN, and we've talked multiple times about this point, thank you Junyang and it's great to have you here. And definitely we'll see more fine tuning, base model seems to be fine tuned a bowl, Yam said he's already cooking something, probably other labs are already shaking their They're pounds in anticipation of how to use the open source stuff, the DPO stuff.</p><p>[01:21:33] <strong>Alex Volkov:</strong> If it works to actually make this model behave instruction fine tuning better than Google did. And I'm sure that it's possible because we've seen a lot of advancements in open source community. And now it looks like Google is catching up to the open source community and not the other way around, which is incredible.</p><p>[01:21:47] <strong>Alex Volkov:</strong> And I want to just say, I will move on from this because folks have been here for an hour and a half, and there's a bunch of other stuff to also talk about. Specifically. Specifically because Google is a, in, in our good graces from one perspective, but also from another perspective, since they released Gemini, and Gemini could generate images they have shown us why potentially they've been hesitating to release anything at all.</p><p>[01:22:11] <strong>Alex Volkov:</strong> Because, and I think OpenAI and DALI has this to some extent as well. But if you've missed the storm and conversation this week definitely, you'll hear about this because Gemini, both, I think, Pro and Ultra on the interface, not the API models they are able to generate images. I think it was with Imogen or some other model from Google DALI and CGPT, right?</p><p>[01:22:31] <strong>Alex Volkov:</strong> And folks, quickly find out that those models do not like the words white. And literally, I think I had to tweet about this, I'll pin this, and I'll add this to the show notes as well. I went and tested something like, hey, generate a glamour shot of two, Jewish couples, two Indian couples, two African couples, and that was fine.</p><p>[01:22:50] <strong>Alex Volkov:</strong> And then I've asked Junyang a glamorous shot of two white people. And then it said, no, I cannot use generation based on race or gender or something like this, even though it just did this for five times. And then many folks tested this with historical figures when they asked hey, Junyang an image of whatever, before United States founding fathers, or some Nazi, or whatever it is.</p><p>[01:23:11] <strong>Alex Volkov:</strong> And they had a significant interjection into prompting, where it created stuff that are not remotely historically [01:23:20] accurate. And when I tested my stuff, it was a response to the historically accurate stuff. And it's still, it seems like there's a problem with how these models are replying to us.</p><p>[01:23:29] <strong>Alex Volkov:</strong> And a lot of folks at Google probably made it hard for these models to actually give me the image that I asked for. So it refuses so much though, the conversation went so hard into, Hey, Google, what did you give us? Why is this thing? So refusing that Google took down the ability to generate people. So right now, if you go, and I think it's like for the past 24 hours or so, if you go now and try to generate an image of an elephant, you'll get it.</p><p>[01:23:54] <strong>Alex Volkov:</strong> But if you try to generate the image of an elephant with, I don't know, two white folks holding its trunk or whatever, it will refuse. And like they, they completely nerfed the ability to generate people altogether, quote unquote, while they serve, solve for this, which is quite. Remarkable to think about how a big company like this, that already been in hot water before.</p><p>[01:24:17] <strong>Alex Volkov:</strong> And obviously this is Google, everybody's gonna dunk and go on Twitter and say bad things because punching up is easy. But, and also this gets you internet points if you're the first person that says, hey, Google is, reverse racist. But, Google has been in this hot water before with some image identification.</p><p>[01:24:34] <strong>Alex Volkov:</strong> I think there was a famous incident like 10, a decade ago almost, if you guys remember, with a image model that was identifying black people and saying gorillas or something. So Google has been burned on kind of the other side of this before, and now it looks like the pendulum swung way back to the other side, enough so that on the first, a week or so of the release.</p><p>[01:24:53] <strong>Alex Volkov:</strong> Now they are taking back the ability to generate people completely. And quite incredible how much of an intervention into multiculturalism, let's say they have in prompt layer. So it does look like the model can generate stuff. I saw one, one hacky attempt. Somebody said, hey, generate a glamorous shot of couple with fair skin.</p><p>[01:25:14] <strong>Alex Volkov:</strong> And then most of them are white, but if you actually say white couple, it's not able to, which is quite interesting. And I think it adds to the point where Yam said that even the open weights model that they've released, they have some built in kind of alignment strongly in the finetuning.</p><p>[01:25:30] <strong>Alex Volkov:</strong> So probably it's a feature of some of the datasets, but also some of the alignment stuff. It's really interesting to see that the internet kind of showed Google that the other side is also not great. Going all the way to the other side is also not great. And so Google, at least some of the teams in Google are, struggling right now to figure out what's the right balance there.</p><p>[01:25:49] <strong>Alex Volkov:</strong> Separately from Yeah, go ahead.</p><p>[01:25:51] <strong>Nisten Tahiraj:</strong> Sorry</p><p>[01:25:52] <strong>Nisten Tahiraj:</strong> I really want to highlight this because it's gotten to the point where the open source models and even GPT 3. 5 will do some tasks fine. And in this case, a task that I tested with is the. Universal Declaration of Human Rights, which is the most translated document</p><p>[01:26:10] <strong>NA:</strong> in human history and it's part of every data set.</p><p>[01:26:13] <strong>Nisten Tahiraj:</strong> And now you have Gemini and you have Copilot which is GPT 4. The thing that is too unsafe to translate, to</p><p>[01:26:24] <strong>NA:</strong> give you a translation of the Declaration of Human</p><p>[01:26:27] <strong>Nisten Tahiraj:</strong> Rights, which is, this has just gotten completely ridiculous. You can use a, you can use a model that's made anywhere else, any open source model, and it will tell you that, whereas now we have the, all the safety people and all the people that they hired, it's gotten to the point that it's completely backfired, and this is ridiculous.</p><p>[01:26:54] <strong>Nisten Tahiraj:</strong> They should be held</p><p>[01:26:56] <strong>Alex Volkov:</strong> Yeah, into unusefulness like some things in history happened, and we would like to, to be able to ask those things. And yeah, I definitely want to hear how this gets solved. I will say there were some folks that are mentioning that, hey, open, DALY, if you ask the same exact thing from DALY, it may give you some similar answers.</p><p>[01:27:14] <strong>Alex Volkov:</strong> So why is Google getting attacked? First of all, they just released it. Second of all, this is Google after all. Like they, they're like the big, they're still the gorilla, the big 600 pound gorilla, I think Microsoft called them in the room. And thirdly, we have short memory. We play with the toys, we play with the tools as we get them.</p><p>[01:27:30] <strong>Alex Volkov:</strong> And then when we discover we go viral. . Back to the good side of Google also, as we had breaking news last Thursday, and we talked about Gemini releasing a million tokens, as Thursday I started last one, which was crazy, Google released an update that said, hey, some developers can now get access to up to a whopping 1 million tokens in context window for Gemini 1.</p><p>[01:27:53] <strong>Alex Volkov:</strong> 5, and technically In research, they have up to 10 million Context Windows support, which is incredible. And I just want to come back and say that after this week, we've seen many folks, including Matt Schumer, who's here on stage, including a bunch of other folks, getting access to this 1 million tokens.</p><p>[01:28:08] <strong>Alex Volkov:</strong> I didn't get access yet. So wink at Google, if somebody hears me, please give me access. And folks are trying books, like full like three Three Harry Potter books on it and getting incredible stuff. Many folks are using it for video, which is also quite remarkable. Uploading an hour of video and getting retrieval from the, from video from within 1.</p><p>[01:28:29] <strong>Alex Volkov:</strong> 5, 100, like 1 million context window. It's, I wanted to follow up and say You know, the safety folks at Google need to take a little break, but the tech folks at Google holy crap, like the 1 million contacts was severely underhyped after Sora released from OpenAI, like two hours after we had also breaking news, and Sora is still blowing minds, and we're going to talk about Sora just briefly, but the 1 million contacts window gets more folks playing with it, And it's incredible for code generation.</p><p>[01:28:59] <strong>Alex Volkov:</strong> People threw the whole code base of 3. js in there. People threw just like whole code bases in one prompt. And we were talking about this a little bit with the Grog guys as well, where this unlocks new possibilities and significant new possibilities that weren't imagined before, and we don't have time for this debate today.</p><p>[01:29:20] <strong>Alex Volkov:</strong> And maybe we'll have to close the space a little early. And I'll tell you why in a second, but. I just wanted to highlight that, there's some stuff that Google did. Google is like this huge company, like full of multiple people. The safety stuff, meh, like we're gonna rally against this, we're gonna tell them that they're wrong and hopefully we'll get like less, less restricted models.</p><p>[01:29:39] <strong>Alex Volkov:</strong> But the context stuff, oh my god, this is like incredible, definitely set the new bar for how models should behave and what the possible things are. 10 hours of audio, you can send in one context 10 hours of audio and it will be able to tell you exactly when somebody said what. And summarize everything with like perfect recall.</p><p>[01:29:58] <strong>Alex Volkov:</strong> We had Greg Cumbrand that we've talked about for End of the Pod as well. He did this needle in haystack analysis on a bunch of context windows, if you remember, on Claude, etc. And they used his needle in haystack analysis to analyze and say that The models that also have very high recall precision, like almost perfect recall precision throughout this whole context, throughout the whole like 600, 000 tokens or so.</p><p>[01:30:21] <strong>Alex Volkov:</strong> And we had folks test this week. Quite incredible advancement there, and Entropic, who are, who did Cloud for us with 100, 000 tokens for a long time, this was their mode, then there is 200, 000 tokens it seems, it's paling in comparison. I did my comparisons from last year, if you guys remember, during May, Mosaic released the jump into 70, 000 tokens or so, and back then that looked incredible, they threw, they put an actual book in there, and I just compared the less than a year, we've gotten like a 10x jump into what we consider like normal context windows or possible context windows, because like less than a year ago, the big jump was to 60, 000.</p><p>[01:31:03] <strong>Alex Volkov:</strong> And now we're jumping to a million. And it's actually possible to use a million. So incredible, incredibly important for Multimodality as well, because videos take just so much content. I think one hour video of this Buster Keaton, I think is the video that they've used in the example, takes around 600, 000 tokens.</p><p>[01:31:20] <strong>Alex Volkov:</strong> Just think about this. Like one hour video takes around 600, 000 tokens. And it's able to tell you exact precision of where something happened in this video, what happened, who spoke about what. Very incredible. Definitely underhyped. I think Sora took. I think, collectively on X we're able to talk about one important thing, and Sora [01:31:40] definitely took that one important thing, but coming back to Gemini 1.</p><p>[01:31:43] <strong>Alex Volkov:</strong> 5 with a huge context is very impressive from Google as well. Anybody here on stage got access to 1. 5 and actually played with this? I haven't yet, I'm just recapping from the feed. Nope, everybody's sad. Google, if you hear us, give us access. Nisten?</p><p>[01:31:59] <strong>Alignment Lab:</strong> I will bite my finger off like a graham cracker</p><p>[01:32:02] <strong>NA:</strong> to get access to that model.</p><p>[01:32:03] <strong>Alex Volkov:</strong> Yes. Exactly. All right, so moving. Yeah, Nisten, go ahead and then we'll move on.</p><p>[01:32:08] <strong>NA:</strong> No, I just</p><p>[01:32:09] <strong>Nisten Tahiraj:</strong> wanted to mention some other news that Roboflow and Sakowski just released the YOLOv9 model. I made some demo with it, with the sailboats</p><p>[01:32:18] <strong>NA:</strong> And the boxing and stuff. And</p><p>[01:32:20] <strong>Nisten Tahiraj:</strong> this</p><p>[01:32:20] <strong>NA:</strong> is</p><p>[01:32:21] <strong>Nisten Tahiraj:</strong> pretty, it's pretty nuts. It's like the next the next gen stuff.</p><p>[01:32:24] <strong>NA:</strong> But they've also released a paper, I think.</p><p>[01:32:27] <strong>NA:</strong> for</p><p>[01:32:27] <strong>Nisten Tahiraj:</strong> some research, which I haven't read yet and I'm incredibly excited. But yeah, this is completely this is not as much LLM related, but it is open source vision AI stuff. And I really recommend people to, to look at it because it's like straight up from the future. Like I I tried YOLOv8 and you all can see the results and stuff on video on stuff you can do.</p><p>[01:32:51] <strong>Nisten Tahiraj:</strong> And</p><p>[01:32:51] <strong>NA:</strong> It's pretty cool.</p><p>[01:32:53] <strong>Alex Volkov:</strong> Could you add this to the space and we'll add to show notes as well. I will just highlight that Peter Skalski, SkalskiP is a friend of the pod, a dear co host, and Roboflow are doing incredible vision stuff, and definitely worth a shoutout every time they release something new, and some of his tutorials on Twitter are amazing.</p><p>[01:33:09] <strong>Alex Volkov:</strong> If you're into vision understanding, Peter is the guy to follow, and a shoutout for for the stuff that they're building there. I think we're gonna move on from the big companies and LMs, we've talked about pretty much everything, Open source. The last thing that we wanted to mention, I think the last thing that I want to mention is Nous Research released Nous Hermes on DPO.</p><p>[01:33:27] <strong>Alex Volkov:</strong> And basically it's the same model, just trained on DPO data set. And that beats the previous Nous Research, the Nous Hermes Open Hermes 2. 5, I think pretty much in every benchmark. And that's been great to see the DPO is Putting itself in, in, in the right position of improving models.</p><p>[01:33:44] <strong>Alex Volkov:</strong> I think we've seen this from our Guia folks who cleaned datasets and actually retrained Hermes models. I think we've seen this. And now we're getting a DPO headset from Nous folks themselves, which is great to see. And Jan, I think you had some comments about how to actually do this DPO thing in, in, in comments to Technium.</p><p>[01:34:00] <strong>Alex Volkov:</strong> So more of that goodness is coming, and open source does not wait, and I can't wait to see all these techniques also apply to, to the different Jemma stuff that we got, and different other, let's say, rumored, wink, from meta stuff that at some point are gonna come, and we're gonna get, hopefully the number three which, if they release today, I'm not gonna be mad honestly Mark, if you're listening to</p><p>[01:34:23] <strong>Nisten Tahiraj:</strong> Yeah, let's close it early, otherwise we'll be here until tomorrow.</p><p>[01:34:27] <strong>Alex Volkov:</strong> that's true.</p><p>[01:34:28] <strong>Alex Volkov:</strong> We're going to close it early because of this next thing that I want to talk about and I actually want to cover this a little bit. So I'm going to put some music and then we're going to talk about this. Oh my God. I got lost in my music stuff. And we're going to talk about this week's buzz. I see that folks are enjoying. me mistakenly hitting different musical buttons. Folks, welcome to this week's buzz. This is a corner here, a section here, that I talk about everything that I've learned working for Weights Biases. And some of this is technical, some of this is just the stuff that we release on courses.</p><p>[01:35:00] <strong>Alex Volkov:</strong> And we released a course with Hamal Hussain about enterprise model management. So if you're into this That course is great. It's going so good. So many people are registering. I haven't had actually time to see it. I should probably see this soon, maybe tomorrow because I'm preparing ThursdAI and working on demos with Groq and everything.</p><p>[01:35:17] <strong>Alex Volkov:</strong> But I've definitely wanted to chat about the reason I was in San Francisco for this last weekend. So as we were finishing up ThursdAI last week, I think I said Swyx was here. I was recording it live from San Francisco. And that day on Thursday, we had a meetup. That I helped co host, and I wasn't the only one there.</p><p>[01:35:36] <strong>Alex Volkov:</strong> A16z, Andreessen Horowitz, the biggest VC firm in the world. With with, if you don't follow Marc Andreessen on X, you definitely should. He's a big proponent of open source. He's been talking about all these kind of very interesting things. Shout out Marc Andreessen. He wasn't there. I definitely expect to see him next time.</p><p>[01:35:52] <strong>Alex Volkov:</strong> But folks, Reiko and Marco Moscoro from A16Z, the guys who give out grants to open source. And you know that many of our friends of the pod are like grant receivers from A16Z. The blog received the grant, Nous Research are grant receivers. I think Axolotl, Wing is from Axolotl, is also a grant receiver.</p><p>[01:36:09] <strong>Alex Volkov:</strong> Like a bunch of folks are getting supported by A16Z. And they had a meetup for open source AI. And I was very proud to be invited and to be a co host and gave out a bunch of Weights Biases swag. And just in terms of names who went, it was mind blowing. We had Nous Research folks, so Technium was there, and EmuZilla was there, Koran Shavani, like all the Nous folks are definitely big help organizers.</p><p>[01:36:33] <strong>Alex Volkov:</strong> Olama folks were there, announced that they're now supporting Windows. LamaIndex, we met with Jerry. LMCs folks, which I really wanted to meet and talk to them, and maybe bring them on ThursdAI, but I didn't get a chance to, so if anybody knows the LMCS folks, please shout out shoot me a DM with them as well.</p><p>[01:36:50] <strong>Alex Volkov:</strong> Replicate, who are doing great stuff, Perplexity, Mistral, there was a Devendra, I think, from Mistral was there as well, and there's also a bunch of friends of the path who also receive grants, if you guys remember, we had a deep dive with John Durbin. from the Bagel Model fame, and he just recently started releasing a bunch of other stuff.</p><p>[01:37:06] <strong>Alex Volkov:</strong> Eric Hartford who released, I think, Lazer, and now he works at Abacus. Hao Tian Liu from Lava, and just a bunch of great folks in the open source community got together in San Francisco and talked to each other about techniques, about how important open source is and they had a panel with like folks from Mozilla and the Linux Foundation and Percy from Together AI as well.</p><p>[01:37:27] <strong>Alex Volkov:</strong> That panel talked about the importance of open source, what is open source actually. How do we treat open source in AI? What is weights fully? Is that enough? Or is something like Olmo that we've talked about from Allen Institute of AI, is that like full open source when they released the training code and data sets and weights and biases logs and all these things.</p><p>[01:37:46] <strong>Alex Volkov:</strong> And so there was a great discussion about what open source actually means in the fully like new AI world. Incredible to meet all these folks. Just shout out to Reiko and Marco for organizing this and inviting us. And I promised a report and this is the report. And definitely I will add to the show notes the summary that Reiko did, because they also did a report on open source stuff.</p><p>[01:38:07] <strong>Alex Volkov:</strong> It's worth looking into this, how much, how many folks downloading the Blo . So many folks download ni maybe you saw this LD as well. So many folks download the bloke's models. Then when the bloke like, I think disappeared for three days or something, peoples like, is he okay? There's no new g GFS on hack face.</p><p>[01:38:24] <strong>Alex Volkov:</strong> What happened? Is he all right? So many people get used to this. The block is also a receiver of the A 16 Z grant. And so that's what I learned in wait and Biases this week. I also visited the office. Those of you who followed me probably seen my ridiculous video that I showed around the office showing waits in the waits and biases, dashboards in, in virtual space.</p><p>[01:38:44] <strong>Alex Volkov:</strong> And I really had a great time there. We also met with sws and some of. His folk in the Swyx small house, so shout out Swyx and Alessio from Latent Space Pod for first of all hosting me, second of all being great friends of the Pod. Honestly, ThursdAI would not exist as a podcast and newsletter without Swyx and Alessio.</p><p>[01:39:03] <strong>Alex Volkov:</strong> And also they're coming up on their one year anniversary for Latentspace. So if I can send them love and subscribers, please go check out Latentspace as well. Happy birthday, folks. And I think we're going to move on to two new things. And then we're just going to do a recap in the AI, art, and diffusion area.</p><p>[01:39:20] <strong>Alex Volkov:</strong> And I think for this, I do have a transition. Let's see. No, I have a transition for this. Yes.</p><p>[01:39:47] <strong>Alex Volkov:</strong> and Alignment just dropped, but I wanted to hear what he was actually saying, but he had issues with space even before. But we did have a transition, and folks, this week is big. This week is big. You guys know that we only talk about [01:40:00] AI when it's huge, and this week was huge. Starting off this week, ByteDance released SDXL Lightning, which takes SDXL that we've talked about one of the best open source diffusion models, and then makes it incredible in just one step.</p><p>[01:40:15] <strong>Alex Volkov:</strong> So if you ever use a stable diffusion, if you ever. Run it yourself, the, the sweet spot is somewhere between 35 and 50 steps, depending on which which of the, I forgot what it's called, the tokenizer? No, something, depends on what you use there between 35 and 50 steps and then.</p><p>[01:40:33] <strong>Alex Volkov:</strong> We obviously had some advancements before. We've seen SDXL Turbo and SDXL Lightning generates incredible images in just one or two steps. Just incredible it's unbelievable how fast this is. And of course, our folks friends of the pod as well from File. ai are putting this in production and you can play with their demo.</p><p>[01:40:52] <strong>Alex Volkov:</strong> The demo is called I'm going to put this in show notes, FastSDXL. ai. And the demo is near real time. You type and you generate images. You type and they generate images. And it's not the LCM stuff that we've talked about. If you guys remember the late consistency model that's something else. This is a full SDXL generation running in two or four steps.</p><p>[01:41:12] <strong>Alex Volkov:</strong> and looks incredible, like 1024 resolution, text to image generation, ByteDance optimized the crap out of this SDXL and it's really mind blowing. I really suggest you go and try to play with it as fast as the SDXL. ai. And I've played with this yesterday and the, what I wanted to do with this is I wanted to And it's added to the show notes as well.</p><p>[01:41:34] <strong>Alex Volkov:</strong> I wanted to see what's possible when we have an LLM that's near instant. So we've had the chat today with the Groq folks, and you can hear, if you're just joining us now, you can hear the chat after I publish the episode. And their LLM is like 500 tokens a second. So basically answers appearing in near instant time, but also SDXL Lightning is, SDXL diffusion appears in near instant time.</p><p>[01:41:56] <strong>Alex Volkov:</strong> And I played with a demo of this and I'm gonna add the video also to the show notes as well, and I was just blown away how responsive things feel. And so the demo that I built was using Neil Agarwal's It's called Infinite Fun or something game where you just draw concepts on top of each other, and he uses AI to generate what those two concepts mean, basically.</p><p>[01:42:17] <strong>Alex Volkov:</strong> Neil, in this Infinite Fun thing, he used emojis. So if you combine Earth and, I don't know, fire or something, you get volcano. So he has the emoji of volcano, right? So he has an AI that picks out the best emoji for this one thing. And I said, hey, emoji is fun, but what if we generate like a full on SDXL image on every turn that I play this game?</p><p>[01:42:37] <strong>Alex Volkov:</strong> And I did this with with Groq. I used Mixtral behind the scenes to generate, to be the prompt engineer, to take these concepts and actually write a nice prompt for SDXL. And with two steps or four steps, overall from dragging this to getting Mixtral to be my prompt engineer, and my initial my initial system message is around a thousand tokens, right?</p><p>[01:42:57] <strong>Alex Volkov:</strong> So I'm sending a thousand tokens or so. Probably, maybe less than a thousand, maybe five hundred. And I get an instant answer from Groq, because their speed is ridiculous. I then send this to FAL, to their API to do SDXL lightning. And I get an image, it's super fast, like it's also ridiculous It's, I think overall for some incredible examples, I got less than 300 milliseconds response from going to an LLM, generating a prompt for me, taking this prompt, sending it to an image image thing and getting back.</p><p>[01:43:24] <strong>Alex Volkov:</strong> under 300 milliseconds. I will remind you that folks from Google a long time ago did a research study where everything under 250 milliseconds to humans is almost real time imperceptible and clicks and reactions. And now we're getting multiple models. In kind of a pipeline together reacting under 300 milliseconds.</p><p>[01:43:43] <strong>Alex Volkov:</strong> And it's incredible. And honestly, I can't release, I cannot release this demo because I didn't build the UI, so I cannot give you the UI. However I can probably send you the extension code if you want to, and you have your own API keys for Groq. I was blown away how easy and fast this was.</p><p>[01:43:59] <strong>Alex Volkov:</strong> And just two of these the same week, two of the speed investments. So SDXL Lightning, two steps for like incredible image generation, and then Groq as well. So this is an answer to folks who are saying, why do we even need this speed? I saw somebody say, hey why do you even need 400 tokens a second?</p><p>[01:44:17] <strong>Alex Volkov:</strong> People cannot read fast enough. And this is the answer for this, because interfaces can happen in near real time. And it's incredible. And the second big thing in AI art and diffusion happened as breaking news. So we're gonna, we're gonna do this.</p><p>[01:44:41] <strong>Alex Volkov:</strong> Folks, we have breaking news. And LDJ, you've been saying about today or I guess for a while now Imad from Stability AI, Imad Mustaq from Stability announces Stable Diffusion 3. StableDiffusion3 uses a new architecture that we've talked about first with Tanishka and folks from the HDIT Hourglass Diffusion Transformers, and also from Sora, a Diffusion Transformer architecture, where they take the both worlds from this last gen of of gen AI and combine them together.</p><p>[01:45:11] <strong>Alex Volkov:</strong> And StableDiffusion3 is going to be Diffusion Transformer. And it's impressive in, so we only got a waitlist, so unlike previously, where Stable Diffusion just dropped now it's a waitlist that you have to sign up for, but shout out to Fox's Stability because it looks incredible. It has, and very impressive, so some examples you can check out in the newsletter that I'm gonna send, some examples are under the hashtag SD3.</p><p>[01:45:36] <strong>Alex Volkov:</strong> On X, it has a very impressive multi subject prompt following, so I can show you an example of this later in show notes, but a prompt like Painting of an astronaut, riding a pig, wearing a tutu, holding a pink umbrella, on the ground next to the pig is a robin bird wearing a top hat, in the corner there are words stable diffusion, and this image is perfect.</p><p>[01:45:56] <strong>Alex Volkov:</strong> All of the subjects and the different things that I told you are existing in this picture and the Robin bird is on the ground, has a top hat and the astronaut is holding an umbrella, but the pig is wearing a tutu. So understand the text is perfect. And understanding of multiple subjects, I think, is something that we've seen great in DALI, for example, but previous versions of Stable Diffusion were not nearly as good at multi prompts, multi subjects, and multi colors, for example, and this nails all of them.</p><p>[01:46:22] <strong>Alex Volkov:</strong> The umbrella is the right color, the tutu is the right color, the bird, everything. And it looks just really awesome. And I gotta wonder if, something like this with the speed of the previous announcement of SDXL could mean. And so they're advancing very fast as well, and it's great to see.</p><p>[01:46:39] <strong>Alex Volkov:</strong> Breaking news, shoutout to Stability for announcing this, they didn't release this yet, they announced. Stable Diffusion 3, it's coming to us very soon, and it looks awesome. And I think, unless folks here on stage wanted to chat about some other stuff that we haven't covered yet,</p><p>[01:46:56] <strong>Alex Volkov:</strong> this is everything we've talked about on ThursdAI. Outside of that, we had our returning hosts. and Co hosts and speakers on the panel. So I want to thank Nisten, I want to thank Yam. LDJ was here, Jun Yang from QEN, and a bunch of other folks. I want to shout out Matt Schumer again and Mark Heaps from Groq for joining and telling us all about this.</p><p>[01:47:14] <strong>Alex Volkov:</strong> And if that, if you missed any part of this conversation, definitely feel free to check us out. With that I want to say thank you for joining ThursdAI as always. I think we're. coming up to almost exactly two hours, and I'm gonna, I'm gonna let you go and then we'll see what else gets released on this crazy AI Thursday.</p><p>[01:47:31] <strong>Alex Volkov:</strong> Thank you everyone.</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-feb-22nd-groq-near-instant</link><guid isPermaLink="false">substack:post:141944289</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Fri, 23 Feb 2024 01:40:46 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/141944289/58ce6e6a61da45793ad4beffc24f1616.mp3" length="77806486" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6484</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/141944289/b9b9d15279b4aed631c4d929b1081f78.jpg"/></item><item><title><![CDATA[🔥 ThursdAI - Feb 15, 2024 - OpenAI changes the Video Game, Google changes the Context game, and other AI news from past week]]></title><description><![CDATA[<p>Holy SH*T, </p><p>These two words have been said on this episode multiple times, way more than ever before I want to say, and it's because we got 2 incredible exciting breaking news announcements in a very very short amount of time (in the span of 3 hours) and the OpenAI announcement came as we were recording the space, so you'll get to hear a live reaction of ours to this insanity. </p><p>We also had 3 deep-dives, which I am posting on this weeks episode, we chatted with <a target="_blank" href="https://twitter.com/YiTayML/status/1757115386829619534">Yi Tay</a> and <a target="_blank" href="https://twitter.com/maxhbain">Max Bane</a> from Reka, which trained and released a few new foundational multi modal models this week, and with <a target="_blank" href="https://twitter.com/dome_271/status/1757427068520796632/photo/2">Dome</a> and <a target="_blank" href="https://twitter.com/pabloppp">Pablo</a> from Stability who released a new diffusion model called Stable Cascade, and finally had a great time hanging with Swyx (from Latent space) and finally got a chance to turn the microphone back at him, and had a conversation about Swyx background, Latent Space, and AI Engineer. </p><p>I was also very happy to be in SF today of all days, as my day is not over yet, there's still an event which we Cohost together with A16Z, folks from Nous Research, Ollama and a bunch of other great folks, just look at all these logos! Open Source FTW 👏 </p><p>TL;DR of all topics covered: </p><p>* <strong>Breaking AI News</strong></p><p>* 🔥 OpenAI releases SORA - text to video generation (<a target="_blank" href="https://openai.com/sora">Sora Blogpost with examples</a>)</p><p>* 🔥 Google teases Gemini 1.5 with a whopping 1 MILLION tokens context window (<a target="_blank" href="https://x.com/OriolVinyalsML/status/1758148444588319020?s=20">X</a>, <a target="_blank" href="https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/">Blog</a>)</p><p>* <strong>Open Source LLMs</strong> </p><p>* Nvidia releases Chat With RTX local models (<a target="_blank" href="https://x.com/BrianRoemmele/status/1757446979360370788?s=20">Blog</a>, <a target="_blank" href="https://huggingface.co/CohereForAI/aya-101">Download</a>)</p><p>* Cohere open sources Aya 101 - 101 languages supporting 12.8B model (<a target="_blank" href="https://twitter.com/CohereForAI/status/1757359611399532921">X</a>, HuggingFace)</p><p>* Nomic releases Nomic Embed 1.5 + with Matryoshka embeddings (<a target="_blank" href="https://x.com/nomic_ai/status/1757782157374734665?s=20">X</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Andrej Karpathy leaves OpenAI (<a target="_blank" href="https://twitter.com/karpathy/status/1757600075281547344">Announcement</a>)</p><p>* OpenAI adds memory to chatGPT (<a target="_blank" href="https://twitter.com/joannejang/status/1757470618264429008">X</a>)</p><p>* <strong>This weeks Buzz (What I learned at WandB this week)</strong></p><p>* We launched a new course with <a target="_blank" href="https://twitter.com/HamelHusain">Hamel Husain</a>  on enterprise model management (<a target="_blank" href="https://www.wandb.courses/courses/enterprise-model-management?utm_source=thursdai&#38;utm_medium=referal&#38;utm_campaign=feb-15">Course</a>)</p><p>* <strong>Vision & Video</strong></p><p>* Reka releases Reka-Flash, 21B & Reka Edge MM models (Blog, <a target="_blank" href="https://chat.reka.ai/chat">Demo</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* WhisperKit runs on WatchOS now! (<a target="_blank" href="https://x.com/argmaxinc/status/1757803686124990770?s=20">X</a>)</p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* Stability releases Stable Casdade - new AI model based on Würstchen v3 (<a target="_blank" href="https://stability.ai/news/introducing-stable-cascade">Blog</a>, <a target="_blank" href="https://huggingface.co/spaces/multimodalart/stable-cascade">Demo</a>)</p><p>* <strong>Tools & Others</strong></p><p>* Goody2ai - A very good and aligned AI that does NOT want to break the rules (<a target="_blank" href="https://www.goody2.ai/chat">try it</a>)</p><p>🔥 Let's start with Breaking News  (in the order of how they happened) </p><p>Google teases Gemini 1.5 with a whopping 1M context window</p><p>This morning, <a target="_blank" href="https://twitter.com/JeffDean/status/1758146022726041615">Jeff Dean</a> released a thread, full of crazy multi modal examples of their new 1.5 Gemini model, which can handle up to 1M tokens in the context window. The closest to that model so far was Claude 2.1 and that was not multi modal. They also claim they are researching up to 10M tokens in the context window. </p><p>The thread was chock full of great examples, some of which highlighted the multimodality of this incredible model, like being able to pinpoint and give a timestamp of an exact moment in an hour long movie, just by getting a sketch as input. This, honestly blew me away. They were able to use the incredible large context window, break down the WHOLE 1 hour movie to frames and provide additional text tokens on top of it, and the model had near perfect recall. </p><p>They used <a target="_blank" href="https://twitter.com/GregKamradt/status/1755660594710278356">Greg Kamradt</a> needle in the haystack analysis on text, video and audio and showed incredible recall, near perfect which highlights how much advancement we got in the area of context windows. Just for reference, less than a year ago, we had this chart from Mosaic when they released MPT. This graph Y axis at 60K the above graph is 1 MILLION and we're less than a year apart, not only that, Gemini Pro 1.5 is also multi modal </p><p>I got to give promps to the Gemini team, this is quite a huge leap for them, and for the rest of the industry, this is a significant jump in what users will expect going forward! No longer will we be told "hey, your context is too long" 🤞 </p><p>A friend of the pod <a target="_blank" href="https://sub.thursdai.news/p/thursdai-special-episode-interview">Enrico Shipolle</a> joined the stage, you may remember him from our deep dive into extending Llama context window to 128K and showed that <a target="_blank" href="https://arxiv.org/abs/2309.12307">a bunch</a> <a target="_blank" href="https://arxiv.org/abs/2309.14509">of</a> <a target="_blank" href="https://arxiv.org/abs/2310.01889">new</a> <a target="_blank" href="https://arxiv.org/abs/2401.01325">research</a> <a target="_blank" href="https://arxiv.org/abs/2401.03462">makes</a> all this possible also for open source, so we're waiting for OSS to catch up to the big G. </p><p>I will sum up with this, Google is the big dog here, they invented transformers, they worked on this for a long time, and it's amazing to see them show up like this, like they used to do, and blow us away! Kudos 👏 </p><p>OpenAI teases SORA - a new giant leap in text to video generation</p><p>You know what? I will not write any analysis, I will just post a link to the blogpost and upload some videos that the fine folks at OpenAI just started releasing out of the blue.</p><p>You can see a ton more videos on Sam twitter and on the official <a target="_blank" href="https://openai.com/sora">SORA website</a></p><p>Honestly I was so impressed with all of them, that I downloaded a bunch and edited them all into the trailer for the show!  </p><p>Open Source LLMs </p><p>Nvidia releases Chat With RTX </p><p>Chat With Notes, Documents, and Video</p><p>Using Gradio interface and packing 2 local modals, Nvidia releases a bundle with open source AI packaged, including RAG and even Youtube transcriptions chat! </p><p>Chat with RTX supports various file formats, including text, pdf, doc/docx, and xml. Simply point the application at the folder containing your files and it'll load them into the library in a matter of seconds. Additionally, you can provide the url of a YouTube playlist and the app will load the transcriptions of the videos in the playlist, enabling you to query the content they cover.</p><p><strong>Chat for Developers</strong></p><p>The Chat with RTX tech demo is built from the TensorRT-LLM RAG developer reference project available from <a target="_blank" href="https://github.com/NVIDIA/trt-llm-rag-windows">GitHub</a>. Developers can use that reference to develop and deploy their own RAG-based applications for RTX, accelerated by TensorRT-LLM.</p><p>This weeks Buzz (What I learned with WandB this week)</p><p>We just released a new course! Hamel Hussein released a course on enterprise model management! </p><p><strong>Course name: Enterprise Model Management</strong><strong>Course Link: </strong><a target="_blank" href="https://www.wandb.courses/courses/enterprise-model-management?utm_source=thursdai&#38;utm_medium=referal&#38;utm_campaign=feb-15">wandb.me/emm-course</a><strong>Who is this for: </strong>The course is targeted at enterprise ML practitioners working with models: MLOps engineers, ML team leaders, ML engineers. It shows both at conceptual and technical level how to get the most value of W&B Model Registry and automations. Attached is also a screenshot of a slide from the course on what different personas (MLOps, ML exec etc) get from Model Registry.<strong>What can they expect: </strong>Learn how to store, version, and evaluate models like top enterprise companies today, using an LLM training & evaluation example. Big value props: improved compliance, collaboration, and disciplined model development.</p><p>Vision & Video</p><p>Reka releases Reka Flash and Reka Edge multimodal models</p><p>Reka was co-founded by Yi Tay, previously from DeepMind, trained and released 2 foundational multimodal models, I tried them and was blown away by the ability of the multi-modals to not only understand text and perform VERY well on metrics (73.5 MMLU / 65.2 on HumanEval) but also boasts incredible (honestly, never before seen by me) multi modal capabilities, including understanding video! </p><p><a target="_blank" href="https://x.com/altryne/status/1757436949730676966?s=20">Here's a thread</a> of me getting my head continuously blown away by the quality of the tonality of this multimodality (sorry...😅)</p><p>I uploaded a bunch of video examples and was blown away, it understands tonality (with the dive dive Diiiiive example) understands scene boundaries, and does incredible OCR between scenes (the Jason/Alex example from speakers) </p><p>AI Art & Diffusion</p><p>Stable Cascade (<a target="_blank" href="https://stability.ai/news/introducing-stable-cascade">link</a>)</p><p>Stability AI introduced a new text-to-image generation model called Stable Cascade that uses a three-stage approach to produce high-quality images with a compressed latent space, making it more efficient to train and use than previous models. It achieved better results than other models in evaluations while having faster inference speeds. The company released code to train, fine-tune, and use control models like inpainting with Stable Cascade to enable further customization and experimentation. Stability AI aims to lower barriers to AI development through models like this one.</p><p>Nate did a comparison between a much slower SDXL and Stable Cascade <a target="_blank" href="https://t.co/kns3Us6Mqo">here</a>: </p><p></p><p>Here’s the transcript for the whole episode, you definitely should check it out! It was really one of the coolest shows we had, and we had over 2K folks listening in! </p><p>[00:00:00] <strong>Alex Volkov:</strong> Hey, this is Alex Volkov, you're on ThursdAI, and I just gotta record this intro real quick, because today marks one of the more singular days in AI that I remember since I started recording ThursdAIs, which was itself a singular day, March 14th, 11 months ago, when GPT 4 was released and announced. We since then had a few days like this GPT Dev Day was one such day, and today marks another one.</p><p>[00:00:38] <strong>Alex Volkov:</strong> Google has released an update to their model, talking about 1 million tokens in the context window, basically unlimited. And then, just a few, just an hour or two later, OpenAI said, you know what, we also have something in store, and released the most incredible jump. Incapability of video generation, text to video generation.</p><p>[00:01:02] <strong>Alex Volkov:</strong> It's called SORA, and what you hear is us recording live, knowing only about Google, which came out an hour and a half before we started recording, and then somewhere in the middle, I think minute 35 or something, you'll hear our live reaction to the Incredibly mind blowing advancement in text to video that OpenAI just released.</p><p>[00:01:31] <strong>Alex Volkov:</strong> And I just wanted to record this as I'm finishing up the editing and about to start writing the newsletter, to say, days like this really are the reason why I'm all in on AI and I'm very excited about the changes and advancements.</p><p>[00:01:49] <strong>Alex Volkov:</strong> And I'm sure there will be more days like this going forward. We've yet to see what Apple came up with, we've yet to really see what Meta comes up with Llama 3, etc. And, yeah, I just wish you enjoyed this and I don't have a lot of words here besides just letting you listen to the rest of the episode and say that I was very happy to be in San Francisco for this, the place where most of this happens, and I was very happy to be in company of good friends, both in the virtual world those on stage in our Twitter live recording, and I was sitting across from Swyx, a friend of mine with whom I recorded an interview at the end of this, you can hear.</p><p>[00:02:30] <strong>Alex Volkov:</strong> I just couldn't let go of this chance. We also had a conversation, besides the updates and the breaking news, we also had conversations with the folks who worked on some of the stuff we talked about. I interviewed Yi Te and Max Bain from RECA, which you'll hear later, and the deep dive into RECA multimodal models, which blew me away just yesterday.</p><p>[00:02:52] <strong>Alex Volkov:</strong> And so my head kept getting blown away this week. And I also interviewed The folks who built Stable Cascade, a new stability model that outperforms the existing stability models. Dome, and Pablo. And all of those were great conversations, in addition to just generally the folks who joined me from week to week, Nisten and Far El and Alignment Lab, and we had Robert Scoble join us, with whom I've been buddying up since Vision Pro was released, as he was expecting, and that blew me away just a week ago.</p><p>[00:03:23] <strong>Alex Volkov:</strong> And I'm very excited to share with you this whole thing, and I hope that Yeah, I hope you enjoyed this as much as I do, and I hope that you enjoyed listening to these as much as I enjoy making them. And if you are, just share them with a friend, it would really help. And give us a 5 star review on Apple.</p><p>[00:03:38] <strong>Alex Volkov:</strong> This would great, gratefully help. With that, I'll give you the ThursdAI thing.</p><p>[00:03:43] <strong>Alex Volkov:</strong> All right, let's go. How's it going, everyone? Welcome to ThursdAI. Today is February 15th, and it's quite a day in the AI updates that we've had so far. Quite a day. Even today, this morning, we had like a bunch of updates. But besides those, we had quite a crazy week as well very interesting show today, very interesting show today.</p><p>[00:04:13] <strong>Alex Volkov:</strong> My name is Alex Volkov, I'm an AI evangelist with Weights Biases, and right now I'm getting my picture selfie taken by my today's co host, Swyx. Welcome,</p><p>[00:04:23] <strong>Swyx:</strong> Hey, hey, hey. Good morning, everyone.</p><p>[00:04:25] <strong>Alex Volkov:</strong> And we're in the Latent Space Studio in San Francisco. I flew in just last night. And as I was flying in, there was more news happening. So we're going to cover all of this.</p><p>[00:04:34] <strong>Alex Volkov:</strong> We have a very exciting show today. We have a bunch of guests, special guests that are coming on the second hour of this. So hopefully we'll see folks from the RECA models, and hopefully we'll see some folks from Stability. We're going to get to talk about Google and everything in between. So meanwhile, settle in.</p><p>[00:04:50] <strong>Alex Volkov:</strong> This is going to be a great show today in San Francisco. And maybe I'll also probably share with you why I Flew in here today. That's gonna come up next. So welcome to ThursdAI and we're gonna get started. All right there. Let's get started. Let me Smoothly fade out the music, say hi to everyone here on stage. Hey, Nisten, welcome. We have Robert Skobul over here, folks. We've been, we've been more, more friendly lately than usual because Robert and I are both members of the VisionPro cult. I think that's what you call it, Robert.</p><p>[00:05:37] <strong>Alex Volkov:</strong> But today is, today's the space for, for AI. But Robert you've been covering AI on your feed as well for, for a long time. We have, obviously Swyx is on stage, but also in front of me, which is super cool. And it's been a while, brother. It's great, you just flew back from</p><p>[00:05:51] <strong>Swyx:</strong> Singapore.</p><p>[00:05:52] <strong>Swyx:</strong> Yeah, Chinese New Year.</p><p>[00:05:53] <strong>Alex Volkov:</strong> Are you jet lagged at all or are you good?</p><p>[00:05:55] <strong>Swyx:</strong> I'm good actually. I have had very little sleep, but for some reason that always helps with the jet lag.</p><p>[00:06:00] <strong>Alex Volkov:</strong> Yes, awesome. And I also want to say hi to Alignment Labs, Austin and Far El as well, folks who are working on open source models, and we usually cover a bunch of stuff that they're doing, and usual co hosts and experts here on ThursdAI.</p><p>[00:06:11] <strong>Alex Volkov:</strong> So if you never joined ThursdAI before, just a brief kind of recap of what we're doing. As I said before, my name is Alex Volkov. I'm an AI evangelist with Weights Biases. It's always so fun to say. And Weights Biases is a company that is basically helping all these companies build their AI models, and it's super cool.</p><p>[00:06:26] <strong>Alex Volkov:</strong> And I flew in, I went to the office last night, and I have some cool videos to share with you from the office as well.</p><p>[00:06:32] <strong>Alex Volkov:</strong> and this</p><p>[00:06:33] <strong>Alex Volkov:</strong> is ThursdAI. ThursdAI is a Twitter space and newsletter and podcast that I started a year ago. And then slowly this built a community of fine folks who show up to talk about everything that happened in the world of AI for the past week.</p><p>[00:06:46] <strong>Alex Volkov:</strong> And there hasn't been many weeks like this last week that highlight how important and how cool ThursdAI actually is. Because we just had So much, so much to cover today and usually I start the space with a roundup of the stuff that we're going to run through just for folks who are not patient, don't have a lot of time and we're going to just run through everything we're going to talk about and then we're going to dive deep because we have some breaking news and I even have, hopefully, I have my breaking news button.</p><p>[00:07:16] <strong>Alex Volkov:</strong> Oh, I don't. Oh my God. Okay.</p><p>[00:07:17] <strong>Swyx:</strong> Oh no.</p><p>[00:07:17] <strong>Alex Volkov:</strong> I'm not set up for a breaking news button, but it's fine.</p><p>[00:07:20] <strong>Alex Volkov:</strong> We'll imagine this. I'm going to put this in the, in the, in the post edit. With that said, are you guys ready for a brief recap? Let's go to a brief recap.</p><p>[00:07:27] Recap and TL;DR</p><p>[00:07:27] <strong>Alex Volkov:</strong> Alright, folks, back for the recap. Today is Thursday. ThursdAI, February 15th. This is a recap of everything we talked about. And, ooh, boy, this was one of the worst days to be caught outside of my own personal production studio because my, my breaking news button didn't make it all the way here. And there was so much breaking news.</p><p>[00:07:57] <strong>Alex Volkov:</strong> So obviously as I woke up, the biggest breaking news of today was Ai. Actually cannot decide what was the biggest breaking news. So the first piece of breaking news from today was Google releasing a teaser of Gemini 1. 5. And 1. 5 was not only a continuation of Gemini Pro that we got last week, 1. 5 actually was teased with up to 1 million, a whopping 1 [00:08:20] million tokens in the context window, which is incredible.</p><p>[00:08:23] <strong>Alex Volkov:</strong> It's just for comparison, JGPT is currently at 128 and cloud to the best. Highest offering up until Gemini was 200k with Entropic Cloud Advanced and Google teased this out of the gate with 1 million token and their claim they have up to 10 million tokens of context window in in in the demos, which is incredible.</p><p>[00:08:44] <strong>Alex Volkov:</strong> And they've shown a bunch of demos. They did the needle in the haystack analysis that we've talked about from Greg Cumbrand and just quite an incredible release from them. They talked about that you can put a whole like hour of a movie of Dustin Keaton, I think it's called. And then you can actually ask questions about the movie and we'll give you the exact.</p><p>[00:09:03] <strong>Alex Volkov:</strong> Timestamp of something happens. They talked about it being multimodal where you can provide a sketch and say, Hey, when this, this scene happened, it will pull out just like incredibly like magic, mind blowing, mind blowing stuff. And all of this needs a lot of context because you take this, you take this video, you turn it into images, you send this into context.</p><p>[00:09:22] <strong>Alex Volkov:</strong> They also talked about, you can send 10 hours of audio within one prompt and then some ad, And the quality of retrieval is very, very high. You're talking about like 90 plus percentage, 95 plus percentage in the haystack, which is incredible. Again, we had Enrico Cipolla, a friend of the pod who worked on the Yarn paper and the rope methods before extending the LLAMA context.</p><p>[00:09:46] <strong>Alex Volkov:</strong> And he brought like four papers or something that show that open source is actually unlocking this ability as well. And not only today was a credible day just generally, but not only Google talked about a large context window, we also saw that Nat Friedman and Daniel Gross just invested 100 million in a company called Magic, that they also talk about multimodality and large context window up to 1 million as well.</p><p>[00:10:08] <strong>Alex Volkov:</strong> So it was very interesting. To see both of them release on the same day as well. We then geeked out about Gemini. We talked about Andre Karpathy leaving open AI and, and invited him to come to Thursday AI and latent space as well. And then we also mentioned the OpenAI ads, memory and personalization to charge G bt, which is super cool.</p><p>[00:10:25] <strong>Alex Volkov:</strong> They didn't release it to many people. Yeah, but personalization is my personal thread of 2024 because these models, especially with the larger, larger context window with personal per perfect recall, these models will. become our buddies that will remember everything about us, specifically, especially tied into different devices.</p><p>[00:10:43] <strong>Alex Volkov:</strong> Like the tab that's somewhere here behind me is getting built in San Francisco. We, we briefly mentioned that NVIDIA released the chat with RTX local models that you can download and run your NVIDIA GPUs. It has rack built in. It has a chat with YouTube videos and super cool. We talked about Cohere release and AYA 101 multimodal.</p><p>[00:11:01] <strong>Alex Volkov:</strong> And our friend of the pod Far El was talking about how he wasn't finding like super impressive. Unfortunately, He dropped in the middle of this. Apologies for El, but Cohere released a big multi model, which is also pretty cool. We mentioned that NOMIC, our friends at NOMIC, which we mentioned last week, released open source embeddings.</p><p>[00:11:17] <strong>Alex Volkov:</strong> If you guys remember, they released an update to those embeddings, NOMIC Embed 1. 5 with Matryoshka embeddings. Matryoshka. is obviously the name for the Russian doll that like sits one inside each other. And we're going to actually talk with the authors of the Matryoshka paper in not the next Thursday, the next after that.</p><p>[00:11:34] <strong>Alex Volkov:</strong> So we're going to cover Matryoshka but it's what OpenAI apparently used, not apparently, confirmed they used to reduce dimensions in the API for embeddings. Super cool. We're going to dive deep into this. As we're going to learn, I'm going to learn, you're going to learn. It's going to be super cool.</p><p>[00:11:48] <strong>Alex Volkov:</strong> And as we're talking about OpenAI I got a ping on my phone because I'm subscribed to all updates from their main account and we had a collective holy s**t moment. Everybody's jaw was on the floor because OpenAI just released Sora, which is a foundational video model, text to video model, that just blew us the F away, pardon my French, because of the consistency.</p><p>[00:12:08] <strong>Alex Volkov:</strong> So if and if you've seen The how should I say the area of video generation has been has been evolving fairly quickly, but not as quick as what we just saw. We saw first we saw attempts at taking stable diffusion rendering frame by frame and the consistency wasn't there. It was moving from one to to another, like the face would change and everything.</p><p>[00:12:30] <strong>Alex Volkov:</strong> You guys saw this, right? So we moved from the hallucinatory kind of videos to Towards consistency videos where stable diffusion recently released and gave us SVD, which was like one to two to three seconds videos. Runway ML gives you the option to choose where the video is going to go. If it's going to be zoom in like brushes, all these things.</p><p>[00:12:49] <strong>Alex Volkov:</strong> And now all of them seem just so futile because open the eyes, Sora, can generate up to 60 seconds of a video. And honestly, we were sitting here just watching all of us just open the Sora website, and we were just mind blown away by the consistency and the complexity of the scenes that you can generate, the reflections.</p><p>[00:13:06] <strong>Alex Volkov:</strong> There was one scene where a woman was walking through the, a very busy street in Japan, and her coat stays the same, her face stays the same. There's another where a Dalmatian dog climbs out of one window and jumps into another. All the spots on the Dalmatian are perfect. perfectly in balance the legs are it's it's really unbelievable how high quality of a thing OpenAI released and what's unbelievable to me also is that The jump from what we saw in video to the open source stuff, or even the runway stuff and Pico stuff, the jump in fidelity, in quality, in consistency, is so much higher than the jump from like 200, 000 tokens to 1 million tokens that Google did.</p><p>[00:13:44] <strong>Alex Volkov:</strong> So it does feel like some folks in OpenAI sat there and said, Hey, Google just released something. It's super cool. It's picking up attention on Twitter. Let's release something else that we have behind the scenes. It looked super polished. So shout out to the folks who worked on Sora. It's really, if you haven't seen The videos, you'll see them in show notes and definitely you'll see them everywhere because Hollywood is about to get seriously, seriously disrupted with the, just the level of quality is amazing.</p><p>[00:14:08] <strong>Alex Volkov:</strong> Compare this with all the vision and, and, and sound stuff. I, moving back to the recap, I'm getting excited again. We also, then we talked about Reka and Reka Flash and Reka Edge from a company called Reka AI. And then, as I love bringing the people who actually built. the thing to talk about the thing.</p><p>[00:14:23] <strong>Alex Volkov:</strong> So we had Yitei and we had Max as well from Reka. Max made for Reka to talk to us about their multimodels. I was very, very impressed with Reka's multimodal understanding. And I think this model compared to Gemini Pro, which is probably huge and runs all the GPUs and TPUs. This model is 21 billion and Reka Edge is even smaller.</p><p>[00:14:41] <strong>Alex Volkov:</strong> And yet it was able to understand my videos to an extent that even surprised the guys who were the co founders of the company. It understood tonality, understood text. And audio in a very specific and interesting way. So we had a conversation with the RECA folks and continuing on this thread. We also had a new model from Stability called Stable Cascade that is significantly faster than SDXL and generates hands and text out of the blue.</p><p>[00:15:07] <strong>Alex Volkov:</strong> It's based on something called Worst Chen, which we learned is a hot dog today. And we had the folks that work behind this, Dom and I'm blanking on the name of the other author that joined. I apologize. It was a very exciting day. So we had a conversation with the guys behind Worshen and Stable Cascade as well.</p><p>[00:15:24] <strong>Alex Volkov:</strong> So definitely check this out. We mentioned that WhisperKid runs now on watchOS, which is quite incredible because Siri's voice to text is still not that great. And I think that's mostly of what we discussed. And then I flipped the mic on my, on my friend here that sits in front of me and I just had a deep dive interview with Swyx.</p><p>[00:15:41] <strong>Alex Volkov:</strong> In the latent space, he just posted a few images as well, and it was a great conversation as well, so definitely worth a follow and a listen if you haven't listened to this. With that, I think we recap ThursdAI on one of the more seminal days that I remember in the AI one after another, and we all hope that, Meta will just release Llama 3</p><p>[00:16:01] Investments updates from Swyx</p><p>[00:16:01] <strong>Alex Volkov:</strong> Unless I missed some stuff that's very important. I'll just double check. Nisten, out of the stuff that we've sent, did I miss anything else? Swyx, did I miss anything else?</p><p>[00:16:10] <strong>Swyx:</strong> Today there was also a LangChain Series A. True. With LangSmith.</p><p>[00:16:13] <strong>Swyx:</strong> Yes. There was Magic. dev, Series A with Nat Friedman.</p><p>[00:16:16] <strong>Alex Volkov:</strong> So I was thinking to cover this around the Google stuff because they also announced a longer context craziness.</p><p>[00:16:21] <strong>Alex Volkov:</strong> But definitely, definitely both of those.</p><p>[00:16:23] <strong>Swyx:</strong> Lambda Labs, Alonzo 300 million, Series C.</p><p>[00:16:26] <strong>Alex Volkov:</strong> Oh, wow, yeah, I even commented. I said, hey, Mitesh good. So we love Lambda, definitely. Most of the stuff that we play around with is happening in Lambda. And</p><p>[00:16:34] <strong>Swyx:</strong> Lindy also had their GA launch today.</p><p>[00:16:37] <strong>Alex Volkov:</strong> nice. Okay. Today</p><p>[00:16:38] <strong>Swyx:</strong> Today was a very bad day to launch [00:16:40] things, because everyone else launched</p><p>[00:16:41] <strong>Swyx:</strong> things.</p><p>[00:16:41] <strong>Swyx:</strong> Yes. If you're not Gemini, it's going to be a struggle</p><p>[00:16:44] <strong>Alex Volkov:</strong> I was just thinking, magic. dev, and I guess let's move to just discussing kind of the breaking news of the hour, as we already is. Let's talk about Google, and Gemina 1. 5.</p><p>[00:16:55] Google teases Gemini Pro 1.5 with 1M context windows</p><p>[00:16:55] <strong>Alex Volkov:</strong> Do we do a musical transition? Sure, let's do a musical News. This is not the Breaking News music. By not even a stretch, this is not a Breaking News music. But, imagine that we have Breaking News right now, because we do. Just an hour or so ago, we had an update from Jeff Dean and then Sundar Pichai and then a blog post and then a whole thread and a bunch of videos from Google.</p><p>[00:17:27] <strong>Alex Volkov:</strong> And if you guys remember some Google videos from before, these seem more authentic than the kind of the quote unquote fake video that we got previously with Gemini Ultra. So just a week after Google released Gemini Ultra, which is now available as aka Gemini Advance. And just a week after they killed Bard almost entirely as a concept they're now teasing.</p><p>[00:17:48] <strong>Alex Volkov:</strong> Teasing did not release, teasing. Gemini 1. 5, 1. 5, they're teasing it and they're coming out with a bang. Something that honestly, folks at least for me, that's how I expect Google to show up. Unlike before, where they're like lagging after GPT 4 by eight months or nine months, what they're doing now is that they're leading a category, or at least they're claiming they are.</p><p>[00:18:07] <strong>Alex Volkov:</strong> And so they released Gemini 1. 5, and they're teasing this with a whopping 1 million tokens. in context window on production and up to 10 million tokens in context window in research. And just to give a context, they put like this nice animated video where they put Gemini Pro, which they have currently, not 1.</p><p>[00:18:26] <strong>Alex Volkov:</strong> 5, the Pro version. is around 32, I think, and then they have GPT 4 with 128 and then they show Cloud 2 is at 200k and then Gemini 1. 5 is a whopping 1 million tokens, which is ridiculous. Not only that, they also came a little bit further and they released it with the Needle in Haystack analysis from our friend Greg Kambrad, which usually does this.</p><p>[00:18:50] <strong>Alex Volkov:</strong> We'll not be able to pronounce his name. I asked Greg to join us. Maybe he will. A needle in a haystack analysis that analyzes the ability of the model to recall whether or not it's able to actually process all these tokens and actually get them and understand what happens there. And quite surprisingly, they show like 99 percent recall, which is incredible.</p><p>[00:19:10] <strong>Alex Volkov:</strong> And we all know, previously in long context windows, we had this dip in the middle. We've talked about the The butter on toast analogy, where the context or attention is like the butter and context window is the toast and you spread and you don't have enough for the whole toast to spread evenly.</p><p>[00:19:27] <strong>Alex Volkov:</strong> We've talked about this. It doesn't seem, at least</p><p>[00:19:30] <strong>Alex Volkov:</strong> on</p><p>[00:19:30] <strong>Alex Volkov:</strong> the face of it, that they are suffering from this problem. And that's quite exciting. It is exciting because also this model is multi modal, which is very important to talk about. They definitely show audio and they are able to scrub through, I said, they said, I think they said 10 hours of audio or so.</p><p>[00:19:47] <strong>Alex Volkov:</strong> Which is quite incredible. Imagine this is going 10 hours of audio and say hey, when When did Alex talk about Gemini in ThursdAI? That would be super dope and Quite incredible. They also did video. They showed a hour of video of Buster Keaton's something and because the model is multi modal the cool thing they did is that they provided this model with a reference of with a sketch.</p><p>[00:20:11] <strong>Alex Volkov:</strong> So they drew a sketch of something that happened during this video, not even talking about this, just like a sketch. And they provided this multimodal with an image of this and said, when did this happen in the video? And it found the right timestamp. And so I'm very, very excited about this. If you can't hear from my voice, Swyx can probably tell you that it looks like I'm excited as well, because it's, it's quite.</p><p>[00:20:31] <strong>Alex Volkov:</strong> As far as I'm considering a breakthrough for multiple reasons. And now we're gonna have a short discussion.</p><p>[00:20:35] Enrico taking about open source alternatives to long context</p><p>[00:20:35] <strong>Alex Volkov:</strong> I want to say hi to Enrico here. Enrico welcome up on stage. Enrico Cipolli, one of the authors of the Yarn paper. And like we've had Enrico before talk to us about long context. Enrico, as we send this news in DMs, you replied that there have been some breakthroughs lately that kind of point to this.</p><p>[00:20:51] <strong>Alex Volkov:</strong> And you want to come up and say hi and introduce us briefly. And let's chat about the long context.</p><p>[00:20:57] <strong>Enrico Shipolle:</strong> Hi, Alex. Yeah, so there actually have been a lot of research improvements within the last couple months, even from before we submitted YARN. You could still scale even transformers to millions of essentially context. length back then. We previously in YARN worked on scaling the rotary embeddings, which was a traditional issue in long context.</p><p>[00:21:19] <strong>Enrico Shipolle:</strong> So I, if you don't mind, I'll probably go through some of the research really quickly because unfortunately,</p><p>[00:21:25] <strong>NA:</strong> so on January 2nd, there was one called, it's called LLM, maybe long LLM. That's a mouthful essentially, but they were showing that you can process these long input sequences during inference using something called self extend, which it allows you to basically manage the context window without even fine tuning these models.</p><p>[00:21:48] <strong>NA:</strong> And then on January 7th, 2024, there was another paper that released, it's called Soaring from 4k to 400k, which allows you to extend like the LLM's context with something called an activation beacon. With these activation beacons, they essentially condense raw activation functions in these models to a very like compact form, which essentially the large language model can perceive this longer context.</p><p>[00:22:14] <strong>NA:</strong> Even in a smaller context window, the great thing about these activation beacons or the LLM, maybe long LLM, is essentially they only take a few lines of code to modify the transformer architecture and get all these massive performance benefits for long context inference.</p><p>[00:22:33] <strong>Alex Volkov:</strong> Are</p><p>[00:22:33] <strong>Alex Volkov:</strong> you serious? Are we getting one of those breakthroughs that take two lines of code, kind</p><p>[00:22:37] <strong>NA:</strong> No so basically all of these require minimal code changes to even be able to scale to, to long, like token counts, whether it's audio, video, image, or text. Text is. Generally, like the shortest token count, if you look at something like RefinedWeb or SlimPajama the, the average token count of a piece of text in that is only anywhere from 300 to 500 tokens.</p><p>[00:23:02] <strong>NA:</strong> So this is actually generally a data centric issue too, when you're talking about long context with even training a standard natural language processing model. The thing about audio and video is, is these have a ton of tokens in them. And the one good thing, and then? the final note, I'm, I'm going to put in, unfortunately, before I have to head out, I know this was a lot of information.</p><p>[00:23:22] <strong>NA:</strong> I can link these</p><p>[00:23:24] <strong>Alex Volkov:</strong> Yeah, we're gonna add some, some of this, we're gonna add some, some links, the links that I'd be able to find, Enrique, if you can send</p><p>[00:23:29] <strong>NA:</strong> Yeah, I'll, I'll send you all the research papers.</p><p>[00:23:32] <strong>Alex Volkov:</strong> Yeah, you want to lend one last thing before we move on? Yeah, go ahead.</p><p>[00:23:36] <strong>NA:</strong> Yeah, So, just the last thing on January 13th is there was this paper called Extending LLM's Context Window with only a hundred samples and they were essentially able to show that even in a very limited amount of long context samples, you're able to massively improve the context lengths of these models. I should mention these are the papers that I found did pretty rigorous evaluation overall, because a lot of them, there's a huge problem in long context evaluation. But I feel these authors generally applied their knowledge pretty well, and these results are really impactful. so, even for the open source community, because you don't need a lot of computational power to be able to scale these context windows massively now.</p><p>[00:24:24] <strong>NA:</strong> And</p><p>[00:24:24] <strong>NA:</strong> that's basically everything I wanted to</p><p>[00:24:26] <strong>NA:</strong> say.</p><p>[00:24:27] <strong>Alex Volkov:</strong> Thank you, Enrico. Thank you, folks. Folks, definitely give Enrico a follow. And we have quite a few conversations with Enrico. If somebody in the open source community knows about Long Contacts, Enrico is that guy. And we're definitely going to follow up with the links in the show notes for a bunch of this research.</p><p>[00:24:41] <strong>Alex Volkov:</strong> And I think just to sum up, Enrico There have been breakthroughs, and it doesn't look like Google is the only folks who come up today. Nat Nat Friedman and Daniel Gross, the guys who have AI grant, they have the Vesuvius Challenge recently, and invest in everything AI possibly. They just announced an investment in magic, that they have a hundred million dollars investment, [00:25:00] quote unquote.</p><p>[00:25:00] <strong>Alex Volkov:</strong> We were so impressed with these guys when we decided to give them a hundred million dollars from Nat Friedman, and they also talk about the model that does. Something like 10 million context windows. Swyx, you wanna, you wanna talk about the magic thing?</p><p>[00:25:12] <strong>Swyx:</strong> They first talked about this last year, like six months ago, and then went completely silent. So we didn't really know what was going on with them. So it's good to see that this is at least real because six months ago they were talking about 5 million token context model.</p><p>[00:25:28] <strong>Swyx:</strong> But no, nothing was demoed. Not even like a little teaser graphic or anything like that. But for Nat to have invested in this amount, I think it's a huge vote of confidence. And it basically promises that you can do proper codebase embedding and reasoning over an entire codebase. Which, it's funny to have a code model that specially does this, because Gemini could also potentially do this.</p><p>[00:25:58] <strong>Alex Volkov:</strong> They showed in their examples 3JS. Did you see this?</p><p>[00:26:01] <strong>Swyx:</strong> No, I didn't see the 3JS, but okay, yeah. And we have a pretty consistent result from what we've seen so far that GPT 4 is simultaneously the best LLM, but also the best code model. There's a lot of open source code models, CodeLlama, DeepSeaCoder, all these things.</p><p>[00:26:18] <strong>Swyx:</strong> They're not as good as GPT So I think there's a general intelligence lesson to be learned here. That it remains to be seen because we, Magic did not release any other details today. Whether or not it can actually do better than just a general purpose Gemini.</p><p>[00:26:34] <strong>Alex Volkov:</strong> Yeah, and so the example that they showed is actually they took 3JS, if you folks know the 3JS library from Mr.</p><p>[00:26:40] <strong>Alex Volkov:</strong> Doob and they, embedded all of this in the context window and then asked questions and it was able to understand all of it Including, finding incredibly huge codebase. And I think I want to just move this conversation.</p><p>[00:26:52] <strong>Alex Volkov:</strong> Yeah, Nisten, go ahead. I see you, I see you unmuting. And folks on the stage, feel free to raise your hands if if you want to chime in. We'll hopefully get to some of you, but we have a bunch of stuff to chat about as well.</p><p>[00:27:01] <strong>Nisten Tahiraj:</strong> I'll just quickly say that there are still some drawbacks to these systems. And by systems the long context models where you dump in a whole code base or entire components in. And the drawbacks, even from the demos, still seem to be that. Yes, now they do look like they're much better at reading and intaking the information, but they're not yet much better at outputting similar length output, so they're still gonna only output, I think, up to 8, 000 tokens or so, and I don't know if that's that's a byproduct of of the training, or they could be trained to re output much longer, much longer context.</p><p>[00:27:43] <strong>Nisten Tahiraj:</strong> However, the benefit now is that unlike Retrieval augmentation system, unlike a RAG the, the drawback with a RAG was that yes, it could search over the document, but it would only find maybe two or three or a couple of points and bring them up. Whereas this one is more holistic understanding of the, of the entire input that you've dumped in.</p><p>[00:28:03] <strong>Nisten Tahiraj:</strong> But again, we're not quite there yet where they can just output a whole textbook. That's, that's what I mean. So that's the thing. That's the next challenge</p><p>[00:28:11] <strong>Far El:</strong> to solve.</p><p>[00:28:12] <strong>Alex Volkov:</strong> So I think, I think the, the immediate reaction that I had is very similar to what you had, Nisten. RAG is something everybody uses right now. And we've talked about long context versus, versus something like a RAG before, and the usual conversation we have is usually about cost. How much does it cost you pair these tokens, right?</p><p>[00:28:30] <strong>Alex Volkov:</strong> If you send 10 million tokens and each token is like a cent, you're basically paying 10 million cents for every back and forth. Also speed and, and user experience. If your users are sitting there and waiting for 45, 60 seconds because they sent a bunch of contacts, if you can solve this with RAG, then RAG is probably a better approach for you.</p><p>[00:28:48] <strong>Alex Volkov:</strong> However, however this specifically looks like. At least from the examples that the Google did, they showed the video transparently, they sped up the inference, but I saw something where with at least the video question, it took them around 40 seconds. to extract a frame of a video of an hour. They sent an hour worth of context of a video within this thing, and it took them 40 seconds for this inference.</p><p>[00:29:13] <strong>Alex Volkov:</strong> Folks, like I said before, and I'm going to say this again, regular ChatGPT, not even crazy context, queries took me sometimes 40 seconds. Now, you may say, okay, Alex they show the demo of their environment, and ChatGPT is in production environment. Yes, but the possibility is, if I can send I don't know, 500, 000 tokens in the context window, and then within 40 seconds get a response which is equivalent to what I get from GPT 4.</p><p>[00:29:38] <strong>Alex Volkov:</strong> Then I think that a bunch of the conversation about RAG being better just from a speed of inference perspective are slowing down. An additional thing I want to say before I get to you, Yam, just a second the immediate response in my head was, okay, RAG is done for, or at least not done for, but definitely the kind of the crown on RAG's head.</p><p>[00:29:56] <strong>Alex Volkov:</strong> Everybody's talking about RAG. There's vector databases everywhere. We just had folks talk about Colbert and different things. RAG is, okay, RAG is now shaky. But the other thing I started to think is, is fine tuning. also under risk. And Swyx, I think this goes back to what you just said about like the general models versus the maybe the Finetune or very specific models, because if a general model can take a whole book, and they had an example about this where there was a very low resource language, Kalamathi, Kalabathi, something like this, and there's only one book that's a dictionary for this language, they literally threw the book in the context window, and the model was able to, from context learning, to generalize and understand this and perform better than fine tuned models.</p><p>[00:30:37] <strong>Alex Volkov:</strong> And I'm thinking here okay, rag is the first thing to go. Is fine tuned second? Are we going to stop fine tuning and sending contexts? So Swyx, I want to hear your reaction about, about the language thing and then we're going to get to Yam and then we're going to ask some more folks.</p><p>[00:30:48] Discussion about effects of longer context windows</p><p>[00:30:48] <strong>Swyx:</strong> Yeah, I think there's generalizable insights about learning about language. And it's not surprising that throwing that into the context window works, especially if it's a cognate language of something that it already knows. So then you're just learning substitutions, and don't forget that transformers are initially trained to do language translation, like this is like bread and butter stuff for transformers.</p><p>[00:31:12] <strong>Swyx:</strong> The second thing I would respond to is, I have to keep saying and banging this drum, long context does not kill RAG because of cost. Imagine if every time you throw 10 million tokens of context in there, you have to pay like a thousand dollars. Because unless something fundamentally is very, very different about this paradigm, you still pay to ingest those tokens of cost.</p><p>[00:31:39] <strong>Swyx:</strong> So ultimately, people want to still reg for cost and then for attribution reasons, like debuggability attribution, which is something that's still valuable. So I think long context is something that I have historically quite underweighted for this reasons. I'm looking to change those assumptions, of course, because obviously this is magical capabilities if you can use</p><p>[00:32:03] <strong>Alex Volkov:</strong> this is magical capabilities if you can use</p><p>[00:32:10] <strong>Far El:</strong> Yeah, I just want to say on the topic of of latency and ingesting a lot of context. I think that there is a solution that we didn't talk about it here and will be something that is going to be incorporated in all the flagship models, which is embedding embedding knowledge into the KB cache, which is something that many of the inference engines today can do.</p><p>[00:32:34] <strong>Far El:</strong> And you simply just prefix the context beforehand, and then you don't need to process it through your model. So you're not sending the whole database each time you are calling your model. It's just saved. Imagine that OpenAI have some sort of API that you embed. The KD cache beforehand, and it's reduced price, of course, and then it uses that as, as your context.</p><p>[00:32:59] <strong>Far El:</strong> Basically, somewhere in the middle between the two. And the reason that it's not supported now in flagship models, because the first flagship model that supports a million tokens came out today. But I think that if we see this this, if we go there, this is something that we're going to see in all of the APIs.</p><p>[00:33:18] <strong>Far El:</strong> Moreover, I also don't [00:33:20] think that RUG is done for it because RUG is explaining to you very, very clearly and very simply. Where the information is coming from, what the model is basing itself on. You can claim that the model with the attention you can do it as well, but it's not like RUG. RUG, you're just showing the clients, the people, exactly where it comes from.</p><p>[00:33:40] <strong>Far El:</strong> And there are use cases where this is absolutely a must. So I think that there will always be room for RUG for these specific use</p><p>[00:33:49] <strong>NA:</strong> cases and long</p><p>[00:33:50] <strong>Far El:</strong> context. With KVCaching is going to be, I think, I think the methods for embedding, for example, a full database, or a book, or something big, and using it multiple times, with many different</p><p>[00:34:05] <strong>Far El:</strong> prompts.</p><p>[00:34:06] <strong>Alex Volkov:</strong> Or also multimodality, right? So thank you for this. Definitely, definitely makes sense. And I think somebody in the comment also left a similar comment as well. So we want to dive into the KVCache stuff maybe in the next one. But I want to talk about the multimodality part of this because, um We've, we've multiple times mentioned.</p><p>[00:34:25] <strong>Alex Volkov:</strong> I think we did this every Thursday. I sense GPT 4 launched because we were waiting for the vision part of GPT 4. And we've talked about 2024 being the year of multimodal. And we're going to have to talk about a bunch of multimodal stuff today, specifically with the RECA folks and the RECA flash, which understands videos.</p><p>[00:34:40] <strong>Alex Volkov:</strong> They basically, so I'm going to have to see whether RECA understands videos better than Gemini, but the Gemini folks talked about there's a specifically. A bunch of multi model effect on the context window where if you send videos, you, at least the way they did this was just frames. They broke down this movie to a bunch of 500, 000 frames or so and just sent it in context window.</p><p>[00:35:04] <strong>Alex Volkov:</strong> And they basically said we have all this video in the context window and then we have a little bit of text. And I think context window expansions like this will just allow for incredibly multi modal use cases, not only video, audio, they talked about, we've talked about previously with the folks from</p><p>[00:35:20] <strong>Alex Volkov:</strong> Prophetic about different fMRI and EEG signals that they're getting like multi modal like applications as well and Context window enlargement for these things, Google specifically highlighted.</p><p>[00:35:32] <strong>Alex Volkov:</strong> And I want to highlight this as well because it's definitely coming. I'm waiting for being able to live stream video, for example. And I know some folks from like 12 Labs are talking about almost live live stream embedding. So definitely multimodal from Google. I think, folks, we've been at this for 30 minutes.</p><p>[00:35:48] Andrej Karpathy leaves OpenAI</p><p>[00:35:48] <strong>Alex Volkov:</strong> Alright, so folks, I think we're going to move on and talk about the next kind of a couple of stuff that we've already covered to an extent, but there's some news from OpenAI, specifically around Andrej Karpathy leaving, and this was announced, I think broke in the information, and Karpathy, some folks here call them senpai, Karpathy is a very Very legit, I don't know, top 10, top 5, whatever, researchers, and could potentially have been listening to the space that we had with LDJ after he left, or, yeah, I think it says, it was clear that he left it was the information kind of announcement didn't have a bunch of stuff, but then Andrei just As, as a transparent dude himself, he came and said, hey, this wasn't the reaction to anything specific that happened because speculations were flying.</p><p>[00:36:33] <strong>Alex Volkov:</strong> And I think at least, at least to some extent, we were in charge of some of these speculations because we did a whole space about this that he could have just listened to. But as speculation was flying, maybe this was ILLIA related, maybe this was open source related, like all of these things.</p><p>[00:36:46] <strong>Alex Volkov:</strong> Andre basically Helped start OpenAI, then left and helped kickstart the Tesla Autopilot program, scaled that to 1500, then left. On the chat with Lex Friedman, Andrei said that Basically, he wanted to go back to hands on coding, and in OpenAI, his bio at least said that he's working on a kind of Jarvis within OpenAI, and definitely Andrei has been also talking about the AI as an OS, Swyx, you wanna, you wanna cover like his OS approach?</p><p>[00:37:14] <strong>Alex Volkov:</strong> I think you talked about this. He had a whole outline, I think you</p><p>[00:37:17] <strong>Swyx:</strong> also</p><p>[00:37:17] <strong>Swyx:</strong> talked about this. LLM OS.</p><p>[00:37:18] <strong>Swyx:</strong> Yeah. He wasn't working on it so much as thinking about it.</p><p>[00:37:21] <strong>Swyx:</strong> Thinking about it,</p><p>[00:37:21] <strong>Swyx:</strong> yeah. And maybe now that he's independent, he might think about it. The main thing I will offer as actual alpha rather than speculation is I did speak to friends at OpenAI who reassured us that it really was nothing negative at OpenAI when he left.</p><p>[00:37:40] <strong>Swyx:</strong> Apparently because they spoke to him before he left.</p><p>[00:37:43] <strong>Swyx:</strong> So yeah, he's for the way I described it is he's following his own internal North Star and every time he does that the rest of us</p><p>[00:37:51] <strong>Alex Volkov:</strong> And definitely the rest of us win.</p><p>[00:37:53] <strong>Alex Volkov:</strong> the open source community is hoping, or I've seen many, many multiple things that say, hey, Andre will unite like the, the, the bands of open source, the different bands of open source.</p><p>[00:38:02] <strong>Alex Volkov:</strong> Andre posted this thing. on his ex, where like his calendar was just free, which shows maybe part of the rationale why he left, because meetings and meetings and meetings and everything and now he can actually work. So shout out to Andrej Karpathy for all he did in OpenAI and for all he's going to continue to do.</p><p>[00:38:16] <strong>Alex Volkov:</strong> We're going to definitely keep up to date with the stuff that he releases. Andrej, if you're listening to this, you're more than welcome to join. We're here on every Thursday. You don't have to have a calendar meeting for this. You can hop on the space and just join. Also on the topic of OpenAI, they've added memory to ChatGPT, which is super cool.</p><p>[00:38:31] <strong>Alex Volkov:</strong> They released a teaser, this, I didn't get into the beta, so they released it to a limited amount of people. They added memory to ChatGPT, and memory is very, very cool, the way they added this as well. So I've said for a long time that 2024 is not only about multimodality, that's obviously going to come, but also it's about time we have personalization.</p><p>[00:38:51] <strong>Alex Volkov:</strong> I'm getting tired of opening a ChatGPT. Chat, and have to remember to say the same things on, it doesn't remember the stuff that previously said. The folks in OpenAI are working on the differentiator, the moat, and different other things, especially now where Google is coming after them with the 10 million context window tokens.</p><p>[00:39:08] <strong>Alex Volkov:</strong> And, they're now adding memory, where ChatGPT itself, like the model, will manage memory for you, and will try to figure out, oh, OpenAI, oh my god, breaking news. OpenAI just shared something. As I'm talking about them, you guys want to see this? Literally, I got a</p><p>[00:39:28] <strong>Alex Volkov:</strong> notification from OpenAI as I'm talking about this.</p><p>[00:39:30] <strong>Swyx:</strong> What?</p><p>[00:39:32] <strong>Alex Volkov:</strong> Let's look at this. I, dude, I needed my, my breaking news button today. Opening, I said, introducing Sora, our text to video model. Sora can create videos for up to 60 seconds.</p><p>[00:39:44] <strong>Alex Volkov:</strong> Holy s**t, this looks incredible. Oh my god, somebody please pin this to the, to the, Nisten, you have to see, there's a video, 60 second video, folks.</p><p>[00:39:54] <strong>Alex Volkov:</strong> Like, all of the, oh my god, breaking, I have to put the breaking news button here, holy s**t. So folks, just to describe what I'm seeing, cause somebody please pin this to the top of the space every video model we had so far, every video model that we had so far does 3 to 4 seconds, Pica the other labs, I forgot their name now, Runway, all of these models,</p><p>[00:40:16] <strong>Swyx:</strong> they</p><p>[00:40:16] <strong>Swyx:</strong> do</p><p>[00:40:16] <strong>Swyx:</strong> Oh my god, Runway.</p><p>[00:40:18] <strong>Alex Volkov:</strong> They</p><p>[00:40:18] <strong>Alex Volkov:</strong> do three to five seconds and it looks like wonky, this thing just that they show generates a 60 second featuring highly detailed scenes and the video that they've shared, I'm going to repost and somebody already put it up on space has folks walking hand in hand throughout a There's a zoomed in, like behind the scenes camera zooming in.</p><p>[00:40:39] <strong>Alex Volkov:</strong> There's a couple Consistent I cannot believe this is January. Holy s**t The consistency is crazy. Nothing changes. You know how like previously video would jump frames and faces and things would shift</p><p>[00:40:52] <strong>Alex Volkov:</strong> Wow, okay, so I guess we should probably talk about this. Reactions from folks. I saw LDJ wanted to come up to see the reaction I'm</p><p>[00:41:00] <strong>Far El:</strong> just wild. Honestly, it looks crazy. It looks really good quality. Better than most text to video models that I've seen.</p><p>[00:41:08] <strong>Alex Volkov:</strong> Holy s**t okay, so I'm scrolling through the page, folks,</p><p>[00:41:13] <strong>Alex Volkov:</strong> those who are listening, openai. com slash Sora, Sora is their like text to video I'm seeing a video of a model walking through like a Japan street, whatever, the prompt is, a stylish woman walks down a Tokyo street filled with warm glowing neon animated city signage, she wears a black leather jacket, long red dress, and black boots, and the consistency here is insane.</p><p>[00:41:35] <strong>Alex Volkov:</strong> I do</p><p>[00:41:35] <strong>Far El:</strong> out the mammoths. Or actually go on their websites. On the Sora, [00:41:40] on OpenAI's website. They've got a</p><p>[00:41:42] <strong>Far El:</strong> few examples. It's crazy. It's crazy. I've</p><p>[00:41:45] <strong>Far El:</strong> never seen a</p><p>[00:41:48] <strong>Alex Volkov:</strong> the if you showed me this yesterday, Far El, if you showed me this yesterday and said this is generated, I would not believe you. So what happens is, now the same video of this woman walking, they have a video camera zooming in, into her eyeglasses, her face stays the same, the same consistency, you can see reflection in the, in the sunglasses.</p><p>[00:42:08] <strong>Far El:</strong> Alex, you have to go on the website. There's like this video of, oh like literally the prop is reflections in the window of a train traveling through the Tokyo suburbs. And</p><p>[00:42:19] <strong>Far El:</strong> honestly, it looks, it looks like someone captured this no way this is AI</p><p>[00:42:23] <strong>Far El:</strong> generated. It's, it's crazy</p><p>[00:42:27] <strong>Alex Volkov:</strong> Wow,</p><p>[00:42:27] <strong>Alex Volkov:</strong> folks. What's the availability of this? Let's, let's see, what do we know? So we know safety. We'll be taking several important safety steps ahead of making SORA available on OpenAI's products, so it's not available yet. Working with Red Teamers, they don't want this to be used in deepfakes for porn, obviously.</p><p>[00:42:43] <strong>Alex Volkov:</strong> That's like the first thing that the waifus are going to use it for. The C2PA metadata that, if you guys remember, we've talked about that they started including in DALI, they're going to probably include this as well. And new techniques prepared for deployment, leveraging the existing safety methods.</p><p>[00:42:56] <strong>Alex Volkov:</strong> Okay research techniques.</p><p>[00:42:58] <strong>Far El:</strong> Crazy.</p><p>[00:43:00] <strong>Alex Volkov:</strong> Consistency is crazy, right folks?</p><p>[00:43:02] <strong>Swyx:</strong> Yeah, it's not available it looks like.</p><p>[00:43:03] <strong>Swyx:</strong> Not available</p><p>[00:43:04] <strong>Swyx:</strong> yet.</p><p>[00:43:04] <strong>Swyx:</strong> To answer your question. They released some details about it being a diffusion model. They also talked about it having links to DALI 3 in the sense that Honestly, I don't know if people know that there was a DALI 3 paper, which is very, very rare in this age of Not close.</p><p>[00:43:22] <strong>Swyx:</strong> Not open ai.</p><p>[00:43:23] <strong>Alex Volkov:</strong> Yeah, not</p><p>[00:43:24] <strong>Swyx:</strong> open AI.</p><p>[00:43:24] <strong>Swyx:</strong> And so they doing this like synthetic data captioning thing for the DO three model and they're referencing the same method for soa. I would just go read the Dolly three paper</p><p>[00:43:37] <strong>Alex Volkov:</strong> Wow. I, I, the consistency has been the biggest kind of problem with these LDJ.</p><p>[00:43:41] <strong>Alex Volkov:</strong> Go ahead, please. As I'm reading this and reacting and, and my mind is literally blown the demo of the doggy. Hold on nj one second. There's a demo. There's a video of the dog, like walking from one window and jumping to another window and the pause, they look like it's a video, like folks like literally does not look like generated, like anything we've seen before.</p><p>[00:44:02] <strong>Far El:</strong> This, is going to disrupt Hollywood immediately we're talking about, text to video disrupting media content creation and so on this is it, this is like the mid journey moment of, of text to video that same feeling that we had when we were able to crop mid journey and get some really high quality images this is the same but for video, essentially.</p><p>[00:44:23] <strong>Alex Volkov:</strong> This, this breaks reality for me right now. Literally I'm watching this video multiple times. I cannot believe that the dog's paws are not shaping in different shapes. The spots on this Dalmatian dog stay in the same place throughout the video. It, it don't make sense. Alright, LDJ, go. I think, I think,</p><p>[00:44:37] <strong>Far El:</strong> Yeah so</p><p>[00:44:38] <strong>Far El:</strong> Sam here, I'll post it on the, on the ding board. Sam said that that certain select creators have access now. And, oh, I just lost the tweet. I'll, I'll get it. But yeah, he says that some creators already have access and I guess they're going to slowly expand it out to like beta users or whatever.</p><p>[00:44:59] <strong>Alex Volkov:</strong> Wow, so Sam asked for some we can show you what Sora can do. Please reply with captions for videos you'd like to see and we'll start making some.</p><p>[00:45:06] <strong>Alex Volkov:</strong> So</p><p>[00:45:06] <strong>Swyx:</strong> Oh yeah, basically give him some really complicated prompt, and let's, let's go, let's go.</p><p>[00:45:12] <strong>Alex Volkov:</strong> A bunch of podcasters sitting, watching Sora and reacting in real time and their heads are blown.</p><p>[00:45:17] <strong>Alex Volkov:</strong> Not literally, because this is insane. How's that for a prompt? I'm gonna post it. Hopefully some will get it.</p><p>[00:45:25] <strong>NA:</strong> Just opening a portal through Twitter, through OpenAI to the Munich and then string</p><p>[00:45:31] <strong>Alex Volkov:</strong> Oh, there's, there's also, I don't wanna spend the rest of Thursday. 'cause we still have a bunch of talk about folks.</p><p>[00:45:38] <strong>Alex Volkov:</strong> Is anybody not scrolling through examples right now? And you definitely should. There's an example of a</p><p>[00:45:43] <strong>Swyx:</strong> there's only nine examples.</p><p>[00:45:45] <strong>Alex Volkov:</strong> What, what</p><p>[00:45:45] <strong>Far El:</strong> This is insane.</p><p>[00:45:46] <strong>Alex Volkov:</strong> The whole, no website has a bunch of, scroll down.</p><p>[00:45:48] <strong>Alex Volkov:</strong> There's like every, every kind of example has</p><p>[00:45:51] <strong>Alex Volkov:</strong> more scrollies. So I'm looking at an example of a chameleon, which, has a bunch of spots and has guys, the spots are in the same place. What the f**k? It doesn't move. it does not look like honestly, let's do this. Everybody send this to your mom and say, Hey mom, is this AI generator?</p><p>[00:46:07] <strong>Alex Volkov:</strong> Or not? Like older folks will not believe this s**t, like</p><p>[00:46:10] <strong>Swyx:</strong> I, I will</p><p>[00:46:13] <strong>Far El:</strong> What's the most impressive</p><p>[00:46:14] <strong>Swyx:</strong> compare this to Google</p><p>[00:46:15] <strong>Far El:</strong> right? Like humans,</p><p>[00:46:17] <strong>Swyx:</strong> don't know, I think you guys</p><p>[00:46:18] <strong>Alex Volkov:</strong> hold on. Pharrell, I think, I think we're talking over each other. Give us a one sec. Swix and then Farrell.</p><p>[00:46:22] <strong>Swyx:</strong> Oh, sorry, yeah, there's a bit of a lag. Oh, no, nothing. Just compare this to Google Lumiere where they release a bunch of sample videos as well.</p><p>[00:46:29] <strong>Swyx:</strong> But you could, the, the, I was impressed by the consistency of the Lumiere demo videos. They would, they demoed sort of pouring syrup onto a pancake and then infilling the syrup and showing that, it would be pretty realistic in pouring all that syrup stuff. Didn't really see that kind of very technical test here.</p><p>[00:46:49] <strong>Swyx:</strong> But the resolution of these videos and the consistency of some of these movements between frames, and the ability to cut from scene to scene is way better. Instantly way better. I was thinking that Lumiere was, like, state of the art a few weeks ago, and now it is completely replaced by Sora.</p><p>[00:47:08] <strong>Swyx:</strong> This is a way better demo. I think OpenAI is showing Google how to ship.</p><p>[00:47:12] <strong>Alex Volkov:</strong> eye. Decided to say, you know what, Google, you think you can one up us with the context window?</p><p>[00:47:18] <strong>Alex Volkov:</strong> We got another thing coming, because I've</p><p>[00:47:20] <strong>Swyx:</strong> just pull up the Lumiere page, and then pull up the Sora page, and just look at them side by side, and you can see how much better they</p><p>[00:47:26] <strong>Alex Volkov:</strong> Lumiere</p><p>[00:47:26] <strong>Alex Volkov:</strong> was mind blowing as well. Go ahead, Far El. Go ahead, because we're still reacting in real time to this whole ridiculously impressive.</p><p>[00:47:32] <strong>Far El:</strong> Yeah, I was just saying that the the most impressive thing are, is like how alive these video shots feel, right? Humans talking action scenes like, all the text to video models that I've seen so far and I've used were very very simplistic, right? It felt like more like you're animating an image to do very minor movements.</p><p>[00:47:55] <strong>Far El:</strong> It wasn't actually alive in any way, but Sora's text to videos is, is nuts, the quality, the consistency, the action, like the actual action of the characters. I wonder how much like granular control do you have on a scene to scene basis. I know that Google released like a paper I think a few months back where they had a basically like a script that allowed the, like for much more long form.</p><p>[00:48:27] <strong>Far El:</strong> video content, but I'm not sure if that's the case here. It's just, it's just really impressive. It's, it's really impressive.</p><p>[00:48:35] <strong>Alex Volkov:</strong> I want to say one of our friends, LaChanze, just sent, at the bottom of the page, it says, Sora serves as a foundation model that can understand and simulate the real world. I can it's really hard for me to even internalize what I'm reading right now, because the simulation of the real world, it triggers something in me, tingles the simulation hypothesis type of thing, and this can regenerate the map of the world and then zoom in and then generate all the videos.</p><p>[00:48:58] <strong>Alex Volkov:</strong> And I'm wearing this Mixed, slash, augmented, slash, spatial reality headset that just generates and this happens on the fly, and what am I actually watching here? So this says Sura serves as a foundation for models that can understand and simulate the real world, a capability we believe will be an important milestone for achieving AGI.</p><p>[00:49:15] <strong>Alex Volkov:</strong> Yeah. Alright, folks. I will say, let's do two more minutes, cause this is I can't believe we got both of them the same day today, holy s**t, we got 10 million contacts window from Google announcement, which is incredible, multi modal as well, I like, my whole thing itches right now to take the videos that OpenAI generated and shove them into, into a Gemini to understand what it sees and see if if it understands, it probably will.</p><p>[00:49:40] <strong>Alex Volkov:</strong> Wow.</p><p>[00:49:40] <strong>Far El:</strong> Thing that would make this Thursday a tiny bit even more awesome is if Meta comes out with telemetry. Too much, too much, too much.</p><p>[00:49:51] <strong>Alex Volkov:</strong> It's</p><p>[00:49:51] <strong>Alex Volkov:</strong> gonna be too much. We need, we need a second to like breathe. Yeah, definitely folks. This is a Literally like singular day. Again, we've [00:50:00] had a few of those. We had one on March 14th when ThursdAI started, OpenAI released GPT 4, Entropic released Cloud, I think on the same day. We had another one when OpenAI Dev Day came about, and I think there's a bunch of other stuff.</p><p>[00:50:12] <strong>Alex Volkov:</strong> I consider this to be another monumental day. We got Gemini 1. 5 with a potential 10 million context window, including incredible results in understanding multimodality in video, up to an hour of video. And then we also have some folks from RECA that's gonna come up soon and talk about their stuff, which is, they just with all due respect with RECA folks this news seems bigger, but they still launched something super, super cool we're gonna chat about, and now we're getting, it's just, the distance, we're used to jumps, we're used to state of the art every week, we're used to this, we're used to this model beats this model by Finetune, whatever, we're used to the OpenAI leaderboard, this is</p><p>[00:50:53] <strong>Alex Volkov:</strong> such a</p><p>[00:50:53] <strong>Alex Volkov:</strong> big jump on top of everything we saw.</p><p>[00:50:55] <strong>Alex Volkov:</strong> From Stable Visual Diffusion. From what are they called again? I just said their name, Runway. I forgot their always forget their name.</p><p>[00:51:02] <strong>Swyx:</strong> Poor guys.</p><p>[00:51:04] <strong>Alex Volkov:</strong> Poor Runway. From Pica Labs. From folks who are generating videos. This is just such a huge jump in capability. They're talking about 60 seconds.</p><p>[00:51:14] <strong>Alex Volkov:</strong> Oh, Meta just announced JEPA. Yeah, I don't know if JEPA is enough. People are commenting about JEPA, and I'm like, okay wait, hold</p><p>[00:51:21] <strong>Swyx:</strong> You, you spiked my heart rate when you said Meta just announced. I was like, what the f**k?</p><p>[00:51:25] <strong>Alex Volkov:</strong> the f**k? Meta literally just came out with an announcement, VJEPA, supervised learning for videos.</p><p>[00:51:29] <strong>Alex Volkov:</strong> But, folks unless they come out with Lama 3 and it's multimodal and it's available right now, not Meta is not participating in the</p><p>[00:51:35] <strong>Swyx:</strong> thing</p><p>[00:51:36] <strong>Alex Volkov:</strong> day</p><p>[00:51:36] <strong>Far El:</strong> Oh wait, this is actually cool. So this is this is something,</p><p>[00:51:39] <strong>Far El:</strong> actually a paper they came out with like about a month ago, but this is for video understanding. So this is pretty much like for input of video, while OpenAI's model is for output of video.</p><p>[00:51:51] <strong>Alex Volkov:</strong> It just, I will say it's a research thing, right? So they're not showing anything there, unless I'm mistaken. Um So, I kinda, so I still have a bunch of stuff to give you updates for, and I still have a bunch of interviews as well, there's a new stability model, but I'm still like, blown away, and I just wanna sit here and watch the videos,</p><p>[00:52:07] <strong>Alex Volkov:</strong> Is this what Ilya saw? Yeah, somebody reacted like, what did Ilya see? Did Ilya see a generated video and the model understanding this and that's why, that's why?</p><p>[00:52:16] <strong>Far El:</strong> No, I think, I think, I think AGI has been achieved internally at</p><p>[00:52:21] <strong>Far El:</strong> this rate.</p><p>[00:52:22] <strong>Alex Volkov:</strong> Wow. I, I'm, I'm still blown away. Like I, if a model can generate this level of detail in very soon, I just wanna play with this. I wish, I wish we had some time to, to, to, I, I was one of the artists and I hope that somebody in the audience here is, and that they will come to talk about this on Thursday.</p><p>[00:52:43] <strong>Alex Volkov:</strong> I and because I'm, yeah. I'm still mind blown. So I see. Quite a few folks that I invited that I wanna, I wanna welcome to the stage. VJEP understands the world while Sora generates one. That's the comment that some folks led. And okay, okay. VJEP is going to be something we definitely cover because Meta released this and Meta are the GOATs, even though yeah, no, Meta's definitely GOATs. I'm just a little bit lost for words right now.</p><p>[00:53:06] <strong>Nisten Tahiraj:</strong> Yeah, so if people have watched a lot of speeches from Yann LeCun is the, the main idea is that these AI models are not very good at understanding the world around them or thinking in 3D. So in some ways, you could reason out that A cat is a lot more intelligent even if it was blind and it couldn't smell, it could still figure out where to go and find its letterbox stuff like that.</p><p>[00:53:30] <strong>Nisten Tahiraj:</strong> This is one part that's missing from the world model that they get purely just from word relationships or word vectors. And so this is a step in that direction, it seems. Again, I haven't read the paper, so I'm Half making stuff up here but it feels like this is a step in, in that direction towards AI models that understand what's going on like us and animals do.</p><p>[00:53:56] <strong>Nisten Tahiraj:</strong> So that, that's the main, the gist of it for, the audience.</p><p>[00:54:04] <strong>Alex Volkov:</strong> Oh, what, what a what A Thursday. What A Thursday. I gotta wonder how am I'm gonna summarize this, all of this. And I just wanna invite, we have here in the audience and I sent you a request to join. If you didn't get it. Make sure that you're looking at requests and then accept. And then we should have, we should have Max as well at some point.</p><p>[00:54:20] <strong>Alex Volkov:</strong> Lemme text Max. 'cause we have guest speakers here from, from Breca that we wanna chat with. Meanwhile I'm gonna continue and, and move forward in some of the conversations. Let's roll back. Okay, while we're still super excited and I can't wait for this to come out, this is an announcement that they did.</p><p>[00:54:35] <strong>Alex Volkov:</strong> It's very polished. We haven't seen we didn't see any access or anything about when it's going to come out. I do feel that this is a breakthrough moment. from Google and from OpenAI. And it does look like it's reactionary to an extent. The folks in OpenAI were sitting on this and saying, Hey, what's a good time to release this?</p><p>[00:54:52] <strong>Alex Volkov:</strong> And, actually now, to let's steal some thunder from Google and they're like 10 million thing that also not many people can use. And let's show whatever we have that not many people can use which, which is an interesting. Think, to think about, because, again, the pressure is on a bunch of other labs, on Meta, to release something, we know Lama3 is coming at some point, will it be multi modal, will it be able to generate some stuff every</p><p>[00:55:16] <strong>NA:</strong> Really, really quick, sorry to interrupt</p><p>[00:55:18] <strong>Alex Volkov:</strong> Go</p><p>[00:55:19] <strong>NA:</strong> the thing about VJEBA seems to be good at is understanding video instructions I guess you could point the camera to something you're doing with your hands and arts and crafts things, or repairing something, and it understands what you're doing, so that, that's actually very easy.</p><p>[00:55:36] <strong>NA:</strong> Powerful for what data sets data sets of skills that will come, because then you can generate actions. I, I think that, that will apply a lot to robotics, what they're doing.</p><p>[00:55:48] <strong>Alex Volkov:</strong> Oh, alright, yeah. And they also have the Ego4D datasets of robotics as well, and they've talked about this.</p><p>[00:55:55] Nvidia relases chat with RTX</p><p>[00:55:55] <strong>Alex Volkov:</strong> so let's go to open source like super quick. NVIDIA released a chat with RTX for local models. And it's actually like very, very cool. So a few things about the chat with RTX. First of all, NVIDIA packed a few, a few models for you. It's 38 gigabytes or something download. And they, they have they have quite a few I think they have two models packed in there.</p><p>[00:56:16] <strong>Alex Volkov:</strong> I wasn't sure which ones. And this, this is basically a, a package you download. I don't know if a doc or not. That runs on any desktop PC with RTX 30 or 40 series with at least 8 gigabytes of RAM. And it gives you a chatbot that's fully local. And we love talking about open source and local stuff as well.</p><p>[00:56:33] <strong>Alex Volkov:</strong> And it Not only that, they give you a rag built in. So you can actually run this on some of the documents that you have. They also have something that runs through a YouTube. You can give it like a YouTube playlist or a video link, and it will it will have you talk to YouTube video. So it has built in rag, built in Tensor rt, LLM, which runs on their, on their stuff RTX acceleration and.</p><p>[00:56:56] <strong>Alex Volkov:</strong> I think it's pretty cool, like it works only on the very specific types of devices, only for like gamers or folks who run these things but I think it's pretty cool that that folks are, that NVIDIA is releasing this. They also have something for developers as well to be able to build on top of this.</p><p>[00:57:11] <strong>Alex Volkov:</strong> And I think the last thing I'll say about this is that it's a Gradio interface, which is really funny to me that people are shipping Gradio interfaces on production. It's super cool.</p><p>[00:57:18] Cohere releases Aya 101 12.8B LLM with 101 language understanding</p><p>[00:57:18] <strong>Alex Volkov:</strong> Cohere releases an open source called AYA 101, a model that's like 12. 8 billion parameters model with understanding of multilingual 101 languages from Cohere. It's, it's honestly pretty cool because Cohere has been done doing a bunch of stuff. AYA outperforms the Bloom's model and MT0 on wide, a variety of automatic evaluations despite covering double the number of languages.</p><p>[00:57:41] <strong>Alex Volkov:</strong> And what's interesting as well, they released a dataset together with AYA and then what is interesting here? Yeah, just, oh, Apache 2 license, which is super cool as well. Apache 2 license for, for this model. Let me invite Yi as a co host, maybe this can, join. Far El, go ahead.</p><p>[00:57:58] <strong>Alex Volkov:</strong> Did you see, do you want to talk about Yi Aya?</p><p>[00:58:00] <strong>Far El:</strong> Yeah first off, I I appreciate and commend Cohere to building a multilingual open source data set and so on. That's awesome. We need more of that. But unfortunately, With the first few questions that I asked in Arabic specifically most of the answers were complete. [00:58:20] nonsense on their train model.</p><p>[00:58:23] <strong>Far El:</strong> Yeah. And to, to the point that it's it's laughable, right? For instance in Arabic, I asked who was the who was the first nation that</p><p>[00:58:32] <strong>NA:</strong> had astronauts on the moon. I</p><p>[00:58:38] <strong>Alex Volkov:</strong> Yes.</p><p>[00:58:39] <strong>NA:</strong> think, I think you cut out for a sec.</p><p>[00:58:43] <strong>Alex Volkov:</strong> I think he dropped. I don't see him anymore.</p><p>[00:58:45] <strong>NA:</strong> He might have</p><p>[00:58:46] <strong>NA:</strong> His phone might have</p><p>[00:58:47] <strong>Alex Volkov:</strong> yeah, we're gonna have to</p><p>[00:58:48] <strong>NA:</strong> I can briefly</p><p>[00:58:50] <strong>NA:</strong> comment on it. Yeah, we're pretty happy now that also Kahira has started contributing,</p><p>[00:58:56] <strong>NA:</strong> To open source because datasets are very important. And yeah, I think the reason it wasn't performing so well In other languages, it's just because some languages do not have there wasn't enough data in that for it to be, to be trained.</p><p>[00:59:12] <strong>NA:</strong> But the beautiful thing is that it is Apache 2. 0. You can just add your own languages data set and it will. Literally, make the whole thing better. And yeah, that's, those are my comments on it.</p><p>[00:59:22] Interview with Yi Tay and Max Baine from Reka AI</p><p>[00:59:22] <strong>Alex Volkov:</strong> Awesome. All right, folks. So now we're moving into the interview stage, and we have quite a few folks. As one of the most favorite things that I want to do in ThursdAI, and it's been an hour since we've been here, is to actually talk with the folks who released the stuff that we're talking about.</p><p>[00:59:35] <strong>Alex Volkov:</strong> So the next thing I'm going to announce, and then we're going to talk with Yitei and Max, and then after that, we're going to talk with Dom as well. Earlier this week, a company named Reka AI released two models, or at least released a demo of two models, right? I don't think API is still available.</p><p>[00:59:51] <strong>Alex Volkov:</strong> We're going to talk about this as well. Called Reka Flash and Reka Edge. And Reka Flash and Reka Edge are both multimodal models that understand text, understand video, understand audio as well, which is like very surprising to me as well. And I had a thread where I just geeked out and my head was blown to the level of understanding of multimodality.</p><p>[01:00:09] <strong>Alex Volkov:</strong> And I think some of the folks here had, had had talked about Sorry, let me reset. Some of the folks here on stage have worked on these multi models models. And so with this I want to introduce Yi Tei and Max Bain. Please feel free to unmute and introduce yourself briefly and then we're going to talk about some record stuff.</p><p>[01:00:25] <strong>Alex Volkov:</strong> Yi first maybe and then Max.</p><p>[01:00:27] <strong>Yi Tay:</strong> Yeah, thanks thanks Alex for inviting me here. Can people hear me actually?</p><p>[01:00:31] <strong>Alex Volkov:</strong> Yeah, we can hear you</p><p>[01:00:32] <strong>Yi Tay:</strong> okay, great, great. Because this is the first, hey this is the first time using space, so yeah, try to figure out how to use it. But thanks for the invite, alex, and so I'll just introduce myself. I'm Yi Teh, and I'm one of the co founders of RectorAI.</p><p>[01:00:45] <strong>Yi Tay:</strong> We're like a new startup in the LMS space. We train multi modal models. Previously I worked at Google Brain working on Flan stuff like that. So yeah, that's just a short introduction about myself. And maybe Max, do you want to introduce yourself? Yeah,</p><p>[01:00:59] <strong>Alex Volkov:</strong> Yeah, Max, go ahead, please.</p><p>[01:01:00] <strong>Max Bain:</strong> thanks Ian. Yeah.</p><p>[01:01:01] <strong>Max Bain:</strong> Thanks Alex for having me. So yeah, as you said yeah, I'm part of Wrecker. So I joined more recently, like six months ago. I just finished my PhD and that was all my video, audio, speech understanding. I've done a bit of work in open source. So if you use WhisperX that was like something I worked on and yeah, now working more on part of Wrecker and really enjoying it.</p><p>[01:01:22] <strong>Max Bain:</strong> yeah, that's pretty much</p><p>[01:01:23] <strong>Alex Volkov:</strong> First of all, let me just say, thank you for WhisperX, I did use this, and it was awesome, and I think this is how we connected before or at least, to some extent, I think this is the reason maybe I follow you, I was really surprised that you were Reka. Let's talk about the models that you guys just released, and because Very impressive on the multimodality part, but also very impressive on just the regular comparative benchmark, and I think you guys released the comparisons to just regular MMLU scores, so Wreck A Flash gets 73.</p><p>[01:01:52] <strong>Alex Volkov:</strong> 5 on MMLU and 65 on Human EVAL, and GPT 4 is at 67, at least, and Gemini Ultra, they claim is 74, but your guy's model is like significantly smaller. What can you tell us about, and I know you said before there's like a bunch of stuff that you won't be able to talk about what can you tell us about the performance just on the textual kind of comparison, even though this is a multimodal model and there's a bunch more that we will talk about?</p><p>[01:02:17] <strong>Yi Tay:</strong> Yeah, thanks so I'll just I can't really say that much, but I can say that there's quite a lot of headroom in pre training just for language alone, and I think that we're still not near the headroom yet for pre training, and I think even for us, actually, we have a better version of RecoFlash internally right now, but we've not even published metrics for that because while we were preparing for the launch we actually have even a better model now.</p><p>[01:02:39] <strong>Yi Tay:</strong> So I think actually there's still quite a lot of headroom for pushing that and there's quite a lot of things to do in pre training but I can't really wouldn't be able to say much about? About like more details, yeah.</p><p>[01:02:48] <strong>Alex Volkov:</strong> About specifics. I did see the comments that you left in your thread, that you talked about the folks who do foundational models from scratch, they, there's a lot of banging a lot of creation they have to do in the process as well, and it looks like at least some of this amount, some of this amount of hard work you guys had to go through in order to train these foundational models.</p><p>[01:03:09] <strong>Alex Volkov:</strong> So let's talk about the multimodality, what what can this model do? And I think I have a</p><p>[01:03:15] <strong>Alex Volkov:</strong> good idea, but can you talk to us on the multimodal part? What can those models do in terms of multimodality?</p><p>[01:03:23] <strong>Max Bain:</strong> Yeah, so in terms of multimodal yeah, if you just, you can use it actually on chat. reco. ai, and I would say the image understanding's pretty good, so people have noticed, you can recognize text pretty well. Yeah, more nuanced details, which tended to be a big issue with VLMs, like they used to be quite biased or it'd hallucinate a lot.</p><p>[01:03:41] <strong>Max Bain:</strong> I think in Rekka Fafri noticed that dropped a lot. So I think kind of image understanding is, I'd say, yeah, pretty on par with Gemini Pro or a bit better. But yeah, that's up to the jury. The video understands also pretty good. We limit it to a one minute input. We do have internally like better things and like bounded by how much we can run like for free. So, yeah, I'd say yeah, overall pretty good video understanding and image. We haven't focused too much on audio right now, but that's like definitely on the, on the roadmap.</p><p>[01:04:14] <strong>Alex Volkov:</strong> I did run into the audio stuff, and I ran a few videos through the demo, and folks definitely should check out the demo. I'll add this in the show notes, and hopefully some folks will add this to the space as well. I just started uploading like short clips, and it's great to hear that you're saying, you guys are limited, you're limiting on the demo, but you can, if I'm hearing correctly, you can The model can understand longer videos as well.</p><p>[01:04:39] <strong>Alex Volkov:</strong> So I uploaded a video of a trip that I took to Hawaii and there's a submarine there and somebody was narrating in the submarine and he yelled something like, there, there, there's the submarine goes, dive, dive, dive, something like this. Very excitedly. And the model really understood this, and actually it said, the commenter said, Dive, dive, dive, like this, with a bunch of I's in it.</p><p>[01:05:00] <strong>Alex Volkov:</strong> And to me, this was like the, the holy s**t moment. I uploaded this video. The narrator for this video was very excited. I did not expect the model to actually pick up on the excitement. And, It was very surprising to me because if you use something like Whisper and you just extract the audio from the, from the video, you would not get this result.</p><p>[01:05:20] <strong>Alex Volkov:</strong> You would not get like the, the excitement in this person's voice. And while we try to get max back in, could you, so could you mention stuff about audio? Do you train this specifically for audio as much as you can share, obviously. Or is it like a, a, a byproduct of, of just this model being multimodal and understanding and can listen as well?</p><p>[01:05:39] <strong>Yi Tay:</strong> Wait, so let me take a step back. Actually, thanks for sharing that example because I</p><p>[01:05:43] <strong>Yi Tay:</strong> actually had to watch your example to find that, that dive, dive, dive. I actually watched the entire video to find that, that clip. So I think it was a pretty Good clip. To be honest, it also surprised me that you found this example.</p><p>[01:05:56] <strong>Yi Tay:</strong> I, I think I was not also expecting this but I, we, we, we co trained this with many modalities. We are not sure, like, why this this specific case is like this. I think that's all I can say, but probably</p><p>[01:06:09] <strong>Yi Tay:</strong> yeah, next one</p><p>[01:06:09] <strong>Alex Volkov:</strong> I can definitely, definitely add one thing that this video wasn't for sure not in your training data set because it was a private video of mine that didn't exist on the internet before. So it wasn't like a result of this video being in a training set. Max, you rejoined. I hope you heard some of this question as well, attributed to you.</p><p>[01:06:26] <strong>Alex Volkov:</strong> Did you see this example? Did it cut you off guard as well? Do you see other examples like this that were like very, very surprising in how this model performs?</p><p>[01:06:33] <strong>Max Bain:</strong> Yeah, I saw that. I was surprised. To be honest, one thing I've noticed is that video benchmarks are quite poor. So [01:06:40] we, in the question answering datasets, we don't really get a chance to see this, especially ones that use like the speech information and things like that. So I guess really, I'm glad you like tested it a lot.</p><p>[01:06:50] <strong>Max Bain:</strong> Cause yeah, like internally we maybe haven't had a chance to I think but it's the benefit of kind of, yeah, training everything from scratch and adding all the modalities</p><p>[01:06:58] <strong>Yi Tay:</strong> and yeah</p><p>[01:06:58] <strong>Alex Volkov:</strong> That's awesome. So I also want to talk about the fact that you guys raised two models and you talked about there's a bigger one. Let's talk about the edge model. Can you talk about Are we going to be able to use this on device? I assume what's the play here? At least from what you can say, what's the play in terms of using the smaller models?</p><p>[01:07:14] <strong>Alex Volkov:</strong> Obviously, smaller models, the benefit of them is using them closer on the edge and device, and that's how you named it. What's the, what's the thinking about releasing, these two models in different sizes? And and what's your plans for those?</p><p>[01:07:26] <strong>Yi Tay:</strong> Oh yeah, sounds good. Yeah, that's a great question. So for the H model, 7B model, it's I think it's it's at a size that it's possible to run it locally, but we are thinking also along the lines of okay, it's actually Faster, like it's just for latency sensitive applications sometimes you just need certain things like this Slightly faster than the 21b model and it's also cheaper to to to host for for a lot of applications So I think that's mainly like this one of the reasons why seven.</p><p>[01:07:55] <strong>Yi Tay:</strong> We also ran lots of ablations at low smaller scale. So this, this turns out to be just the size that we have. And I, I think it's mostly, mainly for latency sensitive stuff. And then like for people who are like for businesses and stuff, like they might just choose to deploy the smaller model if they don't like, need a larger models like the.</p><p>[01:08:13] <strong>Yi Tay:</strong> Flash or the, the core model. So I think that's really like the idea behind it. And then from the research point of view, or at least from the playground point of view, right? Like the, the demo point of view is that people get to, to, to, to get a sense of the view of the model at the seven B scale and the 21 B scale, right?</p><p>[01:08:28] <strong>Yi Tay:</strong> So there's kind some kind of you might be able to, to get a sense of like how this setup looks at the different scale. I think that's mainly like why we deployed two models in the background just so that people can play with. Two variants and the stuff. Actually not much thought here.</p><p>[01:08:42] <strong>Yi Tay:</strong> I mean it's not like super complicated, it just happened this way, but yeah, that's all I can say, yeah.</p><p>[01:08:48] <strong>Alex Volkov:</strong> Awesome. And so folks can go check out the demo. It looks like you guys are set up for API keys as far as I understood. So will developers be able, be, be able to build with this? What stage are you in? I think you, you invited to a disco or something. Could you talk about how we can play with these models, what we can do, and if there's any expected open source, because we'll have open source here on ThursdAI.</p><p>[01:09:08] <strong>Alex Volkov:</strong> If there's anything to talk about there as well, please, please feel free to, to tell us how to actually try these models beyond the demo. Build with them.</p><p>[01:09:16] <strong>Yi Tay:</strong> Yeah, sounds, sounds good. So for API, actually, we, we, we have our API as a system already like working and then some people are already using it. We are like rolling out access coupling without the billing and everything, like we're just making sure everything is running very well.</p><p>[01:09:29] <strong>Yi Tay:</strong> And then we will roll it out soon. So I think that's mainly like the, the idea behind the slightly stitch. API release yeah, so that's for APIs. And then for open source, we I'll just be candid here, we are constantly, we're not sure yet about whether we want to do it or we don't want to do it.</p><p>[01:09:44] <strong>Yi Tay:</strong> It's always a question we have but we're not promising anything, but we're also not saying no yet. So it's a, it's a competition we have very regularly about about this kind of thing. So I, I, so yeah, that's currently the stance we have right now. But we are, we are</p><p>[01:09:55] <strong>Yi Tay:</strong> writing a we are writing a tech report it's not like a paper paper, but it's also not going to be that there'll, there'll be some details in the tech report, but not complete details, but some details.</p><p>[01:10:04] <strong>Yi Tay:</strong> But yeah, so I think that's mainly like the extent of like how we're thinking about things right now, yeah.</p><p>[01:10:09] <strong>Alex Volkov:</strong> Awesome. So first of all, I want to consider you guys friends of ThursdAI. Thanks for coming on the pod. And here, we definitely love open source. We talk about it all the time. And we're just like Champions of Open Source, so if you do release anything Open Source, you're welcome to come back as well. Yi and Max, we have Swyx here, I'm actually in Swyx's audience, so you can hear them from my microphone.</p><p>[01:10:29] <strong>Alex Volkov:</strong> And Swyx has a few follow up questions for Yi and Max as well, so Swyx, go ahead.</p><p>[01:10:32] <strong>Swyx:</strong> Oh, sure. Yeah. Hey I actually tried to set up a chat with you when I was in Singapore, but it didn't happen.</p><p>[01:10:39] <strong>Swyx:</strong> So sorry about that. But I actually wanted to just chat with you more about something that you hinted on your announcement post. You talked about how much of the infra you had to rebuild, you Reka. Everything, you said everything from robust training infra. Proper Human Evaluation Pipelines and Proper RLHF Setups.</p><p>[01:11:00] <strong>Swyx:</strong> I was wondering if you can just give us like a preview of What did you miss? What does Google have? And then what do you think like the industry could innovate on?</p><p>[01:11:09] <strong>Yi Tay:</strong> Okay. That's a very interesting question. I need to be, need to think about what I can say and what I cannot say. But so definitely, definitely I miss GPUs credit to GPUs and being like a, a Googler for all my. Professional life, definitely the infra was completely new to me, and then at Rekka, we have a lot of people from GTM and, and Google in Alphabet in general I think a lot of us could, I feel the same way and then, I think in terms of infra, I think GPU tooling is not as robust as at least what I experienced for TPU Infra back at, at, at Google. So I think that's mainly the first thing is the robustness of the the, the training the, the, the, the, the, the accelerators itself, right? And then also even things like FileIO is something that people take for granted. At Google, the file systems, the X Manager box and stuff orchestrators and stuff like that are, like, just so well designed at Google.</p><p>[01:12:02] <strong>Yi Tay:</strong> And then externally, it's a lot of them are just missing. So I think yeah, I, I, yeah, I think that's basically on the training infrasight and yeah, so I think, I think the tooling for like training like large models is not really super like robust externally, like you're, you're, it's not easy to like just pick off something and then like train like.</p><p>[01:12:26] <strong>Yi Tay:</strong> Like a 100 bit model easily without actually making sure your checkpointing is you're, you're, you're resuming your checkpointing, your, your notes failing and stuff like that. I think those are, like, hard, hard stuff things that, that need to be taken care of but at, at, at Google some, some team Does that for you.</p><p>[01:12:43] <strong>Yi Tay:</strong> Yeah, TLDR of the training infrastructure, yeah.</p><p>[01:12:48] <strong>Swyx:</strong> Does Google have the equivalent of Weights and Biases?</p><p>[01:12:51] <strong>Yi Tay:</strong> TensorBoard, I think, yeah.</p><p>[01:12:53] <strong>Swyx:</strong> Oh yeah, yeah, yeah, of course.</p><p>[01:12:55] <strong>Yi Tay:</strong> Yeah yeah, yeah, yeah yeah.</p><p>[01:12:58] <strong>Alex Volkov:</strong> So</p><p>[01:12:58] <strong>Alex Volkov:</strong> we don't work with Google yet, but hopefully if if folks at Google are listening to us and you want to use kind of Weights Biases, definitely reach out. But at least you guys, now that you're out of Google, you definitely can. You want to follow up with Swyx, or are you,</p><p>[01:13:10] <strong>Swyx:</strong> are you Oh,</p><p>[01:13:10] <strong>Swyx:</strong> I don't know. Did you guys talk about Ricoh Core already?</p><p>[01:13:13] <strong>Alex Volkov:</strong> Yeah, so I think, Yi, there's not a lot of stuff that you can say about the bigger model that you guys have, but give us a little teaser live for a few folks here on stage, like what can we expect from the bigger model, maybe when, what can you tell us?</p><p>[01:13:28] <strong>Yi Tay:</strong> So the bigger model, okay, so I can just say that we, we ourselves are quite impressed by the results and it's if, if if you try to extrapolate from our 7 and 21 based on relative to other models of the scale you can. Try to imagine like what the type of metrics look like, right? But I think we are, we ourselves are, ourselves, we are quite impressed by, by the, the, the, the, the metrics.</p><p>[01:13:49] <strong>Yi Tay:</strong> So like we are I think that's all we can say. I think in the polls, we say that coming out in coming weeks is around that ballpark. It's not like next week, the kind of thing. It's also not like one, two weeks. It's probably like a couple of weeks. But we still, we also kind of like a bit tired after the release.</p><p>[01:14:05] <strong>Yi Tay:</strong> Take</p><p>[01:14:05] <strong>Yi Tay:</strong> a few days light break and then start working again, that kind of thing. So Yeah. I think that that's, that's basically what I can say, but it's, I, we are, we are very happy in the model and as well, yeah.</p><p>[01:14:17] <strong>Alex Volkov:</strong> All right, so we're excited to see this. I want to flip back to Max just for a second. Max as we just talked covered, there's some stuff that I use that you guys are watching. Oh, find somebody test this out. When folks interact with your demo, first of all, I'll just say, definitely folks should do the thumbs up, thumbs down, and reply, so you guys will get some nice RLHF.</p><p>[01:14:35] <strong>Alex Volkov:</strong> What other venues of giving you guys feedback would folks can go? Is there a Discord you want to call out, or anything else you want to add to this as we move on?</p><p>[01:14:44] <strong>Max Bain:</strong> Yeah, thanks guys. We, we actually have a discord channel and if people post, use cases where maybe our model is doing well, or could do better, you can post that, or maybe there's something you're not happy with the current models, like GPT 4V also. And like, I guess, cause we're [01:15:00] such a small team in an early stage, like we'd.</p><p>[01:15:02] <strong>Max Bain:</strong> We're taking a lot of that on board and yeah if you can point any of that stuff, if you have stuff in more detail, you can put that on the Discord and yeah, we're like, really happy for any feedback,</p><p>[01:15:10] <strong>Alex Volkov:</strong> awesome. Are you guys distributed, by the way? Are you working co located? Like, where's, where's RECA located?</p><p>[01:15:16] <strong>Max Bain:</strong> Like, all over the globe, yeah, So he's in Singapore, I'm, like London, sometimes the West Coast, but yeah, it's like a remote first</p><p>[01:15:23] <strong>Max Bain:</strong> company.</p><p>[01:15:25] <strong>Max Bain:</strong> and also, yeah, sorry. Another thing is if we have, do you have job posting? So if you guys would Yeah, like the sound of record, you can also apply to join. We have yeah, quite a few</p><p>[01:15:35] <strong>Max Bain:</strong> positions open.</p><p>[01:15:42] <strong>Alex Volkov:</strong> friends of the pod from now on. E, anything else you wanna, you wanna add as, as we finish up and then move to the next</p><p>[01:15:49] <strong>Yi Tay:</strong> No, thanks. Yeah, really thanks for inviting. It's really nice chatting with you. And yeah, it's been great. Yeah.</p><p>[01:15:56] <strong>Alex Volkov:</strong> I'm, I was, like, like I said, I was blown away by the performance of the multimodality. I was blown away by the tonality understanding, which I've never experienced in any model so far. I heard that it's possible and I saw some technical stuff. I never experienced this on something like my videos as well.</p><p>[01:16:11] <strong>Alex Volkov:</strong> Definitely folks should play around with, with the demo. I'll add this in the show notes and follow Yi and Reka and, oh yeah, one last thing Yi, before you go. What's the meaning of Reka? I know this is a word in Hebrew that I know, but what's, what's the meaning of this word? Like, where, where did this come from?</p><p>[01:16:24] <strong>Alex Volkov:</strong> I was really curious.</p><p>[01:16:26] <strong>Yi Tay:</strong> I think one of the meanings, it's not official, it's not canon, but like one of the meaning it comes from Reka in Eureka, like Eureka, like the Reka</p><p>[01:16:35] <strong>Yi Tay:</strong> in Eureka, but it's not Okay, this is not canon, it's just one of the interpretations of that but it's a bit reverse engineered where people ask us, we just, this is what we say, but that's actually I think that that's it's not really like canon, yeah.</p><p>[01:16:49] <strong>Alex Volkov:</strong> Awesome. Thank you guys for joining and folks, definitely should go check out the demo. And I think the tradition continues because now we have we're moving on to the diffusion area and we have the, the, the, the awesome, the awesome chance to have Dome here. And we. Just released, or I guess we saw this week, a new release from Stable Diffusion called Stable Cascade.</p><p>[01:17:09] <strong>Alex Volkov:</strong> And Dom, I reacted to Imad's tweet about this hey Imad, you want to come to ThursdAI? And he said, Dom, and I think did you say Rodrigo was the other guy? Are the real heroes. And I want to welcome Dom to the stage. Dom, welcome. Feel free to unmute yourself, give a brief introduction. Let's talk about, let's talk about Stable Cascade. .</p><p>[01:17:25] <strong>Dome:</strong> So yeah, my, my name's Dom. I joined stability a couple, actually a couple of months only ago. And I'm currently enrolled in, in Germany in a in a degree. I'm currently finishing that up and I've met Pablo more than a year ago. And ever since that we started working on, generative models, mostly in vision. So image modality and also slowly moving into video stuff. And yeah, at some point, so pretty early, we already connected to stability via Lyon. And at some point they liked what we were doing and liked the progress of how the paper that we called Verstehen was going, which is German and means sausage.</p><p>[01:18:09] <strong>Dome:</strong> I can tell more about that</p><p>[01:18:10] <strong>Alex Volkov:</strong> Oh, that's what it means! Okay.</p><p>[01:18:13] <strong>Dome:</strong> yeah, yeah, yeah. And yeah, so then we joined, we joined and we joined the apply team and we were able to, to work on the third version of it which in the end then was called Stable Cascade, just to make it fit in more, not to confuse people where that name comes from, what's this third version about.</p><p>[01:18:31] <strong>Dome:</strong> And yeah.</p><p>[01:18:34] <strong>Dome:</strong> That's bad.</p><p>[01:18:34] <strong>Alex Volkov:</strong> Awesome. So let's, let's say hi to Pablo as well. Welcome, Pablo. Feel free to unmute yourself. Brief intro from you as well. And let's talk about what makes Cascade different than SDXL or even the V2.</p><p>[01:18:45] <strong>Pablo:</strong> Hey, hi, Alex. A bit about myself. I am a machine learning researcher. I used to work before working at Stability. I used to work at Disney. So I was able to bring a lot of interesting ideas from there. And then I, yeah, I joined Dom and we have been working on very cool things since, since I met him.</p><p>[01:19:03] <strong>Pablo:</strong> And the latest is, is our new stable cascade.</p><p>[01:19:08] <strong>Alex Volkov:</strong> That's awesome. Let's talk about Stable Cascade. I've been able to test this out, and the things I was able to, the things that blew me away were, like, speed, inference speed as well, but also the base model already has hands built in, and they're fine. You guys said you're working with Worshen for a couple iterations, and this became Stable Cascade?</p><p>[01:19:26] <strong>Alex Volkov:</strong> Like, where talk to me about the history, and why is it so good, and so fast?</p><p>[01:19:30] <strong>Dome:</strong> Okay. Yeah. Yeah. So basically the, the biggest difference, and I think that's what it boils down eventually is the, the, the space or the dimension where stuff is generated for, for the text conditional part and for Stable Diffusion XL is, that they have this thing called the VAE, which takes images and just compresses it down to a smaller space.</p><p>[01:19:53] <strong>Dome:</strong> And the only reason to do that is. Just that you work at a smaller resolution, which then gives you faster training and faster inference. Imagine training or generating stuff at a pixel resolution of 1024, so one megapixel. This will be a lot slower than if you try to do the same, try to trying the same model at what, 32 by 32, for example.</p><p>[01:20:15] <strong>Dome:</strong> So the idea is you still want to have high, high quality, high resolution images, but you don't want to generate at that very high pixel space. So you just try to find something, how you can compress it even further. And up, up until now, people always use VAEs, VQGANs, normal autoencoders and so on but they reach limits very early on.</p><p>[01:20:34] <strong>Dome:</strong> So you can get to an spatial compression of eight. So Pablo had this incredible idea of using it. diffusion model to increase that compression, basically, and long story short by using a diffusion model on top of a normal VAE, or you could also leave the VAE away and just start at pixel space, you can achieve much, much higher compressions because you have the diffusion model that can iteratively at first at the lower frequency, so the, the the rough details, and then later on at the high frequency.</p><p>[01:21:04] <strong>Dome:</strong> So at all the details. And so it has just a lot more space to reconstruct an image. And with that it's possible to, to compress images a lot further. And the version that we have now achieves a compression of 42. And that makes a huge difference in terms of training and inference time. And That's probably what you saw because then</p><p>[01:21:24] <strong>Dome:</strong> the big model, the 3.</p><p>[01:21:26] <strong>Dome:</strong> 6 billion, which is. quite big for images. So stable diffusion XL is 2. 2 billion. We're not in the, in the large language models. So yeah, this makes it just a lot faster. And then you have this diffusion decoder, which works at at a higher resolution, but needs a lot less steps and combining this just gives results in making the model very fast.</p><p>[01:21:49] <strong>Alex Volkov:</strong> That's super cool. I want to switch back to Pablo just real quick. So I'm looking at this graph for inference speed, but also checked out some of the examples. One thing that I noticed is the real time rendering basically of how the model kind of searches through the diffusion space. And the last step just like kicks into like super high resolution.</p><p>[01:22:09] <strong>Alex Volkov:</strong> Pablo, what can you tell us from some exciting or maybe surprising results that you've seen or people using it and Yeah, feel free to speak about your cool model a little bit more.</p><p>[01:22:18] <strong>Pablo:</strong> Yeah, I actually I have been really surprised on how well this model could, could could be. We, we, we're not expecting it to be as good as it is. We started this more as an like a, an experimental idea of trying to achieve the same quality of existing models but focusing on, on speed on performance.</p><p>[01:22:39] <strong>Pablo:</strong> But then somehow we ended up with a model that was like very competitive and yeah, I don't know. I think this last step as, as you mentioned, is the the, the upsampling stage. Which is this diffusion model that Dominic mentioned that can bring the image from 24 by 24 latent to a one megapixel.</p><p>[01:23:00] <strong>Pablo:</strong> And that's why you see this like very big difference between the previous to last and the last step.</p><p>[01:23:06] <strong>Alex Volkov:</strong> Yeah, the last step is poof, high quality. I love it.</p><p>[01:23:11] <strong>Dome:</strong> Yeah, we, we, yeah, we, we actually provided a previewer. So when we work in this very highly compressed latent space, In order to be able [01:23:20] to see what the model is doing, we have this very tiny convolutional model that can preview what's going on. That's what you're seeing, which looks pretty blurry. And then yeah, the final step does that.</p><p>[01:23:33] <strong>Dome:</strong> And yeah, why the model can make We're also pretty surprised. The, the big</p><p>[01:23:41] <strong>Alex Volkov:</strong> Text is also very impressive. I think let's not skip over this. The out of the box text. is so good. Compared to, let's say, the Stable Diffusion 1. 4, which it released was, which was bigger, right? I think it was like five gigabytes or something. This is just miles, miles, miles better. And the text out of the box, hands out of the box is very impressive.</p><p>[01:23:59] <strong>Alex Volkov:</strong> Text is super cool as well. Very surprising. Yeah, go ahead, please.</p><p>[01:24:02] <strong>Pablo:</strong> The, the, the biggest difference compared to V2, which was our previous iteration of the model was the size of the architecture of the model and the quality of the data, which I think. It shows how important that, that is, and I think probably, since, since our model is able to work on this very, very highly compressed space, it can learn much more efficiently if, if it has good data, it can learn much more efficiently these, these kind of things.</p><p>[01:24:30] <strong>Pablo:</strong> Maybe it learns them faster than other models which is why Yeah, we're able to have this kind of results.</p><p>[01:24:39] <strong>Alex Volkov:</strong> Awesome. Thank you guys for coming up. I really wanted to make sure that, yeah, you guys get the recognition because like really, really cool. This is under the stability membership, right? This is not like fully, fully open source, but folks are going to be able to use this model for, for their stuff and maybe keep training.</p><p>[01:24:55] <strong>Alex Volkov:</strong> Does it support all of the, the, the fine tuning and the LoRa ecosystem as well?</p><p>[01:24:59] <strong>Pablo:</strong> Yeah, one detail, it's not yet on the the subscription. It's still for only for research but it, it will change probably in, in the following weeks, you asked about the Loras and Control Nets. Yeah, we</p><p>[01:25:13] <strong>Pablo:</strong> we</p><p>[01:25:13] <strong>Pablo:</strong> we we made sure to provide some example code for training Loras, Control Nets, and the full, full fine tunings on, on our repository. We also provide some pre trained Control Nets for in painting, for canny edges for super resolution, which is not the best super resolution model out there, but it's, it's interesting enough to, to share with the community, and we provided Tiny Laura with Dom's dog which is, it's pretty and,</p><p>[01:25:44] <strong>Alex Volkov:</strong> Nice.</p><p>[01:25:45] <strong>Dome:</strong> yeah, and I think that's it for now, that, that's</p><p>[01:25:48] <strong>Yi Tay:</strong> all the</p><p>[01:25:49] <strong>Alex Volkov:</strong> Awesome. Thank you for joining and folks, definitely give Dom and Pablo a follow. Folks, really great shout out for building this and releasing this from Stability and it looks really good and I'm sure the community will adopt this. I've already seen a bunch of AI artists in my, in my kind of field.</p><p>[01:26:02] <strong>Alex Volkov:</strong> field are getting very excited about the possibilities here. Thank you for your work and thank you for coming for Thursday. I please feel free to stay because we're going to cover a bunch of other stuff as well, like super quick. Meanwhile, I just want to do a quick reset. It's been an hour and let's say 35 minutes since we're here.</p><p>[01:26:20] <strong>Alex Volkov:</strong> If you're just joining us, you're on the Thursday I X space, which is live recording for the Thursday I podcast and newsletter. I'm your host,</p><p>[01:26:28] <strong>Alex Volkov:</strong> Alex Volkov, I'm here joined by a co host, Nisten is here on stage, Yamil Spokin, and we have Swyx here, who dropped off the stage, but he's in the microphone, and I will move towards a corner that I have, and then</p><p>[01:26:40] This weeks Buzz</p><p>[01:26:40] <strong>Alex Volkov:</strong> I have a surprise for Swyx I'm moving towards a corner that I have usually, which is called This Week's Buzz, where I talk about the stuff that we have, or I learn in Weights Biases every week, so if you are subscribed to the newsletter, you definitely already know this, I just learn as I go and talk about this.</p><p>[01:26:55] <strong>Alex Volkov:</strong> If you're not subscribed to the newsletter, Why not? I guess you'll be up to date with everything that happens in the world of AI. So definitely check out thursdai. news. This is the URL, HTTPS, thursdai. news. And this week's buzz is all about this new course that we released with Hamil Hussain about putting models in production.</p><p>[01:27:13] <strong>Alex Volkov:</strong> I think I've spoken about this before. Weights Biases has an academy. We release courses and the courses are free for you. There's a bunch of knowledge. The last one we've talked about was the, with Jason Liu about the instructor. And we also have Hamel Hussain who released a course about model management and in production as well.</p><p>[01:27:29] <strong>Alex Volkov:</strong> And this is definitely A very illuminating one, including how to use weights and biases for the, like the best companies do, OpenAI does, and like Microsoft and Meta, and hopefully we'll get Google at some point. Definitely, of course, it's worth checking out and signing up for. This will be in the show notes as well, and I'll post the link as well here.</p><p>[01:27:47] Interview with Swyx from Latent Space</p><p>[01:27:47] <strong>Alex Volkov:</strong> And now I'm gonna Actually yeah, Swyx is now back on stage, and here's my surprise, if you guys follow and Swyx's voice, you know that he's a co host of Latentspace together with Alessio, and we're now sitting in the Latentspace pod studio, which looks incredible the surprise is, I don't remember you being on the other side of the mic, so this is like a surprise interview with Alex and Swyx, but you're gonna be a guest and not a host, and I just wanted to hear about some stuff that you guys are doing, and how Latentspace is going, like all these things.</p><p>[01:28:14] <strong>Alex Volkov:</strong> So this turns from ThursdAI into ThursdAI, like deep dive interview, just a brief</p><p>[01:28:18] <strong>Alex Volkov:</strong> one.</p><p>[01:28:19] <strong>Alex Volkov:</strong> I figured I'd use the opportunity to give you a surprise. This was not staged. Swix told me he may not be able to even join. 'cause you just flew back from</p><p>[01:28:26] <strong>Swyx:</strong> Singapore. Singapore, yeah. Yeah.</p><p>[01:28:27] <strong>Swyx:</strong> Yeah.</p><p>[01:28:28] <strong>Swyx:</strong> Cool, okay,</p><p>[01:28:29] <strong>Alex Volkov:</strong> So as,</p><p>[01:28:30] <strong>Swyx:</strong> I feel like we talk so much and you've been a guest on our pod like five times, so</p><p>[01:28:35] <strong>Alex Volkov:</strong> and</p><p>[01:28:36] <strong>Alex Volkov:</strong> I, I would wanna start with how you would introduce yourself to the audience that doesn't know you.</p><p>[01:28:41] <strong>Swyx:</strong> you so I'm Swyx, I mostly work on developer tooling, and, and, mostly known as the editor or podcaster of Latent Space, which has done pretty well.</p><p>[01:28:51] <strong>Swyx:</strong> I think we're celebrating our first year anniversary pretty soon. And on the the other half of my life is I'm working on small AI and AI Engineer Conference, which we just, which we just announced for June 25th to 27th. Yeah.</p><p>[01:29:05] <strong>Alex Volkov:</strong> Yeah. You've had quite a long career in DX as well. I think Netlify, you had a stint in</p><p>[01:29:09] <strong>Swyx:</strong> Netlify</p><p>[01:29:09] <strong>Swyx:</strong> Yeah, I was one of their earliest employees slash dev rel of Netlify. That's where a lot of people know me. That's where I became quote unquote famous in developer tooling and in React specifically. Because I did a lot of content on React and serverless speaking and writing. And then I've been head of developer experience for Temporal, Airbyte, and then also spent a year at AWS working on the same thing.</p><p>[01:29:34] <strong>Alex Volkov:</strong> Hmm. Awesome. I also from that kind of that side of your career, you work with the Chroma guys as well.</p><p>[01:29:40] <strong>Alex Volkov:</strong> And Chroma</p><p>[01:29:41] <strong>Alex Volkov:</strong> just announced that they have been a year around and looked like millions of companies that probably had</p><p>[01:29:48] <strong>Alex Volkov:</strong> something to do with that. So shout out Jeff. And and, I'm blanking out on the</p><p>[01:29:53] <strong>Swyx:</strong> name, Anton. Yeah, yeah. I so I consulted for them on their DevRel when they were doing their, their first hackathon a year ago, actually. And yeah, I</p><p>[01:30:03] <strong>Alex Volkov:</strong> think</p><p>[01:30:04] <strong>Swyx:</strong> It seems like they are the leaders in open source vector databases. Retool, we did a chat or interview with David Hsu, the founder of Retool, and Retool did a state of AI survey among their customers what they're using.</p><p>[01:30:18] <strong>Swyx:</strong> And Chroma was, like, up and to the right in terms of the adoption and the NPS score, which I think NPS is actually a very important metric to keep tracking. Yeah. Really, really cool. Glad to be involved with Chroma.</p><p>[01:30:30] <strong>Alex Volkov:</strong> Glad to be involved with Chroma. You've been also prolific in writing, like I know many people go to your blogs and like the stuff that you have, how many publications in total are you like, publishing your content in right now?</p><p>[01:30:46] <strong>Alex Volkov:</strong> You have your own personal</p><p>[01:30:47] <strong>Swyx:</strong> one, Yeah, I have three blogs. Three blogs. But Latentspace is the currently primary active blog. I have a personal one and then I have a developer tools advising one because I do a bunch of angel investing and advising for people.</p><p>[01:31:01] <strong>Swyx:</strong> And I don't know. I think More people should blog! It helps you think through what you think that and share your knowledge with other people.</p><p>[01:31:10] <strong>Swyx:</strong> And also, actually the most valuable thing is the most embarrassing thing, which is when you get things wrong. People will come out and correct you, and you will be embarrassed for a second, but then you'll remember the lesson forever.</p><p>[01:31:21] <strong>Alex Volkov:</strong> Can you give me an example of something that you went wrong and people corrected you, and then this improved your thinking?</p><p>[01:31:28] <strong>Swyx:</strong> improved thinking?</p><p>[01:31:31] <strong>Swyx:</strong> Yesterday or into coming into today, right? Because I do a monthly recap where I think what ThursdAI does is [01:31:40] recap news every week and then other people like NLW from the breakdown recaps news every day. And I think the lower frequency granularity of a month means that I only get to do 12 of these a</p><p>[01:31:53] <strong>Alex Volkov:</strong> year.</p><p>[01:31:54] <strong>Swyx:</strong> And that. forces me to think through okay, what is really actually important when you step back and think about it. And for my January recap, January was a slow month, to be honest. Today was more news than January. So I was like, I was trying to recap January, and I was like, okay nothing super interesting this month.</p><p>[01:32:11] <strong>Swyx:</strong> What Do we, if we step back, it's important for AI progress. And I listed a bunch of things, long inference and all that. One thing I specifically said was not interesting for state of the art models was long context.</p><p>[01:32:26] <strong>Alex Volkov:</strong> was, long context. It</p><p>[01:32:28] <strong>Swyx:</strong> I said that yesterday. It's published, I sent it out to 35, 000 people, including Satya Nadella, Drew Houston, and all the people who read the newsletter.</p><p>[01:32:36] <strong>Alex Volkov:</strong> Satya doesn't read, he also participates, like he clicks on</p><p>[01:32:39] <strong>Swyx:</strong> links,</p><p>[01:32:39] <strong>Swyx:</strong> Yeah.</p><p>[01:32:40] <strong>Alex Volkov:</strong> there's an engagement, active engagement from Satya from Lydian Space.</p><p>[01:32:43] <strong>Swyx:</strong> so it's, so it's embarrassing, but also it just forces me to think about okay, how much do I really believe in million token and ten million token context? And I know now, today I learned that Nat Friedman strongly disagrees.</p><p>[01:32:58] <strong>Swyx:</strong> And that's good. That's, that's useful to update. And Google, of course. Yeah, yeah. I think It's, it's a, basically, so it's not about that specific point because we can always debate the pros and cons of that, but the act of writing down what you believe and taking strong opinions instead of saying that everything is awesome, instead of celebrating every little bit of progress as equally important, you have to rank them, and being wrong in your rankings gives you information to update your rankings, and if you don't give yourself the chance to be wrong, then you don't really learn.</p><p>[01:33:36] <strong>Alex Volkov:</strong> You</p><p>[01:33:37] <strong>Alex Volkov:</strong> publish a bunch of stuff. Some of the stuff that you publish turns into more than just an article. You have essays, and I think that the one essay that I remember specifically, obviously, is about the AI engineer essay. Talk to me about thinking about how you approach writing this. Is that stuff that you saw?</p><p>[01:33:51] <strong>Alex Volkov:</strong> And I think as background for folks who are not familiar with you and where you are in, in, you're sitting in the middle of the arena that you helped also coin in San Francisco, right? We're in the middle of Soma Mission, Hayes Valley, somewhere there, if I'm not confusing. We're in this space it's called Newton that you're also like I think you're plugging in latent space where Tons of companies that we know from the Twittersphere are just literally behind us here.</p><p>[01:34:15] <strong>Alex Volkov:</strong> There's Tab with Avi and Julius with Rahul like like a bunch of other companies like sitting right here building like very cool things and And this is an example of one of those so actually I think it was very natural to put those kind of hubs within the bigger bubble of San Francisco. And you, as far as I'm concerned, it was very plugged in to this even before coming to AiEngineer, right?</p><p>[01:34:34] <strong>Alex Volkov:</strong> And potentially, this is the reason why the engineer the conference had so many amazing speakers on stage because very I think you told me back then a lot of like personal favors were pulled to get some folks to show up on that on that. And As somebody who's an outsider from Denver, what I said, right?</p><p>[01:34:48] <strong>Alex Volkov:</strong> This is, this is incredible to see, but also it's very hard to penetrate and understand like what's going on and where the trends are. And this is part of the reason for ThursdAI. So you're sitting in the middle of this, you have all these connections, you said you're an angel investor as well. How does this shape your thinking about the AI engineer?</p><p>[01:35:02] <strong>Alex Volkov:</strong> Do these old people talk in like the hackathons? How do you draw to create something like this that's fairly seminal that now people are considering themselves AI</p><p>[01:35:11] <strong>Swyx:</strong> engine. Okay. Oh. Okay. So there's, there's two questions here.</p><p>[01:35:15] <strong>Swyx:</strong> If I can do rag on your questions. Yeah, please. Which is that one, how do you write impactful perspectives or come up with interesting ideas that will stick around? And two, how do you make sense of San Francisco? Especially as an outsider. And people, I think people can hear in my voice that I'm not American.</p><p>[01:35:34] <strong>Swyx:</strong> I'm Singaporean. And the last seven years of my developer career, I did not spend in San Francisco. I only moved here in April of last year. You don't have to be an SF to have a background in tech. Oh, I think the other the other thing I should offer as context is that I, I have been blogging for quite a bit.</p><p>[01:35:57] <strong>Swyx:</strong> I often say that you have to blog 50 times a year, but in order to get like one post a year that it, that makes up the entire year, it's the one that people know you for. So this is my sort of fourth or fifth Quote, unquote, industry defining blog posts. So I, I've done this for serverless, runtimes and cloud orchestration and AWS, so I've done this before and I knew the work that goes into writing something like this. Rise of the AI Engineer took two months. I had a few potential collaborators</p><p>[01:36:35] <strong>Swyx:</strong> who ultimately did not co author but were heavily involved.</p><p>[01:36:43] <strong>Swyx:</strong> And I can talk about the writing of the post, but the main inspiration is trying to figure out what is important directions.</p><p>[01:36:48] <strong>Swyx:</strong> And it is not purely about coining a term, which I think is a very vanity metric, but it is about picking directions in terms of identifying what is wrong about the zeitgeist. At if you rewind this time one year ago, people were very much focusing on prompt engineering. People were worried about the end of jobs for AI, for, for engineers, for software engineers.</p><p>[01:37:13] <strong>Swyx:</strong> And I think both have been proven wrong in terms of the scope of the prompt engineer. Now, like now you're no longer really here about. Professional prompt engineers, because it's been replaced by the AI engineer who can code. And I think the importance of the ability to code to wield AI makes you a thousand times more effective than people who use AI without the ability to code.</p><p>[01:37:37] <strong>Swyx:</strong> And I think identifying this core difference in ability, understanding that this stack is starting pretty thin and small, but it's going to grow over time, understanding that it is fundamentally very different from the ML engineer stack is a part of the mix that made me convinced that AI engineer would be a category to invest in which is why I started the conference and then pivoted the newsletter and podcast.</p><p>[01:38:04] <strong>Alex Volkov:</strong> Yeah, so let's talk about that as well. So definitely the audience that ThursdAI draws, at least in part, is AI engineers, but also in part, like folks who are trained in Finetune models. And I've noticed like a little bit of a AI engineering is almost like the gateway drug into the larger AI stuff, because at least the folks that I'm familiar with, the folks who are like JSTS devs, that did the Netlify stint, that did React, etc.,</p><p>[01:38:27] <strong>Alex Volkov:</strong> they started to build with these tools. The tools are like significantly easier to get into than ML, than traditional ML. You just do some API calls open AI exposes a bunch of stuff, and suddenly you're like, oh, okay. I have, I've tapped all this power, this incredible power. I'm building intuitions about how to use this power.</p><p>[01:38:42] <strong>Alex Volkov:</strong> I'm building intuitions, how to put this power in production for my users. They tell me some feedback. How do I do more of this? Am I only limited to open ai? Or maybe I can go to the open source. Try some stuff like this. Maybe I can do Olama, which, by the way, shout out to Olama, our friends, just released the Windows thing.</p><p>[01:38:56] <strong>Alex Volkov:</strong> Maybe I can do this like locally on device. Maybe you can do this on Edge, on Cloudflare, for example. All these new tools are popping up, and these people are sounding like from a very limited scope of API users, are growing into API users who also have an intuition about prompting is just one of those things, embedding in RAG and better RAG systems, like we've seen some folks going there.</p><p>[01:39:14] <strong>Alex Volkov:</strong> Definitely the scope grows, and as every category, like frontend was a very tiny scope, JavaScript, HTML, and the client, and suddenly like it became a full stack, you have prompt and like frontend, ops, and like all of these like things. So scope grows.</p><p>[01:39:30] <strong>Alex Volkov:</strong> Where do people learn about this new and upcoming thing?</p><p>[01:39:32] <strong>Alex Volkov:</strong> And I think like the conference is one such way. So we've talked about the conference. This is actually not your first time. I just remembering I interviewed you after the conference for a full hour that we had a full conversation. It wasn't about Swyx. So how was the conference after the conference received?</p><p>[01:39:46] <strong>Alex Volkov:</strong> How did your direction into thinking about latent space and kind of exposing AI in San Francisco to the world? And let's take this to the kind of the next conference where you want to take us. What happened to the AI engineer?</p><p>[01:39:59] <strong>Alex Volkov:</strong> I think I asked</p><p>[01:39:59] <strong>Swyx:</strong> three</p><p>[01:39:59] <strong>Swyx:</strong> or</p><p>[01:39:59] <strong>Swyx:</strong> four. [01:40:00] Yeah, I know.</p><p>[01:40:00] <strong>Alex Volkov:</strong> Break them down however you want.</p><p>[01:40:02] <strong>Swyx:</strong> So the conference was really good, but I would actually classify that as the end of a process rather than the start of a process. It basically recaps</p><p>[01:40:10] <strong>Swyx:</strong> the work</p><p>[01:40:11] <strong>Swyx:</strong> that people are doing in the industry over the past year.</p><p>[01:40:14] <strong>Swyx:</strong> And then, I get to curate and pick and invite people to present, the best of their work and their thought. And I think that's a very privileged position. And then for me, The work begins after the conference for the next the next thing. And I picking directions and having so last year was like a single track conference, this year for World's Fair we're doing nine</p><p>[01:40:36] <strong>Alex Volkov:</strong> When is that, just for the</p><p>[01:40:38] <strong>Swyx:</strong> June 25th to 27th. Yeah.</p><p>[01:40:40] <strong>Alex Volkov:</strong> make sure you sign up.</p><p>[01:40:41] <strong>Alex Volkov:</strong> It's gonna</p><p>[01:40:42] <strong>Swyx:</strong> yeah, yeah. We're going four times bigger this year, 2, 000 people, and last year, 17, 000 people tuned in on the livestream, and hopefully we'll have, we'll have more impact this year. But yeah I think For me, actually, it's a really good way to think about okay, who do people want to hear from, who actually did impactful work that I will be proud to showcase 10 years from now.</p><p>[01:41:04] <strong>Swyx:</strong> I'm always thinking about the test of time. And I was very inspired by NeurIPS, where they actually had a test of time award. And I was like,</p><p>[01:41:10] <strong>Alex Volkov:</strong> man, that's Did Jeremy Howard get it or something, if I remember</p><p>[01:41:13] <strong>Alex Volkov:</strong> correctly?</p><p>[01:41:13] <strong>Alex Volkov:</strong> No, Jeff Dean. Jeff Dean.</p><p>[01:41:14] <strong>Swyx:</strong> Jeff Dean. Yeah.</p><p>[01:41:16] <strong>Alex Volkov:</strong> Shoutout Jeff Dean for today, by the way.</p><p>[01:41:17] <strong>Swyx:</strong> Yeah, yeah, for Word2Vec. I, I always said some people are speculating what is Test of Time for next year, and it was like Ilyas Oskarver, if he ever shows his face</p><p>[01:41:25] <strong>Swyx:</strong> again.</p><p>[01:41:26] <strong>Swyx:</strong> And then I was like, but I know what's gonna win the Test of Time for 2027. Which is attention is all you need.</p><p>[01:41:32] <strong>Swyx:</strong> Yeah, yeah. But basically it's a flex for any, any conference to say okay, Test of Time award goes to something that was presented here 10 years ago. And that and Neuros has been going on for 37 years.</p><p>[01:41:46] <strong>Alex Volkov:</strong> what of the AI engineer presentations would stand the test of</p><p>[01:41:50] <strong>Swyx:</strong> question. I think the audience has voted. It looks like Pydantic and Jason Liu's Instructure is very, very, very, very popular. And I think he's just fundamentally correct that every model, instead of there's like some table six versions of every model. You have the base model when you train it, then you have the chat tune model.</p><p>[01:42:07] <strong>Swyx:</strong> And now I think it's going to be table stakes that every model should have structured output or function calling as, as they call it. And it's even useful if you're not actually using it to, to generate code or call code because it's very good for chain of thought. And so Max Wolf mini maxer on Twitter and on Hacker News actually wrote a really influential post that I'm going to try to showcase.</p><p>[01:42:27] <strong>Swyx:</strong> Yeah, for me as a conference curator that's what I do. Read a lot of stuff and then I try to try to feature like the best of things and also try to make bets that are important. I do think as content creators, like we're like the end of the food chain and not the value chain.</p><p>[01:42:45] <strong>Swyx:</strong> And it's always important to understand like even stuff that we don't pick is very important and substantial and it's</p><p>[01:42:53] <strong>Swyx:</strong> You're, you're picking for an audience to use at work, which is a small subset of the total progress that humanity can make.</p><p>[01:43:01] <strong>Alex Volkov:</strong> Interesting, interesting. Tell</p><p>[01:43:02] <strong>Alex Volkov:</strong> me</p><p>[01:43:03] <strong>Swyx:</strong> I just people, you want to engage in philosophical conversation, you go to Lex Friedman or Dorkesh Patel.</p><p>[01:43:11] <strong>Swyx:</strong> And then if you want to Think, talk about things that you can use in open source. You go to Thursday, ai. And then we have less of an open source focus. We are, we're very much focused on enterprise and things you, things you can use at work to code and to build products and startups with.</p><p>[01:43:26] <strong>Swyx:</strong> And so like I, whatever you do, as, as long as you have a clear focus for the, of the audience that you serve and you know how to reach them, then they will love you because you are, you're making literally the thing for them. And you don't have to appeal to everyone. And I think that's fine.</p><p>[01:43:40] <strong>Alex Volkov:</strong> switching gears from the kind of the conference.</p><p>[01:43:43] <strong>Alex Volkov:</strong> How did the podcast came about? It's you said you're coming up on the year</p><p>[01:43:46] <strong>Alex Volkov:</strong> of</p><p>[01:43:46] <strong>Alex Volkov:</strong> the</p><p>[01:43:46] <strong>Alex Volkov:</strong> podcast. And you also said you moved here in April. I did not know this.</p><p>[01:43:49] <strong>Alex Volkov:</strong> I</p><p>[01:43:49] <strong>Alex Volkov:</strong> thought you're here for SF Native. So how did the podcast came about? How you and Alessia met? Let's talk about</p><p>[01:43:54] <strong>Swyx:</strong> later. Yeah. And we should talk about doing well in San Francisco and like the taxi in, in Ingra, I think, which I, which I think is important and something I'm.</p><p>[01:44:01] <strong>Swyx:</strong> going through but have also done well at. So the podcast specifically was because I started the newsletter writing opinion pieces on just AI stuff. It was actually inspired by Stable Diffusion at the time which was sort of August 2022 ish.</p><p>[01:44:16] <strong>Alex Volkov:</strong> My life changed after that open sourcing.</p><p>[01:44:19] <strong>Swyx:</strong> Yeah, and then you you really run out of opinions very</p><p>[01:44:22] <strong>Alex Volkov:</strong> and</p><p>[01:44:24] <strong>Swyx:</strong> and then you're like, oh, I need to generate unique or new tokens.</p><p>[01:44:29] <strong>Swyx:</strong> The only way to do that is to get source material by interviewing people and putting a microphone in front of them. When you put microphones in front of people, they get more chatty. And sometimes they break news. For us, the big breakthrough was George Hotz when he talked about GPT 4 with being a mixture of experts.</p><p>[01:44:44] <strong>Swyx:</strong> Yeah, that was, that was a surprise, but he likes to do that sort of thing, just drop random alpha.</p><p>[01:44:49] <strong>Alex Volkov:</strong> he dropped it and then you guys posted it and then I had no idea what Mixture of Experts is as well as like most of us and then it turns out to be like a true and now we we</p><p>[01:44:59] <strong>Swyx:</strong> saw it. Now Gemini is</p><p>[01:44:59] <strong>Alex Volkov:</strong> Gemini's Mixture of Experts the 1.</p><p>[01:45:01] <strong>Alex Volkov:</strong> 5 which is quite incredible so that was like a big thing did was this like natural to you to start turning on the microphone did you have to do an</p><p>[01:45:08] <strong>Alex Volkov:</strong> adjustment period</p><p>[01:45:09] <strong>Swyx:</strong> another thing that people don't know is that I started four podcasts before.</p><p>[01:45:13] <strong>Swyx:</strong> So I'm not new to the conversation game, and I'm not new to like audacity and like editing and publishing, but I think, Having taken a few runs at it helps to prep you for, like, when something actually has audience fit.</p><p>[01:45:26] <strong>Swyx:</strong> Because all the others were very small. There were maybe like a few hundred listeners each time. This one went to number 10 on the U. S. tech charts.</p><p>[01:45:33] <strong>Alex Volkov:</strong> Yes, I saw that. That was incredible. Is that the top, top,</p><p>[01:45:36] <strong>Swyx:</strong> I think that's the highest it's been. Recently when it was like as high as 16 over the holidays, and then now it's dropped back down again. It's very, very volatile.</p><p>[01:45:44] <strong>Alex Volkov:</strong> But it's like very clear that you're in the top 50 like tech podcasts in the world, even though AI is Fairly niche. And the topics you discuss are fairly technical.</p><p>[01:45:52] <strong>Alex Volkov:</strong> Like when you talk with folks, it's not a general appeal audience for like Sweden does, or the, the guys from the four guys, the VCs, right? It's very technical. So very impressive that like you broke the top 50 charts and it wasn't by chance you bring like great guests. Like, how do you, is the same approach that you have for the engineer you do for guests as well?</p><p>[01:46:13] <strong>Alex Volkov:</strong> Or are you now getting like requests to come on the podcast from some other</p><p>[01:46:15] <strong>Swyx:</strong> We get requests but you usually, for the, the people that draw the audiences, you have to go reach out to them. It's obviously, that's how it is. I</p><p>[01:46:24] <strong>Alex Volkov:</strong> I heard one such person now does not work in OpenAI, so he can</p><p>[01:46:28] <strong>Alex Volkov:</strong> potentially, potentially join</p><p>[01:46:29] <strong>Alex Volkov:</strong> podcasts as</p><p>[01:46:30] <strong>Swyx:</strong> yeah, he's a, he's a he's a listener and he has said that he'll come on at some point.</p><p>[01:46:35] <strong>Alex Volkov:</strong> We're talking about bad Mephisto for folks in the</p><p>[01:46:37] <strong>Swyx:</strong> Mephisto for Fortunyaga. So yeah,</p><p>[01:46:41] <strong>Swyx:</strong> I don't think it's actually just guests. I think it's also about focus on topics and then being engaged enough with the material that you get to ask questions that no one else asks.</p><p>[01:46:51] <strong>Swyx:</strong> Because, for example, if you have a VC asking questions, they often ask about market and business. But if you're an engineer, you're really asking about API and limitations and trade offs, stuff like that. Things that you don't really get into unless you're, like, actually evaluating it to use something at work.</p><p>[01:47:09] <strong>Swyx:</strong> And I think that's important. And also, I think a lot of guests For us, we try to be like the first podcast that somebody has done. Like we're the first podcast for for Fine, for Cursor, for a bunch of these guys. So they're not experienced speakers. They're not some of them are good speakers.</p><p>[01:47:25] <strong>Swyx:</strong> But they're not experienced at the whole telling their story and all that. So you have to help them. But it doesn't matter because I think that you just try to serve your audience at the end of the day, right? What do people want to know? Ask those questions and then get out of their way and let them talk.</p><p>[01:47:38] <strong>Swyx:</strong> I think that the other thing that we do, the reason I say it's not just GUESS. is because we do special episodes where we have breaking news. We haven't done one in a while because I don't know. I think, I think you got, you have taken that spot of, of the breaking news guy. We</p><p>[01:47:50] <strong>Alex Volkov:</strong> got</p><p>[01:47:51] <strong>Alex Volkov:</strong> the, we got three breaking news, you were here. This is kind of like, that as</p><p>[01:47:54] <strong>Swyx:</strong> that as well. And then we also do like events recaps. Like we did Dev Day we did NeurIPS and that is like a really big sort of editing process work that I really like to do where you're basically performing the work of summarization and Curation, instead of doing long form interviews, and people really like that.</p><p>[01:48:13] <strong>Alex Volkov:</strong> summarization part, like the multiple folks, I think I participated in one, you did one in DevDay NeurIPS as well. So what's, what's [01:48:20] now that we're coming up on an annual kind of thing for, for Latentspace, what's next for Latentspace?</p><p>[01:48:24] <strong>Swyx:</strong> More conversations? That's the weird thing we think that we've done and have done as well as a technical podcast can do in the general podcasting space.</p><p>[01:48:36] <strong>Swyx:</strong> The ultimate number of people who listen to podcasts is still very low. compared to the general audience that might be interested in the same kind of content. That's why I branch out into a conference where you produce talks and very highly polished and all that. We The way to grow a podcast is to not just podcast it's to actually write, where, my essays still get a lot more readers than listeners than to grow on YouTube or whatever, and that's fine.</p><p>[01:49:05] <strong>Swyx:</strong> I think ultimately, podcasting is a mix of entertainment and Education, right? You have to be attached to some kind of story, some kind of personality, and, and then learn something along the way that might be useful at work. So I think personally, I growing as a podcaster is about just growing your influence or understanding of an industry in general and the ability to serve an audience.</p><p>[01:49:29] <strong>Swyx:</strong> And then maybe opening up as hosts and as industry experts as we gain knowledge and understanding. So that people come to us not just for access to guests, but access to us as well, which people have when we did the end of year listener survey people actually requested for us to have more mic time.</p><p>[01:49:47] <strong>Swyx:</strong> Alessio and I did our first just the two of us conversation in a year and that was really good.</p><p>[01:49:52] <strong>Alex Volkov:</strong> Wow. So are you playing more, more of those?</p><p>[01:49:54] <strong>Swyx:</strong> Yeah, yeah, we, so we used to do these one on one episodes where we do Introductions to a topic, like we did Datasets 101, Benchmarks 101, and we did Transformer Math 101, and then we also did RLHF 201.</p><p>[01:50:07] <strong>Swyx:</strong> And so we want to do more of those, where it's like it's like inspired by Acquired FM. And the work for this kind of episode is so different than a normal chat, because in normal chat you just sit down and you, you, maybe you prep a bit, a bit of question, you, you research the other guy's background, and then you just have a nice conversation, and that's it.</p><p>[01:50:23] <strong>Swyx:</strong> Whereas for a content heavy episode like that one, you do</p><p>[01:50:27] <strong>Swyx:</strong> a</p><p>[01:50:27] <strong>Swyx:</strong> week of research. And you compile a whole bunch of stuff, and you simmer it in your mind, and then you try to rehash it and introduce it for an audience who hasn't done that amount of work. Yeah, that, that is a lot more work up front, but obviously it's very high value, and, and also I, I like to call it evergreen.</p><p>[01:50:43] <strong>Swyx:</strong> Evergreen content, meaning, like You want to build up something that will still be useful and relevant in a year.</p><p>[01:50:48] <strong>Alex Volkov:</strong> Yeah. So definitely let me, let me just take a personal position here with Latentspace.</p><p>[01:50:53] <strong>Alex Volkov:</strong> I've been a guest host, in Latentspace a couple of times, in special episodes as well. I, now this, this studio is like super cool, like a home away from home. They're able to come here to the spaces, Alessio on Tap into the AI scene in San Francisco. And I've learned a bunch from just the way you render.</p><p>[01:51:11] <strong>Alex Volkov:</strong> Latentspace, for folks who are listening, is not only just a podcast. If you're subscribing on just your Spotify or Apple News, you're missing a big part of it, which is the newsletter that you send, which has a bunch of links and show notes and folks that you talk</p><p>[01:51:23] <strong>Swyx:</strong> about.</p><p>[01:51:23] <strong>Swyx:</strong> There's one more part. Discord.</p><p>[01:51:26] <strong>Alex Volkov:</strong> Oh, there's also Discord.</p><p>[01:51:27] <strong>Alex Volkov:</strong> You do paper readings as well, right? There's a whole community that you're building.</p><p>[01:51:30] <strong>Swyx:</strong> community The Discord is surprisingly good. For the zero effort that I put into it, people just show up, and then they ask really very good questions, they drop things that I don't know, and then I learn from the Discord, and then I talk about it later. But, yeah, Discord has a lot of alpha.</p><p>[01:51:47] <strong>Swyx:</strong> And it's surprising because I have this newsletter that, I have this bot, That summarizes all the top AI discords, right? Obviously the top ones are, like, Eleuther, TheBloke what else?</p><p>[01:51:55] <strong>Swyx:</strong> Yeah, mid, mid, yeah, but it's not, that's not very technical. That's mostly just prompting.</p><p>[01:52:00] <strong>Swyx:</strong> Midrani is 8 million members. That's something like 13 percent of total Discord membership. Ha ha ha ha ha. That's freaking crazy. But anyway, so like the Discord is the community attachment to the podcast and the newsletter. And then it's, people interacting with each other, some people getting jobs, some people getting investments, I have founders coming in and VCs there also funding them.</p><p>[01:52:22] <strong>Swyx:</strong> And like I, I really think that every every piece of content is a minimum viable community, right? People gather, they're chatting in the Twitter space comments right now. They're chatting in your newsletter comment section. But if you let people gather together live, whether it's online or in person we also have in person meetups.</p><p>[01:52:40] <strong>Swyx:</strong> I just had one in Singapore. We have one in San Francisco, I think, monthly.</p><p>[01:52:45] <strong>Swyx:</strong> I hope to have it monthly. And then obviously once a year you get people together for a really big conference where like they put out their best work. So I call this community annealing, right? You have cold community, like podcasts are cold.</p><p>[01:52:58] <strong>Swyx:</strong> Newsletters are cold because they're asynchronous. There's not somebody there, you don't expect to respond to the other person. Twitter spaces are warm because they're live and, there's some chance of live feedback. Discords are live, but when you, when you, when they're hot, it's when like everyone is on the same call and you're looking in each other's eyes.</p><p>[01:53:16] <strong>Swyx:</strong> And you're conversing and you're, you're having like a real bond and relationship there. And so like communities need this whole range of like warm and hot and cold. And I try to build that for Dane Space.</p><p>[01:53:28] <strong>Alex Volkov:</strong> So for folks who are just listening on podcasts, you're missing several parts of the space. Newsletter is definitely worth checking out. Latent. space is actually a URL.</p><p>[01:53:38] <strong>Swyx:</strong> And that was donated by a reader. Not donated. Sold to us for cheap.</p><p>[01:53:42] <strong>Alex Volkov:</strong> You can consider this a donation but also the Discord part speaking of work that I think we need to wrap up because like we're after two hours and I want to let you go back to work. I also need to edit this and send this. I also want to check out the stuff that we did. Any last kind of parting things here?</p><p>[01:53:56] <strong>Alex Volkov:</strong> Maybe let's touch briefly or is that a bigger conversation? How to succeed in SF or is that for a later</p><p>[01:54:02] <strong>Swyx:</strong> Oh yeah, yeah, yeah. Oh man. This is such an interesting topic, especially for people who are not in sf, right?</p><p>[01:54:06] <strong>Swyx:</strong> Yeah. I think SF is a group of humans and not a place, and they are mostly available on Twitter. Yeah. But then sometimes they, they often gather in San Francisco and Yes, when you meet them in person. There are some people that are not famous online or not fully consistently candid online that you talk to them in person and you're like, Oh, okay, I fully understand you now and everything that you've done and everything that you're going to do, I understand where you're coming</p><p>[01:54:33] <strong>Swyx:</strong> from.</p><p>[01:54:34] <strong>Swyx:</strong> And to me, that is obviously a very high offer, that's why I moved here. But you don't have to go there directly, right? One of my mentors And the last one that I want to talk about is in career is Andrew Chen, who basically blogs his way into being a general partner at Andrews and Horowitz.</p><p>[01:54:49] <strong>Swyx:</strong> Like he runs one of their top three funds, the consumer fund. And he consistently is Hey, just Put out your best work, learn in public, tweet a lot and instead of going to all these parties, there's always, there's always a party every week in San Francisco</p><p>[01:55:03] <strong>Alex Volkov:</strong> Every day, multiple stacks a day sometimes, yeah.</p><p>[01:55:06] <strong>Swyx:</strong> There was one Thursday last year with 10 AI meetups in San Francisco.</p><p>[01:55:09] <strong>Alex Volkov:</strong> So</p><p>[01:55:10] <strong>Swyx:</strong> can go through the motions of networking, but you still end up with a smaller network than you would if you stayed at home. And you just wrote a lot, or you thought a lot, or you did quality work. And so then you don't have to be in San Francisco to do that. You can just, you can keep doing that online.</p><p>[01:55:27] <strong>Swyx:</strong> And then, take advantage of a big conference or something to come into San Francisco and actually meet people in person. And that's totally fine. I don't intend to stay in San Francisco forever, right? I have, once I know enough people, I can just come here like once a quarter and people will still think that I'm in San Francisco.</p><p>[01:55:41] <strong>Swyx:</strong> And that's fine.</p><p>[01:55:41] <strong>Alex Volkov:</strong> I get this question quite a lot. I've been here, maybe this is the fourth or fifth time for the past six months, and I get this question, do you live here? I was</p><p>[01:55:48] <strong>Swyx:</strong> Yeah. I think, I think people are just like borders. I, I'm, I'm a border disrespector and I think I hope more people do that. But do come into San Francisco every now and then maybe for a big conference that's happening June 25th to 27th.</p><p>[01:56:02] <strong>Swyx:</strong> But otherwise do great work online and people will notice it and find you and chat with you. And the in person component doesn't matter so much as plugging into the mentality and the community online.</p><p>[01:56:12] <strong>Alex Volkov:</strong> Yeah. SWIX, it's been a surprising interview. I didn't plan on this.</p><p>[01:56:15] <strong>Alex Volkov:</strong> I just thought we're here. I haven't heard you in a while. The anniversary of latency is coming up a huge kudos for this effort. Like huge kudos. Big, big, big, big. Thank you for me because a lot of what the stuff that you did, you and Alessio pulled me through. I, I still get like a bunch of listeners for Thursday.</p><p>[01:56:30] <strong>Alex Volkov:</strong> I, from the Latan space work on Substack. And so a huge thanks for me because you, you kinda shaped. what I'm doing as well. The newsletter and the podcast combo that I forced myself to doing every [01:56:40] week. This was, this was based on the Substack stuff from you as well. And I really appreciate your, your friendship as well.</p><p>[01:56:45] <strong>Alex Volkov:</strong> So thank you for coming up on Thursday. I thank you for hosting us in Latentspace. And with that, I think I'll move on to the last piece of what we have on Thursday, iFolks, which is a recap of everything we've talked about. And then I'll just briefly run through recap and I'll let you go to your day. We haven't, let me just start with the music, obviously, because like, how else would this work?</p><p>[01:57:02] <strong>Alex Volkov:</strong> However, with that, I just want to wish you a great Thursday. Thank you for joining us from week to week. I want to thank the co hosts that I had on stage. Thank you, Nisten. Thank you, Jan. Thank you, LDJ. Far El was here. Alignment was here. Thank you. A huge thank you for Swyx, Alessio, and the Latentspace folks for hosting me here.</p><p>[01:57:19] <strong>Alex Volkov:</strong> A shout out to a bunch of friends in Silicon Valley who I'm gonna meet. And with that, we'll see you next week. I'm gonna go and try to somehow summarize this all in the newsletter and podcast for you. And we'll see you folks next week. From San Francisco. This has been Alex Volkov. Cheers, everyone.</p><p>[01:57:34] <strong>Alex Volkov:</strong> Not this one. Bye bye.</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-feb-15-2024-openai-changes</link><guid isPermaLink="false">substack:post:141714989</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Fri, 16 Feb 2024 02:09:28 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/141714989/14636144616b8a79db3ce03f1f9e6f2b.mp3" length="84682673" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>7057</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/141714989/8ffa3ede48aed247ae29e35e98c601aa.jpg"/></item><item><title><![CDATA[📅 ThursdAI - Feb 8 - Google Gemini Ultra is here, Qwen 1.5 with Junyang and deep dive into ColBERT, RAGatouille and DSPy with Connor Shorten and Benjamin Clavie]]></title><description><![CDATA[<p>Hihi, this is Alex, from Weights & Biases, coming to you live, from Yosemite! Well, actually I’m writing these words from a fake virtual yosemite that appears above my kitchen counter as I’m not a Vision Pro user and I will force myself to work inside this thing and tell you if it’s worth it. I will also be on the lookout on anything AI related in this new spatial computing paradigm, like <a target="_blank" href="https://x.com/blizaine/status/1755740068017234114?s=20">THIS</a> for example! </p><p>But back to rfeality for a second, we had quite the show today! We had the awesome time to have <a target="_blank" href="https://twitter.com/JustinLin610/status/1754538215959335100">Junyang Justin Lin</a>, a dev lead in Alibaba, join us and talk about Qwen 1.5 and QwenVL and then we had a deep dive into quite a few Acronyms I’ve been seeing on my timeline lately, namely DSPy, ColBERT and (the funniest one) RAGatouille and we had a chat with Connor from Weaviate and Benjamin the author of RAGatouille about what it all means! Really really cool show today, hope you don’t only read the newsletter but listen on <a target="_blank" href="https://open.spotify.com/show/2J3lqMPD0BUI0bF9KJYKc1?si=33f4ab5556204b85&#38;nd=1&#38;dlsi=a12aeec533df45da">Spotify</a>,<a target="_blank" href="https://podcasts.apple.com/us/podcast/thursdai-the-top-ai-news-from-the-past-week/id1698613329?i=1000643008526"> Apple</a> or right here on Substack. </p><p>TL;DR of all topics covered: </p><p>* <strong>Open Source LLMs</strong> </p><p>* Alibaba releases a BUNCH of new QWEN 1.5 models including a tiny .5B one (<a target="_blank" href="https://twitter.com/JustinLin610/status/1754538215959335100">X announcement</a>)</p><p>* Abacus fine-tunes Smaug, top of HF leaderboard based Qwen 72B (<a target="_blank" href="https://x.com/bindureddy/status/1754665925834690907?s=20">X</a>)</p><p>* LMsys adds more open source models, sponsored by Together (<a target="_blank" href="https://twitter.com/lmsysorg/status/1755361624012333239">X</a>)</p><p>* Jina Embeddings fine tune for code</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Google rebranding Bard to Gemini and launching Gemini Ultra (<a target="_blank" href="https://gemini.google.com/chat">Gemini</a>)</p><p>* OpenAI adds image metadata (<a target="_blank" href="https://twitter.com/OpenAI/status/1754930271970005161">Announcement</a>)</p><p>* OpenAI keys are now restricted per key (<a target="_blank" href="https://twitter.com/OpenAIDevs/status/1755275367500386753">Announcement</a>)</p><p>* <strong>Vision & Video</strong></p><p>* Bria - RMBG 1.4 - Open Source BG removal that runs in your browser (<a target="_blank" href="https://twitter.com/bria_ai_/status/1754846894675673097">X</a>, <a target="_blank" href="https://huggingface.co/spaces/briaai/BRIA-RMBG-1.4">DEMO</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* Meta voice, a new apache2 licensed TTS - (<a target="_blank" href="https://x.com/metavoiceio/status/1754983953193218193?s=20">Announcement</a>)</p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* Microsoft added DALL-E editing with "designer" (<a target="_blank" href="https://twitter.com/itsPaulAi/status/1755273709211025539">X thread</a>)</p><p>* Stability AI releases update to SVD -  video 1.1 launches with a webUI, much nicer videos</p><p>* <strong>Deep Dive with </strong><a target="_blank" href="https://twitter.com/bclavie/status/1745219720540831856"><strong>Benjamin Clavie</strong></a><strong> and </strong><a target="_blank" href="https://twitter.com/CShorten30"><strong>Connor Shorten</strong></a><strong> show notes:</strong></p><p>* Benjamin's announcement of RAGatouille (<a target="_blank" href="https://twitter.com/bclavie/status/1742950315278672040">X</a>)</p><p>* Connor chat with <a target="_blank" href="https://twitter.com/lateinteraction/status/1752059559852871765">Omar Khattab</a> (author of DSPy and ColBERT) - <a target="_blank" href="https://www.youtube.com/watch?v=CDung1LnLbY&#38;list=PLTL2JUbrY6tW-KOQfOek8dtUmPgGQj3F0&#38;index=3">Weaviate Podcast</a></p><p>* Very helpful intro to ColBert + RAGatouille - <a target="_blank" href="https://www.youtube.com/watch?v=CDung1LnLbY&#38;list=PLTL2JUbrY6tW-KOQfOek8dtUmPgGQj3F0&#38;index=3">Notion</a></p><p>Open Source LLMs </p><p>Alibaba releases Qwen 1.5 - ranges from .5 to 72B (<a target="_blank" href="https://huggingface.co/spaces/Qwen/Qwen1.5-72B-Chat">DEMO</a>)</p><p>With 6 sizes, including 2 new novel ones, from as little as .5B parameter models to an interesting 4B, to all the way to a whopping 72B, Alibaba open sources additional QWEN checkpoints. We've had the honor to have friend of the pod <a target="_blank" href="https://twitter.com/JustinLin610">Junyang Justin Lin</a> again, and he talked to us about how these sizes were selected, that even thought this model beats Mistral Medium on some benchmarks, it remains to be seen how well this performs on human evaluations, and shared a bunch of details about open sourcing this.</p><p>The models were released with all the latest and greatest quantizations, significantly improved context length (32K) and support for both Ollama and Lm Studio (which I helped make happen and am very happy for the way ThursdAI community is growing and connecting!)</p><p> We also had a chat about QwenVL Plus and QwebVL Max, their API only examples for the best open source vision enabled models and had the awesome Piotr Skalski from Roborflow on stage to chat with Junyang about those models! </p><p>To me a success of ThursdAI, is when the authors of things we talk about are coming to the show, and this is Junyang second appearance, which he joined at midnight at the start of the chinese new year, so greately appreciated and def. give him a listen! </p><p>Abacus Smaug climbs to top of the hugging face leaderboard </p><p>Junyang also mentioned that Smaug is now at the top of the leaderboards, coming from Abacus, this is a finetune of the previous Qwen-72B, not even this new one. First model to achieve an average score of 80, this is an impressive appearance from Abacus, though they haven't released any new data, they said they are planning to! </p><p>They also said that they are planning to finetune Miqu, which we covered last time, the leak from Mistral that was <a target="_blank" href="https://twitter.com/arthurmensch/status/1752737462663684344">acknowledged by Arthur Mensch</a> the CEO of Mistral.</p><p>The techniques that Abacus used to finetune Smaug will be released an upcoming paper! </p><p>Big CO LLMs + APIs</p><p>Welcome Gemini Ultra (bye bye Bard) </p><p>Bard is no longer, get ready to meet Gemini. it's really funny because we keep getting cofusing naming from huge companies like Google and Microsoft. Just a week ago, Bard with Gemini Pro shot up to the LMSYS charts, after regular gemini pro API were not as close. and now we are suppose to forget that Bard even existed? 🤔 </p><p>Anyhow, here we are, big G answer to GPT4, exactly 10 months 3 weeks 4 days 8 hours, but who's counting? </p><p>So what do we actually get? a $20/m advanced tier for Gemini Advanced (which will have Ultra 1.0) the naming confusion continues. We get a longer context (how much?) + IOS and android apps (though I couldn't find it in IOS, maybe it wasn't yet rolled out)</p><p>Gemini now also replaces google assistant for those with androids who opt in (MKBHD was somewhat impressed but not super impressed) but google is leaning into their advantage including home support! </p><p>* Looks like Gemini is ONLY optimized for English as well </p><p>We had quite the conversation on stage from folks who upgraded and started using, including noticing that Gemini is a better role player, and less bland, but also that they don't yet support uploading documents besides images, and that the context window is very limited, some said 8K and some 32K but definitely on the lower side. </p><p>Also from Google : a llama.cpp wrapper called localllm (<a target="_blank" href="https://cloud.google.com/blog/products/application-development/new-localllm-lets-you-develop-gen-ai-apps-locally-without-gpus">Blog</a>)</p><p>OpenAI watermarks DALL-E images and adds per key API limits (finally) (<a target="_blank" href="https://help.openai.com/en/articles/8912793-c2pa-in-dall-e-3">Blog</a>)</p><p>OpenAI's using something calledC2PA for pictures made by DALL-E 3, whether you're chatting with ChatGPT or using their API. It's a way to show that DALL-E 3 actually created those images. But it's just for images right now, not for text or voice stuff. Adding this info can make the files up to 32% bigger, but it doesn't mess with the quality. The tags tell you if the source was DALL-E 3, ChatGPT, or the API by including special signatures and stuff. Just a heads up, though, this C2PA thing isn't perfect. The metadata could get wiped either on purpose or by mistake.</p><p>They also released an update to the developer experience that allows you to track usage but also restrict usage per API key! Very very needed and helpful! </p><p>This weeks Buzz (What I learned with WandB this week)</p><p>First part of the live series with the Growth ML team was live and AWESOME! </p><p>Vision</p><p>BRIA - Open-Source background removal (non commercial)</p><p><a target="_blank" href="https://twitter.com/bria_ai_">BRIA AI@bria_ai_</a><a target="_blank" href="https://twitter.com/bria_ai_/status/1754846894675673097">Feb 6, 2024</a></p><p>📷 Introducing Open-Source Background Removal by @BriaAI 📷 Now live on @huggingface, RMBG v1.4 excels in separating foreground from background across diverse categories, surpassing current open models.  See demo [https://t.co/DDwncjkYqi] #BriaAI #OpenSource #AI @briaai https://t.co/BlhjMMNWxa</p><p>Voice</p><p>MetaVoice (<a target="_blank" href="https://huggingface.co/metavoiceio/metavoice-1B-v0.1">hub</a>)</p><p>1.2B parameter model.Trained on 100K hours of data.Supports zero-shot voice cloning.Short & long-form synthesis.Emotional speech.Best part: Apache 2.0 licensed. 🔥</p><p>Powered by a simple yet robust architecture: > Encodec (Multi-Band Diffusion) and GPT + Encoder Transformer LM. > DeepFilterNet to clear up MBD artefacts.</p><p>That's it for us this week, this time I bring you both the news segment AND the deepdive in one conversation, hope it's not super long, see you here next ThursdAI! 👏</p><p>Full Transcript: </p><p>[00:00:00] Intro and housekeeping</p><p>[00:00:00] ​</p><p>[00:00:00] <strong>Alex Volkov:</strong> You're on ThursdAI, and I think it's time for us to get started with the recording and the introduction.</p><p>[00:00:26] <strong>Alex Volkov:</strong> Happy, happy Thursday everyone! Today is February 8th, 2024. I don't know, This is the second calendar year the Thursday is happening in, so I don't know if I need to mention the year or not but we're well on our way into 2024 and you're here on Thursday, I, the Thursday I is the space, the newsletter, and the podcast to keep you up to date with all of the very interesting things that are happening in the very fast moving world of ai.</p><p>[00:00:58] <strong>Alex Volkov:</strong> Hopefully by now, all of you already have ThursdAI in your podcast, wherever you get a podcast, Spotify, recently YouTube as well, which is weird. But with this introduction, I will just say, hello myself, basically. Hey everyone. My name is Alex Volkov. I'm an AI evangelist with Weights & Biases.</p><p>[00:01:15] <strong>Alex Volkov:</strong> Weights & Biases is the reason why this comes to life to you. And there's going to be a little segment about Weights & Biases in the middle here as well, and I'm joined on stage. Often, and pretty much every week by great friends, experts in their fields. As we talk about everything AI related this week, especially we're going to have some interesting things.</p><p>[00:01:34] <strong>Alex Volkov:</strong> Those of you who come back week after week. Thank you, and we love that you're part of the community, and it's great to see how many people just return, and those of you who are new, we're here every week and The community doesn't stop after we finish the space. There's a bunch of spaces. I think our friend AlignmentLab had the space that went on for the full week, I think.</p><p>[00:01:55] <strong>Alex Volkov:</strong> I don't know if he ever slept. That's maybe why he's not here on stage. But we're here every week for the two hours to give you updates for the first hour and definitely some very interesting deep dives that has been happening, that have been happening for the past few Weeks, I want to say, so I just want to shout out some friends of ours that recently we were featured in the deep dives.</p><p>[00:02:16] <strong>Alex Volkov:</strong> We've talked with Maxime Lubon, who trained the Beagle series and then also gave a deep dive with us about model merging. That was really fun. And on the last deep dive, we talked with the Lilac folks and they're building an open source tool. That lets you peer into huge data sets, like imagine millions of rows, data sets, and they chunk and cluster this. And we've talked about the importance of data sets in creation of LLMs or large language models.</p><p>[00:02:46] <strong>Alex Volkov:</strong> And they've taken the huge data sets of the folks to usually come up on ThursdAI. Technium from Nous Research just released their Hermes dataset, for example. And the folks in Lilac talked to us about how that would be visualized and how you can see which parts of it is comprised of.</p><p>[00:03:03] <strong>Alex Volkov:</strong> It's quite an interesting conversation about how to approach the training and fine tuning area. And we haven't often talked about dataset curation and creation, so that conversation was a very nice one. So we have deep dives. I will say that last weekend, I also interviewed, and that's probably going to come up as a separate episode.</p><p>[00:03:24] <strong>Alex Volkov:</strong> I interviewed Sasha Zhadan from Moscow, and this was a first for me. And I just want to like, highlight where this weird thing takes me, because that's not ThursdAI, and that's not about the news. That was just literally about AI stuff. So this guy from Moscow, and this will be dropping on ThursdAI podcast soon.</p><p>[00:03:42] <strong>Alex Volkov:</strong> This guy from Moscow built a bot that auto swipes for him on Tinder. And that bot started using gpt instruct, and then moved to gpt chat, gpt etc, and then moved to gpt 4. And he talks about how this bot kept improving with the improvement of AI. And then he autoswiped a wife, basically. And then this was, this took over the Russian ex.</p><p>[00:04:08] <strong>Alex Volkov:</strong> I don't know if you guys are on the Russian side of ex, but I definitely noticed that everybody, that's all they could talk about. This guy Previously also did some shenanigans with OpenAI stuff. And so it was a very interesting conversation, unlike anything that I did previously on ThursdAI.</p><p>[00:04:21] <strong>Alex Volkov:</strong> And definitely that's coming more as a human interest story than anything else. But it's very interesting. And also his fiance also joined and we talked about the morality of this as well. And it was really fun. So if that kind of new type of content also interests you definitely check out.</p><p>[00:04:37] <strong>Alex Volkov:</strong> That's probably not going to end up on X.</p><p>[00:04:40] <strong>Alex Volkov:</strong> And I think with this, it's time to get started. , The usual way we get started here is I just run through everything that we have. Just so you know what we're going to talk about.</p><p>[00:04:52] <strong>Alex Volkov:</strong> And then we're going to start with segment by segment. So that's</p><p>[00:04:54] TL;DR and recap of the conversation</p><p>[00:04:54] <strong>Alex Volkov:</strong> Hey everyone, this is a recap of everything we talked about on ThursdAI for February 8th. 2024 and we had a bunch of breaking new stuff today, specifically around the fact that Google finally gave us something. But I'm gonna do this recap properly based on the categories. So let's go. So in the category of open source lms, we've talked about Alibaba releases a bunch of new Qwen models, specifically under the numbering 1.5.</p><p>[00:05:33] <strong>Alex Volkov:</strong> And we had the great pleasure again to talk with Justin J. Yang Lin. from Qwen team the guy who's a tech lead there and pushes for open source. And he came up and talked about why this is a 1. 5 model, not a 2 model. He also talked about the fact that they released a tiny 0.</p><p>[00:05:51] <strong>Alex Volkov:</strong> 5 billion one. This is like a very tiny. Large language model. I think it's really funny to say a tiny large language model, but this is the case. And he talked about multiple releases for Qwen. We also had, friend of the pod, Piotr Skalski from Roboflow, who's like a vision expert who comes up from time to time, and the author of I forget the name of the library.</p><p>[00:06:12] <strong>Alex Volkov:</strong> I will remember this and put this in the show notes as well. He came up and he had a bunch of plays with the visions part of the Qwen. ecosystem, and we've talked about QNVL plus and QNVL max with Justin as well, and we've talked about their potential for open sourcing these models. They also released a 72 billion parameter model that's now part of the top of the Hug Face leaderboard, which is super cool.</p><p>[00:06:34] <strong>Alex Volkov:</strong> So definitely a great conversation. And I love it when the authors of the things that we talk about come out and talk about the, in ThursdAI. We then smooth, smoothly move to the next topic where Abacus, the company Abacus AI, there is Finetune that's now top of the Hug Face leaderboard, and that's based on QN72B, and not even the new one, the previous one, so 1.</p><p>[00:06:54] <strong>Alex Volkov:</strong> 0, and that's now the top model on Hug Face leaderboard, and that has an average score of over 80. And I think it's the first open source model to do and they haven't fully released the process of what they what they used in order to make this much better in different leaderboards. But they have mentioned that they're going to train this model on top of the Mikulik over Mixtral.</p><p>[00:07:17] <strong>Alex Volkov:</strong> And it's very interesting. And they also They're building some other stuff in Abacus as well. Very interesting. And then we moved to talk about LMSYS Arena. LMSYS Arena is the place that we send you to see which models users prefer better versus just the benchmarks and evaluations hung in phase.</p><p>[00:07:35] <strong>Alex Volkov:</strong> LMSYS Arena added a bunch of open source models, so shout out OpenChat again. They added another Hermes the Finetune that Technium did for Hermes on top of Mixtral, and they also added a bunch of Qwen versions as well. LMSYS adds open source, so you continuously can see which models are better and don't have to judge for yourself, because sometimes it's not very easy.</p><p>[00:07:55] <strong>Alex Volkov:</strong> We also covered JINA embeddings that are fine tuned for code. JINA from the company JINA AI and the representative Bo Wang who came, and he's a friend of the pod. We talked about their embeddings for code. Bo didn't show up this time, but maybe next time as well. Then we moved to big companies, LLMs and API, and definitely the conversation turned interesting, where multiple folks here on stage paid the new 20 tax, let's say from AI [00:08:20] for for the rebranded Bard now called Gemini and the launch of Gemini Ultra.</p><p>[00:08:25] <strong>Alex Volkov:</strong> And we've talked about how long we've waited for Google to actually give us something like this. And now we're getting Gemini Ultra and Bard is no more, Bard is Essentially dead as a brand, and now we're getting the Gemini brand. So if you used to go to BART, now you go to Gemini, but also the brain behind this also improved.</p><p>[00:08:41] <strong>Alex Volkov:</strong> So you get Gemini Pro by default for free, I think, and Gemini Ultra is going to cost you 20 bucks a month. It's free for the next two months, so you can sign up for a trial, and then you'll get Gemini Ultra. And you'll get it not only in the web interface, you also get it in iOS and Android apps. And if you're on Android, it also integrates with the Android Assistant.</p><p>[00:09:00] <strong>Alex Volkov:</strong> That's pretty cool. It has a context length of not very much, I think we said 8 or 16 or so and some folks contested this in the comments, so we're still figuring out the context length, and it looks like the context length for that is Restricted with the UI, less on the API side, and Gemini Ultra did not release an API yet.</p><p>[00:09:17] <strong>Alex Volkov:</strong> So we've talked about Gemini Ultra and different things there. We also covered that OpenAI adds image metadata to all DALI generations, whether through the UI or through the API, this image metadata can be stripped. So it's not a watermark per se, but it's definitely helpful. And there also the OpenAI gives us a little bit of a developer experience thing where you can restrict.</p><p>[00:09:36] <strong>Alex Volkov:</strong> Per key on API keys different possibilities. So if one key gets stolen, you can lock only that one, or you can restrict it to only like a specific use as well. In the vision video category, we've talked about the new model for background removal called RMBG from Bria AI. It's not a fully commercial license, but you can play with this now.</p><p>[00:09:57] <strong>Alex Volkov:</strong> There's a demo I'm going to add to the show notes. And also it runs fully on your client via the efforts of friends of the pod Zenova from Transformers. js. And it's pretty cool to have a model that removes background super like with two clicks with no back with no servers. And in the voice and audio category, we talked about MetaVoice, a new.</p><p>[00:10:14] <strong>Alex Volkov:</strong> licensed Apache 2 licensed text to speech model, not from Meta, even though it's called MetaVoice, and it's funny it's pretty decent and has zero shot voice cloning which means that you can provide a piece of your voice and fairly quickly get a your voice speaking back to you generated. And we also talked about breaking news from NVIDIA AI, something called Nemo Canary 1B, which is a ASR model, Automatic Speech Recognition model, that's now top of the leaderboards on Hug Face, and it beats Whisper on everything, including specifically for four languages.</p><p>[00:10:48] <strong>Alex Volkov:</strong> It's trained on 8, 500 hours 85, 000 hours of annotated audio, and it's very fast conformer encoder as well. We barely covered this, but Microsoft added DALI editing with the designer. So if you remember, Microsoft also did a rebrand. It used to be called Bing Chat, and now it's called Copilot.</p><p>[00:11:07] <strong>Alex Volkov:</strong> And that Copilot now adds capabilities that don't exist in other places, like GPT, ChatGPT with DALI. So Microsoft's DALI now is involving the designer thing, and they have cool things where you can edit images. On the fly, you can click on different segmented objects from your generated image and say, Hey, redo this in a different style.</p><p>[00:11:27] <strong>Alex Volkov:</strong> The video for this is super cool. I'm going to add this in the show notes. And it's very interesting to see that Mali Microsoft with their co pilots is moving away from where the capabilities is for ChatGPT exist. We also barely, briefly mentioned and glanced through this, but Stability AI released an update to stable video diffusion, including a web UI that you can use now, and it's not only a model, it's a web UI as well, and that web UI is pretty cool, if you didn't get an access to it, I'll link to the show notes, I think it's now possible to register, much nicer videos, and obviously it's in the open source.</p><p>[00:11:59] <strong>Alex Volkov:</strong> as much as possible. So super cool. But the web UI shows you other people's video attempts. You can actually use their prompts to create videos of your own. They have some controls. It's very nice. Then I think we talked a little bit at the end there about Vision Pro and my experience with this as it comes to AI.</p><p>[00:12:15] <strong>Alex Volkov:</strong> We didn't dive in into Vision Pro, even though this is my new, this is my new toy in life. And I'm very happy to participate in the renaissance of spatial computing. And we covered like the intersection of AI and spatial computing. And I think the very interesting part of today's ThursdAI was thanks to two new guests, Benjamin Clavy and Connor from Weaviate, and we've talked about DSPy and Colbert, or Colbert, and Ragatouille, which is a library to use Colbert embeddings.</p><p>[00:12:43] <strong>Alex Volkov:</strong> And we talked about what they mean, and this was a great learning kind of experience for me. And if you see these concepts on your timeline and you have no idea what we talked about, I basically played the role of, hey, I'm the village dummy, let's say. I'm gonna re ask the question about what this means, why should we use this as well.</p><p>[00:13:01] <strong>Alex Volkov:</strong> And I think this is our show today, folks. This is the quick summary. If I missed anything super big and important, please let me know.</p><p>[00:13:08] Open source LLMs and AI news</p><p>[00:13:08] <strong>Alex Volkov:</strong> But otherwise, I think we'll start with open source. All right, welcome to the open source corner. And I guess because the tradition of ThursdAI is Something releases, I go in the comments and say, Hey, I'm going to talk about this on ThursdAI. Do you want to join? And sometimes people say yes. And this is how we met Justin or Junyang here on stage. Junyang is the dev lead for the Qwen team and welcome Junyang.</p><p>[00:13:50] <strong>Alex Volkov:</strong> It's very late where you are. So I really appreciate your time here. Please feel free to unmute and introduce yourself again. Some folks already know you, but if in case some new folks are listening to us, feel free to introduce yourself. And then let's talk about the stuff that you released.</p><p>[00:14:06] New Qwen models 1.4 from Alibaba</p><p>[00:14:06] <strong>Junyang Lin:</strong> Yeah. Thanks Alex. Nice to be at Thursday. ai it's a very great program for us to talk about ai. I am j Young and you can call me Justin. I'm working in the team for the LM and LMM. And we are now working for the new LLM, Qwen 1. 5, and we are also upgrading our vision language model, QwenBL, to QwenBL Plus and Max.</p><p>[00:14:33] <strong>Junyang Lin:</strong> Plus and Max are not open sourced yet, but we have demos, so you can try in our HuggingFace organization, and you can find our demos, and you can try with Plus and Max. And the max is the best one, and I am very confident with the max demo. And about our language model today actually this week we are open sourcing QWAM 1.</p><p>[00:14:58] <strong>Junyang Lin:</strong> 5. Maybe you previously you have noticed the QWAM 2 code inside Hugging Face target based transformers. Yeah, we are moving to new codes for you to use our QUANT models because in the past few months I have been interviewing our users and they found some problems with using our code, the original QUANT code, so I'm moving a step forward.</p><p>[00:15:23] <strong>Junyang Lin:</strong> So this is why we had the QUANT 2 model, but for the model themselves actually we are still we in our judgment, we are still at the 1. 5 not 2 yet. We're still training the real Qwen 2, so this time we have Qwen 1. 5. For Qwen 1. 5 we are actually fixing a lot of problems because there are some models like 7 billion and 14 billion, there are a lot of people using these models, but they are actually quite old.</p><p>[00:15:50] <strong>Junyang Lin:</strong> They were released months ago. They have some problems for Qwen 14 billion It is actually only supporting around 2 to 4K context length, which is far from enough for a lot of users. So for this time, we have upgraded all models to supporting 32, 000 tokens. And for the sizes, we have released more sizes.</p><p>[00:16:15] <strong>Junyang Lin:</strong> Previously, we had 1. 8, which is the smallest one. But this time, we have 0. 5. only 0. 5. I used to think this one is just for experimental usage but there are some users in Twitter they found still 0. 5 can used to be do something so if you have any comments on [00:16:40] 0. 5 you can share the comments to me. And we also have 4 billion which is between 1.</p><p>[00:16:46] <strong>Junyang Lin:</strong> 8 and 7 billion. The reason why we have 4 billion is that actually when we first released 1. 8 billion it is actually popular because they would like to deploy the small model to some devices like cell phones. but they found just 1. 8 is not good enough for them to for the applications.</p><p>[00:17:07] <strong>Junyang Lin:</strong> So they want something just smaller than 7 billion, but much better than 0. 8. So we have 4 billion. Yeah. We have a wide range of sizes. These are for you to choose. And,</p><p>[00:17:19] <strong>Alex Volkov:</strong> six, six models overall Junaid?</p><p>[00:17:22] <strong>Junyang Lin:</strong> Yeah. Six</p><p>[00:17:23] <strong>Alex Volkov:</strong> Six sizes overall, but definitely more models than this, because you also released, I think for the first time, you released quantized versions as well, correct?</p><p>[00:17:32] <strong>Junyang Lin:</strong> No, but previously we have released GPDQ,</p><p>[00:17:35] <strong>Alex Volkov:</strong> Oh yeah.</p><p>[00:17:35] <strong>Junyang Lin:</strong> our convention, but this time I also have AWQ and also GGUF maybe GGUF is the new one admittedly, previously I don't know too much about AWQ and GGUF. This time I tried and everything is okay. So I just released the AWQ and GGUF.</p><p>[00:17:52] <strong>Junyang Lin:</strong> And GGUF is the new thing for me. But it is quite popular in the community. Like Elm Studio, like you introduced. To me and I found a lot of people using gguf they use in Olama. So I collaborated with Olama. So you can now just run one line of code, like Olama run QWAM. So you can use the QWAM models with Olama and you can also use it in Elm Studio.</p><p>[00:18:15] <strong>Alex Volkov:</strong> I just wanna</p><p>[00:18:16] <strong>Junyang Lin:</strong> No</p><p>[00:18:16] <strong>Alex Volkov:</strong> just a tiny pause here because I think first of all, to highlight the importance of this community, you guys are releasing a bunch of great models in open source, and first of all, just a Great. At testament to the community because you're listening to what folks have been saying, how they're reacting to your models and part of the Thursday aid, I was able to just introduce you to, to LM Studio and you guys work together.</p><p>[00:18:37] <strong>Alex Volkov:</strong> And now the second year model drops, not only you guys already pro providing us quantized versions in four and GGF stuff. It's also very easy to start using and I think, just a shout out to you guys for thinking about this because a lot of models when they release they just release a waste file and then it's up in the community to figure out how to run them, when to run them, what's the problems.</p><p>[00:18:57] <strong>Alex Volkov:</strong> And this was the issue with Gwen before. It was like harder to use and maybe only on hug and face demos. And now you guys released it with support for the most popular open source runners out there. So Ollama, if folks haven't used Ollama by now, definitely there's a CLI, just like Ollama installed this.</p><p>[00:19:14] <strong>Alex Volkov:</strong> And LM Studio, which we've talked about a bunch, so shout out LM Studio. Shout out JAGS. And I'm, I was very happy to introduce both of you. So it's been great. And I've used the small model, the baby model as well. How was the reception from the community? What have you seen people do? Have there been any fine tunes already that you're excited about?</p><p>[00:19:33] <strong>Junyang Lin:</strong> yeah this is a very great comment for helping us to improve. Yeah, previously like us, a lot of people just drop open source models and they just let the community to use it. But this is maybe, this may be not right, because we can do more to the community, maybe we can do things. more easily than the community users.</p><p>[00:19:56] <strong>Junyang Lin:</strong> So this is why we are changing our style. We try to modify our code, try to adapt to the usages to make our models more popular. And recently I found them just gradually fine tuned our models. Previously fine tuned users are inside mainland China because they have chances to talk to us, so they will know more about our models so they, they can finally fine tune it.</p><p>[00:20:24] <strong>Junyang Lin:</strong> But with the support of Lama X Tree and especially Alto wing Winland helped me a lot. Technium just introduced wing land to me, and I found some people are using X lotto to do it. I dunno if Chen I don't know if I pronounced his name he's one of the users of Qwen and he he previously got the usage of our models and then he quickly fine tuned a lot of model its name is Q U Y</p><p>[00:20:54] <strong>Alex Volkov:</strong> Oh, Stable Quan. Yeah, I think I know what the guys are talking about. Stable Quan from also Nous Research</p><p>[00:20:59] <strong>Junyang Lin:</strong> yeah, stableQwen I'm quite familiar with him, I just talked to him very much, and he just directly used our models, very quickly finding a series of models, and I find them, the quality are quite good.</p><p>[00:21:12] <strong>Junyang Lin:</strong> So this is quite encouraging for me, because you can find people are interested in your models, they can find you in it, very fast speed, and I recently found Smog by Abacus AI, but I got no chance to talk to them because I don't know who actually built the model, but I found a small 72 billion is built on Qwen 72 billion</p><p>[00:21:37] <strong>Alex Volkov:</strong> Oh, really?</p><p>[00:21:39] <strong>Junyang Lin:</strong> Open open leaderboard.</p><p>[00:21:40] <strong>Alex Volkov:</strong> Smog is the next thing we're going to talk about, so you're taking us exactly there. I think, Nisten, you have a question just before, and then we're going to move to talk about smog. Just on the community part just the names you mentioned. You mentioned Stablequan, definitely friend of the pod.</p><p>[00:21:52] <strong>Alex Volkov:</strong> You mentioned Technium introduced you to Winglian, the guy from Axolotl. All of this happens in the ThursdAI community, and I love it. I'll just say that I see Robert in the audience here. Smog is from Abacus AI, and I think Robert has some connection to Bindu, so Robert, if you can introduce Junyang to Bindu, that would be great, and then we'll figure out, like, how they use the 72B model.</p><p>[00:22:12] <strong>Alex Volkov:</strong> 72B model that you guys released is one of the more performant ones. I think it's even outperforming Mistral Medium, is that correct?</p><p>[00:22:21] <strong>Junyang Lin:</strong> Yeah it's now this version QEM 1. 5 SIMD2 BDN is for the chat model for the base model, it is actually quite similar some users have found that I admit that, and, but for the chat models, we have some improvements because this time we are not only Actually, we not only SFD the model, but we also use DBO.</p><p>[00:22:40] <strong>Junyang Lin:</strong> We have some progress in DBO. So we've reached like 8. 67 in MTBench. This is a relatively high score and we just did simple DBO and just improved the model. And we also sent our model to Chatbot Arena in Elimsys. supported by Together AI, because we have some friends in Together AI. They just built API for us, and we have been in chatbot arena, so you can try it in chatbot arena to see how it really performs.</p><p>[00:23:18] <strong>Junyang Lin:</strong> Is it really perform just like the score of MTBench? I'm not quite sure, because I'm also dependent on the users feedback.</p><p>[00:23:27] <strong>Alex Volkov:</strong> it depends on human preference. I so first of all, Justin, you're taking over my job now because you're also reporting on the stuff that I wanted to mention, but definitely a shout out for getting added to LMSYS. That's not super easy. Not every model out there on the Hagenfest leaderboard gets added there.</p><p>[00:23:41] <strong>Alex Volkov:</strong> So definitely super cool. Yeah, please go ahead. If you have anything else to</p><p>[00:23:46] <strong>Junyang Lin:</strong> for as you have mentioned Mistral Medium, I'm not sure which one is better Mistral Medium or Qwen 72 Billion from some reviews they might be similar for the Qwen 1. 5 72 Billion similar to MiQ some of my friends like Blade just tested In EqBench, the scores are very similar, but I need some more reviews to let me really know that how the 72 billion model really perform, that how is it better or is it worse than MeeQ?</p><p>[00:24:20] <strong>Junyang Lin:</strong> They're all okay for me. I just want real reviews for me. Yeah,</p><p>[00:24:23] <strong>Alex Volkov:</strong> Yeah,</p><p>[00:24:24] <strong>Junyang Lin:</strong> it.</p><p>[00:24:25] Discussion about Qwen VL with Nisten and Piotr</p><p>[00:24:25] <strong>Alex Volkov:</strong> awesome. Junaid, thank you for joining us. And Nisten, go ahead. You have a few questions, I think, about the interesting things about VL.</p><p>[00:24:34] <strong>Nisten Tahiraj:</strong> Yeah, so one thing is that the 0.5 Bs and the small models, I know Denova in the audience was specifically looking for one around that size or like a 0.3 to run on web GBU, because then even at 32 bit, which older browsers will still support it, it will still only take two gigs. So that, that would run anywhere.</p><p>[00:24:58] <strong>Nisten Tahiraj:</strong> But my question. [00:25:00] So shout out to Feliz de Nova for all that. I know he's going to do something with it but my question for you was more about the Macs and the the larger Qwen QwenVL chats are those also based off of the 72B and did you find more improvements in going with a larger LLM, and I also wanted to know your opinion on Lava.</p><p>[00:25:27] <strong>Nisten Tahiraj:</strong> The Lava 1. 6 method where they mosaic together four clip models on top to get a larger image, even though it slows down inference because now it's got a output like 2000 embeddings. So yeah, what do you think of Lava and is there more stuff to share about the Clang,</p><p>[00:25:47] <strong>Junyang Lin:</strong> VL, Max. Yeah for Plus and Max it may be, sorry for me not ready to open source it.</p><p>[00:25:57] <strong>Junyang Lin:</strong> I cannot decide these things. Yeah actually it's built on larger language models much larger than the Plus, and you can guess whether it is 72 billion. It is not that important, and we have found that The scaling of the language model is really important for the understanding of the VR models.</p><p>[00:26:18] <strong>Junyang Lin:</strong> We have tested it on the MMMU benchmark and we have found that the Max model is highly more com competitive and performs much better than the Quin bi plus. Although previously many people have thought that Quin Bi Plus is strong enough, but we found that the max had. Much better reasoning capabilities, just understand some, something like some reasoning games like poker or these things like that, some complex things that people can understand through the vision information they can somehow understand it.</p><p>[00:26:52] <strong>Junyang Lin:</strong> I think the performance might be a bit slower. Approaching Gemini, Ultra, or GPE4B for the QEMDR MAX. We were just gathering some reviews. I'm not quite sure, but</p><p>[00:27:05] <strong>Alex Volkov:</strong> From the review perspective, I want to say hi to Petr, our friend here on stage, from Roboflow. Petr is one of the vision experts here on stage. Petr, welcome. Feel free to introduce yourself briefly, but I definitely know that you got excited about some of the GwenVL Plus stuff, so definitely feel free to share some of your insights here.</p><p>[00:27:30] <strong>Piotr Skalski:</strong> Okay. Yeah. And first of all, awesome to meet somebody from Qwentin. Yeah.</p><p>[00:27:36] <strong>Piotr Skalski:</strong> So yeah I'm from Roboflow, like you said and I'm responsible there for computer vision and growth. So it's like in between of being ML engineer and marketing something like this.</p><p>[00:27:49] <strong>Piotr Skalski:</strong> And yeah, I was experimenting with Qwen, Plas and Max last week. Super impressed in my opinion. I know that you tried to be humble, maybe, but. In my opinion it's, at least on things that</p><p>[00:28:04] <strong>Junyang Lin:</strong> I test, it performs like the best compared</p><p>[00:28:08] <strong>Piotr Skalski:</strong> to other</p><p>[00:28:09] <strong>Junyang Lin:</strong> models. Thank you very much. Thanks for the appreciation.</p><p>[00:28:14] <strong>Piotr Skalski:</strong> Yeah. And especially the fact, so the biggest game changer for me, and I know that there were models that were capable of that before, is the fact that you can ground those predictions and you can, for example, point to a specific element on the image. So it's not only that you can ask questions and get answers and do OCR, but you can straight up do zero shot detection if you would like.</p><p>[00:28:40] <strong>Piotr Skalski:</strong> Yeah. Which is which is awesome. And that's something that none of the. Other popular models can do to that extent, at least on the</p><p>[00:28:51] <strong>Piotr Skalski:</strong> things</p><p>[00:28:51] <strong>Piotr Skalski:</strong> that I</p><p>[00:28:51] <strong>Piotr Skalski:</strong> tested. My question is,</p><p>[00:28:55] <strong>Piotr Skalski:</strong> do you plan to open source it? Because it's awesome that you can try it out for the API and I highly appreciate the fact that you created the, HF space and you can go there and try it.</p><p>[00:29:07] <strong>Piotr Skalski:</strong> But is there a chance that you will open source it even with the meeting? License are not necessary.</p><p>[00:29:16] <strong>Junyang Lin:</strong> Yeah personally, I would like to open source some but I cannot decide these things, but I think there's a chance I'm still promoting these things inside the core, but I cannot say too many things about these stuff, but we will try because we have found out that we ourselves can also build very good LMM.</p><p>[00:29:37] <strong>Junyang Lin:</strong> I think the gap Just between the big corp between us and the big corp. In LMM, it's very small. And we have found that our techniques or our training is quite effective. So maybe one day we'll share to the community, but for now it is still APIs and demos and I would try to think about these things.</p><p>[00:29:59] <strong>Junyang Lin:</strong> And also question about. The comparison with us and Lava, and I have just tried Lava 1. 6 not quite freQwently. I just tried it. I think it's a very good model and it it has very good performance in the benchmark results but I think the limitations of these other open source models may be that It still lacks sufficient pre training for them Skullscape just said we can do Qwen can do OCR and you can find that Qwen's reasoning capability is quite strong because we have done a lot of pre training work on it.</p><p>[00:30:39] <strong>Junyang Lin:</strong> We have done a lot of data engineering on pre training because we have capabilities of handling different resolutions and different aspect ratios so that we can use the curated, the OCR data and put them in the pre training. And when the vision length model can understand a lot of textual like linguistic information inside the images, they may do something like like we said, reasoning, and you will find that really powerful, very impressive, or things like that.</p><p>[00:31:13] <strong>Junyang Lin:</strong> Yeah I think the gap between other models and us, or also Gemini Ultra and GPT 4b, maybe still the lack of large scale data. for training. Yeah, this is my opinion.</p><p>[00:31:27] <strong>Alex Volkov:</strong> we're waiting for more data, but we're also waiting for you guys too. I just want to thank you for being the champion for open source from within the organization, and really appreciate all your releases as well. I think Piotr and Nisten, like everybody here on stage, definitely. It feels that, and thank you for coming and talking about this.</p><p>[00:31:45] <strong>Alex Volkov:</strong> Justin, feel free to stick around because the next thing we're gonna talk about, you already mentioned, which is Smog 72 B which is the top of the leaderboard. And I just read through the thread from Bindu, ready from Abacus ai and it looks like they didn't even use 1.5. I think they used 70 the previous Quinn</p><p>[00:32:02] <strong>Junyang Lin:</strong> yeah, they used the previous QUANT72B. If they are really based on the base language model there might not be a lot of differences. Because 1. 5 for the base language model 72B is actually slightly better than the original 72B for the base language model. Yeah.</p><p>[00:32:22] <strong>Alex Volkov:</strong> for the base ones. And very interesting what they</p><p>[00:32:24] <strong>Junyang Lin:</strong> the base one.</p><p>[00:32:25] <strong>Alex Volkov:</strong> So they, they don't share any techniques, but they promised to open source their techniques. They're saying like, our next goal will be to publish these techniques as a research paper and apply them to some of the best Mistral models, including Miku.</p><p>[00:32:37] <strong>Alex Volkov:</strong> So I got confused. I thought that they already fine tuned Miku, but no, they just fine tuned on top of Qwen. And now the top Hug Face leaderboard model is based, is a fine tune of Qwen, which is like also super cool.</p><p>[00:32:50] <strong>Junyang Lin:</strong> Yeah, I'm very proud of it.</p><p>[00:32:52] <strong>Alex Volkov:</strong> Yeah, congrats.</p><p>[00:32:53] <strong>Junyang Lin:</strong> They are using our model to be the top of the model. I'm also really expecting their technical report to see how they reach the top of the benchmark. But I think it is not that It is not that kind of difficult because you have a lot of ways to improve your performance in the benchmark, so we'll still see how it really performs in the real scenarios, especially for their chat models, yeah.</p><p>[00:33:18] <strong>Alex Volkov:</strong> Yeah, that's true, [00:33:20] that's often the case. But I just want to shout out that the world is changing like super fast. We're definitely watching and monitoring the Hagenface leaderboard. And performing better than Mistral Medium is impressive. And this looks at least on the MMLU, this is 77. I think they said they broke The average score of 80, this is the first model that broke the average score of 80 on the open source leaderboard on hang and face, which is super cool based on Quinn as well, and definitely worth it.</p><p>[00:33:46] <strong>Alex Volkov:</strong> I'm gonna add this link to the show notes and hopefully we'll find a way to connect you guys with the Bindu team there at Abacus to see how else this can be improved even for, and whether or not these techniques can be put on smaller models as well. I think in the open source, the last thing.</p><p>[00:34:00] <strong>Junyang Lin:</strong> expecting the chat. Yeah, I'm really expecting to chat with them. Yeah, continue,</p><p>[00:34:05] <strong>Alex Volkov:</strong> So definitely hoping that some of our friends can connect between these awesome teams and learn from each other, which I think is the benefit of speaking in the public and putting things in open source. Now, moving on, the last thing that you definitely mentioned is the update from LMSys, which is quite a few of our friends of the pod are now also part of the chatbot arena.</p><p>[00:34:24] <strong>Alex Volkov:</strong> They just announced this yesterday. They've added Three of your versions, right? They added 1.572 B, 1.57 B, 1.5, four B, and they also added open chat. So shout out the folks from Open Chat Alai and the Alignment Lab and some other friends of ours who like release open chats latest release and they also added news imis fine tune.</p><p>[00:34:47] <strong>Alex Volkov:</strong> So if you guys remember we've talked about news fine tuning on mixed mixture and that improved on the mixture of expert model from. From Mistral a little bit based on DPO data sets. So now that's also in the LMCS arena and it's now powered by Together Compute. Which I have no affiliation with besides the fact that they're awesome.</p><p>[00:35:04] <strong>Alex Volkov:</strong> They're sponsoring a bunch of stuff. And we did a hackathon together together is great. Like you can easily fine tune stuff on their platform, but now they're also sponsoring the arena, at least to some extent, which is great because we get more models and arena keeps going. And if you guys remember, or you probably use it, LMC's arena is this another great way for us to feel what human preference is in models.</p><p>[00:35:27] <strong>Alex Volkov:</strong> And for many of these models. That's what's more important than actual performance on evaluations, on leaderboards, et cetera. So definitely great update from LMCs as well. And I think that, I'm gonna ask my folks here on stage, but Nisten, Far El, if this is like anything else in open source that's super interesting this week, I think that's mostly it.</p><p>[00:35:44] <strong>Alex Volkov:</strong> We can talk about Gemini.</p><p>[00:35:48] <strong>Nisten Tahiraj:</strong> There was a data set, which I think is pretty huge of HackerNoon that they released. And oh, there was one more thing HuggingFace made a GPT store.</p><p>[00:35:58] <strong>Alex Volkov:</strong> Oh,</p><p>[00:35:59] <strong>Nisten Tahiraj:</strong> they made their own GPT store. Yes. I think that's a big,</p><p>[00:36:03] <strong>Alex Volkov:</strong> I want to hear about this, for sure. I haven't used it yet, but I invite the Hug Face folks that are listening to this to come and tell us about this, because I haven't used it yet, so I don't actually have many opinions. But yeah, they released their own open source GPT store, which is super cool, and we're going to add this maybe in the show notes, but I don't have a lot to say about this.</p><p>[00:36:24] <strong>Alex Volkov:</strong> And I think, in the spirit of Yeah, go ahead.</p><p>[00:36:27] <strong>Nisten Tahiraj:</strong> Oh, sorry. Sorry. I'll quickly say that the HackerNoon data set of tech articles, those are some Because they have a lot of guest developers I remember over the years, they had the best ones. Those articles, that data set, is extremely great for any kind of coding or website or whatever work you're doing.</p><p>[00:36:50] <strong>Nisten Tahiraj:</strong> That's because it's step by step instructions on how to build something and all the code for it, it's pretty awesome and it's at the very beginning on the Jumbotron if you guys see it from Daniel VanStream. And yeah, and it's MIT licensed and it's 6. 9 million articles and you can do whatever you want with it.</p><p>[00:37:07] <strong>Nisten Tahiraj:</strong> That, shout out to them.</p><p>[00:37:09] <strong>Alex Volkov:</strong> We'll add this again to the show notes. And as you said something about articles and code, I remembered another thing that definitely Also worth mentioning Junaid Embeddings, if you guys remember, we had a chat with Bo Wang from Juna deep dive into embeddings a while ago, and Junaid Embeddings released a fine tune for code.</p><p>[00:37:25] <strong>Alex Volkov:</strong> So just a quick shout out that embeddings can be fine tuned, embedding models can be fine tuned for specific purposes, and definitely embeddings for co and you guys re if those of us who follow from week to week, we talk about embeddings a lot. We've talked about NumX Embeddings last week, the open source full, including the training datasets.</p><p>[00:37:42] <strong>Alex Volkov:</strong> We've talked about. OpenAI changing embeddings and giving us new ones and cheaper ones. And Junaid, we had a deep dive and I definitely welcome you to go and check out that special episode with Bo Wang from Junaid and they trained their own BERT model as the backbone, the LLM backbone that decides about embeddings and they just released an update to their embeddings fine tuned for code retrieval specifically.</p><p>[00:38:03] <strong>Alex Volkov:</strong> And I think for many folks are building rack system. That's something that they should be aware of that embedding models can be also fine tuned for specific purposes like Q& A and obviously code as well. So if you haven't tried that yet and you're doing a bunch of material on top of code, for example, using some of the data sets that Nisten just mentioned, that probably there's code in there definitely check this out.</p><p>[00:38:25] <strong>Alex Volkov:</strong> I think we're moving on to the big company thing, and I don't have a big company transition, I do have this one though.</p><p>[00:38:43] Google finally lanuches Gemini Ultra</p><p>[00:38:43] <strong>Alex Volkov:</strong> Just in, as we started the space, maybe an hour before, our friends from the big G, Google finally answered the question that we've been asking since 10 months and three weeks ago, where is Google? So GPT 4 was released to us after ChaiGPT released in, I want to say December, maybe December 1st, November 31st of 2020.</p><p>[00:39:06] <strong>Alex Volkov:</strong> Then GPT 4 was released in March of 2023. And throughout this time, there was this famous video of Satya Nadella asking where is Google and where's this like 600 pound gorilla in the room of search? And we're going to make them dance. And they definitely make them dance. And we've been waiting.</p><p>[00:39:25] <strong>Alex Volkov:</strong> Where's Google? Where's Google? And Google has released. Quite a few stuff for us since then. Just for context, I think everybody knows this already. Google is the place of the birth of the transformer paper. So like most of this, the recent Gen AI explosion is, can be attributed to transformers architecture that came out from Google.</p><p>[00:39:43] <strong>Alex Volkov:</strong> Google had trained multiple models, including like Palm, and we've talked about Palm and Palm 2, and I don't even remember all the names of the models that they've released for us throughout the years. Google then also. At some point gave us BARD, which is their interface, the chat interface that people used in order to play with their models, and I think some of this was Bye.</p><p>[00:40:04] <strong>Alex Volkov:</strong> Bye. Palm, something else as well. And recently, and I think around December, they said, Hey, you know what? We're here and we have this thing called Gemini after the unification of Google Brain and DeepMind under one org. And we're going to give you Gemini Pro right now, but we'll tell you that Gemini Ultra, that was back in December.</p><p>[00:40:23] <strong>Alex Volkov:</strong> The Gemini, I guess December will tell you the Gemini Ultra is coming and it's going to be better than GPT 4 and you're going to get it soon. And we've been like saying when? And today is the day is the answer for those questions. So today we're celebrating, congrats folks at Google who finally released and upgrade to their LLM capabilities.</p><p>[00:40:41] <strong>Alex Volkov:</strong> Not only an upgrade, so much an upgrade that they've killed the Bard brand completely. No more Bard. That's what I'm understanding. No more BARD, even though that's very confusing. If you guys remember a few weeks ago, we've talked about LMSYS changes were barred with Gemini, I think, something like confusing like this, shot up to the top of the charts and just was trailing GPT 4.</p><p>[00:41:05] <strong>Alex Volkov:</strong> So like second best model in LMSYS arena was barred with GPT 4, or sorry, barred with Gemini. See how confusing this is? And now there's no more barred. But there is an LNCS. Anyway, this is like the whole naming is confusing thing, but Google, including a blog post from Sundar and everything, Google comes out with a new update and says, Hey, Bard is no more.</p><p>[00:41:25] <strong>Alex Volkov:</strong> It's now Gemini and the models are also Gemini. So that's confusing. And the models are Gemini Ultra. We finally get access to Google's answer to GPT 4 today, which is incredible. That answer is Ultra 1. 0. [00:41:40] And we can get this. As part of something like a paid premium tier that's called GMA Advanced on Google.</p><p>[00:41:46] <strong>Alex Volkov:</strong> So you can actually go right now, you can sign up, it's 20 bucks a month, and it starts 20 bucks or 30 bucks? I think it's 20</p><p>[00:41:52] <strong>Nisten Tahiraj:</strong> It's two months free</p><p>[00:41:54] <strong>Alex Volkov:</strong> Yeah, and you get two months, two months trial because they have to prove themselves to you because many people will decide whether or not they're going to go with Google or with JGPT.</p><p>[00:42:03] <strong>Alex Volkov:</strong> And we're going to talk about which one folks will prefer. I haven't tried it yet. Literally as I woke up, I had to prepare my notes for the space. I just want to say. Google, welcome to the party, we've been waiting for you, and I counted, it's been exactly 10 months and 3 weeks and 4 days since GPT 4 released that you came with the same level of, at least, based on benchmarks.</p><p>[00:42:24] <strong>Alex Volkov:</strong> And now we're gonna talk with some folks who actually tried it, Nisten, you tried it, I think Ray, you also tried it let's talk about your first impressions from BART, oh, or, sorry, Gemini.</p><p>[00:42:35] <strong>Nisten Tahiraj:</strong> One, it's heavily moderated. No one's surprised by that. It does answer and reason nicely, or at least the way it communicates, it's a lot more eloQwent, I would say. It feels nicer in the way it reasons stuff out. However, compared to Mistral Medium, or Mixtral, it doesn't quite obey you. I tried my standard question, which is just like Climb out a schedule of building a city on Mars and write the code in C and JavaScript.</p><p>[00:43:10] <strong>Nisten Tahiraj:</strong> And that's a pretty complex question for, that only the best models get. And I needed to re prompt it in order for it to give the answer. And even then, it only wrote some JavaScript. But it was really good JavaScript. However, it didn't do the rest of the task. Okay, it's not bad. It is worth using. Again, very heavily moderated.</p><p>[00:43:33] <strong>Nisten Tahiraj:</strong> As for the vision side of it, it's extremely heavily moderated. I was even telling it to count out, I had an old gaming PC on the floor with two GPUs on the side, and I told it to make me a JSON of all the parts that it sees in the picture. It won't answer questions like, that have humans in them, or even if they're like Star Wars characters or whatever.</p><p>[00:43:58] <strong>Nisten Tahiraj:</strong> But This, I thought, would be something pretty simple, and it, even this one it refused to answer. Yes is good, I think. On the, as far as the vision side goes, the model, the open source models might have it already beat, or will soon.</p><p>[00:44:19] <strong>Ray Fernando:</strong> Yeah, I wanted to add, Ankesh from Google DeepMind actually wrote because I've been posting some of this stuff, and he says, To preempt any confusion, multimodal queries don't go through Pro slash Ultra yet, but that is coming soon too.</p><p>[00:44:33] <strong>Ray Fernando:</strong> Which makes sense a little bit of why you're seeing some of that stuff. I've been seeing similar things when I've been doing some image analysis or even trying to generate images that have people. One of my examples I've just been posting on my my Twitter feed is like having to analyze a meme.</p><p>[00:44:48] <strong>Ray Fernando:</strong> So it's the hot girls meme or the hot ones meme and I was like, hey, this is very popular. Can you tell me what this meme is? And it's I'm sorry I can't because there's images of people. And then I had to do some other meme analysis with Elon Musk and it's the same type of queries. But to add to what Nisten was saying, I've been doing a lot of creative writing tasks, and the writing output has been actually really nice.</p><p>[00:45:10] <strong>Ray Fernando:</strong> And it doesn't have all that extra fluff that you normally would get from ChatGPT 4. And what I find with OpenAI's ChatGPT 4 is that they freQwently say, Hey, don't use purple prose, which is all that extra fluffy stuff you read that make people sound smart. It's I just want a regular sounding piece.</p><p>[00:45:27] <strong>Ray Fernando:</strong> And usually ChatGPT would do that and then revert back to its normal state but I find that Gemini Advanced just keeps going through it and, continues with the writing pieces of things. And for coding stuff, it's really strange. You actually cannot upload any CSV or any text files.</p><p>[00:45:43] <strong>Ray Fernando:</strong> They only let you upload images right now. So you can only have a picture of a microphone and a picture of the little icon to upload an image. Because I wanted to just do a simple analysis on my tweets with a CSV file. And it's there's no place that I see to actually upload that. And I could probably upload so many lines, but there's also a character cutoff, too, that doesn't allow me to upload a lot of code for,</p><p>[00:46:03] <strong>Ray Fernando:</strong> A code base.</p><p>[00:46:04] <strong>Alex Volkov:</strong> What's the, I was about to say this next thing. Do we know the context length? Anybody have an idea of where Gemini Ultra is at around? 'cause we know that GT four is 1 28 K and I think they recently opened this up on the UI as well. I've been noticing less restrictions. I've been able to pace like a lot more code.</p><p>[00:46:21] <strong>Alex Volkov:</strong> My, my test is, you guys know my test is the transcription of the Thursday I conversation that I past and Claude with the a hundred K context definitely takes all of it. GBT. For the pro kind of level used to refuse and now recently it's okay. Yeah, let me summarize this for you Have you guys been able to sense the context length of Gemini Ultra?</p><p>[00:46:41] <strong>Alex Volkov:</strong> Is it any close? Actually, go ahead Welcome to the stage, buddy</p><p>[00:46:46] <strong>Akshay Gautam:</strong> Hello, I just wanted to bring up that their official document mentions that it's 2k context length.</p><p>[00:46:53] <strong>Alex Volkov:</strong> Actually, we don't get greetings of the day</p><p>[00:46:57] <strong>Akshay Gautam:</strong> I see. Yeah. Yeah. Greetings of the day everybody. My name is Akshay Kumar Gautam and I'm an applied AI engineer. I was a data scientist before, but now I work with, modeling and stuff. And yeah I was literally waiting for, I tried, came out, I paid for it because why not? And and a lot of stuff.</p><p>[00:47:14] <strong>Akshay Gautam:</strong> First of all, it's really good at coding. By the way, the context length is 32K at least that's what they say. Yeah, 32K. And and the model is not good at keeping context, like that is what I was here to talk about. It will lose sense for example, if you ask it to do multiple things in a single prompt, it will not.</p><p>[00:47:33] <strong>Akshay Gautam:</strong> Unlike chatGPT, but like with coding, it's better than chatGPT in my humble opinion.</p><p>[00:47:41] <strong>Alex Volkov:</strong> so I want to talk about some advantages that Google has, the big dog definitely, because an additional thing that they released, which Chantipiti doesn't have, is ChairGPT has this, but they released an iOS and Android app, but Android also has integration with the Google Assistant, right?</p><p>[00:47:56] <strong>Alex Volkov:</strong> So you can now join this advanced or ultra tier and use this from your Android device. Now, I'm not an Android user, but I definitely understand that the ecosystem is vast and many people just use this assistant and we're waiting for Apple. We don't have anything to say about Apple specifically today, besides the fact that, they released the, maybe the next era of computing.</p><p>[00:48:16] <strong>Alex Volkov:</strong> But. There's nothing AI series, still the same series from like 2019 with some examples, but Google has now moved everybody who wants to, who pays the 20 bucks a month and has an Android device basically towards this level of intelligence, basically a GPT 4 level of intelligence. And I saw that Marques Brownlee, MKBHD on YouTube, like one of the best tech reviewers out there.</p><p>[00:48:38] <strong>Alex Volkov:</strong> He has been playing with the Android stuff, and he said that even the integration Google Assistant even uses your home stuff. So you can actually ask this level of intelligence to turn on some lights, whatever, and probably better context. Actually, you have any comments on this? Have you played with the Assistant version?</p><p>[00:48:54] <strong>Akshay Gautam:</strong> Two things first of all, Bing chat was already available on Android devices, right? The Copilot, now it's called. Copilot uses GPT 4, so it's already really good. And you can actually use a lot of voice stuff with Copilot as well, which was surprising. The Google Assistant to be honest, in terms of assistants among Siri and I have a Samsung device, so it has Bixby and, among all the AI systems, Google Assistant was the best one by far, in terms of how much you can, use it, and hoping to get access because I have paid for the Ultra, but I still don't have, access to everything.</p><p>[00:49:29] <strong>Akshay Gautam:</strong> Also, there's no API for Ultra, so you cannot actually test anything as well.</p><p>[00:49:34] <strong>Alex Volkov:</strong> we haven't gotten an API developers Sundar Pichai said the developers announcements are going to come next week. IOS hasn't updated yet. Yeah, go ahead Nisten.</p><p>[00:49:44] <strong>Nisten Tahiraj:</strong> I just really quickly tested it with the entire Lama. cpp file. I am down to 15, 000 tokens I cut it down to and it's still too long. We know it's under 16, 000 that you can paste in. I will know [00:50:00] exactly in a few minutes,</p><p>[00:50:03] <strong>Alex Volkov:</strong> So not super, super impressive in terms of like long context. I will also</p><p>[00:50:06] <strong>Nisten Tahiraj:</strong> at least not for the UI,</p><p>[00:50:08] <strong>Alex Volkov:</strong> for the UI. Usually, yeah, usually for some reason they restrict the UI or they forget to update this. And then the model itself is like way longer context, but for now not extremely impressive comparatively.</p><p>[00:50:18] <strong>Alex Volkov:</strong> And again, we're comparing the two like main flagship models OpenAI GPT 4 and now Google's Gemini Ultra. And I also want to say one thing, Gemini seems to be optimized only for English as well, even though it will answer like most of the questions other languages, but it looks like the optimization was focused on English as well.</p><p>[00:50:36] <strong>Alex Volkov:</strong> including some of the apps as well, which is, understandable, but we have to, as we're trying to compare apples to apples GPT 4 is incredibly versatile in multi language operations as well. LDJ, you have some comments? Welcome, buddy, to the stage and give us some Have you played with Ultra so far?</p><p>[00:50:55] <strong>LDJ:</strong> Yes I was actually wondering, does anybody know of plans for them to integrate this with Google Home? Because I just asked my Google Home right now are you Gemini? And it said, I'm a Virgo. And then I asked it, what AI model are you running right now? It said, sorry, I don't understand. So I don't think it's, at least mine, I don't think is running Gemini right now.</p><p>[00:51:16] <strong>LDJ:</strong> But</p><p>[00:51:17] <strong>Alex Volkov:</strong> No, so I think the announcement was</p><p>[00:51:18] <strong>Junyang Lin:</strong> to put it.</p><p>[00:51:19] <strong>Alex Volkov:</strong> The integration into Google Home will come from the Google Assistant. So if you have an Android device, you'll have Google Assistant there. That you can switch on like a smarter brain, and that you can ask it to integrate like with your home through the device. So you can ask it to do stuff in your home.</p><p>[00:51:34] <strong>Alex Volkov:</strong> But the Google Home itself, like the Google Home devices that you have, they're not talked about upgrading them, but maybe at some point, because why not? But I haven't seen anything on this yet. Anything else here?</p><p>[00:51:46] <strong>Junyang Lin:</strong> I think that'd be the perfect. Sorry. Yeah, go on.</p><p>[00:51:48] <strong>Alex Volkov:</strong> Yeah, no, that would be great. I agree with you. Being able to walk around your house and just talk with GPT 4 level intelligence to do operations, I definitely agree.</p><p>[00:51:55] <strong>Alex Volkov:</strong> That would be great. I gotta wonder anything else here on Ultra? We've talked about its code performance. We've talked about its inability to talk about people. Anything else interesting that we want to cover so far? And again, folks, it's been two hours and we're already giving you like a bunch of info, but we'll play with this going forward.</p><p>[00:52:12] <strong>Nisten Tahiraj:</strong> It's about 8, 000 the context length that you</p><p>[00:52:14] <strong>Alex Volkov:</strong> Are you serious? Wow, that's not a lot at</p><p>[00:52:17] <strong>Nisten Tahiraj:</strong> that's as much I was able to paste it like 7, 500.</p><p>[00:52:20] <strong>Alex Volkov:</strong> So yeah, folks, you heard it here first. You'll get more context than you previously got probably, but it's not a lot comparatively. Even though it can probably, it's probably a consideration of compute for Google, right? How much context to give you the model probably gets more. And it's also a vision enabled model.</p><p>[00:52:36] <strong>Alex Volkov:</strong> But I think that we've covered this enough. Gemini Ultra. It's here, it's very impressive from Google, and yet, I want to say personally, maybe a little bit underwhelming because, they need to convince us to move, and it's going to be the same price, and I don't know, let me just ask this before we move on.</p><p>[00:52:55] <strong>Alex Volkov:</strong> Anybody here on stage who has access to both plans to pay for this and not GPT?</p><p>[00:53:03] <strong>Nisten Tahiraj:</strong> I haven't paid for anything since September But I'm</p><p>[00:53:08] <strong>Junyang Lin:</strong> not the right person for this question. My company pays for like my character description. So I might keep both</p><p>[00:53:15] <strong>Alex Volkov:</strong> Interesting.</p><p>[00:53:16] <strong>Junyang Lin:</strong> paying for mine's out of pocket. I'm just going to keep both. I like the OpenAI app because it's just the multimodal picture on my phone.</p><p>[00:53:23] <strong>Junyang Lin:</strong> I'm on the go. For Google, I'm just curious because it's two months free. That just means that, they have me hooked. We'll see.</p><p>[00:53:30] <strong>Alex Volkov:</strong> Yeah, it's two months free. And then let's check in back in two months, and see how many of us kept paying. All right. I so Google also releases. a Llama CPP wrapper called Local LLM. I don't know if you guys saw this. It's pretty cool. It's an open source tool from Google that actually helps you run LLMs locally on CPUs and then also on the Google Cloud with a super easy integration.</p><p>[00:53:51] <strong>Alex Volkov:</strong> Very interesting choice. They also call out the bloke that you can download models from the bloke with their tool. And I think it's very funny that if you go on. The description of the blog of local LLM, they call this. Now, the tool, they told you in the code snippets, they say, Hey, install OpenAI.</p><p>[00:54:10] <strong>Alex Volkov:</strong> So I found it really funny. But yeah, they have a wrapper there that integrates with Google Cloud as well.</p><p>[00:54:15] OpenAI adds DALL-E watermarking and per API key restrictions</p><p>[00:54:15] <strong>Alex Volkov:</strong> Running through the big companies areas like super quick. OpenAI added watermarks to Dali images. They use this new metadata thing called C two P embeds and it embeds in the metadata.</p><p>[00:54:27] <strong>Alex Volkov:</strong> And so basically what this means for us is not that much, but when you download images from Dali generated, I assume that the same will come to Microsoft copilot. They will now have in the metadata, where like the location is and everything else. They will now have the fact that they have been generated with.</p><p>[00:54:43] <strong>Alex Volkov:</strong> They have been generated with DALI this information will sit in the metadata. Now it's only images, not text or voice or anything else from OpenAI. This happens over the API or from the ChatGPT interface as well. This increases the file size a little bit because of some of the stuff, but it's not super interesting.</p><p>[00:55:00] <strong>Alex Volkov:</strong> This can be stripped. So it doesn't mean that if the lack of presence of this thing does not mean that it's not generated with DALI. It just, if there is, it's definitely generated with DALI. And so this is an interesting attempt from OpenAI to say, Hey, we're doing as much as we can.</p><p>[00:55:15] <strong>Alex Volkov:</strong> It's not foolproof, but an interesting attempt. And also, I just want to mention that if, for those of us who develop with OpenAI, The API keys, they keep upgrading the developer experience there and the API keys part. And now you can restrict per API key. You can restrict its usage, which many people have been waiting for a long time.</p><p>[00:55:33] <strong>Alex Volkov:</strong> And that's really like many people has been wanting this. You can create one API key for OpenAI for a specific purpose and restrict it to only DALI, for example. And you can, I don't know if you can restrict. based on credits, I don't think so, but you can definitely restrict in, in the usage related stuff.</p><p>[00:55:49] <strong>Alex Volkov:</strong> That's, I think, all the updates from the big companies and the LLMs and APIs,</p><p>[00:55:53] <strong>Alex Volkov:</strong> This week's buzz is the corner and I stopped the music too prematurely. This week's buzz is the corner where I talk about the stuff that I learned in Weights & Biases this week. And I don't know how many of you were, had a chance to join our live segments, but we definitely had a build week. And I think I mentioned this before, but actually we had a live show on Monday.</p><p>[00:56:19] <strong>Alex Volkov:</strong> We're going to have another one this probably tomorrow. Yeah, tomorrow. I think it's Noon Pacific, where I interview my team, the GrowthML team in Weights & Biases, about the build with projects that we've built, uh, last December to try and see what's the latest and greatest in this world. So as we build tools for you in this world, we also wanna Build internal tools to see what are the latest techniques and stuff like we just talked about.</p><p>[00:56:46] <strong>Alex Volkov:</strong> For example, it gives us a chance to play around with them. It's like an internal hackathon. And what happened was is we build those tools and we present them to the company and then this was basically it. And I said, Hey, hold on a second. I learned the best publicly. I learned the best about, the way I just learned from Connor and Benjamin.</p><p>[00:57:02] <strong>Alex Volkov:</strong> I learned from Nisten and Far El and all the folks in the audience. And Luigi and I had a whole section where he taught me weights and biases before. I learned the best by being public and talking about what I'm learning as I'm learning this. And so I did the same thing with our folks from the GrowthML team.</p><p>[00:57:15] <strong>Alex Volkov:</strong> We just literally folks came up on stage and I asked them about what they built and what they learned. And we're going to summarize those learnings in the live show. And that live show, if you're interested, is all over our social, so on Weights & Biases YouTube and LinkedIn. Yes, LinkedIn, I now need to also participate in that whole thing.</p><p>[00:57:33] <strong>Alex Volkov:</strong> So if you have tips about LinkedIn, let me know. But it's live on LinkedIn, live on YouTube. I think we did X as well and nobody came. We're probably try to send you to the live YouTube flow. But basically the second part of this is coming up tomorrow. We're interviewing three more folks and you get to meet the team that I'm, the incredible team that that I'm part of.</p><p>[00:57:53] <strong>Alex Volkov:</strong> Very smart folks. like Kaggle Masters, and some of them came to Kano's show as well, which is super cool. And I find the first conversation super interesting and insightful for me. Definitely recommend if you're into Understanding how to build projects that actually work within companies was the process.</p><p>[00:58:11] <strong>Alex Volkov:</strong> We have folks who build something from scratch, we have somebody who runs a actual bot with retrieval and re ranking and evaluations and like all these things and [00:58:20] have been running them for a year basically on the production. So you can actually try our bot in Discord right now and in Slack and on GPTs.</p><p>[00:58:28] <strong>Alex Volkov:</strong> If you want to hear about the difference between a mature, rag based But that's in production for a professional AI company, but also the difference between that and something that somebody can like quickly build in a week. We've talked about those differences as well. So definitely worth checking out that live.</p><p>[00:58:46] <strong>Alex Volkov:</strong> Moving on from this week's buzz, and I learned a lot. Okay, so back from the this week's buzz, we're moving into vision.</p><p>[00:58:52]</p><p>[00:58:57] <strong>Alex Volkov:</strong> And Bria AI like super quick, they released a new Background Segmentation Model, or Background Removal Model, that's live on Hug Face, is called RMBG V1. 4, and I think the cool thing about this is that it now runs completely in the browser, thanks to the efforts of our friend Zinova, who is no longer in the audience, I think, from Hug Face and Transformers.</p><p>[00:59:19] <strong>Alex Volkov:</strong> js, and it's super cool. You can like, remove backgrounds completely without sending any images to anywhere, and just straight from your browser. That model is called, again, RMBG, and it's not Commercially viable. So you cannot use this for professional stuff, but it's open for you to try and play with in the voice category, the voice and audio category.</p><p>[00:59:39] <strong>Alex Volkov:</strong> We don't have a lot of audio stuff lately, so I think the main audio stuff that we've talked about was. I want to say Suno is like the latest and greatest, but we're still waiting for some cool music creation stuff from different labs. And definitely I know some of them are coming but in the voice category and you know that we've been talking about, my position in this and Nisten and I share this position.</p><p>[01:00:01] <strong>Alex Volkov:</strong> I think personally, The faster models will come out that can clone your voice and the faster they're going to come out in open source, the better it is generally for society. I know it's a hot take, I know, but I know also, I cannot reveal the source, I know that voice cloning tech is going to be at open source like super, super quick.</p><p>[01:00:21] <strong>Alex Volkov:</strong> And I think it's like one of those. Break the dam type things that the first kind of major lab will release a voice cloning and then everybody will see that nothing happened in the world, everybody else will release theirs, and we know everybody has one. We know for a long time that Microsoft has, I want to say Valley, was that Valley?</p><p>[01:00:38] <strong>Alex Volkov:</strong> That clones your voice in under three seconds. There's papers on this from every company in the world. We know that OpenAI has one. They collaborated with Spotify and they cloned Lex Fridman's voice and it sounds exactly like Lex Fridman. We know that companies like Heygen, for example, I think they use 11labs.</p><p>[01:00:54] <strong>Alex Volkov:</strong> 11labs has voice cloning as well. None of this is open source, everything is proprietary. So we're still waiting for the voice cloning area from open source from a big company. But for now, we got something called MetaVoice from a smaller company. Not from Meta, it's just called MetaVoice, it's confusing.</p><p>[01:01:08] <strong>Alex Volkov:</strong> It's just like a tiny model, 1. 2 billion parameters model. It's trained on 100k hours of data, which is quite significant, but not millions of hours. And it supports zero shot voice cloning. So basically under a few samples, like a basic sample of your voice, and then you're going to get a clone of your voice or somebody else's, which is what scares many people in this area.</p><p>[01:01:30] <strong>Alex Volkov:</strong> It has like long form synthesis as well. It's super cool. And it has emotional speech. If you guys remember, we've talked about. How important emotion is in voice cloning, because again, for those of you who follow ThursdAI for a while, you may remember myself voice cloned in kind of Russian, and I'm doing this with a lot of excitement, when the regular voice cloning thing for Alex speaks in a monotone voice, that's Very clearly not the same kind of person.</p><p>[01:01:56] <strong>Alex Volkov:</strong> So emotional speech is very important. And some of this is with prompt engineering and some of this happens in voice casting or voice acting. And the best part about this MetaVoice thing is Apache 2 license and it sounds pretty good. And so we've talked about multiple TTS models, and now this model is definitely out there.</p><p>[01:02:14] <strong>Alex Volkov:</strong> So if you're building anything and you want a TTS model for you with voice cloning, I think this is now the best. the best shot you have. It's called MetaVoice. I'm going to be adding this to the show notes as well. And I think we have a breaking news from a friend, VB with another model called Nemo.</p><p>[01:02:30] <strong>Alex Volkov:</strong> So let's take a look. Yeah, definitely a new model from NVIDIA. It's called Nemo. Let me actually use this. I want to use the sound as much as possible.</p><p>[01:02:50] <strong>Alex Volkov:</strong> So I'm gonna go and try and find this tweet for you, but basically we have a breaking news, literally Rich VB, which is the guy friend of the Padawars, who's in charge of, like, all the cool voice related and TTS related tech and Hug Face, he mentioned that NVIDIA AI released Nemo Canary.</p><p>[01:03:07] <strong>Alex Volkov:</strong> Nemo Canary is the top of open a SR leaderboard. VB is also part of the folks who are running the leaderboard for us, a SR stands for automatic speech Recognition. No, I think I'm confusing this. Yes, automatic speech recognition. Cool. Thank you, Nisten. So basically, if you guys remember Whisper, we talked about Whisper a lot.</p><p>[01:03:25] <strong>Alex Volkov:</strong> This is the leaderboard, and Whisper has been on top of this leaderboard for a while. Recently, NVIDIA has done some stuff with stuff like Parakit. And now we have a new contender in the ASR leaderboard called Nemo Canary 1B. 1B is not that much. Whisper The highest Whisper large, I think it's 2. 5 B or something.</p><p>[01:03:44] <strong>Alex Volkov:</strong> This is now the top SR leaderboard. It beats Whisper and it beats Seamless from Meta as well. And I don't know about License here. It supports four languages. Whisper obviously supports a hundred, which is, uh, which is, we know the best for many kind of low resource languages as well. Trained on not that much hours of annotated audio, only 85 1000 hours or so, and it's super fast as well.</p><p>[01:04:10] <strong>Alex Volkov:</strong> It's very interesting that NVIDIA does multiple things in this area. We had Parakit, now we have Canary as well. What else should we look at? I think Bits, Whisper, and a considerable margin, again, on these specific languages. Folks, we've been, I think, we've been on this trend for a while, and I think it's clear.</p><p>[01:04:28] <strong>Alex Volkov:</strong> Incredible automatic speech recognition comes on device very soon. Like this trend is very obvious and clear. I will add my kind of thoughts on this from somebody who used Whisper in production for a while. The faster it comes on device, the better. And specifically, I think this will help me talk about the next topic.</p><p>[01:04:47] <strong>Alex Volkov:</strong> Let's see what else I have to cover. Yeah, I think it's pretty much it. The next topic</p><p>[01:04:51] <strong>Nisten Tahiraj:</strong> I'm trying it right now, by the way. And it's pretty good.</p><p>[01:04:55] <strong>Alex Volkov:</strong> Are you transcribing me in real time or what are you doing?</p><p>[01:04:58] <strong>Nisten Tahiraj:</strong> yeah, I was transcribing your voice through the phone to my laptop but weirdly enough it doesn't output numbers, it only outputs words however</p><p>[01:05:06] <strong>Nisten Tahiraj:</strong> It seems pretty good, huh? I don't know, it seems good to</p><p>[01:05:09] <strong>Nisten Tahiraj:</strong> me, LGTM looks good to me.</p><p>[01:05:11] <strong>Alex Volkov:</strong> Yeah, it was good to me. Absolutely. The word error rate, the word error rate for Whisper is around 8%, I think, on, on average for these languages and for Canary is less than it's 5. I think, if I remember correctly, VB told us that word error rate is like how many mistakes per 100 words it does, and this does, Five Mistakes Versus Eight, I think on the general data sets.</p><p>[01:05:36] <strong>Alex Volkov:</strong> Quite incredible. This is coming and I think I'll use this to jump to the next thing</p><p>[01:05:39] Alex finds a way to plug Vision Pro in spaces about AI</p><p>[01:05:39] <strong>Alex Volkov:</strong> . The next thing, and briefly we'll cover this, is that I haven't used it for the show, but for the past, since last Friday, basically, I've been existing in reality and in augmented virtual spatial reality from Apple.</p><p>[01:05:52] <strong>Alex Volkov:</strong> And the reason I finally have a chance to connect these two things is because. I use a lot of the hand gestures within the Vision Pro from Apple, which was released on Friday and a lot of voice as well. And obviously we've talked about Siri, we've talked about finally Google stepping up with their assistant.</p><p>[01:06:08] <strong>Alex Volkov:</strong> Siri voice recognition and also typing is not that great. And I know because I used Whisper in production for a bunch. I also use Super Whisper, shout out Neil on my Mac to actually dictate a bunch. And all those tools, all the new tools, Whisper and now Canary and like all these things, they understand me and my accent very well.</p><p>[01:06:26] <strong>Alex Volkov:</strong> Whereas Siri is like on device. So Siri actually has two automatic speech recognition. They have the fast one on device and they actually do send your voice on onto the cloud and they return something. So you would [01:06:40] actually see a wrong transcription and then the right one replaced the wrong one. And the right one is actually generally okay, even though with my accent doesn't get me as much, but the wrong one is very bad.</p><p>[01:06:50] <strong>Alex Volkov:</strong> It's it's like they stopped. Thinking about ASR, Automatic Spatial Recognition in Apple, back in 2019, and that's what they shipped. However, there were quite a few papers from Apple on this topic, and I know for a fact that we're getting on device. And the reason I'm excited about this in the spatial context as well is because you can talk instead of using Hands on keyboard and that's very cool I think that's all I had to connect with the spatial computing in addition to I've tried all the AI tools and games and everything And we're still not there.</p><p>[01:07:19] <strong>Alex Volkov:</strong> There has been one thing that I want to connect if you guys know from the diffusion model area There is a way to generate images in 360 around you and I thought this was super cool because this is essentially a holodeck moment where you can stand in full virtual embedded reality and just say, Hey, I want this thing to appear.</p><p>[01:07:39] <strong>Alex Volkov:</strong> And we have now models of text to 3d that are coming like super soon. We obviously have virtual friends that embedding them in real space needs a robot. But now if you have this like spatial computing thing, you can actually put an AI friend in the corner. You will always talk to you. So there's a few like attempts at this in the Apple thing.</p><p>[01:07:57] <strong>Alex Volkov:</strong> but not a lot. And also I will ping back to this like last thing where Apple is coming. We've talked about this. Apple is coming on Friday of release of Vision Pro, which was the day after last Thursday. Apple had their uh, shareholder meeting. And in there, Tim Cook said, Hey, we launched spatial computing.</p><p>[01:08:15] <strong>Alex Volkov:</strong> We're really happy. This is the next iteration of spatial stuff, blah, blah, blah. I definitely agree about all this. If you watch my feed for the past week, that's pretty much all I can talk about besides AI. However, going back to the AI, Tim Cook finally mentioned the word AI in the call, and he's not the only one.</p><p>[01:08:30] <strong>Alex Volkov:</strong> It's very clear where the thing is going. Every earnings call for every major company mentioned AI. Tim Cook specifically mentioned AI finally and said, Hey. We're very excited about this technology and we're going to show you something like soon. So I expect that this WWDC is going to be Spatial and AI related and I definitely think that Apple are thinking about both just because the way Siri looks in Spatial is just incredibly like nice.</p><p>[01:08:55] <strong>Alex Volkov:</strong> And I can see how embodying AI in your physical world, where you have spatial awareness, you can put something in the corner, it will sound like it's coming in the corner. And I'm waiting for the, for the point where that has a bot, like a Tesla Optimus bot with AI.</p><p>[01:09:11] <strong>Alex Volkov:</strong> But before that, we'll definitely get there with spatial computing. So I'm going to have embodied AI agents around me and I'm going to ask questions. For some reason, the ChatGPT interface within the headset is horrible. And specifically because we all know that the iPhone app you can talk to, but Vision Pro only has access to iPad apps, and you can install the ChatGPT iPad app, but you cannot talk to it, which is a miss, I think, on OpenAI's part.</p><p>[01:09:35] <strong>Alex Volkov:</strong> This isn't in my segment about the Vision Pro. I tried as much as possible to connect these things to AI to bring this to you. But, separately from this my full review of Vision Pro is, holy s**t, this device is the new category of computing, and I can talk about this in a different space if you're interested.</p><p>[01:09:50] Space reset</p><p>[01:09:50] <strong>Alex Volkov:</strong> and I think it's time for a reset the space, as we've gone up for an hour here, folks. A little bit more than an hour. I'm just gonna play some music, reset the space, and then we're gonna have a conversation with some folks here on stage.</p><p>[01:10:12] Deep dive into DSPy, COLbert and RAGatouille with Ben Clavie and Connor Shorten</p><p>[01:10:12] <strong>Alex Volkov:</strong> Welcome, everyone, to the second hour of ThursdAI. Where we usually, we have a bunch of stuff to cover still from the news angle, like the Bria stuff and the MetaVoice stuff and the Arts in the Fusion. But, and also maybe you want to have some time to talk about Vision Pro, but for now, we have two guests here on stage that I want to welcome and introduce.</p><p>[01:10:31] <strong>Alex Volkov:</strong> And we're going to talk about very interesting things that maybe some of you who follow the Twitter, XAI, Ecosphere have been seeing around and I really want to say I want to say thank you and welcome to Conor and Benjamin for joining us. Maybe let's unmute Conor first and then Benjamin and just introduce yourself.</p><p>[01:10:49] <strong>Alex Volkov:</strong> Benjamin, I know you're going through some stuff, buddy. And as much as you can Benjamin feel free to, to talk to us, but we'll try to cover as much as possible. Conor, go ahead and then Benjamin.</p><p>[01:10:58] <strong>Nisten Tahiraj:</strong> Hey Alex, are you able to hear me first</p><p>[01:11:00] <strong>Alex Volkov:</strong> Yes, we can hear you loud and clear.</p><p>[01:11:03] <strong>Connor Shorten:</strong> Awesome, cool. I think I've been like refreshing the Twitter page and all that, but awesome. So I'm Connor. I'm a research scientist at Weavier. I also host the Weavier podcast. And yeah, I've just been so excited about DSPI and I'm, really excited to be diving</p><p>[01:11:15] <strong>Connor Shorten:</strong> into it further.</p><p>[01:11:16] <strong>Alex Volkov:</strong> That's awesome. And I think that WayVid podcast was the first podcast that I came on as a little bit of a guest from NeurIPS. So we had a great conversation outside of NeurIPS sign. If you guys want to check this out, but also WayVid podcast, the folks from Weights & Biases had a great chat with you.</p><p>[01:11:29] <strong>Alex Volkov:</strong> That's where I know you from. Actually researched my position and my team based on the conversation you had with them. Very knowledgeable. And thank you for that content. It's really great. And folks definitely should check it out. And I want to also say hi to Benjamin Clavy. Welcome, Benjamin.</p><p>[01:11:44] <strong>Benjamin Clavie:</strong> Hey,</p><p>[01:11:45] <strong>Benjamin Clavie:</strong> thank you for having me. Can you hear me?</p><p>[01:11:47] <strong>Alex Volkov:</strong> Yes, you're coming through loud and clear.</p><p>[01:11:50] <strong>Benjamin Clavie:</strong> Yeah. Thank you. Yeah, I've made Tato, which you might have seen if you're interested in T at all, which is</p><p>[01:12:02] <strong>Benjamin Clavie:</strong> physically here, but not present in, but</p><p>[01:12:05] <strong>Alex Volkov:</strong> Do, what's in terms of background? Could you give us a little bit of background? Like how you came up to build these things? What's your background? Is this AI? Give us maybe a few brief sentences there.</p><p>[01:12:15] <strong>Benjamin Clavie:</strong> I'll say. My background</p><p>[01:12:16] <strong>Benjamin Clavie:</strong> here is basically ai. I've done the stereotypical thing of dropping out of uni and immediately gone walking into NLP and I've been doing retrieval on NLP for 6 7 years now.</p><p>[01:12:25] <strong>Benjamin Clavie:</strong> Very standard background.</p><p>[01:12:27] <strong>Alex Volkov:</strong> So definitely related background. Okay. So we're here to talk about multiple multiple things, interesting things. And Conor, I think maybe let's just start with. I think the guy behind some of this work Omar Khattab is not with us, right? But definitely some of the work that we're going to talk about is attributed to him.</p><p>[01:12:45] <strong>Alex Volkov:</strong> So maybe, can you, Conor, can you start us with an introduction to maybe DSPy and then Colbert, and then we're going to talk about Colbert and Ragatouille, and then just a brief one, then we're going to dive into what this means for retrieval stuff, definitely as it relates to you guys in Wave V8 rags are everywhere and like better rack systems and better.</p><p>[01:13:03] <strong>Alex Volkov:</strong> Options to prompt these LLMs to better retrieve is, everybody's looking for those. So let's start maybe there.</p><p>[01:13:12] <strong>Connor Shorten:</strong> Okay, so I'll try to keep the story going from intro to DSPy and then taking it into retrieval. So I think the first thing about DSPy that will like capture your interest is the programming model. It has this way of Writing initial prompts in a really succinct way, and then you can chain together or compose these graphs of several large language model calls with tool use in the middle, and we can come into retrieve a little bit there as well, but you start off with a really coarse description of what you want it to do, re rank these documents, and then it will optimize the, the whole description of the task as well as giving you a few shot examples to put in the prompt.</p><p>[01:13:50] <strong>Connor Shorten:</strong> So that's the first thing that is just super interesting I'm sure everyone listening has done this like manual tweaking of the prompt to try to, get it to do your task and how irritating that can be. And so that's probably the quickest value add is it automatically will come up with the prompts.</p><p>[01:14:03] <strong>Connor Shorten:</strong> And then when you want to switch your language model you've been over there saying please output JSON, four exclamation marks performing better than one. And now you switch from GPT 4 to Gemini Ultra, or say, you want to see if Quinn can be view shot prompted to do this.</p><p>[01:14:17] <strong>Connor Shorten:</strong> You can now recompile the prompt by using DSPy, and you can switch your language model without having to then redo the prompt tuning.</p><p>[01:14:24] <strong>Alex Volkov:</strong> So I have to pause right here, Connor, because I'm coming to this as clean as possible with not a lot of understanding of these things . You said recompile the prompt.</p><p>[01:14:33] <strong>Alex Volkov:</strong> I'm definitely one of the folks who've tweaked prompts, tried again, saw, okay, it works for a GPT 4. I'm definitely one of those folks. What do you mean compile the prompt, recompile the prompt? Let's talk about the compilation part of this.</p><p>[01:14:44] <strong>Connor Shorten:</strong> I even, when I met Omar, I said, compile it. It's overloaded. I think this kind of analogy started with calling LLMs the new operating system for LLMs and So I think that's the line of thinking to be calling it a compiler. Really we mean automated prompt [01:15:00] tuning.</p><p>[01:15:00] <strong>Connor Shorten:</strong> But the reason compiling, I think is the right way to think about it, is, let's say you have eight large language model programs eight parts of it that's what I think is the really exciting that's what I think makes LangChain so popular is people see this gallery of examples of chains where you first analyze some chunks of blog posts, extract the topics, then, You later on aggregate the topics into a description of the topic and then you maybe pass it to an editor prompt, and then you maybe have a council of reviewers, like there's this chain, and so with each component of the chain, or I think graph is now the more common abstraction.</p><p>[01:15:35] <strong>Connor Shorten:</strong> You have a prompt there. So let's say you have eight language, or however many, I imagine that as this, continues to evolve, we're going to see like super deep LLM the programs that will have so many LLMs in the middle of it. And so you have a prompt for each of those components.</p><p>[01:15:49] <strong>Connor Shorten:</strong> And so that's why compiling, I think the analogy is great because you're compiling the prompts for all of these prompts and yeah, so that's why I'll defend the compiling.</p><p>[01:16:01] <strong>Alex Volkov:</strong> So I'll just say like from a perspective of a tinkerer. That's something that maybe triggers me a little bit to say, Oh, I need to compile stuff. No, I just write Python code, but you're saying developers do not fret. Compiling is not that like crazy. It's specifically very helpful and useful for larger applications and very, is very helpful for when you want to replace the brain behind the stuff that you're doing or you want to do this in a structured way.</p><p>[01:16:24] <strong>Alex Volkov:</strong> Is that me understanding correctly of what we're talking about?</p><p>[01:16:28] <strong>Connor Shorten:</strong> Yeah, I agree completely with that.</p><p>[01:16:29] <strong>Alex Volkov:</strong> Awesome. So that's DSPy and Omer Hatab Latent Interactions, or Latest Interactions I think the nickname is. We're definitely going to add him to show notes as well. He's the author of this. DSPy has been around for a while. I definitely know that he has been posting about this quite, quite a lot, but recently has been on the pickup as well.</p><p>[01:16:46] <strong>Alex Volkov:</strong> And maybe Colbert is one of the reasons. Let's maybe, can you introduce Colbert as well, Conor? Or do we have some stuff about DSPi still to cover in the introduction phase?</p><p>[01:16:56] <strong>Connor Shorten:</strong> Okay, I can transition to Colbert.</p><p>[01:16:58] <strong>Alex Volkov:</strong> Colbert? Colbert? How do we, how do you even pronounce this thing?</p><p>[01:17:02] <strong>Connor Shorten:</strong> I was surprised when Omar pronounced it Colbert because it, it's Bert and then there's Stephen Colbert. I'd heard him on the podcast with I think Christopher Manning from Stanford who had, asked him about that.</p><p>[01:17:14] <strong>Alex Volkov:</strong> So if Omar, the creator of this pronounced Colbert, unfortunately, even though it's BERT models, I think Colbert is what we're talking about. But yeah, from Stephen Colbert. What is Colbert? Why is there excitement on my feed around this? And let's give us an introduction, Carmen.</p><p>[01:17:31] <strong>Connor Shorten:</strong> So the, probably the right way to start thinking about it is in search, you typically have retrieval and then re ranking and retrieval is where you have like encodings of the documents. Like you put each of the documents into an embedding model and you get a vector embedding, and then you're doing just, dot product distances between the query vector and these document vectors.</p><p>[01:17:51] <strong>Connor Shorten:</strong> So there's no interaction between the query and the documents. The representations are encoded completely separately in retrieval. And then you'll typically pass that into a re ranker. And so there are three kinds of re rankers. There's point wise re rankers that take as input the query in the document and then output a relevance score, doing the interaction between just this query and this, the query in this one document.</p><p>[01:18:12] <strong>Connor Shorten:</strong> Then there's pair wise where you take two documents in the query and have a tournament of two at a time. And then there's the list wise re rankers where you're taking all the documents as input at once. So the re rankers are pretty effective, you have this massive latency overhead by doing it like that.</p><p>[01:18:28] <strong>Connor Shorten:</strong> So what Colbert introduces is this late interaction. So the benefit of having this interaction between the query and the document most similar to the point-wise cross and coer reran, where you keep the vectors for the the documents and you have this kind of interaction between the inner token vectors.</p><p>[01:18:47] <strong>Connor Shorten:</strong> So let me, it's right now what we're doing mostly with vector search is, and this is why the BERT thing is actually really important, is because we're using these encoder only models that output that like a vector for each of the token. But then we pool all those vectors to represent the object with one vector.</p><p>[01:19:02] <strong>Connor Shorten:</strong> But Colbert, you keep all the vectors for the query and the document. And then you have this inner, it's maybe a little hard to just talk you through the math behind this, but you have this. The maximum similarity of each of those query vectors with all the document vectors. So say you have 100, document vectors and you're at index 0 of the query vector as you do the maximum similarity with those 100.</p><p>[01:19:22] <strong>Connor Shorten:</strong> Then you're at the first vector of the query, second, third, so on. And then you'll average that out. So you now have this late interaction of the vectors between the query and the document. I hope that maybe Benjamin can take the mic from here. I hope that gets the gist of it.</p><p>[01:19:37] <strong>Benjamin Clavie:</strong> Yeah, that was pretty good. So just to clarify, like max similarity is like when you're using normal vectors or like batch representation, you do have a single vector for the whole document.</p><p>[01:19:48] <strong>Benjamin Clavie:</strong> When you're using Colbert, like Connor said, you've got one vector per token, and at retrieval time, what you do is you compare every single one of your query tokens, so generally not a lot, like maybe 32, and you compare that with every single token in every single document, and you make, you only keep the highest similarity, and then you sum that up, and so you compare every token to every token, you get this really fine grained comparison, instead of trying to slot everything into one massive vector, which would probably lose information.</p><p>[01:20:17] <strong>Benjamin Clavie:</strong> Because you're doing it at the token level, it's very clear, I call this like a bag of embeddings because it's like quite close to what we do with TF IDF but with embeddings instead of like just a word count.</p><p>[01:20:29] <strong>Alex Volkov:</strong> Wow. Okay. So let me try. So Connor said a bunch of stuff. Then Lindgren, you simplified. Let me try to simplify from my understanding. Okay. Regular rack system, regular basic, not without even the re ranking step. Connor? Like the basic stuff that people do in the wavy examples, for example or whatever local embeddings you have, let's say a vector store of a bunch of information.</p><p>[01:20:49] <strong>Alex Volkov:</strong> You have a user asking a question, you want to augment LLM's information. tree because of the knowledge cutoff. And then you embed the user's query in some sort of embedding. We've talked about embeddings multiple times here on ThursdAI. You get some number back and like Benjamin said, you get one embedding for the whole document or the whole query.</p><p>[01:21:08] <strong>Alex Volkov:</strong> You get like just one, not per token. You get one embedding and then you use that. And to compare, and the usual similarity score is the ways to compare this. Then if we, you wanna go to advanced stuff, then you maybe do some re ranking. Re ranking is basically showing you like another LLM step, basically, right Conor?</p><p>[01:21:28] <strong>Alex Volkov:</strong> Or some maybe model that does re ranking for you, that chooses, you retrieve multiple examples, and you choose which one like fits better. And you can do this based on several things. The downside of this is, the bigger documents you embed, the kind of, um, The last concepts maybe in this whole embedding are similar to your query.</p><p>[01:21:47] <strong>Alex Volkov:</strong> And we've all like talked about this kind of similarity is very interesting because embedding definitely has dimensions, but it's hard to figure out if a huge document like embeds into one is how should I say, averages with everything that happens in there. And the benefit here of cold bear.</p><p>[01:22:06] <strong>Alex Volkov:</strong> Finally, I'm pronouncing this correctly. Colbert is that instead of embedding one time, it embeds per token. And am I getting this correctly? That sounds to me like a lot of compute. Is that correct? Embedding per token sounds okay, now we can compare each token from the query to each token of the document.</p><p>[01:22:24] <strong>Alex Volkov:</strong> But is it significantly overhead in terms of compilation time compute? What's the downside? It sounds better on the surface.</p><p>[01:22:32] <strong>Benjamin Clavie:</strong> So yeah,</p><p>[01:22:33] <strong>Alex Volkov:</strong> Go ahead, Benjamin, please. Yeah.</p><p>[01:22:35] <strong>Benjamin Clavie:</strong> clarification was quite clear in that, yeah, it's very clear, the problem with single vector representation is You've got a long document, and you're essentially asking the model to be like, I'm going to squeeze in every single thing that could be to know about this document into 500 floats or something, which is not a lot of space.</p><p>[01:22:54] <strong>Benjamin Clavie:</strong> But, Colbert takes more storage space, to answer your question, like you will need to store more tokens even though there are compression techniques, and we'll get into that later. But compute wise, it's essentially the same, because when you're using any sort of transformer model, you'll be attending to every token anyway.</p><p>[01:23:09] <strong>Benjamin Clavie:</strong> The only difference is Colbert actually stores those, instead of just averaging them at the end.</p><p>[01:23:15] <strong>Alex Volkov:</strong> Oh, so the, on the output of something like Colbert, you actually get all of the [01:23:20] embeddings per token and not just one embedding per the whole document. And then you can, it's like the storage is higher, but you can actually use those for more, better, higher quality comparisons. That's what we're talking about here.</p><p>[01:23:33] <strong>Alex Volkov:</strong> Is that correct?</p><p>[01:23:35] <strong>Benjamin Clavie:</strong> That's the gist of it, yeah. And then after Colbert You've got Colbert V2 and PLED, which is essentially Omar and Tim found out that, yeah, that does take a lot of space, but can we compress the embeddings? So most of the time when you see Colbert using production, it actually compresses every single token vector to just one or two bits.</p><p>[01:23:56] <strong>Benjamin Clavie:</strong> So don't take that much space</p><p>[01:23:58] <strong>Alex Volkov:</strong> Oh, so Colbert v2 is, what, 10x size or something comparison, right? Something like this. Conor, can you speak about this? Cause obviously you're in the vector dataset space. The more folks host, the better it is, for you guys. Cause you get a pet token. Can you just speak about the size of this and like the improvement as well?</p><p>[01:24:20] <strong>Connor Shorten:</strong> There's a couple ways you can do this quantization. The most common is just to have k means on the segments. You divide vectors and every two contiguous values you would then cluster that and then reduce the precision to like, eight bits, so when you quantize the token vectors, you can take down the storage overhead a lot. But yeah, I think Benjamin already said it all.</p><p>[01:24:43] <strong>Alex Volkov:</strong> Okay, so now let me take this into the practical realm because Colbert, the original paper came out in 2020 and I don't remember this off the top of my head, but the way I'm reading, I have some mental documentation here that I'm using to ask you guys the proper questions. And then Colbert V2 came out and a significant compression of the data because they quantize the actual individual embeddings and performance is essentially the same, I assume.</p><p>[01:25:06] <strong>Alex Volkov:</strong> And then. It also came out a while ago, and then, Benjamin, I think you're in charge, single handedly, for the resurrection, or like the renewed interest, because all of what we're saying doesn't not, doesn't sound to me super easy, as somebody who just okay, it's super easy for me to use a vector database, like wavy, other competitors, local vector stores, they all have very simple tutorials for me to just embed the query, go do a regular the nearest neighbor can then search whatever, and then just do this for the user.</p><p>[01:25:34] <strong>Alex Volkov:</strong> Now, all of what we're talking about, embedding per token, like comparison, like all of these things sound complex to me, and then that's where Ragatouille comes in, correct? So can you talk about, you see all this happening, and then what's your library doing why is it in charge of the resurrection of this whole concept?</p><p>[01:25:53] <strong>Benjamin Clavie:</strong> Yeah, I don't know if I'll go as far as resurrection, but yeah, Colbert is basically used by everyone who is quite aware of search, like pretty much every search startup, people at Google, etc. are using Colbert, but they don't got that big outside the poor user area, and the reason I think it's something that Omar mentioned the other day is I wouldn't say Colbert itself isn't usable, but it's not approachable.</p><p>[01:26:16] <strong>Benjamin Clavie:</strong> If you go look at the repo, it's scary. There's a lot of things. How do I store those vectors, et cetera. And the point of Rege2 is trying to bridge that gap because we are now at the point, I think, where AI has users that aren't like traditional AI for users, especially in IR. Vectors are complicated.</p><p>[01:26:33] <strong>Benjamin Clavie:</strong> Embeddings are complicated. And the point of Rege2 was basically like, yeah, but what if you could use Colbert and just like 4 lines of code, and I tried to build that, and it turned out to be quite easy to build, so that's how it came to be.</p><p>[01:26:46] <strong>Alex Volkov:</strong> So you built it, it's quite easy for you. What is it? Just this is like a library wrapper on top of, The knowledge of how to run Colbert in production. What is the library like? Is this the lang chain for Colbert? Tell us like what folks are to expect when they open up and they say, okay, I need to use something like this.</p><p>[01:27:03] <strong>Alex Volkov:</strong> This is super interesting. This is higher quality retrieval. How do I start?</p><p>[01:27:09] <strong>Benjamin Clavie:</strong> Yeah, so I think there's two things here, that's where I would like it to be, and where it currently is. Where I would like it to be is to keep adding more stuff and basically bridge the gap between what's popular in IR research or retrieval, which is probably a few years ahead of what's actually popular in the mainstream because it's quite obscure.</p><p>[01:27:26] <strong>Benjamin Clavie:</strong> And then what it is right now, like when you open like a tool, it's basically there's two main classes, one that you can use to fine tune and train Colbert models and hopefully more late interaction models, but right now it's just Colbert. And tries to abstract away all the hard stuff there's a thing called hard negatives, when you're training for retrieval, and you need to mime for hard negatives, and that's they're done in the background.</p><p>[01:27:48] <strong>Benjamin Clavie:</strong> And then you've got the main one, which you can use to use Colbert to re rank her, or use Colbert to uncode documents in memory, or use Colbert to create an optimized Colbert index, which does the compression, etc. So it's basically, yeah, give it your documents, it will process them, and then you end up with something you can play.</p><p>[01:28:04] <strong>Alex Volkov:</strong> Just from a perspective of nobody that used this model so far . Let's say I already have a vector database existing. I need to reed everything in there to start using called Bay and with regulatory. And that's what you mean by fine tune or is there like an additional thing that's called fine tune?</p><p>[01:28:20] <strong>Alex Volkov:</strong> 'cause this is not like the LLM fine tune that we've talked about here on Thursday and multiple times. This is a different fine tune. What are we fine tuning? How long does it take? Does it need GPUs? If you don't mind, walk us through this. If how easy this is for the user to do.</p><p>[01:28:36] <strong>Benjamin Clavie:</strong> Yeah, that's a good question. So it's actually quite similar to LLM fine tunes, just on a much smaller scale, because you would actually be fine tuning the model itself. There's another paper by Omar and team, Omar is everywhere in this link, regardless. There's another paper by Omar and team called UDA PBR, which is actually a combination of choosing DSP, so the proto DSP Y.</p><p>[01:28:59] <strong>Benjamin Clavie:</strong> With Colbert to make the fine tune Colbert to any unknown domain. So any new domain, you could technically get a much better retrieval model using that. Right now there's only one implementation. That's something we would like to have in Regentoo. But yeah, the other question is, can you use joint distinct vectors with this?</p><p>[01:29:17] <strong>Benjamin Clavie:</strong> The answer is no, and that's quite annoying. And when fine tune, I also mean like you can fine tune the model, but you can also just choose Colbert of the shells and use that to embed your documents and create a new index. Beef. If I have to speak of the cons, I would say there's no VectorDB except Vespa, which I don't think qualifies as a modern VectorDB we probably mean here that can use call back embeddings out of the box.</p><p>[01:29:41] <strong>Benjamin Clavie:</strong> I know there's interest, maybe Connor, you guys will support it at</p><p>[01:29:44] <strong>Connor Shorten:</strong> some point soon. Yeah, we're definitely working on it. I think we, I think, I do think that you've maybe understated the contribution of Ragatouille before you did this, it wasn't, it was not easy to train your own Colbert model, and it definitely wasn't something that we saw as freQwently.</p><p>[01:30:03] <strong>Connor Shorten:</strong> It was like, Yeah, I think that you've definitely evangelized it. I don't necessarily agree with the most people doing search were doing it this way. Maybe I've just opened a thing, but I think most people have been doing the kind of pooled vectors thing and this is very new, but and yeah, we are working on adding it.</p><p>[01:30:22] <strong>Alex Volkov:</strong> I, from my perspective, just judging by the social feeds, I agree, Benjamin, without working through it I don't think I've been even been interested. But I want to maybe ask Connor here as a follow up. So you, Ragatori, you see it blowing up, like what piques your interest in how approachable this is?</p><p>[01:30:36] <strong>Alex Volkov:</strong> What's fine tuning a Colbert model mean for retrieval? You guys are like researching every retrieval technology out there as much as possible in order to bring this obviously to your users as well. Quality of retrieval is very high of a very high importance as well, but storing these like vectors in different vector databases.</p><p>[01:30:54] <strong>Alex Volkov:</strong> What do you see in Ragatori like exploding and how does this translate into people are using rags better, sorry, rags better.</p><p>[01:31:05] <strong>Connor Shorten:</strong> Yeah, I guess it yeah it definitely is just I think what I opened with this kind of retrieved and re rank it, collapsing it into the one thing. And I think it's really just explained it really well. I agree with you, Alex. I don't think I would have understood Colbert as well as I do now if it wasn't for Benjamin and Ragatouille.</p><p>[01:31:21] <strong>Connor Shorten:</strong> So that's what I think, but under the hood, it's I think it's still like this re ranking thing where we can still use, we still use the pooled vector and like an HNSW search to surface the candidates and then we'll now bring the, the other token vectors with it.</p><p>[01:31:35] <strong>Connor Shorten:</strong> And then, for Weaviate that just means opening up, like having a more generic type [01:31:40] for how we store vectors to, instead of just one vector now we have this, like an open interface. To, to let you still use the, because the pooled vector embedding search is still very popular as well.</p><p>[01:31:51] <strong>Connor Shorten:</strong> The OpenAI embedding. I think the Matryoshka thing, maybe we could talk about that as well. I think that has some flavors of this. I'm not sure if it still has the same kind of hierarchy to it. But I think there's also, maybe I'm going off topic, but there's also a paper from DeepMind about semantic IDs.</p><p>[01:32:06] <strong>Connor Shorten:</strong> And so semantic IDs, they're like this like hierarchical, discrete quantized things where it'd be like you Like at the, say you have three, three IDs and they're each eight bits and the first one would be like whether it's about sports or news or something like that. So there's definitely a, yeah, this is definitely like a newer thing, I would say.</p><p>[01:32:25] <strong>Connor Shorten:</strong> And I hope I answered the question. I think I just did like a circle around.</p><p>[01:32:28] <strong>Alex Volkov:</strong> No, with this article, definitely. I just want to touch about a concept that may be not familiar for folks here on the ThursdAI stage. Matryoshka embeddings came to my, on my radar just recently after OpenAI released their new embedding models. And one of the things they've added in their new embedding models is the ability to reduce dimensions like via API call.</p><p>[01:32:45] <strong>Alex Volkov:</strong> And people were starting thinking like, Hey, how did they do this? What usually, like when you get an embedding model, you get And then some folks started saying there was this paper called Matryoshka embeddings that Matryoshka, if you guys are not visualizing what this is like the Russian dolls thing where one fits into another.</p><p>[01:33:00] <strong>Alex Volkov:</strong> And there's this paper, and I think the author of Matryoshka embeddings is on my Reddit as well. Maybe we'll get him on ThursdAI that actually allows for significantly smaller, correct me if I'm wrong, way to do this. And I think. Folks from Junaid definitely talked about trying to train Matryoshka with some other stuff.</p><p>[01:33:17] <strong>Alex Volkov:</strong> So this is like a new concept we haven't touched upon yet, but could potentially be an additional competitor here. I want to scroll back real quick. We have Benjamin back. Benjamin let's talk about the speed of this for like larger documents. Definitely what I Learned about Regato definitely, but also about Colbert is for larger documents.</p><p>[01:33:36] <strong>Alex Volkov:</strong> I saw something, I think from Omar about just like millions of rows or something significantly faster. Could you speak about like the speed of this whole thing? Are we getting like an improvement significantly for speed? Like why would a person who already has a setup consider switching to something like this?</p><p>[01:33:51] <strong>Alex Volkov:</strong> And let's talk about the seconds it takes to run through like a bunch of documents. to find similarities.</p><p>[01:33:59] <strong>Benjamin Clavie:</strong> Okay, so that's, so I did miss a few things, so it might have been said already, but there's a trade off here in that creating a Colbert index as in an optimized one using quantization, like Connor said, is quite slow, like pretty slow, because it has to run k means on all your embeddings, etc., but the con, like the flip side of that is that once your documents are in an optimized index, Query is pretty much in constant time, like it doesn't matter if you've got 100 million documents or billions, it will take about 50 60 milliseconds, and that's because the indexing optimization step, I think, creates A bunch of centroids that you can use to, you can use as a gateway to documents, like to simplify things.</p><p>[01:34:40] <strong>Benjamin Clavie:</strong> So query is pretty much constant, and that's a big pro of optimized Colbert indexes. I think that's what counts, because it also means that adding and deleting from a Colbert index is very slow, because you need to recompute that. And I think that's space here for some sort of hybrid approach. Also using NHSW for like smaller collections, because you don't need that sort of optimization if you've got like 10, 000 documents or something.</p><p>[01:35:04] <strong>Alex Volkov:</strong> Interesting. It's just for my understanding brain this is very similar to pre compilation of some stuff versus like runtime executions or some stuff you're saying if basically you can offload. The compilation part, and your users will not basically suffer from this, right?</p><p>[01:35:20] <strong>Alex Volkov:</strong> You don't have to go and call different APIs for this. If you're able to do this, and then you precompile everything, and the benefit here is larger indices, like larger, like significant larger document stores. You're talking about like millions or a hundred millions or so. But then retrieval is almost near time, like instant, under like milliseconds.</p><p>[01:35:41] <strong>Alex Volkov:</strong> That's, I think, a crazy benefit for folks, especially in enterprises and different places where Yeah, I think it's like a significant improvement towards regular like search and vector comparison. Conor, would you say so as well? Because you guys are in the business of vector comparison and bringing people.</p><p>[01:36:00] <strong>Alex Volkov:</strong> Are you seeing like a significant improvement from a retrieval speed here.</p><p>[01:36:08] <strong>Connor Shorten:</strong> Yeah, I think the latency probably isn't too bad because you, the way that I understand Colbert is that you still, or Colbert, sorry, I would agree on Colbert, but, is that you still have the the top 100 search with HNSW and, that latency is, Pretty slow. It's gonna be like five milliseconds at a million scan.</p><p>[01:36:25] <strong>Connor Shorten:</strong> That's like the most hand wavy thing ever, but and then you just bring these quantized vectors into memory to then re it's way faster than, the cross encoder approach where you're going to take those top 100 results and then append them with the query and send them to a, an inference container to get back the scores and sort them.</p><p>[01:36:39] <strong>Connor Shorten:</strong> So it's way faster than that. I think maybe one thing out of what you just said that I'd want to parse is I don't think it's the same analogy as compile it or compose it at runtime. It's maybe more so like an asynchronous kind of thing where you can query the index that you currently have and then in the background, the index can start doing that k means quantization.</p><p>[01:37:00] <strong>Connor Shorten:</strong> That's probably the slowest thing of as Benjamin just mentioned. Like that quantizing the token vectors and now we're, let's say we're I'm actually not familiar with the detail of exactly how many token vectors you're keeping for document, but let's say it's 512, right?</p><p>[01:37:14] <strong>Connor Shorten:</strong> And now you're going to be running k means over, each of those and in parallel and then you also are, trying to multi thread the per segment codebook. So I think feeding that, fitting that codebook is going to be your challenge. And so that's probably, and then keeping that fresh because these codebooks, if that's the way you're doing it, I don't The thing about Matryoshka and it's like maybe, and it's like maybe you can get the quantized vectors out of the box with one of the embedding models, but it's the quantization schemes are pretty dependent, like dependent on your data, particularly, like you can't it's not like the embedding models that you get from the common APIs that they come with the code books.</p><p>[01:37:53] <strong>Connor Shorten:</strong> You have to fit these code books to your data. So I think the way to think about it would be that we can fit these code books like asynchronously in the background and you can query what you currently have and then, the updating and having the refresh indexing that can happen with a cycle kind of way.</p><p>[01:38:10] <strong>Alex Volkov:</strong> All right. I wanna maybe move towards, okay. Let's say folks are interested to trying this. Benjamin, could you could you speak about how to like. Is Regatoid the right start? Do they have to? I think you mentioned this briefly. I just want to return to this. Is this only like significantly better for a large set of documents?</p><p>[01:38:28] <strong>Alex Volkov:</strong> What are the steps to getting started here and what people should know? And then I guess we'll ask about if where to find you guys and how to keep up to date with as these developments around this area happen.</p><p>[01:38:43] <strong>Benjamin Clavie:</strong> So if you want to get started, I think Regato is probably definitely the easiest way to try Colbert. We've got a few example notebooks on the GitHub repository. If you want to contribute more, please do. That's the big thing. I need more documentation, more notebooks. But you can try re ranking or indexing in memory or building your index.</p><p>[01:39:01] <strong>Benjamin Clavie:</strong> So I've got Finetuning pretty much out of the box. So I'd say start there. In terms of retrieval performance, like Colbert is always a really strong competitor. Performer in the existing IR literature, and we do have a re ranker, so you can just try it out, just use it to re rank before you commit to indexing your whole documents, just to see how it would perform for you.</p><p>[01:39:21] <strong>Benjamin Clavie:</strong> So that could be an easy way to slot in any existing pipeline, basically, just retrieve documents, re rank them. and see what the rerun code does for you.</p><p>[01:39:29] <strong>Alex Volkov:</strong> And that in that case, I think integration with existing libraries also exists for folks who use like ClangChain or LAMI index. I saw that they also integrate at least some parts of this, correct?</p><p>[01:39:40] <strong>Benjamin Clavie:</strong> Yeah, and I do want to thank them for that because they basically did this within 24 hours of me reusing ReGaTu. On Lama Index you can use Colbert Indexes and on LangChain you can use Colbert Indexes and you can use like Colbert's rerun code as well. So if you already use LangChain you can add like an extra Colbert step using [01:40:00] ReGaTu in three more lines of code, I think.</p><p>[01:40:02] <strong>Alex Volkov:</strong> Incredible. So folks definitely definitely who are interested in trying out what the big dogs use for search re ranking is a very easy, like without committing re ranking is a fairly easy way to get started with this and see if you get a significant performance. And Connor, we barely touched on DSPies.</p><p>[01:40:19] <strong>Alex Volkov:</strong> I do want to have a conversation about because that's also all over my feed and basically Omar is all over my feed. And could you Let's say, how, does this all connect somehow with DSPies or no, and because DSPies is for the prompts area. This is more for the retrieval area. Where's the connection point that I'm missing besides Omar being everywhere?</p><p>[01:40:39] <strong>Connor Shorten:</strong> I think that, oh, I think Omar being everywhere is maybe the biggest connection I, because to me it's kinda like D SPY is like optimizing the LLM program prompt part. And then I think to have the optimi optimization loop connect between that and the retrieval model, you definitely, there's works like propagator in pairs.</p><p>[01:40:59] <strong>Connor Shorten:</strong> Omar has, I think, UDAPDR, something like that, where you use the LM to generate synthetic queries, then you fine tune the embedding model with that. So that's that would be where the connection would be, DSPy is like a synthetic data framework, you tell it what you want it to do, and it will use the LLMs to generate successful executions of the task, and then you use that to distill it to either small models, or to tune the prompts, or you could fine tune an embedding model.</p><p>[01:41:25] <strong>Connor Shorten:</strong> I don't think it's quite, I think that would be pretty advantageous. Benjamin can take the mic from here.</p><p>[01:41:32] <strong>Benjamin Clavie:</strong> Yeah, I wouldn't say DSPy and Colbert are directly related. They exist in the same space, but definitely very different tools. Like Connor mentioned, UDA PDR, which is a paper, the paper I mentioned, actually, where you use DSP and hopefully soon DSPy to fine tune a Colbert to any domain.</p><p>[01:41:50] <strong>Benjamin Clavie:</strong> It's not exposed. It's never been exposed to before and get it to a state of the art result only domain. That's a really good application of DSPy to Colbert. And likewise, you can use Colbert as a retriever on your DSPI pipeline, but it's just a component, it's not quite the DSPI thing.</p><p>[01:42:08] <strong>Connor Shorten:</strong> I do have something, though, that is very related to retrieval generally.</p><p>[01:42:12] <strong>Connor Shorten:</strong> Is we saw all these amazing LLM query router things. I want to give Llama Index credit for evangelizing most of this stuff. But, so one example is, say you have the LLM pick a metadata filter to put on the vector search. Like you want to, search only where you're searching through, let's say you have an index of podcast clips and you want to say only where the speaker is Omar Khattab, and you have an LLM predict that filter, and then that would be in the retrieval engine.</p><p>[01:42:38] <strong>Connor Shorten:</strong> And so you have this you have a prompt behind that same with text to SQL. There's a prompt behind how you we'll put these things around retrieval. And so DSPy can optimize the prompts or optimize the models that do that to get the maximum performance out. And not, I, not to, I don't mean to say anything negative about the existing frameworks, but you're right now, locking into the prompts that they have built in to the framework it do these things, whereas DSPy opens it up to optimize it for your thing.</p><p>[01:43:06] <strong>Alex Volkov:</strong> Interesting. Yeah, I don't think it's negative necessarily. I think people after using some of these frameworks they understand that and we've seen this from multiple folks. This, they could potentially start with something like a Lama index or LinkedIn and then quickly figure out that some more.</p><p>[01:43:20] <strong>Alex Volkov:</strong> Freedom is needed and de SPI saying is a potential kind of way to do that. Okay. Connor, anything else? Very interesting. So first of all, you have a bunch of great content on this. You recently did. I think it's been to the top of the tweet. I'll definitely add this to the show notes as well.</p><p>[01:43:32] <strong>Alex Volkov:</strong> You did a deep dive into de SSPs on your, was that on the podcast or was just a video? Definitely we'll send folks there. Anything else you want to add of like, how to find you, where to find your content and definitely folks should follow you. First of all, we'll add your things.</p><p>[01:43:48] <strong>Connor Shorten:</strong> Thanks, Alex. Yes, I have two podcasts right now with Omar, of course, and then I have Carol Duserlink, who's created this. Infer, Retrieve, Rank, Program. It's one of the coolest examples of DSPi. And yeah, and then I have one video out so far explaining the whole thing. Quickly, I wanted to point people to the update to DSPi Assertions.</p><p>[01:44:05] <strong>Connor Shorten:</strong> Because I think this is the most important thing with these prompting frameworks. And I think it's important. to also understand Instructor from Jason Liu, which is where you use Pydantic to define the schema of the outputs that you want from the language model, and then you validate the outputs to make sure that it, outputted JSON with the keys that you wanted.</p><p>[01:44:23] <strong>Connor Shorten:</strong> And so DSPi Assertions is in this similar category, and this is like the most common discussion I'm seeing in the DSPi Discord is people looking to add Instructor to DSPi and jointly looking to do this thing of like structured outputs and have this retry mechanism. There's a new work from Arnav Signal Sig, oh, sorry, Arnav Singh V.</p><p>[01:44:43] <strong>Connor Shorten:</strong> We haven't met yet, but, and know more about DSPi assertions. And I'm going to link it in the description of this chat. Cause I highly recommend people check it out.</p><p>[01:44:50] <strong>Alex Volkov:</strong> Awesome. Nisten, just before I give you a question I will shout out that Jason Liu from the instructor library came to the Weights & Biases courses, and there's a course that he builds with us as well that's for free. You can just go 1db. ai courses. I'll definitely add this in the link below about structured output and how to force these LLMs to give us better structure output.</p><p>[01:45:09] <strong>Alex Volkov:</strong> It's funny that a person named Jason is building, you tools to get LLMs to output JSONs. But that's all I have. Just super quick. Nisten, go ahead. You had a question here.</p><p>[01:45:19] <strong>Nisten Tahiraj:</strong> I just want to say it's pretty amazing that the people we bring here are from the industry. We actually use, like from last week, I started using Lilac, I might actually start running Ragatouille on that on that Hacker Neon dataset. And so I wanted to know and mainly since some people ask in the comments, what have I used, I forced myself to only use open source models.</p><p>[01:45:45] <strong>Nisten Tahiraj:</strong> And cause I feel like that's the only way they're going to start getting better if we restrict themselves to them. I don't recommend you do it just yet, just wait another. Maybe a week or two but I want, I wanted to ask uh, we see some limitations with retrieval augmentation systems, like in GPT 4 when people use it.</p><p>[01:46:07] <strong>Nisten Tahiraj:</strong> It only gives three points from the document, doesn't really summarize it and stuff. What are the benefits of going with the Colbert? I'm sorry. Is it because it's much faster? Can you feed it many more documents? I'm talking from a practical point of view, not necessarily even from a tech person's point of view, like as a business who has a lot of customer data why should they use this versus just putting it on pgVector and doing function calling?</p><p>[01:46:41] <strong>Nisten Tahiraj:</strong> Is this faster that way? And what limitations does using again, RegA2 with Colbert</p><p>[01:46:47] <strong>Benjamin Clavie:</strong> have? That is a good and open question. So limitations we have a lot right now, like the lack of Cloud hosting offering is a big one. There's not really somewhere you can host this except doing it yourself, which is a big problem.</p><p>[01:47:05] <strong>Benjamin Clavie:</strong> And the main reason to use it, I would say, is generalization because the thing when you use any of the shared submitting models is they look good on benchmarks, and they tend to work quite well, but they've been optimized for those benchmarks. Whereas Colbert, for instance, like Colbert V2, has never been trained on the MTEB benchmark for retrieval, etc.</p><p>[01:47:24] <strong>Benjamin Clavie:</strong> The reason it generalizes well is because working at the token level makes it a lot easier for your model to encode information. Whereas, like, when you're trying to squeeze everything into a single vector, it might not very well, not work very well, say, for your custom domain. Whereas with Colbert, you can always assume it's going to be okay in every domain, but if it's not the best, you will need to fine tune it later.</p><p>[01:47:45] <strong>Benjamin Clavie:</strong> It's probably the biggest draw, I'd say.</p><p>[01:47:51] <strong>Alex Volkov:</strong> Awesome. So I definitely wanna thank you guys for coming up and explaining these concepts that have been floating around in very simple language. And I appreciate your patience with me re asking this in the way that I understand, because I know definitely that this is my way to understand, but also some folks in the audience.</p><p>[01:48:06] <strong>Alex Volkov:</strong> That's how we do here on ThursdAI, so more than welcome to rejoin. For I now consider both of you friends of the pod, so I agree with Nisten. It's really cool to see the authors of the libraries and the tools that we use. Come here to ThursdAI to talk about them, [01:48:20] and obviously, upcoming features as well.</p><p>[01:48:22] <strong>Alex Volkov:</strong> Definitely welcome. Benjamin, thank you for doing a bunch of open source stuff, and evangelizing the whole con birth call birth thing to make it simpler for folks. Definitely, thank you. And any anything you want to add here that I haven't touched yet? Please go ahead, Benjamin.</p><p>[01:48:36] <strong>Benjamin Clavie:</strong> I do have a few shoutouts, shall we say. One of them is that LungChain and DSPy are not mutually exclusive, and I shared that in the chat. There is now LungChain x DSPy integration, where you can define your chains in LungChain and still use DSPy to optimize things, which is pretty cool.</p><p>[01:48:53] <strong>Benjamin Clavie:</strong> And in the embedding world, so you mentioned Matrioshka embedding, and we talked about Colbert, and the people at JIN are actually training a Colbert model right now using Matrioshka embedding for compression, as like some sort of let's try this out, see how it works. And the final one is, you might have brought this up already, but the people at BAI train, like really, BGM3, as a really cool embedding model that in a single pass outputs.</p><p>[01:49:19] <strong>Benjamin Clavie:</strong> Dan's Vector, Burst, or Colbert Style Multivector Implantation, and the Splate Style Sparse Implantation. I won't go into too much detail about that,</p><p>[01:49:26] <strong>Alex Volkov:</strong> I'm sorry. I don't think I covered that. Who was that? Sorry. Could you repeat?</p><p>[01:49:31] <strong>Benjamin Clavie:</strong> The people at BAAI, the people who do the BGE</p><p>[01:49:34] <strong>Alex Volkov:</strong> Oh yeah, but yeah. We've talked about their model recently. They,</p><p>[01:49:37] <strong>Benjamin Clavie:</strong> ABI, yeah,</p><p>[01:49:38] <strong>Alex Volkov:</strong> Oh, I did not know.</p><p>[01:49:39] <strong>Alex Volkov:</strong> So they're now have a thing where outputs a regular embedding and also called burst style embedding.</p><p>[01:49:46] <strong>Benjamin Clavie:</strong> Yeah, the big thing last week was M3, which has a Colbert Style Embedding, Splate Style Embedding, which is a Sparse Implantation method, and Dan's Embedding, which is just a single model, a total of three.</p><p>[01:49:57] <strong>Alex Volkov:</strong> Oh, that's incredible. Okay. So we're adding some knowledge here. Thank you for, let me just repeat just the way that I hear this, we've talked about the BAAI BGE M3. M3 basically stands for multiple things. One of them is multilinguality. So they upgraded their embeddings to use not only English, but also I think a hundred languages as well.</p><p>[01:50:14] <strong>Alex Volkov:</strong> So now Benjamin, you're saying they're also implementing for us this step, the output, the dense embedding, but also the. The call Burr embedding, correct?</p><p>[01:50:25] <strong>Benjamin Clavie:</strong> yeah, yeah, one of the meanings of M, I think, is</p><p>[01:50:27] <strong>Alex Volkov:</strong> Multicomposability or some con yeah. Multifunctionality. Yes, exactly.</p><p>[01:50:33] <strong>Benjamin Clavie:</strong> can use it to generate different kinds of embedding. And I think that the first Non Colbert, actually like strong multi vector model. There's issues as in the vectors are too big, etc.</p><p>[01:50:45] <strong>Benjamin Clavie:</strong> But it's a very nice thing to see happen. Definitely, like</p><p>[01:50:49] <strong>Alex Volkov:</strong> Oh, definitely shout out then we need to get the folks from BA AI here to speak about this. So if you folks know them, definitely connect them to me. I would love to hear about from the authors of BG. Yeah, definitely shouts out Junaid. I think Bo Wang, we've mentioned he's a friend of the pod.</p><p>[01:51:03] <strong>Alex Volkov:</strong> He came when Junaid released embeddings and he often comes here and gives us like insights about how embeddings work. Shout out Bo and the team with Junaid as well. Connor your stage, if you want to add everywhere else where folks. You can follow or shout out your stage. And then we're going to continue with some more news.</p><p>[01:51:21] <strong>Connor Shorten:</strong> It's been so cool to be a part of the podcast. And I love how it's integrated into X because this is actually my favorite place to manage communication. So if you want to reach out, here would be great.</p><p>[01:51:31] <strong>Alex Volkov:</strong> Yeah. So definitely give a Connor a follow and a Wave8 podcast is incredible. We've been, by we, Wits and Biases. We had a mutual video together and Connor hosted our folks. And there was a, I learned a bunch of it before I joined Wits and Biases as well. A great source of information from both of you.</p><p>[01:51:45] <strong>Alex Volkov:</strong> Thank you guys so much for coming up, explaining these complex. on the surface concept to us, maybe complex also implementation wise, but making them simpler as well. I think it's very important talking about them. I think it's very important and you are now considered friends of ThursdAI community and hopefully this will get more folks to learn about this, contribute, etc.</p><p>[01:52:05] <strong>Alex Volkov:</strong> And I think with that, we're like, a bit over the top, like two hours since I started the recording. We had a great show today. Thank you everybody for listening and coming. I just wanna summarize this in a few notes that that I really enjoy my time here every week. And I really enjoy learning from folks. I think Nisten, you mentioned today that it's so cool to have the authors of the things we talked about.</p><p>[01:52:25] <strong>Alex Volkov:</strong> So today we also had this benefit. We had Benjamin here and we had Connor who covered this. And we also had Justin again from the Qwen team to talk about the Qwen stuff that they released. And it's really cool that the community now connects different people.</p><p>[01:52:36] <strong>Alex Volkov:</strong> So I was able to connect Justin and the Qwen team with the LM studio folks and Olama folk. No, I think only LM studio. And they were able to work together that they release is now supported in LM studio. the second they release something. So I love how this community comes together. I encourage everybody who listens to this to also participate in this.</p><p>[01:52:55] <strong>Alex Volkov:</strong> Either follow everybody who's on stage here interact with our posts and boost the signal a little bit. Tell your friends if you're working with friends and they don't listen to ThursdAI. And there's alpha in listening to ThursdAI like today definitely tell your friends where this alpha can be found.</p><p>[01:53:10] <strong>Alex Volkov:</strong> And with that, I want to thank you all and have a nice Thursday. Bye bye, everyone.</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-feb-8-google-gemini-ultra</link><guid isPermaLink="false">substack:post:141511433</guid><dc:creator><![CDATA[Alex Volkov, Connor Shorten, Benjamin Clavie, and Nisten]]></dc:creator><pubDate>Fri, 09 Feb 2024 01:25:42 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/141511433/f2c711d41e38d903dd177f1d0356e14d.mp3" length="81978861" type="audio/mpeg"/><itunes:author>Alex Volkov, Connor Shorten, Benjamin Clavie, and Nisten</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6831</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/141511433/12a049a8a8c515ed3f0037f65660f823.jpg"/></item><item><title><![CDATA[📖 ThursdAI - Sunday special on datasets classification & alternative transformer architectures]]></title><description><![CDATA[<p>Hello hello everyone, welcome to another special episode (some podcasts call them just.. episodes I guess, but here you get AI news every ThurdsdAI, and on Sunday you get the deeper dives) </p><p>BTW, I'm writing these words, looking at a 300 inch monitor that's hovering above my usual workstation in the Apple Vision Pro, and while this is an AI newsletter, and I've yet to find a connecting link (there's like 3 AI apps in there right now, one fairly boring chatbot, and Siri... don't get me started on Siri), I'll definitely be covering my experience in the next ThursdAI, because well, I love everything new and technological, AI is a huge part of it, but not the ONLY part! </p><p>📖 It's all about the (big) Datasets </p><p>Ok back to the matter at hand, if you've used, finetuned, trained or heard about an AI model, you may or may not realize how important the dataset the model was trained with is. We often talk of this model, that model, and often the only different is, additional data that folks (who I sometimes refer to as alchemists) have collected, curated and structured, and creating/curating/editing those datasets is an art and a science. </p><p>For example, three friends of the pod, namely <a target="_blank" href="https://x.com/ldjconfirmed/">LDJ</a> with Capybara, <a target="_blank" href="https://twitter.com/alignment_lab">Austin</a> with OpenChat and <a target="_blank" href="https://twitter.com/Teknium1/status/1752799892215673313">Teknium</a> with Hermes, have been consistently taking of the shelves open source models and making them smarter, more instruction tuned, better for specific purposes. These datasets are paired with different techniques as well, for example, lately the so-called DPO (Direct preference optimization) is a technique that showed promise, since it not only shows a model which answer is the correct for a specific query, it shows an incorrect answer as well, and trains the model to prefer one over the other. (see the recent <a target="_blank" href="https://x.com/argilla_io/status/1752760042607063351?s=20">Capybara DPO improvement by Argilla</a>, which improved model metrics across every evaluation)</p><p>These datasets can range from super high quality 16K rows, to millions of rows (Teknium's recently released Hermes, one of the higher quality datasets comes in at just a tad over exactly 1 million rows) and often times it's an amalgamation of different other datasets into 1.  </p><p>In the case of Hermes, Teknium has compiled this 1 million chats from at least <a target="_blank" href="https://twitter.com/Teknium1/status/1752799892215673313">15 different datasets</a>, some his own, some by folks like <a target="_blank" href="https://sub.thursdai.news/p/jan14-sunday-special-deep-dives">Jon Durbin</a>, <a target="_blank" href="https://sub.thursdai.news/p/thursdai-llm-finetuning-deep-dive">Garage bAInd</a>, and shareGPT, from LMsys.org, which was complied by scraping the very popular <a target="_blank" href="http://sharegpt.com">sharegpt.com</a> website, from folks who used the shareGPT extension to share they GPT4 conversations. It's quite remarkable how much of these datasets are just, conversations that users had with GPT-4! </p><p>Lilac brings Garden</p><p>With that backdrop of information, today on the pod we've got the co-founders of <a target="_blank" href="http://lilacml.com">Lilac</a>, <a target="_blank" href="https://twitter.com/nsthorat">Nikhil Thorat</a> and <a target="_blank" href="https://twitter.com/dsmilkov">Daniel Smilkov</a>, who came on to chat about the new thing they just released called Lilac Garden. </p><p>Lilac is an open source tool (you can find it<a target="_blank" href="https://github.com/lilacai/lilac"> RIGHT HERE</a>) which is built to help make dataset creation, curation and classification, more science than art, and help visualize the data, cluster it and make it easily available. In the case of Hermes, that could be more than millions of rows of data.</p><p>On the pod, I talk with Nikhil and Daniel about the origin of what they both did at Google, working on Tensorflow.js and then something called "know your data" and how eventually they realized that in this era of LLMs, open sourcing a tool that can understand huge datasets, run LLM based classifiers on top of them, or even train specific ones, is important and needed! </p><p>To strengthen the point, two friends of the pod (Teknium was in the crowd sending us 👍), LDJ and Austin (aka Alignment Lab) were on stage with us and basically said that "It was pretty much the dark ages before Lilac", since something like OpenOrca dataset is a whopping 4M rows of text. </p><p>Visualizations in the Garden. </p><p>So what does lilac actually look like? Here's a quick visualization of the top categories of texts from OpenOrca's 4 million rows, grouped by category title and showing each cluster. So you can see here, Translation requests have 66% (around 200K rows) of the translation category, and you can scroll on and on and add filters and really dissect this whole thing up and down. </p><p>The categorization is created by running Lilac on your dataset, which uses embedding algorithms and other neat tricks to quickly chunk and put labels on the categories (AKA classifying them). </p><p>Btw, you can see this view and play around with it yourself, <a target="_blank" href="https://lilacai-lilac.hf.space/datasets#lilac/OpenOrca&#38;query=%7B%7D&#38;viewPivot=true&#38;pivot=%7B%22outerPath%22%3A%5B%22question__cluster%22%2C%22category_title%22%5D%2C%22innerPath%22%3A%5B%22question__cluster%22%2C%22cluster_title%22%5D%7D">here</a></p><p>But running this on your own local machine can be a drag, and take hours if not days for bigger datasets, including sometimes hanging and not even working 100%, so the Lilac folks created Lilac Garden, which is a hosted solution by them to provide a dataset, and do classify something like 4M in 4-5 hours or so. </p><p>Which is definitely not possible on local machines. If you're into that kind of thing, again, Lilac is open source ,so you don't have to sign up or pay them, but if speed and this view matters to you, definitely check Lilac out! </p><p>RWKV with Eugene (Pico Creator) </p><p>On the news segment of ThursdAI we mentioned Eagle, which is the 5th version of RWKV, an attention free, potential alternative to Transformers, that's being developed fully in the open source. Later in the show we had the honor to have PicoCreator, one of the front running folks in the RWKV effort, which is an attempt to see if Transformers can be beat with a new type of architecture (RNN) that doesn't require specific attention mechanisms, that add the problem of Quadratic Attention scaling, making LLMs hard and expensive to run the more context is provided. </p><p>Eugene had some technical issues so joined in the middle of the pod, so we didn't have a full deep-dive, however, I figured it's important to bring this info to you guys, as these efforts may yield AI that runs 10-100x cheaper and potentially faster on devices, using almost infinite context lengths. </p><p>RWKV and other attempts like StripedHyena (Together AI) and Mamba (from Tri Dao) are attempts that are worth watching as they may supersede or join with Transformers to create the next jump in LLM capabilities.</p><p>That's all for this Sunday, needless to say, with the Vision Pro releasing on a Friday, it's been a full weekend of future exploration, which is the main driver in my personal life! </p><p>P.S - if you read through to here, you get a gift! A teaser, I have done something different on the pod, recorded a human interest podcast x AI, for the first time. I mostly bring the news and sometimes deep dives like this one, but this story I couldn't ignore, so stay tuned if you're into dating x AI, and how technology disrupts our lives and wether this is all moral or not, as I recorded an Episode with Sasha Jadan and his new Fiancee Karina, which his AI bot picked out for him, after swiping and matching with over 5200 girls on Tinder. The AI also... suggested he'd propose which he did. It was a very interesting conversation that I plan to upload soon! </p><p>That's it from me this week, see you all on ThursdAI and don't forget, if you liked this, do me a solid, listen to the pod and then leave a review or a 5 star (at least a 4?) on Apple podcasts 🙏 </p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-sunday-special-on-datasets</link><guid isPermaLink="false">substack:post:141380659</guid><dc:creator><![CDATA[Alex Volkov and Daniel Smilkov]]></dc:creator><pubDate>Mon, 05 Feb 2024 01:15:10 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/141380659/452089ca0625421a2e2e4c63947b93cb.mp3" length="36450467" type="audio/mpeg"/><itunes:author>Alex Volkov and Daniel Smilkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>3037</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/141380659/d25acb99594a6d2dd047dfa3d7399135.jpg"/></item><item><title><![CDATA[ThursdAI - Feb 1, 2024- Code LLama, Bard is now 2nd best LLM?!, new LLaVa is great at OCR, Hermes DB is public + 2 new Embed models + Apple AI is coming 👀]]></title><description><![CDATA[<p>TL;DR of all topics covered + Show notes</p><p>* <strong>Open Source LLMs</strong></p><p>* Meta releases Code-LLama 70B - 67.8% HumanEval (<a target="_blank" href="https://twitter.com/AIatMeta/status/1752013879532782075">Announcement</a>, <a target="_blank" href="https://huggingface.co/codellama/CodeLlama-70b-Instruct-hf">HF instruct version</a>, HuggingChat, Perplexity)</p><p>* Together added function calling + JSON mode to Mixtral, Mistral and CodeLLama</p><p>* RWKV (non transformer based) Eagle-7B - (<a target="_blank" href="https://x.com/RWKV_AI/status/1751797147492888651?s=20">Announcement</a>, <a target="_blank" href="https://huggingface.co/spaces/BlinkDL/RWKV-Gradio-2">Demo</a>, <a target="_blank" href="https://twitter.com/Yampeleg/status/1751850391480721693">Yam's Thread</a>)</p><p>* Someone leaks Miqu, Mistral <a target="_blank" href="https://x.com/altryne/status/1752748034180481533?s=20">confirms</a> it's an old version of their model</p><p>* Olmo from Allen Institute - fully open source 7B model (Data, Weights, Checkpoints, Training code) - <a target="_blank" href="https://allenai.org/olmo">Announcement</a></p><p>* <strong>Datasets & Embeddings</strong></p><p>* Teknium open sources Hermes dataset (<a target="_blank" href="https://twitter.com/Teknium1/status/1752799124775374928">Announcement</a>, <a target="_blank" href="https://huggingface.co/datasets/teknium/OpenHermes-2.5">Dataset</a>, <a target="_blank" href="https://lilacai-lilac.hf.space/datasets#lilac/OpenHermes-2.5">Lilac</a>)</p><p>* Lilac announces Garden - LLM powered clustering cloud for datasets (<a target="_blank" href="https://twitter.com/lilac_ai/status/1752361374640902402">Announcement</a>)</p><p>* BAAI releases BGE-M3 - Multi-lingual (100+ languages), 8K context, multi functional embeddings (<a target="_blank" href="https://twitter.com/BAAIBeijing/status/1752182391983280248">Announcement</a>, <a target="_blank" href="https://github.com/FlagOpen/FlagEmbedding">Github</a>, <a target="_blank" href="https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/BGE_M3/BGE_M3.pdf">technical report</a>)</p><p>* Nomic AI releases Nomic Embed - fully open source embeddings (<a target="_blank" href="https://twitter.com/nomic_ai/status/1753082063048040829">Announcement</a>, <a target="_blank" href="https://static.nomic.ai/reports/2024_Nomic_Embed_Text_Technical_Report.pdf">Tech Report</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Bard with Gemini Pro becomes 2nd LLM in the world per LMsys beating 2 out of 3 GPT4 (<a target="_blank" href="https://twitter.com/lmsysorg/status/1750921228012122526">Thread</a>)</p><p>* OpenAI launches GPT mention feature, it's powerful! (<a target="_blank" href="https://x.com/altryne/status/1752755823212667084?s=20">Thread</a>)</p><p>* <strong>Vision & Video</strong></p><p>* 🔥 LLaVa 1.6 - 34B achieves SOTA vision model for open source models (<a target="_blank" href="https://twitter.com/imhaotian/status/1752621754273472927">X</a>, <a target="_blank" href="https://llava-vl.github.io/blog/2024-01-30-llava-1-6/">Announcement</a>, <a target="_blank" href="https://llava.hliu.cc/">Demo</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* Argmax releases WhisperKit - super optimized (and on device) whisper for IOS/Macs (<a target="_blank" href="https://twitter.com/reach_vb/status/1752434666659889575">X</a>, <a target="_blank" href="https://www.takeargmax.com/blog/whisperkit">Blogpost</a>, <a target="_blank" href="https://github.com/argmaxinc/WhisperKit">Github</a>)</p><p>* <strong>Tools</strong></p><p>* Infinite Craft - Addicting concept combining game using LLama 2 (<a target="_blank" href="https://t.co/tb51ejeqsK">neal.fun/infinite-craft/</a>)</p><p></p><p>Haaaapy first of the second month of 2024 folks, how was your Jan? Not too bad I hope? We definitely got quite a show today, the live recording turned into a proceeding of breaking news, authors who came up, deeper interview and of course... news.</p><p>This podcast episode is focusing only on the news, but you should know, that we had deeper chats with Eugene (<a target="_blank" href="https://twitter.com/picocreator">PicoCreator</a>) from RWKV, and a deeper dive into dataset curation and segmentation tool called Lilac, with founders <a target="_blank" href="https://twitter.com/nsthorat">Nikhil</a> & Daniel, and also, we got a breaking news segment and (from ) joined us to talk about the latest open source from AI2 👏</p><p>Besides that, oof what a week, started out with the news that the new Bard API (apparently with Gemini Pro + internet access) is now the 2nd best LLM in the world (According to LMSYS at least), then there was the whole thing with Miqu, which turned out to be, yes, a leak from an earlier version of a Mistral model, that leaked, and they acknowledged it, and finally the main release of LLaVa 1.6 to become the SOTA of vision models in the open source was very interesting!</p><p></p><p>Open Source LLMs</p><p>Meta releases CodeLLama 70B</p><p>Benches 67% on MMLU (without fine-tuninig) and already available on HuggingChat, Perplexity, TogetherAI, Quantized for MLX on Apple Silicon and has several finetunes, including <a target="_blank" href="https://x.com/rishdotblog/status/1752329471867371659?s=20">SQLCoder which beats GPT-4 on SQL</a></p><p>Has 16K context window, and is one of the top open models for code</p><p>Eagle-7B RWKV based model</p><p>I was honestly disappointed a bit for the multilingual compared to 1.8B stable LM , but the folks on stage told me to not compare this in a transitional sense to a transformer model ,rather look at the potential here. So we had Eugene, from the RWKV team join on stage and talk through the architecture, the fact that RWKV is the first AI model in the linux foundation and will always be open source, and that they are working on bigger models! That interview will be released soon</p><p>Olmo from AI2 - new fully open source 7B model (announcement)</p><p>This announcement came as Breaking News, I got a tiny ping just before Nathan dropped a magnet link on X, and then they followed up with the Olmo release and announcement.</p><p>A fully open source 7B model, including checkpoints, weights, Weights & Biases logs (coming soon), dataset (Dolma) and just... everything that you can ask, they said they will tell you about this model. Incredible to see how open this effort is, and kudos to the team for such transparency.</p><p>They also release a 1B version of Olmo, and you can read the technical report <a target="_blank" href="https://allenai.org/olmo/olmo-paper.pdf">here</a></p><p>Big CO LLMs + APIs</p><p>Mistral handles the leak rumors</p><p>This week the AI twitter sphere went ablaze again, this time with an incredibly dubious (quantized only) version of a model that performed incredible on benchmarks, that nobody expected, called MIQU, and i'm not linking to it on purpose, and it started a set of rumors that maybe this was a leaked version of Mistral Medium. Remember, Mistral Medium was the 4th best LLM in the world per LMSYS, it was rumored to be a Mixture of Experts, just larger than the 8x7B of Mistral.</p><p>So things didn't add up, and they kept not adding up, as folks speculated that this is a LLama 70B vocab model etc', and eventually this drama came to an end, when Arthur Mensch, the CEO of Mistral, did the thing Mistral is known for, and just acknowleged that the leak was indeed an early version of a model, they trained once they got access to their cluster, super quick and that it indeed was based on LLama 70B, which they since stopped using.</p><p>Leaks like this suck, especially for a company that ... gives us the 7th best LLM in the world, completely apache 2 licensed and it's really showing that they dealt with this leak with honor!</p><p>Arthur also proceeded to do a very Mistral thing and opened a pull request to the Miqu HuggingFace readme with an attribution that looks like this, with the comment "<strong>Might consider attribution" </strong>🫳🎤</p><p>Bard (with Gemini Pro) beats all but the best GPT4 on lmsys (and I'm still not impressed, help)</p><p>This makes no sense, and yet, here we are. Definitely a new version of Bard (with gemini pro) as they call it, from January 25 on the arena, now is better than most other models, and it's could potentially be because it has internet access?</p><p>But so does perplexity and it's no where close, which is weird, and it was a weird result that got me and the rest of the team in the ThursdAI green room chat talking for hours! Including getting folks who usually don't reply, to reply 😆 It's been a great conversation, where we finally left off is, Gemini Pro is decent, but I personally don't think it beats GPT4, however most users don't care about which models serves what, rather which of the 2 choices LMSYS has shown them answered what they asked. And if that question has a google search power behind it, it's likely one of the reasons people prefer it.</p><p>To be honest, when I tried the LMSYS version of Bard, it showed me a 502 response (which I don't think they include in the ELO score 🤔) but when I tried the updated Bard for a regular task, it performed worse (<a target="_blank" href="https://twitter.com/altryne/status/1704154333544382614">in my case</a>) than a 1.6B parameter model running locally.</p><p>Folks from google replied and said that it's not that they model is bad, it's that I used a person's name, and the model just.. refused to answer. 😵‍💫 When I removed a last name it did perform ok, no where near close to GPT 4 though.</p><p>In other news, they updated Bard once again today, with the ability to draw images, and again, and I'm sorry if this turns to be a negative review but, again, google what's going on?</p><p>The quality in this image generation is subpar, at least to mea and other folks, I'll let you judge which image was created with IMAGEN (and trust me, I cherry picked) and which one was DALLE for the same exact prompt</p><p>This weeks Buzz (What I learned with WandB this week)</p><p>Folks, the growth ML team in WandB (aka the team I'm on, the best WandB team duh) is going live!</p><p>That's right, we're going live on Monday, 2:30 PM pacific, on all our socials (<a target="_blank" href="https://twitter.com/weights_biases">X</a>, <a target="_blank" href="https://www.linkedin.com/company/wandb/">LinkedIn</a>, <a target="_blank" href="https://www.youtube.com/@WeightsBiases">Youtube</a>) as I'm hosting my team, and we do a recap of a very special week in December, a week where we paused other work, and built LLM powered projects for the company!</p><p>I really wanted to highlight the incredible projects, struggles, challenges and learnings of what it takes to take an AI idea, and integrated it, even for a company our size that works with AI often, and I think it's going to turn out super cool, so you all are invited to check out the live stream!</p><p>Btw, this whole endeavor is an initiative by yours truly, not like some boring corporate thing I was forced to do, so if you like the content here, join the live and let us know how it went!</p><p>OpenAI releases a powerful new feature, @mentions for GPTs</p><p>This is honestly so great, it went under the radar for many folks, so I had to record a video to expalin why this is awesome, you can now @mention GPTs from the store, and they will get the context of your current conversation, no longer you need to switch between GPT windows.</p><p>This opens the door for powerful combinations, and I show some in the video below:</p><p>Apple is coming to AI</p><p>Not the Apple Vision Pro, that's coming tomorrow and I will definitely tell you how it is! (I am getting one and am very excited, it better be good)</p><p>No, today on the Apple earnings call, Tim Cook finally said the word AI, and said that they are incredibly excited about this tech, and that we'll get to see something from them this year.</p><p>Which makes sense, given the MLX stuff, the Neural Engine, the Ml-Ferret and the tons of other stuff we've seen from them this year, Apple is definitely going to step in a big way!</p><p>Vision & Video</p><p>LLaVa 1.6 - SOTA in open source VLM models! (<a target="_blank" href="https://llava.hliu.cc/">demo</a>)</p><p>Wow, what a present we got for <strong>Haotian Liu</strong> and the folks at LLaVa, they upgraded the LlaVa architecture and released a few more models, raging from 7B to 34B, and created the best open source state of the art vision models! It's significantly better at OCR (really, give it a go, it's really impressive) and they exchanged the LLM backbone with Mistral and Hermes Yi-34B.</p><p>* Better OCR and higher res</p><p>* Uses several bases like Mistral and NousHermes 34B</p><p>* Uses lmsys SGlang for faster responses (which we covered a few weeks ago)</p><p>* <strong>SoTA Performance!</strong> LLaVA-1.6 achieves the best performance compared with open-source LMMs such as <a target="_blank" href="https://github.com/THUDM/CogVLM">CogVLM</a> or <a target="_blank" href="https://huggingface.co/01-ai/Yi-VL-34B">Yi-VL</a>. Compared with commercial ones, it catches up to Gemini Pro and outperforms <a target="_blank" href="https://huggingface.co/spaces/Qwen/Qwen-VL-Plus">Qwen-VL-Plus</a> on selected benchmarks.</p><p>* <strong>Low Training Cost</strong>. LLaVA-1.6 is trained with 32 GPUs for ~1 day, with 1.3M data samples in total. The compute / training data cost is 100-1000 times smaller than others.</p><p>Honestly it's quite stunningly good, however, it does take a lot more GPU due to the resolution changes they made. Give it a try in this online <a target="_blank" href="https://llava.hliu.cc/">DEMO</a> and tell me what you think.</p><p>Tools</p><p>Infinite Craft Game (<a target="_blank" href="https://twitter.com/nealagarwal/status/1752716375255101440">X</a>, <a target="_blank" href="https://neal.fun/infinite-craft/">Game</a>)</p><p></p><p>This isn't a tool, but an LLM based little game that's so addicting, I honestly didn't have time to keep playing it, and it's super simple. I especially love this, as it's uses LLama and I don't see how something like this could have been scaled without AI before, and the ui interactions are so ... tasty 😍</p><p>All-right folks, I can go on and on, but truly, listen to the whole episode, it really was a great one, and stay tuned for the special sunday deep dive episode with the folks from Lilac and featuring our conversation with about RWKV.</p><p>If you scrolled all the way until here, send me the 🗝️ emoji somewhere in DM so I'll know that there's at least one person who read this through, leave a comment and tell 1 friend about ThursdAI! </p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-jan-31-2024-code-llama-bard</link><guid isPermaLink="false">substack:post:141297426</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Fri, 02 Feb 2024 01:50:46 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/141297426/bb80a5d6cbb91d39526f3f7e4d693162.mp3" length="79281591" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>4955</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/141297426/ba26a0b84b57ed3d45ddde3a3ea3cc43.jpg"/></item><item><title><![CDATA[📅 ThursdAI - Sunday special on Merging with Maxime LaBonne]]></title><description><![CDATA[<p>Hey everyone, we have an exciting interview today with Maxime Labonne. </p><p>Maxime is a senior <a target="_blank" href="https://www.linkedin.com/in/maxime-labonne/">Machine Learning Scientist at JPMorgan</a>, the author of <a target="_blank" href="https://packt.link/a/9781804617526">Hands on GNNs</a> book and his own <a target="_blank" href="https://mlabonne.github.io/blog/">ML Blog</a>, creator of <a target="_blank" href="https://twitter.com/maximelabonne/status/1743643451848093941">LazyMergeKit</a> (which we cover on the pod) and holds a PHD in Artificial Intelligence from the Institut Polytechnique de Paris. </p><p>Maxime has been mentioned on ThursdAI a couple of times before, as he released the first Phi mixture-of-experts, and has previously finetuned OpenHermes using DPO <a target="_blank" href="https://towardsdatascience.com/fine-tune-a-mistral-7b-model-with-direct-preference-optimization-708042745aac">techniques which resulted in NeuralChat7B</a> </p><p>For the past couple of months, following AI on X, it was hard not to see Maxime's efforts show up on the timeline, and one of the main reasons I invited Maxime to chat was the release of NeuralBeagle7B, which at the time of writing was the top performing 7B model on the LLM leaderboard, and was specifically a merge of a few models. </p><p>Model merging</p><p>Model merging has been around for a while but recently has been heating up, and Maxime has a lot to do with that, as he recently checked, and his wrapper on top of <a target="_blank" href="https://github.com/cg123/mergekit">MergeKit</a> by Charles Goddard (which is the library that put model merging into the mainstream) called <a target="_blank" href="https://x.com/maximelabonne/status/1743643451848093941?s=20">LazyMergeKit</a> was in charge of >50% of the merged models on HuggingFace hub leaderboard. </p><p>Maxime also authored a <a target="_blank" href="https://huggingface.co/blog/mlabonne/merge-models">model merging blogpost</a> on Hugging Face and wrote quite a few articles and shared code that helped others to put merged models out. </p><p>Modern day Alchemy</p><p>This <a target="_blank" href="https://huggingface.co/blog/mlabonne/merge-models">blogpost</a> is a great resource on what model merging actually does, so I won't go into depth of what the algorithms are, please refer to that if you want a deep dive, but in a nutshell, model merging is a technique to apply algorithms to the weights of a few models, even a few instances of the same model (like Mistral7B) and create a new model, that often performs better than the previous ones, without additional training! </p><p>Since this is algorithmic, it doesn't require beefy GPUs burning power to keep training or finetuning, and since the barrier of entry is very low, we get some cool and crazy results as you'll see below. </p><p>Yeah, quite crazy as it sounds, this method can also create models of non standard sizes, like 10B or 120B models, since it's slicing pieces of other models and stitching them together in new ways. </p><p>If you recall, we had a deep dive with Jon Durbin who released Bagel, and Jon specifically mentioned that he created Bagel (based on everything everywhere all at once) as a good base for merges, that will include all the prompt formats, you can read and listen to that episode <a target="_blank" href="https://sub.thursdai.news/p/jan14-sunday-special-deep-dives">here</a></p><p>This merge frenzy, made HuggingFace change the leaderboard, and add a checkbox that hides model merges, because they are flooding the leaderboard, and often, and require much smaller effort than actually pre-training or even finetuning a model</p><p>And quite often the top of the leaderboard was overrun with model merges like in this example of Bagel and it's merges by CloudYu (which are not the top ones but still in the top 10 as I write this) </p><p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></p><p>On why it works? </p><p>Nisten summarized this pretty well in <a target="_blank" href="https://x.com/nisten/status/1741560927906906538?s=20">this</a> now famous copypasta tweet and I've confirmed with Maxime that this is his current understanding as well, it's quite unclear why this seems to perform so well, but it of course doesn't stop the "folks who look for AI Waifus" to keep merging.</p><p>Following folks like Nathan Lambert from <a target="_blank" href="https://www.interconnects.ai/">interconnects.ai</a> to start paying attention even though he didn't want to! (Still waiting on your writeup Nathan!) </p><p>UPDATE: As of today Monday Jan 29th, <a target="_blank" href="https://substack.com/profile/10472909-nathan-lambert">Nathan Lambert</a> just released a super comprehensive deep dive into merges, which you can read here 👇👏</p><p>YALL + Automated LLM Evaluation</p><p>Maxime as also worked on so many models of his own, that he built a convenient little tracking leaderboard to track their performance, which he called YALL, <a target="_blank" href="https://huggingface.co/spaces/mlabonne/Yet_Another_LLM_Leaderboard">Yet Another LLM Leaderboard</a> and it's on HuggingFace. You can see that NeuralBeagle is the top dog (sorry, I literally could not resist) </p><p>It uses the Nous evaluations, and Maxime has created an automation called LLM AutoEval that makes it really simple to run evaluations, which you can run in a Colab super easily. </p><p><a target="_blank" href="https://github.com/mlabonne/llm-autoeval">LLM AutoEval</a> is on Github. </p><p>Merge-aology! </p><p>Since chatting, Maxime has released a Colab and later a HuggingFace space that takes models names, and shows the genealogy, nay, Merge-aology of the models, which models it was merged from and it's pretty crazy how deep this rabbit hole goes, and crazier even still that these models perform very well after all of these lobotomies! </p><p>Try it out here: <a target="_blank" href="https://huggingface.co/spaces/mlabonne/model-family-tree">https://huggingface.co/spaces/mlabonne/model-family-tree</a></p><p>I really hope you enjoy this special deep dive, I definitely learned a BUNCH from this conversation with Maxime, and I'm very happy that he came on! </p><p></p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/merge-deepdive-maxime-labonne</link><guid isPermaLink="false">substack:post:141014996</guid><dc:creator><![CDATA[Alex Volkov and Maxime Labonne]]></dc:creator><pubDate>Sun, 28 Jan 2024 17:30:00 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/141014996/81ff0ba4920b77e2020a8d4cb562a835.mp3" length="34089116" type="audio/mpeg"/><itunes:author>Alex Volkov and Maxime Labonne</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>2130</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/141014996/6b39411248293d55321fd45a6e09c676.jpg"/></item><item><title><![CDATA[📅 ThursdAI - Jan 24 - ⌛Diffusion Transformers,🧠 fMRI multimodality, Fuyu and Moondream1 VLMs, Google video generation & more AI news]]></title><description><![CDATA[<p>What A SHOW folks, I almost don't want to write anything in the newsletter to MAKE you listen haha but I will I know many of you don't like listening to be babble. </p><p>But if you chose one episode to listen to instead of just skimming the show-notes, make it this one. </p><p>We've had 2 deep dives, one into the exciting world of multi-modalilty, we chatted with the creator of Moondream1, Vik and the co-founders of Prophetic, Wes and Eric about their EEG/fMRI multimodal transformer (that's right!) and then we had a DEEP dive into the new Hourglass Diffusion Transformers with Tanishq from MedArc/Stability. </p><p>More than <strong>1300</strong> tuned in to the live show 🔥 and I've got some incredible feedback on the fly, which I cherish so if you have friends who don't already know about ThursdAI, why not share this with them as well? </p><p>TL;DR of all topics covered: </p><p>* <strong>Open Source LLMs</strong> </p><p>* Stability AI releases StableLM 1.6B params (<a target="_blank" href="https://x.com/StabilityAI/status/1748492613261680810?s=20">X</a>, <a target="_blank" href="https://stability.ai/news/introducing-stable-lm-2?utm_source=X&#38;utm_medium=website&#38;utm_campaign=blog">Blog</a>, <a target="_blank" href="https://huggingface.co/stabilityai/stablelm-2-zephyr-1_6b/blob/main/stablelm-2-zephyr-1_6b-Q5_K_M.gguf">HF</a>)</p><p>* InternLM2-Math - SOTA on math LLMs (90% GPT4 perf.) (<a target="_blank" href="https://x.com/OpenMMLab/status/1749792525488140582?s=20">X</a>, <a target="_blank" href="https://huggingface.co/spaces/internlm/internlm2-math-7b">Demo</a>, <a target="_blank" href="https://github.com/InternLM/InternLM-Math?tab=readme-ov-file">Github</a>)</p><p>* MedArc analysis for best open source use for medical research finds Qwen-72 the best open source doctor (<a target="_blank" href="https://x.com/iScienceLuvr/status/1750506376034734325?s=20">X</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Google teases LUMIERE - incredibly powerful video generation (TTV and ITV) (<a target="_blank" href="https://x.com/hila_chefer/status/1749972797537796353?s=20">X</a>, <a target="_blank" href="https://lumiere-video.github.io/">Blog</a>, <a target="_blank" href="https://arxiv.org/abs/2401.12945">ArXiv</a>)</p><p>* 🤗 HuggingFace announces Google partnership (<a target="_blank" href="https://twitter.com/_philschmid/status/1750520073402630428">Announcement</a>)</p><p>* OpenAi 2 new embeddings models, tweaks turbo models and cuts costs (<a target="_blank" href="https://twitter.com/altryne/status/1750592326563500211">My analysis</a>, <a target="_blank" href="https://x.com/OpenAI/status/1750636119321120942?s=20">Announcement</a>)</p><p>* Google to add 3 new AI features to Chrome (<a target="_blank" href="https://x.com/sundarpichai/status/1749849297800266035?s=20">X</a>, <a target="_blank" href="https://blog.google/products/chrome/google-chrome-generative-ai-features-january-2024/">Blog</a>)</p><p>* <strong>Vision & Video</strong></p><p>* Adept Fuyu Heavy - Third in the world MultiModal while being 20x smaller than GPT4V, Gemini Ultra (<a target="_blank" href="https://twitter.com/Maxwell_Nye/status/1750220876824633773">X</a>, <a target="_blank" href="https://www.adept.ai/blog/adept-fuyu-heavy">Blog</a>)</p><p>* FireLLaVa - First LLaVa model with commercial permissive license from fireworks (<a target="_blank" href="https://x.com/lqiao/status/1748243039766925351?s=20">X</a>, <a target="_blank" href="https://fireworks.ai/blog/firellava-the-first-commercially-permissive-oss-llava-model">Blog</a>, <a target="_blank" href="https://huggingface.co/fireworks-ai/FireLLaVA-13b">HF</a>, <a target="_blank" href="https://app.fireworks.ai/models/fireworks/firellava-13b">DEMO</a>)</p><p>* Vikhyatk releases Moondream1 - tiny 1.6B VLM trained on Phi 1 (<a target="_blank" href="https://x.com/vikhyatk/status/1748831249198924057?s=20">X</a>, <a target="_blank" href="https://t.co/LNn6nFs5oY">Demo</a>, <a target="_blank" href="https://huggingface.co/vikhyatk/moondream1">HF</a>)</p><p>* <strong>This weeks's buzz </strong>🐝🪄 - <strong>What I learned in WandB this week</strong></p><p>* New course announcement from Jason Liu & WandB - LLM Engineering: Structured Outputs (<a target="_blank" href="https://www.wandb.courses/courses/steering-language-models?utm_source=thursdai&#38;utm_medium=referal&#38;utm_campaign=jan-25">Course link</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* Meta W2V-BERT - Speech encoder for low resource languages (<a target="_blank" href="https://x.com/reach_vb/status/1750225679898071232?s=20">announcement</a>)</p><p>* 11 labs has dubbing studio (<a target="_blank" href="https://x.com/altryne/status/1749879475045535789?s=20">my dubbing test</a>)</p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* Instant ID - zero shot face transfer diffusion model (<a target="_blank" href="https://huggingface.co/spaces/InstantX/InstantID">Demo</a>)</p><p>* 🔥 Hourglass Diffusion (HDiT) paper - High Resolution Image synthesis - (<a target="_blank" href="https://x.com/iScienceLuvr/status/1749624496770973816?s=20">X</a>, <a target="_blank" href="https://t.co/tHDmh7w68B">Blog</a>, <a target="_blank" href="https://t.co/2gzaBkwgOd">Paper</a>, <a target="_blank" href="https://github.com/crowsonkb/k-diffusion">Github</a>)</p><p>* <strong>Tools & Others</strong></p><p>* Prophetic announces MORPHEUS-1, their EEG/fMRI multimodal ultrasonic transformer for Lucid Dream induction (<a target="_blank" href="https://x.com/PropheticAI/status/1750534355242418300?s=20">Announcement</a>)</p><p>* NSF announces NAIRR with partnership from all major government agencies & labs including, OAI, WandB (<a target="_blank" href="https://new.nsf.gov/news/democratizing-future-ai-rd-nsf-launch-national-ai">Blog</a>)</p><p>* Runway adds multiple motion brushes for added creativity (<a target="_blank" href="https://x.com/runwayml/status/1749799137762267628?s=20">X</a>, <a target="_blank" href="https://t.co/EUbxc4aA0D">How to</a>)</p><p>Open Source LLMs </p><p>Stability releases StableLM 1.6B tiny LLM</p><p>Super super fast tiny model, I was able to run this in LMStudio that just released an update supporting it, punches above it's weight specifically on other languages like German/Spanish/French/Italian (beats Phi)</p><p>Has a very surprisingly decent MT-Bench score as well</p><p>License is not commercial per se, but a specific Stability AI membership</p><p>I was able to get above 120tok/sec with this model with LM-Studio and it was quite reasonable and honestly, it’s quite ridiculous how fast we’ve gotten to a point where we have an AI model that can weight less that 1GB and has this level of performance 🤯</p><p>Vision & Video & Multimodality</p><p>Tiny VLM Moonbeam1 (1.6B) performs really well (<a target="_blank" href="https://huggingface.co/spaces/vikhyatk/moondream1">Demo</a>)</p><p>New friend of the pod <a target="_blank" href="https://x.com/vikhyatk/status/1748831249198924057?s=20">Vik Hyatk</a> trained Moonbeam1, a tiny multimodal VLM with LLaVa on top of Phi 1 (not 2 cause.. issues) and while it's not commercially viable, it's really impressive in how fast and how quite good it is. Here's an example featuring two of my dear friends talking about startups, and you can see how impressive this TINY vision enabled model can understand this scene. This is not cherry picked, this is literally the first image I tried with and my first result. </p><p>The image features two men sitting in chairs, engaged in a conversation. One man is sitting on the left side of the image, while the other is on the right side. They are both looking at a laptop placed on a table in front of them. The laptop is open and displaying a presentation, possibly related to their discussion.</p><p>In the background, there is a TV mounted on the wall, and a cup can be seen placed on a surface nearby. The scene suggests a casual and collaborative environment where the two men are sharing ideas or discussing a topic.</p><p>Vik joined us on the pod to talk about why he didn't go with Phi-2, he also mentioned that Phi-1.5 was retroactively also MIT'd, it's license literally says MIT now on HF 👏 Great conversation, tune in for that at around 00:31:35</p><p>Adept is teasing FuYu Large - their CHONKY VLM</p><p>Adept previously released Persimmon, and then Fuyu VLM (which is a type of persimmon we see you adept) and now tease the release for Fuyu Heavy, a much bigger model that can compete or come close to GPT4V and GeminiUltra on MMMU and MMLU (text) while being 20x smaller approx. </p><p>While we don't yet get to play with this, they show some great promise in the benchmarks</p><p>⭐️ Performance: Excels at multimodal reasoning and matches/exceeds text-based benchmarks.❗️ Challenges Faced: Dealt with issues related to image data, model stability, and pre-training data scarcity.✅ Evaluations: Outperforms Gemini Pro on MMLU and MMMU benchmarks.AI Summary by Arc Browser (haha see how I cheated here? I sometimes do shortcut summaries using Arc Max, it's dope, try it) <a target="_blank" href="https://t.co/BZi6EKhS5R">https://t.co/BZi6EKhS5R</a></p><p>Fireworks AI releases FireLLaVa - with a commercially available license</p><p> FireLLaVA is the first commercially permissive open-source LLaVA model, a type of multi-modality model called a Vision-Language Model (VLM) that can understand both visual and textual inputs.</p><p>* The original LLaVA model was limited for commercial use as it was trained on data generated by GPT-4, which has non-commercial licenses. </p><p>* <a target="_blank" href="Fireworks.ai">Fireworks.ai</a> recreated the LLaVA training data using an open-source language model, CodeLlama 34B Instruct, to make a commercially viable version.- </p><p>* FireLLaVA performs comparably to the original LLaVA model on benchmarks, showing open-source models can generate high-quality data for VLM training.</p><p>* FireLLaVA is available via HuggingFace and through Fireworks.ai's prediction API, enabling new visual capabilities for applications.</p><p>Vik and I chatted about this, and while Fireworks didn't release datasets, they did release an example of how to start collecting them, and it's clear that everyone is clamoring after great vision / image datasets 👏</p><p>Really hoping that many great dataset for multimodal AIs will come out in 2024 giving us increasingly better multi modal LMMs 👏</p><p>Big CO LLMs + APIs (<a target="_blank" href="https://lumiere-video.github.io/">Blog</a>)</p><p>GOOGLE announces LUMIERE video generation model that shows incredible push in consistency </p><p>Supports multiple tasks like image to video, text to video, video inpainting, Video stylezation and more, looks incredible. It seemed that they have cracked both spatial and temporal consistency, something that's severly lacking in previous video generation attempts, and makes character consistency quite remarkable. Of course, as with other google incredible papers, we never know if we'll ever see this model or be able to play with it, here's hoping 🤞</p><p>Google will add 3 new AI features to chrome</p><p>* Chrome is introducing 3 new experimental AI features to make browsing more efficient:</p><p>* Tab Organizer: Chrome will automatically group similar tabs to help with multitasking</p><p>*  Custom themes: Users can generate unique browser themes using text prompts and AI image generation</p><p>* Writing help: Chrome will offer suggestions to help users draft messages and posts on websites</p><p>- They are currently only available to US users who opt-in on the Experimental Features page </p><p>I think this development is super super important because making AI accessible via the incredible Chrome platform to billions of people, is going to put Gemini in front of grandmas, students, everyone. Qutie impressive and the compute needed to pull something like this off is also quite mindboggling! 👏 </p><p>Of course, they are not the first browser to add AI, I love the Arc Browser and it has AI previews that I use quite often! </p><p>This weeks Buzz (What I learned with Weights & Biases this week)</p><p>Have you like many of us have trouble getting structure output (JSON, other stuctures) from LLMS? Jason also had this problem, that's why he authored the Instructor Library, which makes it easy to guide the LLM to give structured output using Pydantic. Jason has presented at Ai Engineer conference, and recently collaborated with Weights & Biases to launch a free course in how to guide your LLM to give structured outputs! </p><p><a target="_blank" href="https://www.wandb.courses/courses/steering-language-models?utm_source=thursdai&#38;utm_medium=referal&#38;utm_campaign=jan-25">COURSE LINK</a></p><p>Jason is also an independent consultant working with companies on their AI implementations and has many battle tested examples from implementations across the board, which he shared with us on the pod. </p><p>Give this short course a try if you haven't yet, it's really high quality content, in addition to tons of other stuff we have there, for free 👏</p><p>Voice & Audio </p><p>11Labs has a new overdub studio and it's really working well</p><p>Check out this short segment of myself, speaking in dubbed Russian! It’s really sounds like me, sent to my mom to see if she falls for it 😆 She didn’t</p><p>AI Art & Diffusion</p><p>Hourglass Diffusion Transformers</p><p>New high resolution diffusion architecture from K-diffusion and RoPE team (<a target="_blank" href="https://x.com/iScienceLuvr/status/1749624496770973816?s=20">X</a>, <a target="_blank" href="https://t.co/tHDmh7w68B">Blog</a>, <a target="_blank" href="https://t.co/2gzaBkwgOd">Paper</a>, <a target="_blank" href="https://github.com/crowsonkb/k-diffusion">Github</a>)</p><p>Paper presents a new method called HDiT ( HourGlass Diffusion Transformers) that shows promise in training models with high resolution images without incurring the significant hardware costs that go with scaling image sizes, replaces the latent diffusion models enabling O(n) complexity and scaling well. </p><p>Utilizing tricks and best practices for transformers architectures, like RoPe (that we've covered on <a target="_blank" href="https://sub.thursdai.news/p/thursdai-special-episode-interview">ThursdAI before</a>) cosine similarity self-attention, RMSNorm, GeGLU, etc. and using something called local self attention, this paper shows incredible promise for high resolution architectures for image creation tools. </p><p>We had the pleasure to host <a target="_blank" href="https://x.com/iScienceLuvr">Tanishq Abraham</a>, one of the co-authors (and CEO of MedArc, Director of research with Stability + PHD at 19) to walk us through the paper, explain the problem and the solution. Additionally, friend of the pod <a target="_blank" href="https://substack.com/profile/35973841-enrico-shippole">Enrico Shippole</a> is co-author as well 👏 and <a target="_blank" href="https://twitter.com/Birchlabs">Alex Birch</a> joined us silently from the audience 👂while giving commentary in the group chat.</p><p>P.S - All of these co-authors attribute the bulk of the work to <a target="_blank" href="https://twitter.com/RiversHaveWings">Katherine Crowson</a> from <a target="_blank" href="https://github.com/crowsonkb/k-diffusion?tab=readme-ov-file">k-diffusion</a> 👏 </p><p>Tools & Others </p><p>Prophetic introduces Morpheus-1 - multimodal foundational model trained on fMRI and EEG signals</p><p>In a breaking news fashion, the folks behind <a target="_blank" href="https://twitter.com/PropheticAI">Prophetic</a>, a new startup that just announced MORPHEUS-1 as we were hopping into the space, came to chat with us.</p><p>They are working on a new multimodal ultrasound transformer! That's right, multimodaliy is not only about images/text folks, we've covered this before but these chads are actually trying this out, they have trained a transformer architecture to take EEG and fMRI signals and output directions for the ultrasound to activate areas of the brain to induce Lucid dreaming. And they are asking for beta testers! </p><p>It's all quite futuristic, and if you're in NY, reach out to them (and then let us know if you had Lucid dreams!) </p><p>Definitely worth a listen on the pod and check out their video announcement for mode details, was really quite an incredible conversation with <a target="_blank" href="https://twitter.com/weslouis_">Wes</a> and <a target="_blank" href="https://twitter.com/EricWollberg">Eric</a>. </p><p>National Science Foundation launches NAIRR pilot (<a target="_blank" href="https://new.nsf.gov/news/democratizing-future-ai-rd-nsf-launch-national-ai">Blog</a>)</p><p>Partnering with 10 other federal agencies as well as 25 private sector, nonprofit and philanthropic organizations, the NAIRR pilot will provide access to advanced computing, datasets, models, software, training and user support to U.S.-based researchers and educators</p><p>Basically, this is a huge governmental endeavor to provide resources about AI, make sure companies collaborate and keep AI accessible across the board and tons of government agencies as well as private sector companies have joined hands in this. Just look at this list, it's a veritable who & who of AI in US (notably, Tesla/X is missing) </p><p>And that’s all folks, that’s all she wrote (or I guess, I wrote) today! What an incredible show, really thankful for folks who came out, guests and co-hosts and see you next week! </p><p>If you scrolled all the way to here and want to show me that you did, your emoji of the week is 🍊 (only cause persimmons don’t have emojis) so DM or reply with this and share this pod with 1 friend or tag us on social media! </p><p></p><p>Full Transcription below: </p><p>transcript</p><p>[00:00:00] <strong>Alex Volkov:</strong> right, folks, it's time for the sound. Let's get it started today.</p><p>[00:00:11] <strong>Alex Volkov:</strong> Welcome, everyone. Welcome to</p><p>[00:00:13] <strong>Alex Volkov:</strong> this live recording of ThursdAI, the Twitter space, podcast, and newsletter that brings you. everything that happened. the AI world, every Thursday, literally almost every Thursday. My name is Alex Volkov, an AI evangelist with Weights Biases, and</p><p>[00:00:33] <strong>Alex Volkov:</strong> this is ThursdAI</p><p>[00:00:37] Recap & TL;DR</p><p>[00:00:37] <strong>Alex Volkov:</strong> Alright, recap, here we go. Taking a deep breath. We've talked about incredible amount of stuff here on Thursday AI for January 24th. We've talked about the areas of open source LLMs was very interesting. We've talked about stability AI, releasing a stable LLM, tiny version, 1. 6 billion parameters. That's really good at different languages, the European languages as well.</p><p>[00:00:58] <strong>Alex Volkov:</strong> And it's not commercially viable. For open source, but it is under the stability membership. So if you have that's a great model for you. We've talked about Intern LM2 for a state of the art on math LLMs. We briefly mentioned this, but it's getting 90 percent of GPT 4 performance on math, which is, was quite incredible.</p><p>[00:01:16] <strong>Alex Volkov:</strong> We also had the pleasure of Tanishq, Abraham to join us from MedArk for the analysis of open source models as it relates to the medical field. And it turns out that the model called Quen72 from Alibaba, Quen72 is the best open source doctor that we have achieving like incredible and beating even MedPalm1, which was back then by Google trained as one of the best medical LLMs.</p><p>[00:01:42] <strong>Alex Volkov:</strong> We also. were a very multi modal heavy space today like a lot we had the like we had the folks from Prometheus lab join us and talk about their multi modality which is not Trans, which is transformer based, but not LLM based so their multimodality is EEG signals and fMRI signals as they work on hyper focused ultrasound to induce a lucid dream state in your brain.</p><p>[00:02:11] <strong>Alex Volkov:</strong> Their multimodal model is basically taking inputs from EEG and outputs in, in the directions or where to focus this ultrasound is super cool. And I definitely advise you to listen to them. It wasn't planned. I just saw the post. I just commented, Hey, we're going to talk about this. They jumped on Prometheus looks like a cool multimodal attempt, nothing to do with vision, but also we talked about vision multimodality as well.</p><p>[00:02:34] <strong>Alex Volkov:</strong> So we've covered Adept the company who was founded by a few folks from the original Transformers paper and they have previously released. Per semen models. And then EU eight B was a multimodel that did not use a vision encoder like a different architecture. They released an announcement. They didn't release any code or weights or the way for us to try this yet, but they released something called Fool You Heavy, or they announced something called FU You Heavy, which is an extension of the previously released fool you eight B.</p><p>[00:03:00] <strong>Alex Volkov:</strong> Significantly more trained. And they talked about how difficult it is to train multimodal models and they claim to have a third. Place in the world after GPT 4 and Gemini Ultra on a bunch of the multi modal metrics and evaluations like MMU and MMLU. They also talked about the process, how difficult it is to train these models at scale.</p><p>[00:03:20] <strong>Alex Volkov:</strong> So cool from Adept and we're waiting for some ways to test this. We also talked about fire lava, which is, if you remember, we've talked about lava before multiple times. Lava is a Open source way to train models in multimodal and like Baklava from Focus on Stage here, Nissen and Farrell, and Obsidian from LDJ who's also on here and also Moondream.</p><p>[00:03:39] <strong>Alex Volkov:</strong> Like all of the things we've talked about are based on Lava. Lava was not commercially permissive licensed because of the data set. Fire Lava decided or released the first Lava model with commercial permissive license from Fireworks AI. And we also had it. Quite an interesting chat with Vic, who is the author of Moondream 1, which is a tiny 1.</p><p>[00:03:59] <strong>Alex Volkov:</strong> 6 billion parameter vision language model, also on top of Lava, that has Phi 1 as 1. 6 billion. The foundational kind of brain, the LLM brain in it the conversation with Wick was very interesting. So shout out Wick. Thanks for coming up. Specifically because he also mentioned that Phi 1 Microsoft, if you guys remember Phi 2 was MIT licensed back in December.</p><p>[00:04:20] <strong>Alex Volkov:</strong> It was a surprise to all of us. And apparently they went back and also changed the the License on Phi 1, which is super cool, and Vic told us that he saw this. So Moondream is a very capable, very tiny vision model that works quite well. Definitely worth listening to this conversation with Vic.</p><p>[00:04:36] <strong>Alex Volkov:</strong> We also announced in the This Week's Buzz category of ours, or segment of ours, about Everything Weights Biases, we announced a new course in our academy from Jason Liu, the author of the Instructor Library. And he has a course now that was released today called LLM Engineering Structural Outputs.</p><p>[00:04:54] <strong>Alex Volkov:</strong> And as Nissen , pointed out a bunch of the folks in open source are learning from these free YouTube videos and definitely worth checking out Weights Biases Academy because there's a bunch of knowledge there. And it's all for free and just join and just register. It's super, super cool. And then we had an incredible honor again of having one of the authors of this paper.</p><p>[00:05:12] <strong>Alex Volkov:</strong> As always, I love when we discuss stuff and the authors of the stuff come to chat with us. So we had Tanishq Abraham. But also we had Alex Birch in the audience listening to us while he was working and sending us DMs from the new paper called Hourglass Diffusion High Resolution Image Synthesis.</p><p>[00:05:30] <strong>Alex Volkov:</strong> And this paper will be in the show notes and Dinesh went through the kind of the in depth of the problem he tries to solve. And they. They talked about integrating transformers and diffusion models previously to separate areas and they haven't came up with the first one, but they definitely used a bunch of the techniques to optimize transformers into the diffusion world and create a pixel space, high resolution image synthesis, which is, shows great promise going forward.</p><p>[00:05:59] <strong>Alex Volkov:</strong> Incredibly insightful conversation from Tanishq, definitely worth a listen. We also covered in this area, we covered Instant ID, which is a one, one shot or zero shot face transition into diffusion models. So you can upload one picture of yourself and get quite incredible results in image diffusion.</p><p>[00:06:17] <strong>Alex Volkov:</strong> Or like generative images with your faces or your kid's faces, which is super cool. I haven't tried my cat. I don't know if it like works on cat's faces. I'll try it out. We covered a new, a state of the art. Automatic speech recognition system that beats Whisper or at least runs 30 times faster than Whisper on different tasks.</p><p>[00:06:36] <strong>Alex Volkov:</strong> We're going to add this to the show notes as well. And a little bit about deepfake audio with 11 labs have a dubbing studio released. And some conversation about whether or not or how it already affects politics. And then the last thing we've covered is the National Science Foundation, NSF, announces a new partnership from all major labs and government agencies around AI, and includes DOD and DOA, and includes OpenAI and Tropic, includes open source folks like Hug and Face, and MetaAI is also participating in this.</p><p>[00:07:11] <strong>Alex Volkov:</strong> And also Ways and Biases is part of that huge partnership, governmental partnership. So I think this is all the stuff that we've covered in this space.</p><p>[00:07:19] Show starts with house keeping and structure breakdown</p><p>[00:07:19] <strong>Alex Volkov:</strong> We have quite the show for you today, and as always there's no boring weeks in AI, is there? And some weeks start slow and then pick up, some weeks start Crazy from the get go. If you remember, there's one week where one Friday had a bunch of releases, and this week we had a very full week, full of very cool innovations, but also exciting stuff.</p><p>[00:07:47] <strong>Alex Volkov:</strong> And then we have some authors of those stuff here with us today, and we're gonna talk about a bunch of multimodality, which we've been talking about for a while. Obviously the space started with the multimodal GPT 4 and then we just kicked it into high gear. I think that it's time to get started with our default segment. So for those who are new to Thursday AI, we usually segment this to five or six segments, the biggest one being open source LLMs. And then we have big companies LLMs and API. So we usually cover the Google stuff and OpenAI stuff.</p><p>[00:08:18] <strong>Alex Volkov:</strong> Mistral has been here and there, been [00:08:20] in the open source, now is the big company as well. So depends on what they release, that's where Mistral stuff falls. And then we talk about vision and video, which is Basically, we'll recover the multimodality stuff and that section is going to be the, I think, the main one today.</p><p>[00:08:36] <strong>Alex Volkov:</strong> There's so much stuff. It's crazy. We also have tthis com this corner I call This Week's Buzz. I feel like I have to explain this. Maybe people don't get this dad joke that I put in there. Buzz, as in bees, right? So bees, Buzz. And Weights and Biases, the shorthand for Weights and Biases is WandB.</p><p>[00:08:54] <strong>Alex Volkov:</strong> Weights and Biases, W and B. And for a very funny reason, there's a mascot of ours that's a bee that's holding a wand, because it's WandB. And like this little joke has been Prevalent like in many places. I think I haven't explained it yet. And so this week's buzz is actually the corner about everything that I've learned with Weights Biases every week.</p><p>[00:09:13] <strong>Alex Volkov:</strong> And so this corner we're going to chat with Jason and announce some cool stuff. The next corner we have is voice and audio, which we usually have a bunch of stuff. We have VB from Hug Face usually join us. He's like the AI audio person over there. There's not a lot of voice and audio stuff.</p><p>[00:09:29] <strong>Alex Volkov:</strong> So I actually don't have anything voice and audio related in my notes. However if you guys know like very cool things that happened. This week with voice and audio, please let me know, we're going to talk about them. We're going to move to AI art and diffusion in the next segment. We're going to talk about some cool things there.</p><p>[00:09:45] <strong>Alex Volkov:</strong> And then the last segment is like a free for all, it's tools and others. So I usually put agents in there. I usually put like super cool things. So I have two, two, two exciting things to talk about there. So this is usually the structure.</p><p>[00:09:58] <strong>Nisten Tahiraj:</strong> I do have, is one more thing there, and it's the W2V, the BERT speech encoder. think it's for meta, and it's about, it's supposed to be like 30 times faster than than Whisper. So yeah, it's another very efficient automatic recognition ASR model. So I'll I'll post it in the links</p><p>[00:10:20] <strong>Alex Volkov:</strong> And I think also we had 11Labs announce like a yeah, I had a tweet about actually ThursdAI Content, that I spoke in English, obviously, and then I asked it to translate to Russian. We'll cover this, 11Labs has a dubbing studio.</p><p>[00:10:33] <strong>Alex Volkov:</strong> .</p><p>[00:10:33] Open Source LLMS</p><p>[00:10:33] <strong>Alex Volkov:</strong> And then, let's go to open source, folks. I think let's go to open source.</p><p>[00:10:55] <strong>Alex Volkov:</strong> All right, let's start with our open source segment here. And I think the first thing we should probably quickly mention is our dear friends at Stability AI, folks who've Made a dent on the industry with Stable Diffusion, obviously but they're training a bunch of other stuff. We've talked about multiple stuff they did.</p><p>[00:11:12] Stable LM 1.3B</p><p>[00:11:12] <strong>Alex Volkov:</strong> We've talked about Stable Video Diffusion and like how open source lags behind closed source, but not by that much. And Stability released a new LLM, which they had the Stable LLM before, I think, Nistan, have you used Stability stuff before? For the LLM stuff?</p><p>[00:11:31] <strong>Nisten Tahiraj:</strong> I have Months ago, so I'm not up to date on</p><p>[00:11:35] <strong>Alex Volkov:</strong> Yeah, so</p><p>[00:11:36] <strong>Nisten Tahiraj:</strong> used it on Google collabs and</p><p>[00:11:37] <strong>Alex Volkov:</strong> Yeah, so they're not like, they haven't changed the industry in the LLM world as much as they have in the image diffusion world, for sure. However, there's a big however, they're working on multiple fronts. And it looks like, I had a chance to actually chat with Imad for almost 20 minutes.</p><p>[00:11:52] <strong>Alex Volkov:</strong> Imad is this like very incredible person who knows a lot about a lot. And it's like the conversation there is like basically a stream of consciousness conversation, which I had. No trouble in following up because we talk about everything here on ThursdAI. But the folks who were with me and talking to Imad, they looked at me and was like, How do you know all this?</p><p>[00:12:11] <strong>Alex Volkov:</strong> And I'm looking at Imad and was like, How does Imad know all this? That's what happens when you're on stability. So they released they're training a bunch of different models. This week they gave us Stable LLM, which is a tiny model, 1. 6 billion parameters model. It's really we've been saying this previously.</p><p>[00:12:24] <strong>Alex Volkov:</strong> It's really funny to say small LLM, right? If you expand the LLM abbreviations, like a small large language model. But this one is tiny. It runs super fast on, on multiple devices. I think their point is to actually like edge device running. So obviously we've covered multiple small. LLMs before, we've covered PHY, if you remember PHY 1, we're gonna talk about PHY with Vik in a second.</p><p>[00:12:47] <strong>Alex Volkov:</strong> We also talked about like PHY 2, I think there's like a few others StabilityRelease, there's It's pretty good. It's pretty good. I was itching to play with this, they released a GGUF. Apparently I dunno if you knew this name, but apparently stability has their own CPP and their like GGF file, which is like a, for those who are not following all the AT acronyms.</p><p>[00:13:11] <strong>Alex Volkov:</strong> GGF is a quantized version of models. So apparently stability has, like stability. CPP is incompatible with Lama cpp . And so apparently Elm Studio had to add a specific support for this and they did. And so if you wanna play with stability, AI. Stable LM, now you can , with LM Studio, and LM Studio at least in my experience, gave me ridiculous performance.</p><p>[00:13:34] <strong>Alex Volkov:</strong> I got, on, on this Macbook M3, M3 Max I got more than 130 tokens per second, which was like ridiculously fast. And the model was fairly capable for a small model. I was very impressed. So if you want to play with a small model, you want to do some stuff with this, stability is definitely an interesting one.</p><p>[00:13:53] <strong>Alex Volkov:</strong> Support in Elm Studio. Yeah, go ahead.</p><p>[00:13:56] <strong>Nisten Tahiraj:</strong> yeah, it's a 1. 6B. So in that means it's 1. 6 gigs to run at eight bit without losing much accuracy. However, the, that means that it has a lot more applications for tiny stuff, because then you can get that down to 800 megs. And so on. So this is people did find some issues. Again, it's a tiny model, but they found issues with it being able to continue the conversation.</p><p>[00:14:24] <strong>Nisten Tahiraj:</strong> However, for one shot answers, it was extremely capable. So just keep that in mind when using it. It is probably right now the best model for that size. Just keep in mind if you're going to do something with it. Don't expect much in terms of follow up stuff. Just if you can do it in one shot, great.</p><p>[00:14:48] <strong>Nisten Tahiraj:</strong> Use that. And yeah that's about all I have to say.</p><p>[00:14:51] <strong>Alex Volkov:</strong> Yeah. And additional things that it punches above its weight on other languages. So if you folks remember when we talked about Mistral, for example, getting compared to open the eye on Tropic, et cetera Mixtral medium, the model is like specifically for the German, the European language, the German, Spanish, French, Italian, all those it's significantly better.</p><p>[00:15:11] <strong>Alex Volkov:</strong> Stability is also playing in that market looks like for the smaller size. And so this. Out this tiny model beats the five versions of three billion parameters. So it beats models twice its size, even some seven billion parameters, specifically for , European languages,</p><p>[00:15:25] <strong>Alex Volkov:</strong> and if you remember, we've talked about MPT from Mosaic, was that? Yeah. So this model beats the Mosaic MPT 7B, which was probably back in May was like the coolest like open source model. So that was 7 billion. This beats that on empty bench and everything.</p><p>[00:15:40] <strong>Alex Volkov:</strong> It's quite incredible. It beats Falcon 40B. It's really, the speed, the reason why we bring you these models is not only Hey, use this one. Because Nissen said this one may not be exactly good for your commercial stuff. Also, it's not really commercially viable. There's a specific stability license that you have.</p><p>[00:15:58] <strong>Alex Volkov:</strong> Stability membership, they call it. They have to apply for stability AI membership. And then based on the size of your business you're able to use, they have to make money somehow. But we bring this to you also to show that how fast we're moving from a 30 billion parameter model to a 77 billion parameter model, and now to a 1.</p><p>[00:16:13] <strong>Alex Volkov:</strong> 6 billion parameter model, that compresses like incredible amounts of trillions of like words from the human knowledge into just, listen, do we say like this can go down to like less than a gig, right? If we look super quick,</p><p>[00:16:28] <strong>Nisten Tahiraj:</strong> Yep. At 4 bit, it should be 800 So we're getting to the point where they'll just fit in a Raspberry Pi Zero with 512 megs and they'll be conversational [00:16:40] and useful and even multi modal. So we're almost there.</p><p>[00:16:43] <strong>Alex Volkov:</strong> Yeah, it's quite incredible. And then, okay, so this is stability stuff. Meanwhile, I'll say hi to a new guest of ours that I just saw on my timeline.</p><p>[00:16:51] Prophetic announces MORPHEUS-1 an EEG/fMRI multimodal to induce lucid dreams via hyperfocused ultrasound</p><p>[00:16:51] <strong>Alex Volkov:</strong> What's up Wes, how are you?</p><p>[00:16:53] <strong>Wes Louis:</strong> Hey</p><p>[00:16:54] <strong>Wes Louis:</strong> guys, how are you?</p><p>[00:16:55] <strong>Alex Volkov:</strong> Hey. Hey welcome. Folks maybe saw my tweet, maybe didn't as that I love planning for Thursday, but I also love breaking news. As I was planning, I was going through my feed, and thankfully my Twitter feed is back at his own, like giving me the best AI stuff. And Wess and I think your co-founder is also here.</p><p>[00:17:10] <strong>Alex Volkov:</strong> Eric, yeah. Let me add you real</p><p>[00:17:12] <strong>Alex Volkov:</strong> quick. I didn't plan on this folks. I just literally just like tagged and they came. The video that you guys posted came through my timeline and I would love to go and give you a stage for a minute or two to explain what prophetic is because the transformer stuff that you discussed with the EEG and fMRI signals, I really dig.</p><p>[00:17:30] <strong>Alex Volkov:</strong> Could you summarize that video for us for a brief, like two sentences? That would be super cool, I think.</p><p>[00:17:38] <strong>Wes Louis:</strong> So</p><p>[00:17:38] <strong>Wes Louis:</strong> this has been something we've been working on for a while.</p><p>[00:17:40] <strong>Wes Louis:</strong> It's really a, essentially,</p><p>[00:17:42] <strong>Wes Louis:</strong> a multimodal transformer model that is designed entirely for neural data. And so basically, what we've done is, we built a data set of EEG and fMRI and, what we're designing is a neural simulation device to basically induce lucid dreams.</p><p>[00:17:59] <strong>Wes Louis:</strong> And so we build the data set on heightened prefrontal cortex activity. This is, the neural correlate of lucid dreaming. And we basically built a model where you prompt it with your current brain state. We have a set of sensors on the device, and then we output targets for the neurostimulation.</p><p>[00:18:17] <strong>Alex Volkov:</strong> That's quite incredible. So for folks in the audience, we talk about multimodality often and oftentimes we just mean VLMs, like we mean like vision and text, which we're going to cover like a bunch today. But today I think the highlight of today's Thursday is multimodality applies to many things. So you guys are, your multimodality is not even there's no text in there at all, right?</p><p>[00:18:36] <strong>Alex Volkov:</strong> This is just EEG signals and fMRI signals. Is that correct?</p><p>[00:18:41] <strong>Wes Louis:</strong> Yeah, it's purely prompted with EEG. And one thing I'll say is, everyone talks about multimodal. And, so you're using, let's say, an LLM, and you're prompting it with a photo, for example. This is similar in many ways because neural imaging data, particularly EEG, is you can nicely get, you can get it into, it's a neural image you can get it into an image format.</p><p>[00:19:02] <strong>Wes Louis:</strong> And then prompt the model that way, but then on the generation side of things that's entirely, we use a pretty unique fMRI embedding process that we've come up with ourselves and ultimately the idea there is that you take this heightened neural activity, And those are candidates for targets for nerve simulation.</p><p>[00:19:20] <strong>Wes Louis:</strong> And, we</p><p>[00:19:21] <strong>Alex Volkov:</strong> What do you, sorry, what do you mean, what do you mean by targets for folks who have no idea what this means?</p><p>[00:19:26] <strong>Wes Louis:</strong> Yeah. We're using this is the other big technology that makes all this work is FocusUltraSound. FocusUltraSound, for those that don't know, is this Really, cutting edge neurosimulation technique that can get, quite deep into the brain, other techniques, people who may be familiar with, direct current, alternating current, really get soaring to the surface.</p><p>[00:19:47] <strong>Wes Louis:</strong> Of the brain, whereas focus ultrasound can get quite deep, but there's also this ability to steer the beam and also create acoustic holograms. And so when we think of heightened neural activity it really takes the form of these 3D figures. And the idea being that we can create these outputs of fMRI targets and then translate those over to the focus ultrasound.</p><p>[00:20:12] <strong>Alex Volkov:</strong> This multi modal transformer takes on the input EEG signals, and on the output it prints out those targets. Those are targets for this technology to then stimulate the brain to go into a specific state.</p><p>[00:20:31] <strong>Wes Louis:</strong> Yes, and all of this is closed loop so in that, once you create the simulation, the model is prompted again with the current brain state and this is continuous. Process of learning and figuring out what sets of tokens lead to this heightened state and that heightened state is really identified as gamma frequencies and that's really the fastest band of activity.</p><p>[00:20:53] <strong>Wes Louis:</strong> So it's this continuous process until someone gets to a lucid state.</p><p>[00:20:58] <strong>Alex Volkov:</strong> That's quite incredible. So you guys announced the LLM today, but it's not like you're not releasing the open source. This is just an announcement of your efforts, correct? Anything else you want to add here? And I think you started talking about folks can join the beta if they want to.</p><p>[00:21:12] <strong>Nisten Tahiraj:</strong> Yeah, that's what I</p><p>[00:21:12] <strong>Wes Louis:</strong> would point out is that we have a beta program that, that this is really the purpose of this announcement is we're looking for people to sign up. We've had 200 or so in the last, Two hours. And so this spring we'll have it working. And if you're a New York based or you're willing to come out to New York we'd be, more than happy to have you test out the product.</p><p>[00:21:31] <strong>Alex Volkov:</strong> That's awesome. Congrats folks. Actually, you want to add anything?</p><p>[00:21:33] <strong>Eric Wollberg:</strong> Alex. Hey, how's it going? This is Eric. I'm a</p><p>[00:21:36] <strong>Alex Volkov:</strong> Oh, Eric, yeah.</p><p>[00:21:37] <strong>Eric Wollberg:</strong> with West. Yeah. Hi thanks for doing this. Yeah, one thing that's just I think, the sequence of how we've released these things, we showcased in October our prototype that we designed with Card79 notably did, Neuralink for Elon, and then we, Also worked with Max Hodak at Science.</p><p>[00:21:52] <strong>Eric Wollberg:</strong> Max Hodak used to run Neuralink for Elon and then spun out Science. So really top consumer VCI kind of design folks. And so then now we have this model, right? This ultrasonic transformer where now we're going to be migrating that on to, the technically working prototype and beginning neuromodulation.</p><p>[00:22:08] <strong>Eric Wollberg:</strong> So that's what the beta user program is all about. We've got, yeah, like 225 people signing up in the first two hours we're really looking for we're excited to have people on board and begin to do this you have an opportunity if you're, especially if you're early up on that list to be the first person to achieve an ultrasonically induced lucid dream, which is You know, I think it's going to be a pretty watershed moment.</p><p>[00:22:28] <strong>Alex Volkov:</strong> That's super cool. I've tried to, to lucid dream a lot of times in my life and I never actually got to a stable one. So I'm excited to follow you guys, but also excited from the technology application of this, because we talk about transformers and a lot of this is going to LLMs.</p><p>[00:22:42] <strong>Alex Volkov:</strong> Now we're going to, this week we're going to talk about Transformers as applied to the fusion models as well. And here you are like doing like full multimodality out, out of the left field. So I love it. And hopefully you guys will do some cool things and keep us up to date and welcome to, to join on Thursday.</p><p>[00:22:55] <strong>Alex Volkov:</strong> I, to talk about this.</p><p>[00:22:57] <strong>Nisten Tahiraj:</strong> Awesome. Thanks, Alex. Thank you, Alex.</p><p>[00:22:58] <strong>Alex Volkov:</strong> Thanks for hopping on, folks. And as folks, as I love breaking news here on Thursday. This is like a tiny breaking news. Thank you, Wes. Thank you, Eric, for joining folks. If you want to try, the future, sign up for the beta, because why not?</p><p>[00:23:09] <strong>Alex Volkov:</strong> And I think it's it feels like non invasive, right? You put this headset on, and then hopefully you go to sleep, and hopefully you're able to control your dreams, which is like what Vision Pro will do for outside world, but this is like inside your dream, it's super cool. All right, let's move on to, I think we're moving on to the big, no, actually we're moving on to the big category for multimodality as we're already here.</p><p>[00:23:33] <strong>Alex Volkov:</strong> Vision and video and multimodal, or at least VLM multimodal.</p><p>[00:23:38] Adept teases Fuyu Heavy, their flagship multimodal catching up to Gemini Ultra and GPT4V</p><p>[00:23:38] <strong>Alex Volkov:</strong> I'm gonna start with the big dog here, ADEPT. If you guys remember ADEPT Labs were co founded by a few folks from the original Transformer paper. I don't think they're no longer there, but I have to, I feel like I have to add this.</p><p>[00:23:52] <strong>Alex Volkov:</strong> Prefix every time we talk about adept, adapt released a few models for us. If you guys remember, Persson was a seven B model or eight B, eight B it was weird, but they released an 8 billion parameter model. It was like very interesting back then. They also then on top of this released fio, which is persson is the type of fruit, F is the type of tree that persimmon grows on.</p><p>[00:24:10] <strong>Alex Volkov:</strong> So we see you adept, we see your jokes here. Also. I love the LLM naming and then they raised Fuo back then. And FIO was. Interesting from the perspective of it didn't use a vision encoder, it did something else. It was very interesting that their approach to vision models allowed them to use Non standard image sizes, because they didn't train it on such a thing.</p><p>[00:24:31] <strong>Alex Volkov:</strong> So back then, that was what was interesting. And now, they've announced, they haven't released anything. They haven't said, hey, here, use this. I wasn't even able to use this. But they announced Fuyu Heavy. Fuyu Heavy, according to them. And so far, Adept have been trustworthy enough for us to trust.</p><p>[00:24:48] <strong>Alex Volkov:</strong> What they say this is the third in the world multi modal or I guess VLM. So not multi modal like, like Wes and Eric just told us, but a multi modal in the sense of like images plus text together. This is the [00:25:00] third in the world model behind GPT 4 Vision and Gemini Ultra. Which Gemini Ultra we haven't yet tried, obviously, we don't have access.</p><p>[00:25:08] <strong>Alex Volkov:</strong> If you have access in the audience for Gemini Ultra, and you want to help me, help a brother out, let me try and play with this, please let me know. But so they're announcing, AdeptFuyu is announcing that Fuyu Heavy, their model, is 20 sizes smaller than GPT 4 Vision. I have no idea how they even know what size GPT 4 Vision is.</p><p>[00:25:28] <strong>Alex Volkov:</strong> They say that around 20 to 30 sizes smaller. And comes very close in the multimodality stuff. And they talk about the challenges of creating like large multimodal image based model. The challenges are stemming from there's not a lot of assets properly to test. There's not a lot of the tooling instrumentation stuff are really hard for images as well.</p><p>[00:25:47] <strong>Alex Volkov:</strong> And so they announced this they showed some very incredible performance. And I will remind folks that Adept specifically started with tools to make you run your computer. So their models are specifically tuned on UX, UI and web stuff. And expecting to hear more from them and finally getting to play with this.</p><p>[00:26:06] <strong>Alex Volkov:</strong> Go ahead, Faro.</p><p>[00:26:09] <strong>Far El:</strong> I just</p><p>[00:26:09] <strong>Far El:</strong> want to say that,</p><p>[00:26:10] <strong>Far El:</strong> Demos are easy. I'm going to take it with a</p><p>[00:26:14] <strong>Far El:</strong> grain of salt until I actually see the model or are able to test it. The thing is that there is no indication of actual like speed of the inference or whether these examples were cherry picked or not, right? There's a lot of question marks about this, especially when you just come out and, make a marketing announcement without actual access to the model.</p><p>[00:26:37] <strong>Far El:</strong> Yeah, it looks cool, but I'm not, I'm not hyped just because it's not like it's not verified or validated</p><p>[00:26:43] <strong>Nisten Tahiraj:</strong> in any way.</p><p>[00:26:44] <strong>Alex Volkov:</strong> Yeah, I'm with you, I'm with you. Specifically I will say though, about Adept specifically, we've seen stuff from them, we've seen papers from them before, and they did, folks started asking like, Hey, where's the weights? Where's the weights? And they did say that, stuff is coming, but they want to like, keep a competitive edge.</p><p>[00:27:00] <strong>Alex Volkov:</strong> But we see, we've seen like at least a new architecture from them, if you remember with Fuyu. And so we know</p><p>[00:27:05] <strong>Nisten Tahiraj:</strong> Oh, of course.</p><p>[00:27:06] <strong>Alex Volkov:</strong> yeah, the Fuyu architecture is legit, like they literally was able to. create a multi modal without an image encoder thing back then. We're definitely going to listen to this. But based on the metric that they released, if this actually performs as well on MMMU, which is the kind of the equivalent of MMLU.</p><p>[00:27:25] <strong>Alex Volkov:</strong> For multi modal stuff it's going to be very exciting their heavy model, definitely.</p><p>[00:27:29] Fireworks releases FireLLaVa with a fully commercially viable license</p><p>[00:27:29] <strong>Alex Volkov:</strong> Moving on, actually, Pharrell we'd love to hear what you think about this. And actually, Vic, this is wrapping you up to the next conversation. Fireworks AI that I haven't actually used, but they released the first Lava model with commercial permissive license from Fireworks.</p><p>[00:27:43] <strong>Alex Volkov:</strong> So Lava was released. Lava, we've talked about Lava is the architecture. That allows many of these models to be trained in a multi modal fashion, correct? Lava was released, it was not with a commercial license because it was trained on a bunch of I want to say that wasn't marked for commercial and open source licensing.</p><p>[00:28:01] <strong>Alex Volkov:</strong> So a lot of these models that we get, we cannot actually use in production. And FireLava announced that like their first Lava model was commercially permissive licensing. And I think that's super cool because finally folks will be able to build this. And as a reminder, Lama, the LLM was released without commercial license.</p><p>[00:28:19] <strong>Alex Volkov:</strong> And then Lama 2 released with commercial license and then incredible amount of stuff started happening because companies who wanted to use this in production actually started like looking into this and using Lama 2. And so hopefully the same will start happening with FireLava. I actually am not sure if they released the weights.</p><p>[00:28:36] <strong>Alex Volkov:</strong> I think they did. Yes, they released the weights on Fireworks AI, FireLava 13B on HugInFace. And yeah, listen, go ahead. You guys trained stuff on top of Lava. So please, first of all, introduce the stuff that you've trained on and then also like comment on the ability to use this now in production.</p><p>[00:28:56] <strong>Nisten Tahiraj:</strong> Yeah, I just want to say that The entire vision open source vision field, and non open source, it is extremely competitive right now. For example, here, we've released Baklava, which is bak lava. Again with the naming. So that that was three months ago. Also LDJ here made the obsidian, which is like the three B one, and then they made A seven B as well.</p><p>[00:29:22] <strong>Nisten Tahiraj:</strong> We also have the dev lead of Quinn. He was in the audience as well, so they made the Quin 14 b vl. And this part is, oh, and we have Vic as well, who also made a very fast. And a small model recently. And Valkylava was being used as a benchmark, which was pretty interesting, actually. Yeah, the Vision LLMs are extremely competitive right now, and I think it's one part where open source can really surpass what you get from from any from any API, because it's something you can run local on the device and you have full control over.</p><p>[00:30:01] <strong>Nisten Tahiraj:</strong> So the interesting thing yeah, as for Fireworks 13b, that's still Lama 13b base, as far as I saw, and I tried to use their inference on their site, but it wasn't working, and I can't complain too much about it, because ours is not working either. That's why I wasn't using WSGULAG yeah, also to comment a little bit on Fuyu, because I do like their trying a completely new approach. They don't use stuff that's similar to clip image models, which is what everybody else uses. They do something where they take, I think, groups of pixels or stuff. They serialize it, so the image is just being represented as just another string of text or a string of tokens. So they can scale.</p><p>[00:30:48] <strong>Nisten Tahiraj:</strong> To 8k, 16k, whatever you have, they don't have, they don't have that limitation that others have in, in terms of architecture. So it is good to see that approach is working overall, whether it will be competitive we'll see. So yeah, I wanted to comment on that. But yeah, I haven't actually tried the Fireworks model itself, but I did see, again, the architecture is similar to also Lava 13b. Yeah, that's about all the comments I have on that.</p><p>[00:31:22] <strong>Alex Volkov:</strong> And like you said, interestingly, it's still based on Lama, right? And it's time for, it's time for new things. And I think this takes us to the next topic of conversation. And again, Vic, I want to introduce you properly this time, or at least let you introduce yourself.</p><p>[00:31:35] Moondream1 from Vik Hyatk - 1.8B VLM</p><p>[00:31:35] <strong>Alex Volkov:</strong> But the next kind of iteration or of our conversation about multimodality, like we said, today is a multimodal space is the existence of like very tiny vision models, vision, large language models, or a large multimodal model, it's really hard to like, name these things. Vic, welcome to the space, this is your first time, please introduce yourself and then let's talk about Moondream a little bit.</p><p>[00:31:57] <strong>Vik Hyatk:</strong> Hey folks hey Alex, thanks for having me. Super excited. My name is Vik. I'm pretty new to the AI space, I think. Like a lot of people, I got into it when that big stable diffusion moment happened. And I was like, this is what I need to spend my life working on. So I went out, bought a workstation with 3090 and started playing around with stuff.</p><p>[00:32:15] <strong>Alex Volkov:</strong> You and me both brother, you and me both. And, okay. So the reason why you're here and the reason why I'm , calling on you in the vision and video area is because of Moon Dream one. You, can you introduce Moon Dream one a little bit to the audience?</p><p>[00:32:29] <strong>Vik Hyatk:</strong> Yeah so it's a small language model. It's about 1. 6 billion parameters. It's built on top of Siglip from Google or DeepMind. I forget which one of the two. Trimil, because that's the vision encoder and it uses 5. 1. 5 as the text model, and then it's trained using the standard lava. So super thankful for the folks that worked on these projects amazing models they've put together.</p><p>[00:32:52] <strong>Vik Hyatk:</strong> It works. I'm tooting my own horn a little bit here, but it's surprising. I see people post screenshots of them asking questions and it still blows my mind that it works that well.</p><p>[00:33:03] <strong>Alex Volkov:</strong> I let me talk the horn a little bit because I definitely tried out. Thank you for the hugging face. How can I say, space that you put up like super quick, and the next follow up is going to be about how to actually use this, but this is based on Lava, so the same non commercial license, correct?</p><p>[00:33:19] <strong>Vik Hyatk:</strong> [00:33:20] Correct. The top piece of feedback I've gotten from people is that they want to see this with a commercially permissive license. I'm working with, working on that. The FireLava folks didn't release the dataset, but thankfully they did talk about their process to create the the non encumbered version of the dataset.</p><p>[00:33:37] <strong>Vik Hyatk:</strong> So I'm working on it. I'll have that out in a couple of days, the dataset at least, and then we can start training models that aren't encumbered like that.</p><p>[00:33:44] <strong>Alex Volkov:</strong> Incredible. And so the next thing that I wanted to talk to you about is PHY 1. So PHY is from Microsoft. PHY 1 was not released with a commercial license. We remember it was trained on synthetic data in tiny stories, like a tiny 1. 6 model. So we saw a few releases since then. So obviously we talked just now about StableLM.</p><p>[00:34:01] <strong>Alex Volkov:</strong> Semi commercial, if you're a part of their membership, and also Phi2 was MIT license. It's a little bit bigger. It's three, I think, billion parameters. Have you tried with Phi2 and could you speak about that experience?</p><p>[00:34:14] <strong>Vik Hyatk:</strong> Yeah, I I did actually. So I was initially working on training Moondream 1 with PHY 2 once it came out. There are some issues with fine tuning it when you have flash attention on I believe. And so it just takes a lot longer to train. So I went back and looked at PHY 1. 5 and I saw that they updated the license for 1.</p><p>[00:34:32] <strong>Vik Hyatk:</strong> 5 to MIT as well.</p><p>[00:34:33] <strong>Alex Volkov:</strong> Oh, really?</p><p>[00:34:35] <strong>Vik Hyatk:</strong> stick with what works. Yeah.</p><p>[00:34:37] <strong>Alex Volkov:</strong> Wow. I did not know this. So it actually updated the license backwards.</p><p>[00:34:42] <strong>Vik Hyatk:</strong> Yeah, on the Hugging Face page, at least it says MIT now.</p><p>[00:34:45] <strong>Alex Volkov:</strong> I love it. Like it would make sense, right? But folks, I don't think we've talked about this. So like breaking news here. Thanks, Vic. Phi 1 is also, we'll check this. We'll double check,</p><p>[00:34:55] <strong>Nisten Tahiraj:</strong> Also three. They're both MIT licensed now. So whatever pressure we put on Microsoft's Azure side, it worked.</p><p>[00:35:03] <strong>Alex Volkov:</strong> nice. That's incredible. All so now, so this part of your stack of Moonbeam is now MIT licensed. So Lava is the only thing that's holding this back from being used in</p><p>[00:35:14] <strong>Vik Hyatk:</strong> Just the</p><p>[00:35:14] <strong>Unkown:</strong> data set, yeah.</p><p>[00:35:16] <strong>Alex Volkov:</strong> The dataset. Okay. Okay. So definitely there's work being done there. I will just pay send folks attention to the nest, to the top of the space where I had my tests.</p><p>[00:35:25] <strong>Alex Volkov:</strong> I literally just pasted an image. And again, thank you for the demo, Vic. Folks will get the demo in show notes as well. I pasted an image of two of my friends just sitting and talking across like a TV with some things. Literally the model said, image features two men sitting in chairs engaging in conversation.</p><p>[00:35:42] <strong>Alex Volkov:</strong> One man sitting on left side, one other on the right side. That's obvious, but still cool. They're both looking at a laptop placed on the table in front of them. The laptop is open and displaying a presentation. Possibly related to their discussion. So this feels like hallucination a little bit because the model does not know what it displays, but fine.</p><p>[00:35:57] <strong>Alex Volkov:</strong> And so in the background, there's a TV mounted on the wall, a cup that can be placed on the surface nearby. The scene suggests a casual collaborative environment. This is ridiculous. This is like a super tiny model and it outputs this scene almost perfectly. And. I've tested like the same image in different, like a bigger, GPT 4, it pretty much gives me the same information.</p><p>[00:36:17] <strong>Alex Volkov:</strong> So I was really impressed. So Turing the Horn, for sure, because the tinier the model is, the better the utilization. And we've talked about different vision enabled hardwares that are possible or not possible based on whether or not they're going to be able to run stuff on like Raspberry Pi. And, the smaller these models, the smarter they are, the better we'd be able to use them in cheaper hardware.</p><p>[00:36:40] <strong>Alex Volkov:</strong> Really impressive. What are you planning to do with this? Like, how has the community accepted this? What type of conversations did you get into? And what are you planning to do next here? Besides training the</p><p>[00:36:51] <strong>Vik Hyatk:</strong> I was blown away by the reception to this. I've, when I put it up, I thought like maybe it might get like a hundred likes or something and then I'd move on to my next project. But I've seen a bunch of super cool demos. Come out of this, I think the fact that it is small and it runs inference so fast makes a lot of use cases that were previously not possible, a lot more viable, like captioning a video in real time or recaptioning a billion images and whatnot.</p><p>[00:37:15] <strong>Vik Hyatk:</strong> There's a couple of things I'm working on. Obviously the top thing is like getting it to a permissive license. I also, I could use some help on a couple of fronts. So I do want to make it easier to run, get gguf, olama integration and whatnot.</p><p>[00:37:30] <strong>Alex Volkov:</strong> Definitely LM Studio integration. I would love To play around with this with Elm Studio, just to see how fast this is, this runs on my software. MLX would be a cool suggestion as well the community is very excited about MLX, I don't know if you saw. But Elm Studio is a friend of the pod, definitely it's connected to YouTube.</p><p>[00:37:46] <strong>Alex Volkov:</strong> I think it's super easy to just add it there. Right? Listen it's not difficult.</p><p>[00:37:51] <strong>Nisten Tahiraj:</strong> You just gotta add a Jason file to, to, to your model and that's it. Or just message him 'cause he's very responsive to this stuff. And might even write the Jason for you. And then it will be immediately available for everyone running LM Studio.</p><p>[00:38:06] <strong>Vik Hyatk:</strong> Amazing. Another thing we have going on, by the way, is we're building an agent version of this with Open Interpreter in mind.</p><p>[00:38:13] <strong>Vik Hyatk:</strong> A version of this that's excellent at identifying UI elements because we want Open Interpreter to have the ability to operate purely off of a local model. Open Interpreter, by the way super cool project. Check it out, folks, if you haven't already, is is a way to have the LLM use your computer.</p><p>[00:38:31] <strong>Vik Hyatk:</strong> So you can do stuff like. Just tell LLM here I want to turn dark mode on and it'll figure out what buttons to click to enable dark mode for</p><p>[00:38:40] <strong>Alex Volkov:</strong> for folks who follow ThursdAI closely, they remember Kilian came on the pod like a week after Open Interpreter was released, and this was, I think, in 2023, our most famous or received episode back then. It was a super cool conversation, so shout out Kilian Lukas, and definitely Open Interpreter since then has been very huge community of people building very cool things.</p><p>[00:39:00] <strong>Alex Volkov:</strong> Recently released the kind of the browsing area, where it can Controls the computer for you. And it definitely needs eyes for that. And so I think it used GPT 4 vision and now you're saying that Open Interpreter will get open source eyes. Is that what I'm hearing?</p><p>[00:39:15] <strong>Vik Hyatk:</strong> Exactly. That's a goal. CogAgent is super promising in this space. They didn't release their datasets, so we're working on replicating that. CogAgent is just too big for most people to run on their computers. It's I forget, 17 billion parameters or something.</p><p>[00:39:29] <strong>Alex Volkov:</strong> Is that CogAgent and CogVLM, right? I think we, yeah, I think we talked about this. Yeah. It's really good</p><p>[00:39:35] <strong>Vik Hyatk:</strong> but yeah, that's another place where if folks want to get involved the link in my bio as a Discord would love to collaborate with folks on getting that dataset together and training that version of the model.</p><p>[00:39:44] <strong>Alex Volkov:</strong> So I think the kind of the thing I'm hearing from Fuyu, from you as well, the data set for vision stuff are the bottleneck to create like incredible things, right? Like data sets for images, data sets for how people use different UIs, for example, like all these data sets are the kind of the bottleneck for us to get to the next hurdle of getting these models even smaller, even faster performing.</p><p>[00:40:04] <strong>Alex Volkov:</strong> So what are we doing folks? Let's start building multimodal data sets.</p><p>[00:40:09] <strong>Nisten Tahiraj:</strong> Yeah, and at first for Baklava, we were going to have the dataset also open source because we are, the code for us is also open source as well. So it's not just open wave. It is fully open. However, the data we couldn't because of So that's not available and yeah, it's pretty hard to make datasets for vision because with text is very, it's very easy now to manipulate, modify, do whatever you want to, to the data and you can do that at large scale with images, just aren't that many tools, that many ready to go datasets and the open source models just started getting good at them.</p><p>[00:40:52] <strong>Nisten Tahiraj:</strong> So yeah, that's going to remain. A challenge for the time being, but again if anybody here is like a grad student or you're at a company or something in academia, the biggest contribution you can make probably is in the data sets, because the models will get replaced. You'll always have better models coming and going, but the data sets are forever.</p><p>[00:41:15] <strong>Nisten Tahiraj:</strong> If you want to make an impact in this field, get your professor, university, whatever to, to put some money for datasets. We need datasets. For images. With images. Yeah.</p><p>[00:41:27] <strong>Alex Volkov:</strong> And we need them like bigger and bigger ever increasingly bigger scale. All right, Vic, so thank you so much for joining us. Thank you for talking, taking us through how you created Moonbeam. And thanks for telling us like what's next, how [00:41:40] the community can help besides, besides just, data sets provided and testing.</p><p>[00:41:45] <strong>Alex Volkov:</strong> What else would you need?</p><p>[00:41:48] <strong>Nisten Tahiraj:</strong> I I have a</p><p>[00:41:49] <strong>Vik Hyatk:</strong> list of issues on GitHub where I'm looking for help with various But besides that, Compute always helps. I'm currently I'm limited on how many things I can do because my 4090s can only do so many matrix multiplications at a given time. So if anyone has Compute that they can give me access to run these, that would be super appreciated.</p><p>[00:42:09] <strong>Alex Volkov:</strong> Yes, I I've seen this time and time again on ThursdAI on stage, folks ask for sponsorship for compute. I'm actually getting I'm actually getting like DMs from different companies like, Hey Alex, the space is super cool. Can we sponsor someone? Can we? And I'm like, no, I already work with Let's Ambassadors, I don't need sponsorship.</p><p>[00:42:25] <strong>Alex Volkov:</strong> I would want to connect guys that work on super cool things. We need compute to keep going with different companies around like compute specifically. So I'll definitely keep you in mind. And go ahead, Nissan. You had a thing you want to say?</p><p>[00:42:38] <strong>Nisten Tahiraj:</strong> Yeah, just really quickly, this is a very effective way to make projects that are impactful. For example, with Balclava, Pharrell here, and Suntex, they just put out a readme, and tweeted something out, and we got compute. And we got it from Together Computer. So they, they sponsored that, that project and they ended up being a very impactful project that a lot of people use.</p><p>[00:43:05] <strong>Nisten Tahiraj:</strong> That, that works pretty well. I just say be careful with conditional stuff. If they're gonna start talking about an NDA, just Ignore them because that's not really, then you're doing an exchange, you're basically doing work for that person, so that's just a job contract, that's not a sponsor, if someone's sponsoring an open source model</p><p>[00:43:27] <strong>Alex Volkov:</strong> Better be.</p><p>[00:43:28] <strong>Nisten Tahiraj:</strong> not be like an NDA, that's not, that's no longer a</p><p>[00:43:32] <strong>Alex Volkov:</strong> Better be open source after that. Yes, absolutely. So Vic, I'll keep you in mind when people reach out to me. Folks in the audience, if you work at a company that wants to be featured forever in the, in the open source community, definitely reach out to Vic and we want more of this.</p><p>[00:43:47] <strong>Alex Volkov:</strong> We want more of like tiny models that perform incredibly well. We want them to be built into different Tools that we can all use without relying or paying by just using our machines. So definitely we'll keep in mind. Vic, welcome and welcome to the community of ThursdAI. More than welcome to keep joining and participating in this.</p><p>[00:44:06] <strong>Alex Volkov:</strong> I think it's time for us to move on, folks. It's been around 40 minutes. I think we're actually good on time. I think it's time for us to move on to this week's buzz. I wish I had a I really want to do like a music transition here for the, with this week's buzz, with like bees buzzing, etc.</p><p>[00:44:20] <strong>Alex Volkov:</strong> But maybe for next week. Let me just play the regular music and we'll transition and talk with Jason a little bit.</p><p>[00:44:24] This weeks buzz - Jason Liu launches a new course with Weights & Biases for free</p><p>[00:44:24] <strong>Alex Volkov:</strong> All right, welcome to this week's buzz, where I talk about some cool things that happened or I learned in Weights Biases. Weights Biases is, ooh, that was an abrupt music stop. Weights Biases is the system of records for all your LM needs. So pretty much like most of the folks up on stage who use who train models use Weights Biases.</p><p>[00:44:52] <strong>Alex Volkov:</strong> It's incredible. The ubiquity , where bias pretty much prevented everywhere. I just saw a stable Kwan, one of our friends of the pod just train something and just post like words and biases, like a snapshot of his last curve going down and literally just asked Hey, do you mind putting a link to the dashboard?</p><p>[00:45:08] <strong>Alex Volkov:</strong> And he did. So you wanna check out how his model is going? I think he's training. I don't think I saw, he's training something like super cool, like a Oh, he's training a mixture. Four 400 million parameters. So he's training like a tiny MOE of mixed role. StableKwan is, he just posted like a chart with the train loss from Weights Biases and I just asked, Hey. Can we follow along with the training? And he posted a link to the Weights Biases dashboard, which is super cool.</p><p>[00:45:34] <strong>Alex Volkov:</strong> Which got a reaction from Weights Biases CEO. . And so I love seeing this in the wild. So folks, if you're training models, please put those dashboards up so people can follow along. It's super it's really nice. But on the other news from Weights Biases this week I want to say hi to Jason Liu.</p><p>[00:45:47] <strong>Jason Liu:</strong> Yeah, Jason Liu.</p><p>[00:45:48] <strong>Alex Volkov:</strong> Jason Liu. Welcome, Jason. I've seen you around. I've seen you, I think at AI engineer event from SWIX. I don't know if we like ran into each other there, but you had a talk there as well. Yeah.</p><p>[00:45:58] <strong>Jason Liu:</strong> Yeah, it was Paidandic is all you need. It did pretty well on YouTube, so I'm pretty</p><p>[00:46:02] <strong>Alex Volkov:</strong> It did great. I also talked with a bunch of people. I think I was interviewing folks, outside of the stage while we were giving the talk, but then it was very well received. And this is on the same similar topic that we're going to talk about now. So please feel free to introduce yourself briefly.</p><p>[00:46:15] <strong>Alex Volkov:</strong> And then we're going to talk about the stuff that we did together.</p><p>[00:46:19] <strong>Jason Liu:</strong> Great. Yeah. So I'm Jason. In the past year and a half, I've been mostly doing a lot of applied AI consulting. Before that, I spent the past like eight years just doing like machine learning. So I did the big data wave, the machine learning wave, the neural networks and deep learning wave.</p><p>[00:46:32] <strong>Jason Liu:</strong> And now we get generative AI. So it's been a lot of fun. And in my spare time I work on a library called Instructor. So now. We have Instructor in, I think, JavaScript, Python, and Elixir. And the general idea is that we want to bring just functions and structs into LLMs and make LLMs feel a lot more backwards compatible with existing code rather than creating new abstractions to handle some of these things.</p><p>[00:46:55] <strong>Jason Liu:</strong> And I think that's been pretty well received in the community.</p><p>[00:46:57] <strong>Alex Volkov:</strong> Absolutely. So Instructor is definitely where I know you from. And today we have an announcement together. So feel free to. Feel free to announce the cool thing that we did and that you worked on really hard.</p><p>[00:47:09] <strong>Jason Liu:</strong> Yeah, so we're starting a new series around, the idea of using like schemas and structures to prompt language models. And I think. At the day or end of this week, we're going to release the first part of a LLM engineering series. And the first part really is just an introduction on how we can use things like structure to prompt LLMs a lot better, right?</p><p>[00:47:30] <strong>Jason Liu:</strong> In the past, we just like beg for the language model to give us JSON. Now we have things like JSON mode and function calling and tools, which gives us the ability to get more structure. But we still need a lot more tools and ways of thinking about how we can reason about these structures. And so part one is going to be around justifying and motivating why we wanna, why we might want to do this.</p><p>[00:47:54] <strong>Jason Liu:</strong> And then I think in February or March we'll start working on part two that uses a lot of the new ways and biases, observability tools to look at how I've solved a lot of LLM problems in production with a lot of my consulting clients.</p><p>[00:48:07] <strong>Alex Volkov:</strong> So just to highlight for folks, Weissenbeisser has a free courses area, Weissenbeisser Academy. And some like very prominent folks in the industry have collaborated with Weissenbeisser to like just basically teach. So we teach you for free how to do these things. So we have courses from like training, LLM from scratch, fine tuning, et cetera.</p><p>[00:48:24] <strong>Alex Volkov:</strong> And then Jason is announcing a new course today that he wrote and and recorded and we helped edit a little bit and publish and also obviously talk and promote this a little bit about how to actually ask your model to give you what you need as a developer, as a AI developer in the structured output, which uses the instructor library.</p><p>[00:48:42] <strong>Alex Volkov:</strong> Correct, Jason?</p><p>[00:48:43] <strong>Jason Liu:</strong> Yeah, these ideas can be used in other libraries as well, right? So for the Python community, we're really using a library called Bydantic, and so this is supported in things like Langchain and Marvin. And so even if you don't use a library like Instructor, learning how to think about prompt infrastructure is still something that's going to be really applicable and valuable for everyone listening.</p><p>[00:49:05] <strong>Alex Volkov:</strong> And you mentioned before, there's like a bunch of stuff that open the icons up with, like JSON mode, in example, etc. There is functions back in June. But also The other LLMs, they don't necessarily follow the same kind of new abstractions that OpenAI releases. I think Anthropic just recently announced that they're moving to function system messages or moving to just messages, things.</p><p>[00:49:27] Function calling in OpenSource LLMS</p><p>[00:49:27] <strong>Alex Volkov:</strong> And also we have open source, which is like all over the place. So I guess my question is, with these libraries, with these Paidantic approach and Instructor, would that apply to other LLMs? Does this apply to open source, which we talk a lot about?</p><p>[00:49:40] <strong>Jason Liu:</strong> Yeah, so right now there's only a few open source models that support function calling. So if you've looked at some of the work from the functionary team, they have been training I think mixed role now with function calling, same with the guys that like news research with Technium. There's been a lot of progress in the open source world and getting things like function calling.</p><p>[00:49:58] <strong>Jason Liu:</strong> If you want more structured outputs [00:50:00] too, there's a great library called outlines. That can use something like the Hugging Face Transformers library to also do structure extraction. And again, they also support things like Pytantic. And the goal of the course really is to show you how to think about and how to model these problems in a particular way.</p><p>[00:50:15] <strong>Alex Volkov:</strong> Absolutely. And I think John Durbin in the audience I think Ouroboros was trained on function calling as well, if I'm not mistaken, John. So folks who haven't heard our conversation with John, definitely go and check out where the deep dive with John about Bagel, which now includes the Ouroboros dataset, which now includes function calling as well.</p><p>[00:50:33] <strong>Alex Volkov:</strong> So that's awesome. The open source also moves there. Go ahead, Nissan.</p><p>[00:50:37] <strong>Nisten Tahiraj:</strong> Also really quick the news vision model ended up being good at at function calling, although it had other drawbacks. It was good at function calling because of the Arrow Boro's like thousand something functions dataset. And as far as I saw the newer bagel models, so Bagel seven B are also good at at that, at at function calling.</p><p>[00:50:57] <strong>Alex Volkov:</strong> So big old model series from Maxim Le Bon. Again, shout out Maxim Lebon, who came on the pod last week, and then the full deep dive with him will be released this Sunday, so make sure you're subscribed. We talk about, we don't talk about FunctionCall, we talk about NeuroBeagle. NeuroBeagle is like one of the top performing 7 billion parameters, it's a merge, it's a cool conversation about merging.</p><p>[00:51:16] <strong>Alex Volkov:</strong> But let me back, let me get back to Jason just real quick. Jason, you're also like doing independent consulting, you said, in multiple places, and you're like helping them build. I got to like tap into your experience from like actually like hands on AI building and companies. Could you give us like a little bit of what do companies struggle with?</p><p>[00:51:32] <strong>Alex Volkov:</strong> Like with the first obvious thing that comes to mind that people like AI builders probably like already solved in their minds. What do you have to go through to not only build to them, but also educate them on as you join the company, it starts like helping them out with AI stuff.</p><p>[00:51:47] <strong>Jason Liu:</strong> Yeah. So one of the biggest things I noticed is that when we look at something like a RAG application, really what it looks like is a recommendation system. If you went on Netflix, for example, and you watch a bunch of movies and the recommendations don't get better, it would be a really terrible experience and you probably lose a lot of customers.</p><p>[00:52:03] <strong>Jason Liu:</strong> But for a lot of companies these days that are using things like agents or retrieval, We are in a situation where, you know, no matter how many users you get, if you don't improve your language model, if you don't improve your embeddings, the product doesn't really get any better. And so one of the big things I'm focusing on this year is helping these companies build a better feedback loop and a data flywheel.</p><p>[00:52:22] <strong>Jason Liu:</strong> And so we can know for sure that as we get more users, there's these network effects that improve the models that we want to train. And so I think step one is, being able to fine tune your own embedding models and your re rankers and go from there and then, see what comes up in the future.</p><p>[00:52:39] <strong>Alex Volkov:</strong> Awesome. So definitely folks, give Jason a follow. The course, I think we're releasing it today, but I haven't seen any social mentions, but it's really worth watching. I watched a few of this and we'll follow as well. And this is a course series now. So we're going to start with this, and then we're going to continue with the monitoring tools that Waze Ambassadors have.</p><p>[00:52:56] <strong>Alex Volkov:</strong> Correct?</p><p>[00:52:58] <strong>Nisten Tahiraj:</strong> Yeah, the first course is like 30 minutes. It's super quick. The real goal is to show you what's possible and get you thinking about some new ideas. And then the next course will be deeply integrated with the more visibility tools from Wisdom Biases and specifically around the experiences I've gotten from consulting production clients.</p><p>[00:53:13] <strong>Alex Volkov:</strong> Incredible. Thank you, Jason. Thank you for joining us. And thank you folks who worked on the course together with you. I'm excited to see this. And again, the reminder, there's a bunch of free stuff there. There's a bunch of like knowledge just drops here. And hopefully I will be able to tap into this community and also build more things.</p><p>[00:53:29] <strong>Alex Volkov:</strong> Go ahead, Nistan, and then we'll move on.</p><p>[00:53:31] <strong>Nisten Tahiraj:</strong> Yeah, I just want to say that a lot of us here that got good at machine learning were from just a random YouTube series. So the Karpathy series on Building one from scratch. The full stack is just pronounced like that. Their LLM one from way back in April and March. So I'm really looking forward to this one because doing YouTube tutorials is actually extremely efficient.</p><p>[00:53:53] Breaking News - HuggingFace announces a collaboration with Google</p><p>[00:53:53] <strong>Nisten Tahiraj:</strong> But on that note, we have breaking news.</p><p>[00:53:56] <strong>Alex Volkov:</strong> Wait, we have breaking news. Hold up. You know what this means.</p><p>[00:54:11] <strong>Alex Volkov:</strong> Yes, Nistan, go ahead now.</p><p>[00:54:14] <strong>Nisten Tahiraj:</strong> Phil Schmidt, who is a friend of the pod and has been here.</p><p>[00:54:18] <strong>Alex Volkov:</strong> Here, yes.</p><p>[00:54:18] <strong>Nisten Tahiraj:</strong> definitely. Yeah, Devleet at, At Hugging Face, he's also the one that did the integrations, if I might be wrong, but the integrations for with AWS Bedrock and also with CloudFlare workers. Yeah, so now it looks like he's been working on doing an integration.</p><p>[00:54:35] <strong>Nisten Tahiraj:</strong> with Google, where you'll be able to just take whatever models or fine tunes and stuff you have on HuggingFace and then use Google's infrastructure, use both their TPUs and NVIDIA H100s, they're advertising this, that Google owns, to continue training, fine tuning, serving, deploying stuff via HuggingFace.</p><p>[00:54:55] <strong>Nisten Tahiraj:</strong> Google. Is this is a very interesting move. Google's jumping in more on the open source side there. I don't know what this means, but this is a very interesting development.</p><p>[00:55:06] <strong>Alex Volkov:</strong> I know what this means. This means that, if Hug Face becomes public ever, buy their stock. That's what this means. Hug Face like literally embedded into the, like the infrastructure of AI and definitely worth following. And the more integrations they have, the better, like it is for the open source community as well.</p><p>[00:55:25] <strong>Alex Volkov:</strong> All right, folks. Thanks Nissan</p><p>[00:55:26] <strong>Nisten Tahiraj:</strong> This is not financial. By the</p><p>[00:55:28] <strong>Alex Volkov:</strong> financial advice, but they're also not public yet. Look, I don't think this move. Yeah, I don't think this moves the needle for, in terms of Google investing,</p><p>[00:55:36] Hourglass Diffusion Transformers deep dive with Tanishq Abraham</p><p>[00:55:36] <strong>Alex Volkov:</strong> Alright folks, we're moving forward and the way, where we're moving forward is also like into kind of like diffusion mode, and I'm very excited to introduce Tanishq.</p><p>[00:55:45] <strong>Alex Volkov:</strong> Tanishq, have you been here before? Remind me, please. I don't think you've been here on stage before.</p><p>[00:55:50] <strong>Tanishq Abraham:</strong> I, I don't think I've been on stage</p><p>[00:55:52] <strong>Alex Volkov:</strong> No. All right. So I'm very excited to have you here. Thanks. Thank you for joining us. So folks, one of the coolest things that came out in at least the research area from this week was this paper from.</p><p>[00:56:03] <strong>Alex Volkov:</strong> From multiple authors, some of them friends of the pod, like Enrico, if you remember the chat with Enrico we did with rope scaling is on the paper as well. Katherine Crowson who we should mention, I don't think she's been here or, but we've talked about some stuff that she did. Stefan Baumann, Alex Birch, Tanishq, you're on there, Daniel Kaplan, and then Enrico, a friend of our Nico.</p><p>[00:56:23] <strong>Alex Volkov:</strong> Tanishq has been the friend of the pod behind the scenes, you guys didn't know this, but we've met in NeurIps so we've met before. Tanishq, do you mind introducing yourself just briefly for the audience who haven't met you or followed you so far?</p><p>[00:56:34] <strong>Tanishq Abraham:</strong> Yeah, sure. My name is Tanish. I am a research director at Stability ai and also CEO of MedAR, which is a medical AI research organization. I've also been involved with fast ai, been working on, diffusion models for</p><p>[00:56:48] <strong>Tanishq Abraham:</strong> I guess past year and a half or so. Yeah, so I do all kinds of stuff.</p><p>[00:56:53] <strong>Tanishq Abraham:</strong> Generative ai,</p><p>[00:56:53] <strong>Tanishq Abraham:</strong> medical ai. Yeah.</p><p>[00:56:55] <strong>Alex Volkov:</strong> You also just like a briefly skipped over the fact that you got your PhD at 19, right? Is that correct?</p><p>[00:57:01] <strong>Tanishq Abraham:</strong> Yes, that's correct. I got</p><p>[00:57:02] <strong>Tanishq Abraham:</strong> it. That was last year. Yes,</p><p>[00:57:03] <strong>Alex Volkov:</strong> So if folks in the audience don't know what this means that there's not many like 19 year old PhDs and Tanishq is one of them. And also we met once. I think a year and a half ago. And then the next time we met in Europe, I just remember every detail of our conversation. But that's beside the point.</p><p>[00:57:17] <strong>Tanishq Abraham:</strong> yes.</p><p>[00:57:19] <strong>Alex Volkov:</strong> Thanks</p><p>[00:57:19] <strong>Tanishq Abraham:</strong> met at the Stability AI</p><p>[00:57:21] <strong>Alex Volkov:</strong> Lunch party. That was super cool. And since then, many things have changed. And I really want to talk to you in that area, right? So this paper, shout out to all the authors because I'm looking at this. I've seen like multiple folks share this paper. Paper is talking about high resolution image synthesis.</p><p>[00:57:39] <strong>Alex Volkov:</strong> With something called Hourglass Diffusion Transformers. And I will pin your great thread about this here on top of the space, and it will be in the show notes. Could you briefly tell us the problem this tries to solve? And then we're going to go into actually how this kind of approaches how to solve this.</p><p>[00:57:57] <strong>Tanishq Abraham:</strong> Yeah, definitely.</p><p>[00:57:58] <strong>Nisten Tahiraj:</strong> Yeah. So first of all, of course preface this by saying it's mostly, of course</p><p>[00:58:01] <strong>Tanishq Abraham:</strong> Kat's genius work here. And we were just lucky to be able to help her on this project. But yeah, just to get her started.</p><p>[00:58:06] <strong>Alex Volkov:</strong> just one tiny second because it's worth a shout out. So Kat, by Kat you refer to Katherine Carlson, right? And if folks ever used Stable Diffusion before, either in Automatic 1. 1. 1 or whatever, and you [00:58:20] choose anything with K dash that's, this is the Katherine, right?</p><p>[00:58:24] <strong>Alex Volkov:</strong> This is, K Diffusion is like her area. Very incredibly prolific person in this area I don't know many facts about her, but like everybody who I talked to from this paper, including Enrico, everybody's like referring to Kat, that's her work. So maybe a huge shout out to Kat and yeah, go ahead, please.</p><p>[00:58:40] <strong>Tanishq Abraham:</strong> Yeah yeah, she's like a, she was like one of the original AIR people, so yeah, I had, she helped start the field in a way, anyway,</p><p>[00:58:46] <strong>Tanishq Abraham:</strong> To To provide some context of</p><p>[00:58:48] <strong>Tanishq Abraham:</strong> what this paper is about the idea is that, if you want to do like high resolution generation, so think like 1024 by 1024 the typical approaches these days utilize some sort of multi stage approach, like the most common one, like stable diffusion, is this sort of latent diffusion where you have to encode it in with some sort of auto encoder into some latent space and you're doing diffusion on the latent space and you're not actually doing it on the actual pixels.</p><p>[00:59:15] <strong>Tanishq Abraham:</strong> And so that comes with some disadvantages. For example, if I don't know if people who are like doing things like image editing with stable diffusion, you realize you don't have a whole lot of fine grained level of control in terms of the actual, at the pixel level.</p><p>[00:59:30] <strong>Tanishq Abraham:</strong> It's difficult to do that because it's happening in the latent space rather than at the pixel space. So there are various different things where like it has its own challenges. Of course, like latent diffusion has a lot of different advantages too, but you know for some applications it may not be ideal.</p><p>[00:59:44] <strong>Tanishq Abraham:</strong> And then on top of that the other aspect that, we wanted to like, look into basically was the fact that we're seeing People move towards towards transformer models for diffusion as well. And of course, in the past, most of the diffusion models have been with, a U net architecture, a convolutional U net.</p><p>[01:00:02] <strong>Tanishq Abraham:</strong> Also stable diffusion uses a convolutional U net. But, there have been a lot of papers examining the use of transformers. And, of course, the nice thing about transformers is, people know how to train them, they're quite scalable, so people would rather use transformers for diffusion over over something like a U net.</p><p>[01:00:18] <strong>Tanishq Abraham:</strong> But again, the problem is that So far, it's mostly only been applied to the Latent Diffusion Scenario, mainly because it would be very hard to do this at pixel scale because of the quadratic complexity of attention. So if you wanted to scale up to higher resolution, you know that, it would be, at the number of pixels, you're going to have quadratic scaling with that.</p><p>[01:00:40] <strong>Tanishq Abraham:</strong> So it would be very difficult to train this with, I guess enough resources or whatever. So that's the problem that we're trying to solve is what can we do to resolve the quadratic complexity of the transformer architecture that allows us to then train a diffusion transformer in pixel space.</p><p>[01:00:58] <strong>Tanishq Abraham:</strong> So that's what the hourglass diffusion transformer tries to address.</p><p>[01:01:02] <strong>Alex Volkov:</strong> Thank you for the brief introduction. For I will try to recap as a way I understand this. So folks who are not machine learning scientists in the audience would be able to follow along. But basically Gen AI, this whole wave of Gen AI has two, two big infrastructures so far, right?</p><p>[01:01:15] <strong>Alex Volkov:</strong> The diffusion, the stability AI and of the image models and video models. They're based on like diffusion or you said latent diffusion, correct? And then there's the LLM area with basically based on transformers. And we've seen a bunch of stuff going back and forth in tech, like in techniques between them, right?</p><p>[01:01:31] <strong>Alex Volkov:</strong> So Laura, I think is a thing that like many people in the diffusion area, like trained Laura's on different concepts. And then obviously like fine tuning with Laura's then became a thing and back and forth. We've seen like back and forth different approaches. I think you said like The open source area in LLMs in Transformers specifically has a bunch of like super cool tricks and optimization techniques and flash attention different things, right?</p><p>[01:01:54] <strong>Alex Volkov:</strong> There's a bunch of stuff that people developed in one area that wasn't necessarily applyable to to, to diffusion models. And so you guys set out to try and unify those two, or at least use some of the tricks and looks</p><p>[01:02:09] <strong>Alex Volkov:</strong> succeeded to an extent. Yeah. Go ahead please.</p><p>[01:02:12] <strong>Tanishq Abraham:</strong> yeah, I think it's, yeah, about Now that we have this transformer architecture, we can try to apply some of the tricks that people have been using, things like, rope embeddings there are other tricks like RMS norm, these are the sorts of tricks, for example, that are used in the Lama architecture these sort of similar architectural decisions and you could take those sorts of best practices and try to see if they help with diffusion now.</p><p>[01:02:33] <strong>Tanishq Abraham:</strong> So yeah, I think that's the idea. And like people were exploring yeah, that's like another interesting thing about our papers. Like people were exploring diffusion transformers, but they were using very kind of old architectures for diffusion transformers. And here we're trying to also apply all these tricks that we see.</p><p>[01:02:47] <strong>Tanishq Abraham:</strong> People are applying the LLM space and trying to apply that to to, to diffusion. Yeah, that was also an important part of our paper as well.</p><p>[01:02:54] <strong>Alex Volkov:</strong> And of course, you mentioned Rope, and I want to shout out a friend of the pod, Enrico, from News Research, Enrico. Wait, I don't actually remember if Anuka is part of News Research. Maybe, so he and News Research worked on the Rope paper together. And for folks who are interested in hearing about Rope, we had a deep dive during the summer, one of the coolest episodes.</p><p>[01:03:12] <strong>Alex Volkov:</strong> Most of it back then went above my head, but it was super cool going back there and saying, Hey, oh, I learned this. Rope is basically a way to extend context windows and do a bunch of other things for Transformer based large language models. And I wonder how does Ropen play here? And Enrico is part of the authors here on, on the paper.</p><p>[01:03:29] <strong>Alex Volkov:</strong> So he contributed at least part of that work, I assume. Enrico?</p><p>[01:03:34] <strong>Tanishq Abraham:</strong> Yeah. I think the rope stuff is like something that We even, we haven't like fully explored the full potential there, I think. But at least for what we were doing, we saw improvements in, in performance, just using rope over other sorts of, these sorts of position embeddings.</p><p>[01:03:50] <strong>Tanishq Abraham:</strong> But yeah, I think there's definitely potential for allowing the model to handle larger resolutions or do things like this because of the rope embeddings that we have in the model. Yeah it's, I think, also meant for future work.</p><p>[01:04:02] <strong>Alex Volkov:</strong> Incredible. You guys use all these techniques. You introduce, or I guess start formally announcing this concept of the fusion transformers, which is the mixture of these two things. And what are some of the results that you get? You've trained a few models to test.</p><p>[01:04:15] <strong>Alex Volkov:</strong> How do you even, measure that you're getting performance or you're just looking at algorithms or you're actually generating images. Can you talk us through the process of validating this like theories and papers?</p><p>[01:04:26] <strong>Tanishq Abraham:</strong> Yeah, but I just want to yeah, I guess to take a step back to clarify we didn't necessarily invent the concept of diffusion transformers. That is something that people have already developed but the idea that we focus here is the problem is in the past, diffusion, Transformers were done with the latent space because of this quadratic complexity.</p><p>[01:04:45] <strong>Tanishq Abraham:</strong> So we basically have a different type of transformer architecture, which is this hourglass transformer that enables for Like O of N scaling, so like a linear complexity. So it, it will scale with the number of pixels much better than it won't blow up like, like you, you have with with the attention quadratic complexity.</p><p>[01:05:07] <strong>Tanishq Abraham:</strong> So that was the main trick that we're using. So we have some tricks in there. That allow it to have that property. And that's what enables us to do it on the pixel space, as opposed to the latent space that the previous diffusion transformers were doing. And then on top of that, we are adding all these additional transformer tricks, which no one had tried out before with diffusion transformers.</p><p>[01:05:27] <strong>Tanishq Abraham:</strong> So those are the main sort of contributions of this paper in terms of in terms of, and yeah, I guess one thing, the, yeah, the other thing worth mentioning is that the way that this architecture is able to do this is partly because it's, it the architecture is a very hierarchical architecture.</p><p>[01:05:45] <strong>Tanishq Abraham:</strong> So it's actually able to process at different image resolutions. And for example at the high resolutions, we use a sort of the, this sort of local attention, which is what is. Having this linear scaling, but then at the low resolutions, we were able to do the regular attention.</p><p>[01:06:01] <strong>Tanishq Abraham:</strong> Yeah, there's also this hierarchical processing of the image resolution. That's also, I think, an important point, which enables also for higher fidelity as for generation. And yeah, in terms of testing the</p><p>[01:06:13] <strong>Alex Volkov:</strong> Yeah. And so the next question is how do you actually like test the architecture? How do you validate these like approaches that you tried actually better than what the field has previously been at?</p><p>[01:06:26] <strong>Tanishq Abraham:</strong> Yeah. We looked at two datasets. One, we did ImageNet generation. So can conditional, class conditional ImageNet generation. So that is, passing in an ImageNet class, you generate images of that class. So if you pass in a zebra [01:06:40] class, you're generating zebras, or you're in some sort of dog class, you generate the dogs.</p><p>[01:06:43] <strong>Tanishq Abraham:</strong> That's, we train a model for that. We train it at a resolution of 256 by 256 and that, that's one of the experiments where we compare to other architectures. And so we we're, the interesting thing is that, of course, we're comparing to other architectures that are using, for example Latent Diffusion, that they're, using the latent space there the architecture is functioning on the latent space and not on the pixel space, but we have our architecture that's functioning on the pixel space and using this hourglass transformer and it's getting better results than with the with the latent space.</p><p>[01:07:19] <strong>Tanishq Abraham:</strong> We're beating, for example, the previous Diffusion Transformer model which was using the latent space. And then another interesting data set that we used was the FFHQ. Data set which is this sort of data set of high yeah like high resolution faces and so this is at this is at a 1024 by 1024 resolution and so this is like you know very difficult to be able to train especially in a pixel space you know at Scale of 1024 by 1024.</p><p>[01:07:47] <strong>Tanishq Abraham:</strong> And actually there are not many other diffusion models that are trained on this model. There are a bunch of GAN models, for example, but not really many diffusion models. There's like only one or two that we actually found in the literature because it is, it can be a bit difficult because of this, because of the.</p><p>[01:08:01] <strong>Tanishq Abraham:</strong> The pixel scale or the, the resolution of the images, but yeah we were managed to train a model with our architecture. It can, it trains quite fast. And yeah we are able to we're basically like, I guess at this point now we would be the best diffusion model for that for that data set.</p><p>[01:08:18] <strong>Tanishq Abraham:</strong> And we are measuring with FID. But of course, like FID, as a metric also has its problems it does have some bias towards like towards GANs and so GANs tend to have a lower FID kind of in terms of the bias of the FID. So like when we look at it qualitatively, honestly, we think like it's quite comparable to the GANs, might be better than the GANs, honestly.</p><p>[01:08:41] <strong>Tanishq Abraham:</strong> So we may do more evaluations and study that further. But honestly, this may be like. One of the state of the art models for this FFHQ dataset but it's a bit hard when you're using as a metric, but that's of course the problem with, everyone's using that metric in the literature, but yes, but yeah, I think that, again, that's another really interesting result that we observed.</p><p>[01:09:01] <strong>Tanishq Abraham:</strong> And then, of course, we do</p><p>[01:09:02] <strong>Alex Volkov:</strong> I want to follow up with a question here real quick. For folks like, hard for them to follow like much of this, but they've used something like Stable</p><p>[01:09:09] <strong>Tanishq Abraham:</strong> oh, sorry.</p><p>[01:09:10] <strong>Alex Volkov:</strong> No, that's all great. This is all recorded. Folks can like pause and go to, and go research and come back and listen to you.</p><p>[01:09:15] <strong>Alex Volkov:</strong> This is great. Like you did the deep dive. I really appreciate it. I just want to bring this back a little bit upwards towards like</p><p>[01:09:21] <strong>Unkown:</strong> Sure.</p><p>[01:09:22] Effects on the industry from Hourglass Diffusion Transformers</p><p>[01:09:22] <strong>Alex Volkov:</strong> affect the industry, given that we have stuff like Stable Diffusion out, and that keeps getting better, Mid Journey is getting like reality adjacent to the point where like it's really hard to distinguish, there's like different upscalers that take the outputs and then run some upscaling how does this affect the industry to, in your mind?</p><p>[01:09:40] <strong>Alex Volkov:</strong> Will this Accelerate some stuff. Will this be applied to different areas that like the fusion models have not been traditionally in? What is the kind of the, let's say this is a building block that you've created. How does this affect us in three, six months?</p><p>[01:09:54] <strong>Tanishq Abraham:</strong> Yeah, I think this is just a kind of a new unique direction to explore. Of course, I think latent diffusion is still a very interesting, invaluable direction, but this is just it's always good to have different directions to explore. And I think And honestly, like this architecture can be applied to latent diffusion as well, and maybe we get even better results, for example, we can do maybe like, multi megapixel level synthesis by combining, this method with latent diffusion or something like this as well.</p><p>[01:10:23] <strong>Tanishq Abraham:</strong> So it's not even like it's. Limited to just the pixel space. That's what we're showing that, that's something that is interesting about this. But again, it can also be applied to agent diffusion and can even, of course, these models could be scaled up further. There's a whole lot of future work to explore here, I think.</p><p>[01:10:39] <strong>Tanishq Abraham:</strong> And yeah, I think and of course it's computationally efficient. And yeah, I think the nice thing is yeah, moving towards the transformer architecture when, it's, people understand the transformer architecture at this point. I think, there's people understand how to scale it and different tricks.</p><p>[01:10:55] <strong>Tanishq Abraham:</strong> And it's, I think this is a good, by introducing this architecture, this is a good way for. As to try to bring some of those advances in transformers into the diffusion model field as well. So I think that's the other interesting aspect of this.</p><p>[01:11:12] <strong>Alex Volkov:</strong> for me reading this is not a machine learning scientist. Reading this was like the highlight of interesting things were like The open source community moves in, in, in different areas, but also like bringing over some of the learnings, bringing over some of the talent the tooling around, like making things available.</p><p>[01:11:28] <strong>Alex Volkov:</strong> And I think that's like very exciting. We also have Alex Birch, is that correct? The name also in the audience. So shout out Alex. And then what else do we not cover this stage? What is the last thing that you want to say? Or maybe shout out some of the co authors feel free, the stage is yours.</p><p>[01:11:44] <strong>Tanishq Abraham:</strong> Yeah, I'm just looking at some comments that I, Alex also has some comments that he said. So he thinks, for example, that with this model, that there's potential to. Achieve more realistic textures than even mid journey. So I think, we have observed, like with the model, like the, because that's the thing about using, when you're using a latent diffusion where, it's not, you're not doing, when you're not doing it at the pixel level, it's a bit.</p><p>[01:12:07] <strong>Tanishq Abraham:</strong> Difficult to get those get those, textures accurately, but if you're doing it the pixel level I think you're able to get those textures yeah, it can do that much better. And we've observed that with the models that we've been training. And yeah, I definitely agree with Alex there.</p><p>[01:12:22] <strong>Tanishq Abraham:</strong> Yeah, I think also like it may have potential to achieve like really realistic textures and that, that's something. That I guess we could look forward to hopefully. Yeah.</p><p>[01:12:31] <strong>Alex Volkov:</strong> that's incredible cause I think the realism comes from the imperfections, especially like textures and skin, et cetera. And like diffusion models have, at least for many folks are easier identifiable by the kind of the smoothness of edges and different things. So definitely like more more textures are there for humans in real pictures.</p><p>[01:12:50] <strong>Alex Volkov:</strong> And then we're looking forward to more of that in diffusion models. That's incredible. So definitely, thank you for breaking this down for us, Dinesh. Thank you and Catherine and Alex and everybody else in Rico who worked on this. I think we have some questions from folks on stage here. Vic, go ahead, please.</p><p>[01:13:05] <strong>Vik Hyatk:</strong> Yeah, another question.</p><p>[01:13:06] <strong>Vik Hyatk:</strong> I just wanted to see I played around with the repository a bit. It's a great way for anyone interested in getting into diffusion models to get started. It's not your typical research code base. It's super clean.</p><p>[01:13:19] <strong>Vik Hyatk:</strong> You're not going to run into a bunch of dependency issues and whatnot.</p><p>[01:13:22] <strong>Vik Hyatk:</strong> So that</p><p>[01:13:23] <strong>Vik Hyatk:</strong> was amazing. It's also super compute efficient, so you don't need a ton of compute. To start to see good results. I'd strongly recommend checking it out if anyone was feeling intimidated</p><p>[01:13:32] <strong>Vik Hyatk:</strong> before,</p><p>[01:13:32] <strong>Vik Hyatk:</strong> don't be.</p><p>[01:13:34] <strong>Alex Volkov:</strong> Incredible.</p><p>[01:13:35] <strong>Tanishq Abraham:</strong> Yeah. That, that, that comes down to Kat's again, Kat's genius. I think this is a code base that she's been working on for quite some time and I also really enjoy working with it.</p><p>[01:13:42] <strong>Tanishq Abraham:</strong> It's like one of my favorite diffusion model code bases. So I definitely agree that anyone who's interested in playing around with diffusion models should check it out.</p><p>[01:13:49] <strong>Alex Volkov:</strong> So that, that's on Cat's GitHub. We're going to add this in shell notes called KDiffusion, correct? It's now</p><p>[01:13:55] <strong>Alex Volkov:</strong> part of that existing code base, but now like this, the Hourglass Diffusion Transformer. Get used to say Diffusion Transformers from now on, folks. Hourglass Diffusion Transformers, HDITs, are now a thing.</p><p>[01:14:06] <strong>Alex Volkov:</strong> And Tanish, thank you so much. And Alex for joining in from the comment area. And thank you for working on this work. Hopefully this will get the recognition it deserves and definitely as a foundational block to get us Higher performance, lower, hardware requirement models that look way better.</p><p>[01:14:22] <strong>Alex Volkov:</strong> Incredible.</p><p>[01:14:23] Open source models in medical fields</p><p>[01:14:23] <strong>Alex Volkov:</strong> Tanishq I wanted to follow up with you, because MedArk is something that you're now CEO of medical things, and then you had a tweet today that I really wanted to talk to you about specifically because Quyen was involved, and we have like folks from Quyen, usually like friends of the path as well, they join us could you,</p><p>[01:14:37] <strong>Alex Volkov:</strong> let's talk through this please, let's talk through How open source is catching up to medical space.</p><p>[01:14:42] <strong>Alex Volkov:</strong> Could you briefly summarize what we've talked, recent work from you guys?</p><p>[01:14:46] <strong>Nisten Tahiraj:</strong> Yeah. Sure. Yeah. I've been</p><p>[01:14:48] <strong>Tanishq Abraham:</strong> quite busy with all kinds of different research projects. So that was another. Ongoing research project that we're working on at MedArc and that I'm shared some progress of that today morning. So basically, at MedArc, we're of course interested in [01:15:00] developing open source medical language models.</p><p>[01:15:03] <strong>Tanishq Abraham:</strong> So that, that's something that we're heavily interested in. And of course, in order to be able to do we wanted to understand what The current capabilities of these language models look like the open source language models and no one had done like a very proper analysis of this as far as I could tell and yeah, basically we, what we did is we added this suite of tasks known as the Multimed QA.</p><p>[01:15:24] <strong>Tanishq Abraham:</strong> Sweet of tasks. So this is a kind of a bunch of tasks, a total of nine tasks that were they came from different other papers and stuff, but Google put them together as this is their sort of evaluation bench, this is the evaluation benchmark that This is what Google was using to evaluate their MedPAL models and, whatever models they had.</p><p>[01:15:44] <strong>Tanishq Abraham:</strong> And then, the community, the medical AI community been using that. It's been used to evaluate GPT 4</p><p>[01:15:49] <strong>Unkown:</strong> and all kinds of</p><p>[01:15:50] <strong>Tanishq Abraham:</strong> other models as well. And yeah, I, we, at MedArf, we added it to the LM eval harness. So that's like the common sort of for open source language models.</p><p>[01:15:59] <strong>Tanishq Abraham:</strong> Everyone I think uses LM eval harness to evaluate the models on various tasks. So now it's in there. So people can easily also evaluate their, whatever the models they have on these medical tasks. And so once we added it into LM eval harness, we just wanted to just. Do a comprehensive like analysis of a whole bunch of models in the open source space, just to see like these sorts of generalist models.</p><p>[01:16:21] <strong>Tanishq Abraham:</strong> Like they're not necessarily particularly trained on medical data. Of course they've probably seen some in, in, in their pre training or whatever, but that's not their main purpose and that's not their main focus in their pre training. And I'm, I was just curious what their performance would look like and, how it compares to other models like GPT 4.</p><p>[01:16:36] <strong>Tanishq Abraham:</strong> GPT 4 is also a generalist. It's a generalist language model as well. It's not also necessarily trained on medical, but, it's really good at that. In fact Prompt Engineer GPT 4 is state of the art on this benchmark, actually.</p><p>[01:16:48] <strong>Alex Volkov:</strong> I remember this. I remember where Google came up with a specific medical device and then GPT 4 just like basically with prompt engineering on that benchmark became the top one, right? This was quite incredible that the most generic</p><p>[01:17:00] <strong>Alex Volkov:</strong> model we have. Yeah,</p><p>[01:17:02] <strong>Tanishq Abraham:</strong> that's the it's called MedPrompt. That's the state of the art, this prompt engineering, prompt engineered GPT 4, it's called MedPrompt. And so they do a whole bunch of tricks like, dynamic few shot and GPT 4 written chain of thought and all kinds of tricks that they throw at GPT 4 and they got state of the art.</p><p>[01:17:18] <strong>Tanishq Abraham:</strong> And then of course they use the same tricks to then, later claimed that GPT 4 is better than Gemini as well. It's not just for medicine that you can use it. They use it for just general prompt engineering as well. But yeah, anyway so yeah, this is, so overall the point is I wanted to evaluate how these models do in the how the open source models do in this, on this benchmark.</p><p>[01:17:38] <strong>Tanishq Abraham:</strong> And so I evaluated a whole bunch of models. I evaluated Lama, Mistral, Mixtral. I evaluated the Yi series of models. I evaluated Quinn. Yeah, so I evaluated a whole bunch of models here and basically what I found out is first of all, Lama 2 is not that great compared to all these other models, actually, and it's, It's interesting because in the literature people are still fine tuning Lama 2 for medical purposes but, it actually doesn't have a very good base capability of for medical knowledge.</p><p>[01:18:09] <strong>Tanishq Abraham:</strong> So Lama 2 is not very good at medical stuff, but the models that are quite good are basically the Yi series of models, so Yi 34b is really good, as well as the Quen series of models. So Quen 72b is The state of the art open source model and it's, and this is not like doing any sort of prompt engineering or anything like this.</p><p>[01:18:28] <strong>Tanishq Abraham:</strong> This is just like five shot prompting and it's beating MedPalm version 1. So MedPalm version 1 was released in November of 2022 and that was like the first sort of yeah, that was Google's model that they had. And this Quenz72B is beating MedPom1 without any sort of prompt engineering or any of these tricks.</p><p>[01:18:50] <strong>Tanishq Abraham:</strong> And yeah, I think that's really, honestly, quite impressive because</p><p>[01:18:54] <strong>Alex Volkov:</strong> Yes.</p><p>[01:18:55] <strong>Alex Volkov:</strong> I want to shout out Jun Yang or Justin Lin a friend of the pod, the head of technical, working on Quen for such like incredible achievement. And thank you for testing this. Because we and Nistan, like you worked on AI in medicine as well. Like we've waiting, this is going to happen.</p><p>[01:19:11] <strong>Alex Volkov:</strong> Want it or not, there's like several doomers that say, Hey, never trust an AI doctor, but, many people already go to JGPT to, maybe get a second opinion. And Google has obviously been working on this MetPalm and MetPalm2.</p><p>[01:19:22] <strong>Alex Volkov:</strong> I think for many people it's going to be easier to digest this idea if the model that talks to them is like fully runs on their computer, open source, no internet, like no data sharing.</p><p>[01:19:33] <strong>Alex Volkov:</strong> I think that's a very important piece of this as well. And it's great to see that, we're now getting like some cool comparison, but definitely open source is coming strong on this one.</p><p>[01:19:42] <strong>Unkown:</strong> Yeah.</p><p>[01:19:43] <strong>Nisten Tahiraj:</strong> Yeah. I had the same thing as, Astonish with the Lama models, you can train them on good medical data, but they don't have a, they don't perform great at the base. I'll tell you, it's still, GPT 4 is king when it comes to it. And the product I worked on last year in March, it's still going, Dr.</p><p>[01:20:04] <strong>Nisten Tahiraj:</strong> Gupta. ai is, it is still going. It's just a very well prompted, engineered product. Doctor with with a good RAG system too, that was one of the first, but I will say the thing about the main concern now, and why I think open source will basically completely dominate medical AI, is that their main concern is If they're dependent on some kind of API endpoint that makes the hospital and people's medical data really vulnerable to malware and foreign intelligence groups, which have been wrecking havoc with with medical data and ransomware.</p><p>[01:20:42] <strong>Nisten Tahiraj:</strong> So that's their main concern. And the only way we're going to solve that is by having models that they run locally. So I'm really glad Tanishq actually took the task on. Benchmarking some of these, because you have the entire medical safety field and all the funding and all the people and I have yet to meet an AI safety person that even knows how to rename a file in Linux, let alone actually write some kind of benchmark.</p><p>[01:21:07] <strong>Nisten Tahiraj:</strong> So I'm glad someone's actually taken on the challenge of making open medical yeah, medical LM benchmarks.</p><p>[01:21:19] <strong>Tanishq Abraham:</strong> Yeah, I completely agree in terms of yeah, I definitely think open source is definitely the feature for medical AI and medical LLMs. And I think hospitals and doctors will be more comfortable when they know they have access to the model and this is the model that they're using rather than when it's behind some API also where not only like in the case of like malware or things like this, but open eye.</p><p>[01:21:40] <strong>Tanishq Abraham:</strong> AI will just change the model or something like this too, or, these are all concerns that we see this already happening with the models that OpenAI has. And, these are all like concerns that, there needs to be complete transparency when working with with these kind of more crucial applications.</p><p>[01:21:55] <strong>Tanishq Abraham:</strong> And, by doing all this open source I think that that provides that transparency that doctors and hospitals and healthcare systems will be comfortable with that. That's why I'm really excited about working in this area. And I think there's really a lot of potential here.</p><p>[01:22:09] <strong>Alex Volkov:</strong> incredible. Thank you for this work, Dinesh. Thank you for bringing us kind of the idea of which of the models. Surprisingly, Quen. Like I wouldn't assume if you gave me all the models that we've talked about I wouldn't assume that Quen was the most performing, but hey, we'll take what we can get.</p><p>[01:22:22] <strong>Alex Volkov:</strong> Quen72b, the best open source doctor, folks. You hear, you heard it here based on this research.</p><p>[01:22:30] <strong>Tanishq Abraham:</strong> Yeah. Thank you for letting me share all this work.</p><p>[01:22:32] <strong>Alex Volkov:</strong> That's incredible. And as a friend behind the scenes, but now friend of the path, you're always welcome. Thank you for the deep dive on the Hourglass Diffusion Transformers. Thank you for the authors as well. Alex, like still, I think is in the audience and Catherine and Rico and some other folks, and definitely for MedArk, keep us up to date.</p><p>[01:22:48] <strong>Alex Volkov:</strong> We'll keep reporting and the stage is yours whenever you want it. I think folks we're moving forward. I think Nissan, unless you have, or sorry, Tanish, you have the one last thing you want to</p><p>[01:22:57] <strong>Tanishq Abraham:</strong> I would just say please follow first of all, follow all of our Hourglass Diffusion authors. They all deserve your support and also please follow MedArk as well.</p><p>[01:23:06] <strong>Alex Volkov:</strong> 100 percent worth following and definitely will be in the show notes for folks who are listening to this while driving and cannot like click that follow button. I think we're moving to as we're in the hour and a half into the space, let me reset [01:23:20] this a little bit for folks. If you just recently joined us, you're listening to ThursdAI where we talk about everything.</p><p>[01:23:26] <strong>Alex Volkov:</strong> And everything incredible and interesting in the world of AI and open source, LLMs, big companies we cover. And we also had a deep dive today about vision video. My name is Alex Volkov. I'm with Weights Biases. I'm an AI evangelist. And yeah, we're here every week and we keep up to date. So you don't have to, so if you were out of Twitter or if you don't even participate in Twitter and you're just listening to this on the podcast, we got you we're going to cover everything that's most important and then send you this, so definitely check out.</p><p>[01:23:52] <strong>Alex Volkov:</strong> There's the i. news for that. And I think we're moving towards the big companies area, which we haven't touched. We briefly covered in the breaking news where Hug Face just announced a partnership with Google. So you'd be able to very easily run the models from Hug Face on TPUs and the Thingisneyosa GPUs, which is incredible because Google has those, but they don't even give them away.</p><p>[01:24:15] <strong>Alex Volkov:</strong> I think they're all reserved for collab or something. But also. Everything that I have today in the big company LLMs and APIs and everything is from Google.</p><p>[01:24:25] Google teases LUMIERE, SOTA video generation models</p><p>[01:24:25] <strong>Alex Volkov:</strong> So the next thing that we're going to talk about is Lumiere. And I don't know if you guys saw the video, but I definitely saw the video. I think Pharrell, you sent this in our group chat first, but by that time there was already spreading around.</p><p>[01:24:37] <strong>Alex Volkov:</strong> . So there's obviously the whole area that we've talked about. Sable Diffusion Video releases like very short videos image to video and text to video. And then there's the front runners in the closed source, which is Runway and Pika. And there's like another one Firework. Oh, Leonardo is doing some incredible things.</p><p>[01:24:54] <strong>Alex Volkov:</strong> All of them have very short videos and the consistency between the frames is not like incredible. And Lumiere. Has shown a video and this, like for all, sorry, you're saying this could be like very cherry picked, et cetera. But it feels like this is like another step in this direction that's significant.</p><p>[01:25:13] <strong>Alex Volkov:</strong> And for folks who are not like watch the video yet, definitely worth watching. I'm going to add this it's already on the top of the space, but basically you see they announced a bunch of stuff that Lumiere can do besides just generation. So video in painting is one that they've announced.</p><p>[01:25:28] <strong>Alex Volkov:</strong> They announced like a text to video text to video, image to video in painting. And they have something like they say, realistic, diverse, and coherent motion specifically around the motion of kind of the characters, which has been lacking in all these like video synthesis. I will say it's.</p><p>[01:25:44] <strong>Alex Volkov:</strong> It's pretty remarkable to even discuss that oh, this vision text to video image is not as good as that one. It's really incredible that we're, like, at this point where we can say, a highbrow, Oh, yeah, I prefer this output. We're, like, we're typing text and getting a video back.</p><p>[01:25:59] <strong>Alex Volkov:</strong> It's ridiculous on the surface of even saying this to us. Like a year and a half ago that this would even be possible. But with that said, we're moving forward. We're like hedonistic adaptation is a thing. We're getting used to these tools and we're getting them like day to day. And then we're like, okay, yeah, this tool is better.</p><p>[01:26:15] <strong>Alex Volkov:</strong> They said the existing video malware synthesized distant keyframes, followed by temporal super resolution and then that's probably it makes it temporal consistency difficult to achieve. Temporal consistency basically says where like characters throughout the video, what they do.</p><p>[01:26:30] <strong>Alex Volkov:</strong> And so you've all seen these videos where like the face changes from frame to frame, et cetera. And this. This series of videos from New Year looks very consistent, like spatially and temporally. Like definitely where the characters are in the video, but also like throughout time. And they Attribute this to different methods that they've used I will not go into this, but I think the tasks are very interesting.</p><p>[01:26:53] <strong>Alex Volkov:</strong> They have video editing applications image to video and painting and stylized generation. Something I also liked. You'd be able to take like an image and then generate videos based on that style, not necessarily that image. So very impressive from folks from Google, as always from Google.</p><p>[01:27:08] <strong>Alex Volkov:</strong> I haven't played with this. I don't think there's a way for us to play with this yet. So there's a paper maybe some of the ideas in the paper could be reproduced in open source. But it's like a model show in the paper from folks quite a lot of folks, Omar Bartal, Hila, Omar, Charles Herman, and there's like a bunch of folks there on the paper.</p><p>[01:27:25] <strong>Alex Volkov:</strong> Very like visually appealing demo as well. So definitely we'll add this video in the show notes. And I think we have. One more thing here in Diffusion stuff. Yes, the one last thing that I wanted to talk about is Instant ID. Where so we moved off from Lumiere, Lumiere is like super, super cool, but we haven't seen this work.</p><p>[01:27:43] <strong>Alex Volkov:</strong> Hopefully the releases as Google has a back, they have an example of like when they released stuff, like Dreambooth was released and everybody was using this. And. I think that's pretty much it in the big companies and open source.</p><p>[01:27:55] InstandID - 0 Shot face transfer diffusion models</p><p>[01:27:55] <strong>Alex Volkov:</strong> The other thing that I wanted to mention is instant ID. We've mentioned this briefly before, but it's been pretty much everywhere on my timeline. If you haven't played with this, I very strongly encourage you to play with this. Because instant ID is a technique to transfer to create diffusion models with your face.</p><p>[01:28:11] <strong>Alex Volkov:</strong> And we've all probably tried this at once. With, like I said, like a dream booth from. Nathaniel Ruiz, who's a dear friend of the pod, who's been here a couple of times. There's like other techniques also to transfer your face into a latent diffusion model. And they all used to take multiple images of your face and some amount of training.</p><p>[01:28:32] <strong>Alex Volkov:</strong> And Instant ID is basically a technique that you can try right now, super quick. With zero shot, one image. You can generate images with your face, or with your kid's face, or whatever. And literally I just want to highlight how impressively fast we're moving towards these type of tools. This used to take fine tuning.</p><p>[01:28:52] <strong>Alex Volkov:</strong> This used to take GPU and knowledge, and there's, like Kokhya, and like this used to take Loras and before Loras, Dreambooths. So actually there's a couple of companies that I know that built on top of providing the fine tuning experience around this, where you upload images, you get like this huge, like four gigabit, like stable diffusion file specifically trained on you as a concept.</p><p>[01:29:13] <strong>Alex Volkov:</strong> And now there's like a zero shot transfer thing called Instant ID. Where a hug and face demo is included here. I will attach this now soon. Where you just upload one image of yourself. Literally for me and Nishtha and Tanishq, for the non on, Umesh, for the non anons here on stage, we'd be able to use our profile picture here and just generate us with a cowboy hat in, in noir style and it will look like us.</p><p>[01:29:36] <strong>Alex Volkov:</strong> For most of the time. I've tested this Instant ID on my kids. And, I'm not going to post this because of privacy. But my kid loved it incredibly so much. He was a superman. It looked like him. It's unbelievable. That it was, like, able to transfer this with one image. It's quite incredible how fast we moved here.</p><p>[01:29:52] <strong>Alex Volkov:</strong> Definitely, if you haven't tried Instant ID but you have tried avatars before, Try Instant ID, you'll be blown away. It runs on your Mac as well, not that great, but it runs through Pinocchio computer. Definitely worth noticing how fast we're moving in this generation. And shout out to whoever built this.</p><p>[01:30:08] <strong>Alex Volkov:</strong> And there's quite a few technologies like this now. Highlighting how fast we're moving, and I think that's pretty much it.</p><p>[01:30:15] Voice and Audio - New tech challenges Whisper</p><p>[01:30:15] <strong>Alex Volkov:</strong> So we've covered our diffusion. We've covered yeah, let's move to voice and audio Nistan, you brought us this new, so I definitely want you to pull up the tweet and let's talk about the faster encoder ASR.</p><p>[01:30:25] <strong>Alex Volkov:</strong> And then we can also, while maybe you pull this up, I will say that this week I've 11Labs announced like a big funding rise, but 11Labs also released their dubbing studio. And if you followed Twitter at all, not even the I Twitter for the past like week and a half, two weeks, you maybe have seen the dubbed video of the Argentinian prime minister, or I don't know if he's a prime minister or president, probably president, right?</p><p>[01:30:55] <strong>Alex Volkov:</strong> Yes, president. Millet something he went to the World Economic Forum and gave a speech in Spanish. And then there was a dubbed version, as like these meetings of global summits of leaders, et cetera, they have. Instant translation in their ear to any language, and that's a human that knows both languages.</p><p>[01:31:14] <strong>Alex Volkov:</strong> And then, somebody said, hey, okay, this is one example, and they posted a Heijan. If you remember Heijan, we've talked about Heijan, quite incredibly translation, dubbing, and leap modulation service, where you can upload yourself and get an instant avatar. Somebody used Heijan on the whole speech.</p><p>[01:31:29] <strong>Alex Volkov:</strong> And that went ridiculously viral. I think there was like 50 million views on it, on X. And that was like mostly a combination of [01:31:40] Millet being like very viral in his opinions, being like, stoking some controversy. But also because you literally hear the person. Speak in English with a Spanish accent where this didn't happen, like literally he spoke in Spanish.</p><p>[01:31:52] <strong>Alex Volkov:</strong> Quite incredible technology and people have been shocked and said, Oh my God, this is coming for all of us in DeepFakes. Fine, we've talked about this multiple times. So Eleven Labs now has a, like a alternative to this, called Eleven Labs Dubbing Studio. And I've actually used this on a piece of Like on a trailer for ThursdAI, of me speaking in English, and they asked to dub me in Russian, the language that I do speak, and my mother tongue from Ukraine, and it sounded ridiculously cool.</p><p>[01:32:18] <strong>Alex Volkov:</strong> Here's a quick snippet of me from a Thursday I show with you three weeks ago that I dubbed into Russian for your entertainment.</p><p>[01:32:28] Gadget for children, for parents who have children who do not want to buy iPhones. Because then Instagram will destroy their brains. This is the perfect device for this.</p><p>[01:32:36] It looks like a language. In fact, you can talk to a rabbit, it is very cute, there is one simple interface, this is a voice.</p><p>[01:32:43] <strong>Alex Volkov:</strong> It sounded like, so far, How should I say, these models that emulate voice did not work on me. Specifically, my accent is not that great, but because my accent is probably Russian, the Russian version of me sounded really close to me.</p><p>[01:32:54] <strong>Alex Volkov:</strong> For the first time, I was like, Oh, okay. All right. And Eleven Labzner released this dubbing studio and hopefully these models are now coming to open source.</p><p>[01:33:04] AI deepfake of Biden caused controversy on mass media about AI</p><p>[01:33:04] <strong>Alex Volkov:</strong> Because there's also a thing where I think there's a recording of Biden saying something like stay home going around and everybody in the media making the big fuss about, Oh my God, AI is coming for all of us.</p><p>[01:33:15] <strong>Alex Volkov:</strong> And there's a big cry for folks to say, we should build tools to detect against this, et cetera. And my stance remains the same. Listen, I think we've talked about this multiple times. The only way through these woods is for everybody to know that their voice is very easily be fakable with three seconds or 10 seconds of their voice.</p><p>[01:33:31] <strong>Alex Volkov:</strong> It's time for the it's time for humanity to adapt to the situation where there's no panacea here. You should just know that just trusting voice blindly without knowing the source just don't do that because it might as well be fake. I don't know if you want to add anything.</p><p>[01:33:44] <strong>Alex Volkov:</strong> Yeah, go ahead.</p><p>[01:33:45] <strong>Nisten Tahiraj:</strong> really quick, I want to say, we already have laws to deal with this. More law is not necessarily going to fix the issue because, fraud is illegal in a free market. And if you want. Or at least people that are more in politics and stuff. If you want to solve the issue, do the job you already have.</p><p>[01:34:05] <strong>Nisten Tahiraj:</strong> You already have a list of spam callers, which you have been identified without an AI. And can you shut them down? So People love to imagine problems and love to think of doom or whatever in the future and then they completely ignore the stuff in front of them. All of us do this, but yeah, again, fraud is illegal.</p><p>[01:34:27] <strong>Nisten Tahiraj:</strong> Can you shut it down as a job, as a government? You don't need a new law, you don't need to be make speeches about AI. You need, just need to shut down fraud when it's identified. Otherwise, all of these tools and conferences and stuff are pointless.</p><p>[01:34:42] <strong>Alex Volkov:</strong> As predicted.</p><p>[01:34:43] <strong>Nisten Tahiraj:</strong> that's what I'm gonna</p><p>[01:34:44] <strong>Alex Volkov:</strong> Yeah, no, that's great. As predicted, the first. Election related deepfake type thing. The media was all over this and the doomers were like, here we go. And people were like it came sooner than we thought. And no, we've literally been talking about this for the past year.</p><p>[01:34:57] <strong>Alex Volkov:</strong> That like elections are coming. These things are going to happen. The technology was there even before. Now it's just like a little bit more accessible. The laws are in place, make it more difficult for grandmas to get spam calls, not make it difficult for the open source stuff. So hopefully like the more prevalent these technologies are, this is my stance, the better the chance that, people will just get used to this being everywhere.</p><p>[01:35:19] <strong>Alex Volkov:</strong> And definitely for folks of us who have our audio out there, we're doomed, right? So come up, like my usual suggestion here is come up with your loved ones with a key phrase that only you to know like. The Terminator scene with the dog come up with this and make sure that if you get a call in 3 a.</p><p>[01:35:34] <strong>Alex Volkov:</strong> m. at night, it sounds like a bad quality version of you, of your relative from somewhere, from an unknown phone. Make sure it's them by asking like, Hey, remember we went to Hawaii and you never went to Hawaii? And they say, Oh yeah, of course. But also you can probably most of those will be LLMs, so you can probably like.</p><p>[01:35:53] <strong>Alex Volkov:</strong> Don't prompt trick them, the spammy LLM calls that sound like you're a relative.</p><p>[01:35:57] W2V BERT ASR gets whisper quality with significantly less parameters</p><p>[01:35:57] <strong>Alex Volkov:</strong> Alright, moving for unless, listen, you want to add some stuff about this W2V BERT speech encoder? I've added it to the top of the space.</p><p>[01:36:07] <strong>Nisten Tahiraj:</strong> Yeah, just really quickly, I'm gonna do the paper reading on it 'cause</p><p>[01:36:10] <strong>Alex Volkov:</strong> Oh, hell yeah!</p><p>[01:36:11] <strong>Nisten Tahiraj:</strong> It's a pretty nice paper, so stay tuned from that at some point when we announce it and it's from MITs and and Google and some people from Google. So it's a, another really nice encoder only it has potentially seems to be up to 30 times faster.</p><p>[01:36:29] <strong>Nisten Tahiraj:</strong> So this could</p><p>[01:36:30] <strong>Alex Volkov:</strong> then whisper,</p><p>[01:36:31] <strong>Nisten Tahiraj:</strong> quite useful. It could be quite useful for those making assistance that run on local devices or on low resource devices. But also, For stuff on the web. Now it is officially supported by the Transformers library. We'll wait on Zenova to I think probably it's going to be available via WebGPU and stuff, I'm guessing.</p><p>[01:36:55] <strong>Nisten Tahiraj:</strong> Yeah it's very, it's nice to see that that field also going forward. Because we already have excellent speech recognition. We know it works really well. We just needed to work on more low power devices and mobile and</p><p>[01:37:08] <strong>Alex Volkov:</strong> Absolutely. And looks like some stats here. A bunch of languages are more than the Stan Whisperer, 143 languages. And you can like fine tune this on specific languages as well to make it like better. And VB benchmarked it on Mongolian, and beat Whisperer in less than 1200 steps. So smaller model, like fine tunable, super, super cool, and the best part of it is MIT license.</p><p>[01:37:29] <strong>Alex Volkov:</strong> So there have been other ASRs. They're not in this license. And now we're getting like a state of the art tiny model in this license. I think that's most of the stuff that I wanted to cover.</p><p>[01:37:39] NSF announces a new initiative called NAIRR</p><p>[01:37:39] <strong>Alex Volkov:</strong> No, I wanted to cover one last thing. One last thing. National Artificial Intelligence Research Resource. N A I R R.</p><p>[01:37:47] <strong>Alex Volkov:</strong> Which is coming to us from National Science Foundation, United States National Science Foundation collaborating with agencies and different so All of these incredible three letter agencies are collaborations in this foundation now. NSF is the science foundation, both DARPA and NASA, and NIST, which is the Institute of Standards and Technology, and DOD and DOE, and, like, all these things.</p><p>[01:38:11] <strong>Alex Volkov:</strong> But also, the private sector is joining this companies like Entropic and OpenAI. And Palantir, and Google, and Luther, and HugInFace, and Weights Biases. Obviously, I saw this oh, that's cool. We're, like, Weights Biases are participating in this incredible effort. Are all joining together in this initiative to, to promote, support AI research and advancing like safe and secure and trustworthy AI.</p><p>[01:38:33] <strong>Alex Volkov:</strong> And it's also great to see like folks like Hug Face here and Meta as well is represented folks who push open source as well, because, these government affiliations, government organizations, they have to have folks who promote open source as well. And they've organized them to. Four focus areas open enable AI research to access into diverse AI resources via the NAIRR pilot portal.</p><p>[01:38:56] <strong>Alex Volkov:</strong> So definitely expect there to be government grants for GPUs for different things, I don't know how easily those will be obtainable, but we had some folks in Canada from Canada before talked about you could ask for grants. to train or fine tune like the stuff that Tanish was talking about research which open source is better medical in QA could be happening through the government they also focus on security and And I think something called NARR classroom, which I have no idea.</p><p>[01:39:22] <strong>Alex Volkov:</strong> Oh, which new communities for education, training and user support. Like very government like approached. However, this is definitely like good to see the companies that participate in this. It's not only government, it's also open, like a private sector as well. NVIDIA is there, AMD is there, Eleuther, like we said, open source folks are represented as well.</p><p>[01:39:43] <strong>Alex Volkov:</strong> A huge kind of chunk of companies, it's good to see that the government is like actually moving towards some standardization which may be needed hopefully less regulation, more standardization. And I think with that, we are pretty much all over the news that we had for [01:40:00] this week. Which was great.</p><p>[01:40:01] <strong>Alex Volkov:</strong> I want to say thank you. A huge thank you again for, first of all, the listeners who come here and listen, and the folks on stage who help me from week to bring you the latest and greatest in the iNews.</p><p>[01:40:11] <strong>Alex Volkov:</strong> Thank you so much, and we'll let you go on this Thursday, and we'll see you next week.</p><p>[01:40:14] <strong>Alex Volkov:</strong> Take care, everyone. Bye bye.</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-jan-24-diffusion-transformers</link><guid isPermaLink="false">substack:post:141055137</guid><dc:creator><![CDATA[Alex Volkov, Eric Wollberg, Wes Louis, Vik, and Nisten]]></dc:creator><pubDate>Fri, 26 Jan 2024 01:13:03 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/141055137/27763db5e327bf29d4c4c69a50d62cfe.mp3" length="96715460" type="audio/mpeg"/><itunes:author>Alex Volkov, Eric Wollberg, Wes Louis, Vik, and Nisten</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6045</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/141055137/f4b0b97d942e00c9b68696adb6fcb917.jpg"/></item><item><title><![CDATA[📅 ThursdAI Jan 18 - Nous Mixtral, Deepmind AlphaGeometry, LMSys SGLang, Rabbit R1 + Perplexity, LLama 3 is training & more AI news this week]]></title><description><![CDATA[<p>👋 Hey there, been quite a week, started slow and whoah, the last two days were jam-packed with news, I was able to barely keep up! But thankfully, the motto of ThursdAI is, we stay up to date so you don’t have to! </p><p>We had a milestone, 1.1K listeners tuned into the live show recording, it’s quite the number, and I’m humbled to present the conversation and updates to that many people, if you’re reading this but never joined live, welcome! We’re going live every week on ThursdAI, 8:30AM pacific time.</p><p>TL;DR of all topics covered: </p><p>* <strong>Open Source LLMs</strong> </p><p>* Nous Hermes Mixtral finetune (<a target="_blank" href="https://twitter.com/Teknium1/status/1746990384738357731">X</a>, <a target="_blank" href="https://huggingface.co/NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO">HF DPO version</a>, <a target="_blank" href="https://huggingface.co/NousResearch/Nous-Hermes-2-Mixtral-8x7B-SFT">HF SFT version</a>)</p><p>* NeuralBeagle14-7B - From Maxime Labonne<strong> (</strong><a target="_blank" href="https://twitter.com/maximelabonne/status/1747350120067154227"><strong>X</strong></a><strong>, </strong><a target="_blank" href="https://huggingface.co/mlabonne/Beagle14-7B"><strong>HF</strong></a><strong>,)</strong></p><p>* It's the best-performing 7B parameter model on the Open LLM Leaderboard (when released, now 4th)</p><p>* We had a full conversation with Maxime about merging that will release as a standalone episode on Sunday! </p><p>* LMsys - SGLang - a 5x performance on inference (<a target="_blank" href="https://twitter.com/lmsysorg/status/1747675649412854230">X</a>, <a target="_blank" href="https://lmsys.org/blog/2024-01-17-sglang/">Blog</a>, <a target="_blank" href="https://github.com/sgl-project/sglang">Github</a>)</p><p>* NeuralMagic applying #sparceGPT to famous models to compress them with 50% sparsity (<a target="_blank" href="https://twitter.com/neuralmagic/status/1747330381257252948">X</a>, <a target="_blank" href="https://arxiv.org/abs/2301.00774">Paper</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* 🔥 Google Deepmind solves geometry at Olympiad level with 100M synthetic data (<a target="_blank" href="https://twitter.com/GoogleDeepMind/status/1747651817461125352">Announcement</a>, <a target="_blank" href="https://deepmind.google/discover/blog/alphageometry-an-olympiad-level-ai-system-for-geometry/?utm_source=twitter&#38;utm_medium=social">Blog</a>)</p><p>* Meta announces Llama3 is training, will have <strong>350,000</strong> H100 GPUs (<a target="_blank" href="https://twitter.com/altryne/status/1748057569816416451/video/1">X</a>)</p><p>* Open AI releases guidelines for upcoming elections and removes restrictions for war use (<a target="_blank" href="https://openai.com/blog/how-openai-is-approaching-2024-worldwide-elections">Blog</a>)</p><p>* Sam Altman (in Davos) doesn't think that AGI will change things as much as people think  (<a target="_blank" href="https://twitter.com/altryne/status/1747652118033396124/video/1">X</a>)</p><p>* Samsung S24 has AI everywhere, including real time translation of calls (<a target="_blank" href="https://twitter.com/MKBHD/status/1747680740429496829">X</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* Meta releases MAGNet (<a target="_blank" href="https://twitter.com/reach_vb/status/1747011614815969628">X</a>, <a target="_blank" href="https://huggingface.co/collections/facebook/magnet-659ef0ceb62804e6f41d1466">HF</a>)</p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* Stable diffusion runs 100% in the browser with WebGPU, Diffusers.js (<a target="_blank" href="https://twitter.com/cyrildiagne/status/1746580637379690751">X thread</a>)</p><p>* DeciAI - Deci Diffusion - A text-to-image 732M-parameter model that’s 2.6x faster and 61% cheaper than Stable Diffusion 1.5 with on-par image quality</p><p>* <strong>Tools & Hardware</strong></p><p>* Rabbit R1 announces a deal with Perplexity, giving a full year of perplexity pro to Rabbit R1 users and will be the default search engine on Rabbit (<a target="_blank" href="https://www.perplexity.ai/search?q=%s&#38;focus=internet">link</a>)</p><p>Open Source LLMs </p><p>Nous Research releases their first Mixtral Finetune, in 2 versions DPO and SFT (<a target="_blank" href="https://x.com/NousResearch/status/1746988416779309143?s=20">X</a>, <a target="_blank" href="https://huggingface.co/NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO">DPO HF</a>)</p><p>This is the first Mixtral finetune from Teknium1 and Nous team, trained on the Hermes dataset and comes in two variants, the SFT and SFT+DPO versions, and is a really really capable model, they call it their flagship! </p><p>This is the fist Mixtral finetune to beat Mixtral instruct, and is potentially the best open source model available right now! 👏 </p><p>Already available at places like Together endpoints, GGUF versions by the Bloke and I’ve been running this model on my mac for the past few days. Quite remarkable considering where we are in only January and this is the best open chat model available for us. </p><p>Make sure you use ample system prompting for it, as it was trained with system prompts in mind. </p><p>LMsys new inference 5x with SGLang & RadixAttention (<a target="_blank" href="https://lmsys.org/blog/2024-01-17-sglang/">Blog</a>)</p><p> LMSys introduced SGLang, a new interface and runtime for improving the efficiency of large language model (LLM) inference. It claims to provide up to 5x faster inference speeds compared to existing systems like Guidance and vLLM. </p><p>SGLang was designed to better support complex LLM programs through features like control flow, prompting techniques, and external interaction. It co-designs the frontend language and backend runtime.</p><p>- On the backend, it proposes a new technique called RadixAttention to automatically handle various patterns of key-value cache reuse, improving performance. </p><p>- Early users like LLaVa reported SGLang providing significantly faster inference speeds in their applications compared to other options. The LMSys team released code on GitHub for others to try it out.</p><p>Big CO LLMs + APIs</p><p>Meta AI announcements (link)</p><p>These #BreakingNews came during our space, Mark Zuckerberg posted a video on Instagram saying that Llama3 is currently training, and will be open sourced! </p><p>He also said that Meta will have <strong>350K</strong> (that’s not a typo, 350,000) H100 GPUs by end of the year, and a total of <strong>~600,000 H100</strong> equivalent compute power (including other GPUs) which is… 🤯 (and this is the reason why I had to give him double GPU rich hats)</p><p>Deepmind releases AlphaGeometry (<a target="_blank" href="https://deepmind.google/discover/blog/alphageometry-an-olympiad-level-ai-system-for-geometry/?utm_source=twitter&#38;utm_medium=social">blog</a>)</p><p>Solving geometry at the Olympiad gold-medalist level with 100M synthetic examples</p><p>AlphaGeometry is an AI system developed by Google DeepMind that can solve complex geometry problems on par with human Olympiad gold medalists</p><p>It uses a "neuro-symbolic" approach, combining a neural language model with a symbolic deduction engine to leverage the strengths of both</p><p>The language model suggests useful geometric constructs to add to diagrams, guiding the deduction engine towards solutions</p><p>It was trained on over 100 million synthetic geometry examples generated from 1 billion random diagrams </p><p>On a benchmark of 30 official Olympiad problems, it solved 25 within time limits, similar to the average human medalist</p><p>OpenAI releases guidelines for upcoming elections. (<a target="_blank" href="https://openai.com/blog/how-openai-is-approaching-2024-worldwide-elections">Blog</a>)</p><p>- OpenAI is taking steps to prevent their AI tools like DALL-E and ChatGPT from being abused or used to spread misinformation around elections</p><p>- They are refining usage policies for ChatGPT and enforcing limits on political campaigning, impersonating candidates, and discouraging voting</p><p>- OpenAI is working on technology to detect if images were generated by DALL-E and labeling AI-generated content for more transparency  </p><p>- They are partnering with organizations in the US and other countries to provide users with authoritative voting information through ChatGPT</p><p>- OpenAI's goal is to balance the benefits of their AI while mitigating risks around election integrity and democratic processes</p><p>Microsoft announces copilot PRO</p><p>Microsoft announced new options for accessing Copilot, including Copilot Pro, a $20/month premium subscription that provides access to the latest AI models and enhanced image creation. </p><p>Copilot for Microsoft 365 is now generally available for small businesses with no user minimum, and available for additional business plans. </p><p>This weeks Buzz (What I learned with WandB this week)</p><p>Did you know that ThursdAI is not the FIRST podcast at Weights & Biases? (Shocking, I know!) </p><p>Lukas, our CEO, has been a long time host of the Gradient Dissent pod, and this week, we had two of the more prolific AI investors on as guests, Elad Gil and Sarah Guo. </p><p>It’s definitely worth a listen, it’s more of a standard 1:1 or sometimes 1:2 interview, so after you finish with ThursdAI, and seeking for more of a deep dive, definitely recommended to extend your knowledge. </p><p>AI Art & Diffusion</p><p>Zero shot face adapted image gen - 3 different tech approaches </p><p>What used to take ages, now takes seconds with 0 shot, there are quite a few approaches to generate images with real human faces, in 0 shot capacity, providing just a few faces. Gradio folks call it Zero-shot face-adapted image generation and there are 3 tools to generate those:</p><p> 1⃣IPAdapter</p><p> 2⃣PhotoMaker</p><p> 3⃣InstantID</p><p>Here’s a great <a target="_blank" href="https://x.com/Gradio/status/1747688294333505963?s=20">summary thread</a> from Gradio folks for this fast advancing field! Remember we had to finetune on faces for a long time? Dreambooth and then LORAs, and now we have this exciting development. </p><p>Tools & Hardware</p><p>Rabbit R1 partners with Perplexity</p><p>The R1 device that was just announced, is about to sell through it’s first 50K in just a few days, which is remarkable. I definitely pre-ordered one, and can’t wait to get my hands on it. Jesse the founder has been all over X, getting incredible recognition, and after a few conversations with Aravind Srinivas, they agreed to make a deal right on X.</p><p>Today they hopped on a space and announced that all the first 100K early buyers of Rabbit are going to get a full year PRO subscription of Perplexity (one of the best AI search engines out there) for free! I sure as heck didn’t expect it, but the email was sent just a few minutes after the X space, and now guess who uses perplexity pro? </p><p>Here’s <a target="_blank" href="https://www.perplexity.ai/search/What-episode-did-nASHIVVFSci9aICuNi7XhA?s=c#9c048721-5545-49c8-bd68-80ae362ed784"><strong>an example</strong></a> of a perplexity searching ThursdAI content (it doesn’t always get it right tho)! </p><p><p>I guess that’s it for today, as I’m writing this, there are incredible other stuff getting released, Codium open sourced AlphaCodium (here’s <a target="_blank" href="https://twitter.com/itamar_mar/status/1747957348293824676">a link to</a> the founder talking about it) but I didn’t have a second to dive into this, hopefully will bring Imatar to ThursdAI next time and chat about it! </p><p>Have a great weekend all 🫡  (please give us a good review on Apple Itunes, apparently it really helps discovery!) </p></p><p>Full Transcription for convenience: </p><p>[00:00:02] <strong>Alex Volkov:</strong> Hey everyone, happy Thursday. My name is Alex Volkov. I'm an AI evangelist with Weights Biases, and this is Thursday AI.</p><p>[00:00:13] <strong>Alex Volkov:</strong> We had such a great show today, over 1100 of you tuned in to the live recording, which is incredible.</p><p>[00:00:30] I also wanted to say that if you're not subscribed to <a target="_blank" href="thursdai.news">thursdai.news</a> newsletter, please go ahead and do because I send a full blog with the links to the show notes and to the speakers that we have on stage, and you should be able to follow up.</p><p>[00:00:46] <strong>Alex Volkov:</strong> There's a bunch of multimedia, like videos, that are not coming through in the audio only podcast format. So please subscribe to ThursdayEye. News as well. This live recording, we also hosted Maxime Lebon, who's a senior machine learning scientist with J.</p><p>[00:01:04] <strong>Alex Volkov:</strong> P. Morgan, and the author of several models, and Merged models, lately the Neural Beagle model that we've talked about. We had a great conversation with Maxime. And that full episode will be posted as a Sunday special evergreen content episode. So please stay tuned for that.</p><p>[00:01:29] <strong>Alex Volkov:</strong> It's been an incredibly illuminating conversation in the world of merging and merge kit and everything else that Maxim does and it was a super cool conversation. So that's coming soon.</p><p>[00:01:41] <strong>Alex Volkov:</strong> And, as I've been doing recently, the following is going to be a 7 minute segment, from the end of the live recording, summarizing everything we've talked about.</p><p>[00:01:54] <strong>Alex Volkov:</strong> I hope you've been enjoying these TLDR intros. Please let me know in the comments if this is something that's helpful to you.</p><p>[00:02:05] ThursdAI Jan18 TL;DR recap by Alex</p><p>[00:02:05] <strong>Alex Volkov:</strong> Alright we started with talking today, Thursday I, January 18th. We was talking about n News imis, the Mixt mixture fine tune that came out from Teo and the folks at News. It, it was of the first fine noon of mixture, the mixture of experts model from a mistral that came from the news research folks.</p><p>[00:02:35] <strong>Alex Volkov:</strong> And it released in two versions, the DPO only version SFT plus DPO version. Given different data sets they was trained on and actually different capabilities. It looks based on the community, the DPO version is like very well performing. I've been running this on my Macbook with LM studio and it really performs well.</p><p>[00:02:53] <strong>Alex Volkov:</strong> So shout out and folks should try this. This is By far the best, looks like the best new Hermes model based on just benchmarks. They're trained on the best open source model that's currently Mixtro. Mixtro is number 7th in the world based on LMCS Arena, and that's an open source model that we all get to use.</p><p>[00:03:10] <strong>Alex Volkov:</strong> Then we've covered the Neural Beagle 14. 7b from Maxim Le Bon. Maxim also joined us for a full interview that you can hear as part of the a podcast episode and Maxim released a Neural Beagle, which is a merge plus a DPO fine tune. And it's one of the top performing 7 billion parameters on the OpenLM leaderboard.</p><p>[00:03:30] <strong>Alex Volkov:</strong> When released in a few days ago, now it's fourth. So the speed with which things change is quite incredible. We then covered the LMSYS. SGLang attempt is a 5x performance inference bunch of techniques together on the front end and the back end called Radix attention on the back end and the SGLang way to run through inference code on the front end that combines into almost a 5x performance on inference.</p><p>[00:03:56] <strong>Alex Volkov:</strong> 5x is incredible Nistan mentioned that it does less than 5x on like longer sequences and then we had a conversation about Where it could improve significantly, which is agents and agents are sending short sequences. Alignment Labs told us that this could be significant improvement in that area.</p><p>[00:04:13] <strong>Alex Volkov:</strong> So our agents are about to run way faster. A 5x improvement is just incredible. And we also mentioned that at the same day when this was released, another Optimization was shouted out by Tim Ditmers from the Qlora fame called Marlin that also improves by 4x some significant inference techniques.</p><p>[00:04:34] <strong>Alex Volkov:</strong> And I wonder if those can be compiled together in some way. Quite impressive. We also covered neural magic doing spars, pacification and sparse. And we did in a deep dive into a short, deep dive. Thank you. Alignment and thank you Austin for what's spars, pacification means. And they do in this for like major models and they compress them with specification to around 50% sparsity.</p><p>[00:04:55] <strong>Alex Volkov:</strong> It's zeroing. Out the weights that you don't actually use. And it makes the models like significantly smaller. We covered Desilang a little bit. We didn't actually get to the diffusion. I'll just read out those updates as well. Then we covered the OpenAI had new guidelines for upcoming elections, and they're trying to add techniques for folks to identify daily generated images.</p><p>[00:05:18] <strong>Alex Volkov:</strong> And they're adding, restrictions to how their LLMs are used in the context of voter suppression, etc. We then talked about DeepMind and AlphaGeometry, where DeepMind released And open sourced looks like a model called Alpha Geometry that uses neuro symbolic approach with two models that solves geometry at almost a gold medal at the Olympiad level.</p><p>[00:05:42] <strong>Alex Volkov:</strong> So Geometry Olympiads and quite impressive this release from from DeepMind and shout out. It was trained on a hundred million synthetic data set sources. A source from like more than one billion. Or so random examples and it's quite impressive. So shout out DeepMind as well. We also briefly mentioned Samsung that has a Samsung S24, the flagship phone that now Apple is needed to compete with, that has AI everywhere, uses the new Qualcomm chip and has AI in.</p><p>[00:06:10] <strong>Alex Volkov:</strong> Pretty much summarization everywhere. There's like a button with the sparkles with AI. And one cool thing that we haven't mentioned, but I saw MKBHD on Twitter review is that they added real time translation of calls. So you can literally call some people with a different language and on device translation, after you download the model on device, we'll actually be able to translate this in real time.</p><p>[00:06:30] <strong>Alex Volkov:</strong> So you can read what the other person said in different language, but also hear it. And that's like quite cool. Then we had a deep interview with Maxim Lebon, the author of many things. Recently, we've talked about Fixtral or Fixtral, the mixture of experts of the five models. We've talked about merges.</p><p>[00:06:46] <strong>Alex Volkov:</strong> Maxim had a great explanation on, on, on his blog. And then on the Hug Face blog about what merges, what MergeKit does and how that. Plays into the whole ecosystem, the top LLM leaderboard now has been taken over by merges, specifically, likely because merging models does not require additional computer, additional training, and that's fairly easy to do with just the code merges takes and combines.</p><p>[00:07:11] <strong>Alex Volkov:</strong> With different, using different algorithms like SLURP and other algorithms it combines different models and different weights from different models, including potentially building models of novel sizes. So we've seen 10 billion parameter models, like 120 billion parameters so you can use those techniques to Combine models or merge models into different ways.</p><p>[00:07:31] <strong>Alex Volkov:</strong> There's also Frankenmerge that uses different models to combine into one. So we dove into that and what the inspiration for merging and what it actually does. Maxim also released like Lazy Merge Kit, which is a thin wrapper on top of the merge kit from Charles Goddard. So shout out to Charles.</p><p>[00:07:47] <strong>Alex Volkov:</strong> So we had a very interesting interview about merging and thank you, Maxim, for joining us. Definitely worth a listen as well. And then we had breaking news from BigZuck and the meta team that talked about he gave an update about the number of GPUs that they have. And by the end of this year, they're talking about 350, and overall 600, 000 H100s or equivalents of compute which they're going to use for AI and Metaverse.</p><p>[00:08:14] <strong>Alex Volkov:</strong> And Definitely a great update. They're training Lama 3 right now. The stuff that we didn't get to, but I wanted [00:08:20] to update, there's a, and I will add in show notes. There's a stable diffusion code that runs 100 percent in the browser with WebGPU and Diffusers. js, a thread from ClipDrop, the CEO Cyril Diagne.</p><p>[00:08:32] <strong>Alex Volkov:</strong> And there's also, we've talked about DeciEye, the company that releases a bunch of models. They release DeciDiffusion, a text to image model with only 370, the 300. Sorry, 732 million parameters. It's twice as fast and 61 percent cheaper than Stable Diffusion with the same image quality, so that's getting improved.</p><p>[00:08:51] <strong>Alex Volkov:</strong> But I think they're talking about Stable Diffusion 1. 4, so not SDXL or the new one. And Desi, I also released Desi Coder, and we also covered the Stable Diffusion Coder that is a coding model that runs closer on device, a 3 billion parameter model that beats Code Llama 7b. I think that's most of the stuff we talked about.</p><p>[00:09:09] <strong>Alex Volkov:</strong> And then one of the major things that Umesh brought we've talked about corporate drama, maybe a new segment in Thursday Eye where Microsoft, Did some things that actually disrupted workflows and companies actual products built on top of Microsoft, which is considerably not great and led to a fight.</p><p>[00:09:30] <strong>Alex Volkov:</strong> Hopefully not, but potentially a legal battle as well, and that's not something that should be done by a cloud provider such as Microsoft. Very ugly. In addition to this, we also talked about Microsoft announcing the CoPilot Pro that's now open for small businesses for 20 bucks a month with no minimum seats as well.</p><p>[00:09:46] <strong>Alex Volkov:</strong> And I think that's most of the things that we've mentioned</p><p>[00:09:49] <strong>Alex Volkov:</strong> Let's go.</p><p>[00:09:51] <strong>Sounds:</strong> to all of you.</p><p>[00:09:57] <strong>Alex Volkov:</strong> from, I guess</p><p>[00:09:59] <strong>Sounds:</strong> all of you. Namaskaram to</p><p>[00:10:07] <strong>Alex Volkov:</strong> 2024, we all need to get used to say 2024 at this point we have a bunch of AI news. My name is Alex Volkov, I'm an AI evangelist with Weights Biases, and I'm joined on stage here with dear friends, co hosts of Thursday AI. Podcast, newsletter, live X recording, community, I don't know, a bunch of other stuff as well.</p><p>[00:10:29] <strong>Alex Volkov:</strong> Nishten does paper readings, is a semi part of this as well. Welcome everyone. Welcome.</p><p>[00:10:33] Introduction to the Session's Structure</p><p>[00:10:33] <strong>Alex Volkov:</strong> I will just say a few things before we get started. So first of all, for those of you who are new, who are listening to this for the first time first of all, welcome.</p><p>[00:10:41] <strong>Alex Volkov:</strong> It's great that you have found us. Please DM me with like how you found us. I would love to know as I'm looking into the channels, et cetera. However, I will say that we've been here every week, pretty much at the same time. I don't think we've changed time since the summer. So 8.</p><p>[00:10:55] <strong>Alex Volkov:</strong> 30 AM Pacific and we try to do this every Thursday. I think we missed one or two. I was sick once, apologies. But other than that, we're here to talk about the AI every week. And what happens often is as we as we talk about things, different breaking news happened and folks announced different stuff on Thursday., and we cover pretty much everything. A very broad spectrum in AI changes. So I know there's like spaces to talk about diffusion, specifically art spaces as well. So we cover diffusion to an extent, but we try to focus on I guess our main focus is open source LLMs. We love those. We have a bunch of folks here on stage. They're training and fine tuning the greatest kind of open source models and definitely follow up on the different how should I say, different techniques, like the merging stuff that we're going to talk to at length later, and we, we hopefully get to hear about them first before they take over hug and face which was the case, I think with some of the models and some of the techniques.</p><p>[00:11:54] <strong>Alex Volkov:</strong> And I see two more folks joining us as well from different areas of the open source community. So I will say welcome LDJ and welcome alignment, LDJ. You've been missing in action. I was just saying, how are you, man? Welcome back.</p><p>[00:12:08] <strong>Luigi Daniele:</strong> Yeah, I'm doing good. Glad to be</p><p>[00:12:10] <strong>Alex Volkov:</strong> Yeah. And also we have Austin AKA Alignment Lab. What's up Austin?</p><p>[00:12:16] <strong>Alignment Lab:</strong> Oh, dude, I'm doing great. I was actually just in a call with LDJ and he was like, oh, Thursday Eye is starting and I was like, let's go.</p><p>[00:12:22] <strong>Alex Volkov:</strong> Yeah that's exactly what I like to hear that the calendar events is popping off and Thursday is starting.</p><p>[00:12:27] Open Source AI: Nous Hermes Mixtral Finetune + DPO deep dive</p><p>[00:12:27] <strong>Alex Volkov:</strong> So with that, I think it's time for the open source stuff.</p><p>[00:12:44] <strong>Sounds:</strong> Open Source AI, let's get it started.</p><p>[00:12:48] <strong>Alex Volkov:</strong> All right, so welcome to probably the biggest, the most fun, the most Contentful section of Thursday ai, where we talk about open source, LLMs and lms. I guess we should also start mentioning because a bunch of these models that we see are also multimodal, and I guess we'll start with.</p><p>[00:13:08] <strong>Alex Volkov:</strong> , News Hermes Fine Tune on Mixtral we've been waiting for this, Mixtral was released I want to say a month or so ago, a month and a half ago, and now we're getting one of the top kind of data sets and fine tunes trained on Mixtral, and we're getting this in multiple formats.</p><p>[00:13:25] <strong>Alex Volkov:</strong> Again, shout out Technium. If you guys don't follow Technium yet what are you even doing showing up on Thursday? I definitely give Technium a follow. But Mixtral fine tune is available and it comes in two variants and SFT and then DPO and SFT only. So SFT is a supervised fine tuning and DPO, direct preference optimization.</p><p>[00:13:45] <strong>Alex Volkov:</strong> This is like a, not a new technique, but definitely has been around for a while. Many people are using DPOs at this point. We've talked about DPO multiple times. I think we also saw, Nistan, correct me if I'm wrong, the actual mixtural instruct is also DPO, right? We saw this in the paper.</p><p>[00:14:00] <strong>Alex Volkov:</strong> So DPO is everywhere. And this is not the first time that the SFT and DPO pair is getting released separately. I think we've chatted with John Durbin who's, shoutout John, is in the audience. And that conversation is on the feed. So definitely check out the conversation with John.</p><p>[00:14:16] <strong>Alex Volkov:</strong> And the Bagel models were also released separately with SFT and the DPO version as well. And I think John back then mentioned that each one has Different different things it's good at. And I also would love to figure out which one of the new, Neus Ermis Mixtural Fine Tunes is best at what.</p><p>[00:14:33] <strong>Alex Volkov:</strong> Technium has a bunch of stuff in in, in the thread, so I'll link this below for examples. And I will say that the comparisons to Mixed Real Instruct. Technium posted a bunch of comparisons to Mixed Real Instruct. And it's interesting that not all of the benchmarks look like on improvements.</p><p>[00:14:51] <strong>Alex Volkov:</strong> There's a few, I think on GPT4ALL and HelloSwag. The base model, at least the non DPO base model, still wins just by a little bit. But everything else, like ARX, AGI, EVAL, and MMLU are significant improvements. And we're gonna probably continue to see those improvements. Shoutout. If you have tried it, please let me know.</p><p>[00:15:08] <strong>Alex Volkov:</strong> I will say this last thing, that finally, after setting up LM Studio again, shoutout to LM Studio we'll get to chat with LM Studio at one point. Hopefully soon, I am now, the first thing I do is download these models because it's super, super easy. Both of them, Studio and Allama, and there was a tiny, I think, quantization thing in the beginning, and now there isn't, and now it works great.</p><p>[00:15:33] <strong>Alex Volkov:</strong> And these models, I've loaded them up on my Mac before, before a flight. And I was just able to chat with this AI with no internet connection or like poorly internet connection. It was really something. I know we've talked about this multiple times. Hey, put this on a a thumb drive and then have all of human knowledge, quote unquote.</p><p>[00:15:51] <strong>Alex Volkov:</strong> I'm not really saying it's all human knowledge, but I've been actually able to do this before my flight and it was really cool.</p><p>[00:15:57] <strong>Alex Volkov:</strong> And I think the last thing to mention here is that Technium suggests to make liberal use of system prompts. So all of Hermes models, which is, there's now a bunch of Hermes models flying around, definitely the most. At least the famous one is Hermes, I think, 7B, but also the YI version, and this seems to beat the YI version as far as our friend Wolfram Raven, Wolfram Loco Lama tested.</p><p>[00:16:22] <strong>Alex Volkov:</strong> This is probably the best news model out of them all. So far, obviously it's based on the best. Open source model called Mixtro and definitely liberal use of system prompts. Yeah, roleplay is suggested setting expectations, specifications and everything else you can think of. Very easy to do with Elm Studio.</p><p>[00:16:39] <strong>Alex Volkov:</strong> I haven't [00:16:40] dove into like actually how to steer these models for exactly the task that I do. Luigi, you said LDJ, you said that you want to Tell me how to use LM studio in regards on this. So I would love to hear from you. First of all, have you had a chance to try these models specifically? And second of all let's talk about system prompts in LM studio a little bit, because I think it's a part that people are definitely missing.</p><p>[00:17:02] <strong>Luigi Daniele:</strong> Yeah. A lot of the latest models like Hermes and I think maybe Dolphin too, trained with system prompts. So if you really want to get the best use out of it definitely use that and it's just same thing with chat GPT really, where it's give instructions of how you maybe want to have it respond to you, or maybe add in a few threats of, of what you would do to the AI if it does not respond correctly, and so surprisingly that seems to actually sometimes.</p><p>[00:17:28] <strong>Luigi Daniele:</strong> Give good results, I personally try to always say please and thank you, but yeah yeah. And there's also prefix and suffixes, which I think I talked to you about, Alex,</p><p>[00:17:36] <strong>Alex Volkov:</strong> You briefly mentioned this, but maybe worth like a given a little bit of a heads up for folks.</p><p>[00:17:41] <strong>Luigi Daniele:</strong> yeah I think it really is worth maybe just a sit down and just a video with me and you actually going through it, because,</p><p>[00:17:47] <strong>Alex Volkov:</strong> Sure.</p><p>[00:17:47] <strong>Luigi Daniele:</strong> it's a decent amount to go through, but, yeah on the model card of most models, if you just look at something called prefix or suffix that is usually described in the model card, then You apply that to the LM Studio settings on the right panel in the chat settings.</p><p>[00:18:03] <strong>Luigi Daniele:</strong> And yeah, you just make sure you have those things right. If you don't, there's a good chance you're not actually using the model correctly. And it's not going to give you the best results.</p><p>[00:18:10] <strong>Alex Volkov:</strong> And they differ from the base model as well. Like we've seen like different base models have different things that you want to you want to add there. And you may getting like the same performance, but getting under performed a little bit. I'll also say for folks who are using Mac the Silicon, Apple Silicon, there's a little hidden checkbox there that I don't know if it's like, it's by default already.</p><p>[00:18:30] <strong>Alex Volkov:</strong> It's called use Apple Metal. And definitely make sure that's on for you. Significant improvement in performance and inference. All so I think NeuralRMS, anything else on folks here on stage that want to talk about this model and how it was trained and the difference in DPO? Folks, feel free to chime in.</p><p>[00:18:45] <strong>Alignment Lab:</strong> There's the cool thing about DPO is It's so it's a reinforcement learning technique. I don't know if anyone else has had a chance to read the paper about it, but essentially what occurred was that some researchers found that the, that transformers already have a baked in optimal reward function.</p><p>[00:19:03] <strong>Alignment Lab:</strong> And so what DPO is really doing is just training the model on that reward function, just biasing it towards the selected. Like good example when you give it a good and bad example pairs not directly unique to to the, to this model, but it is super interesting because it really opens up a whole bunch of possibilities for what you can do with the model now that you can give it negative examples and get more performance for it.</p><p>[00:19:27] <strong>Alex Volkov:</strong> DPO is ranking different outputs in terms of like preference, . So can you talk about the pairs stuff? Everybody says DPO pairs, like what do they mean by pairs? Could you say this about this?</p><p>[00:19:38] <strong>Alignment Lab:</strong> instead of training on like typically what you would do is you would build your data set. And that would be like your good data set. You'd have a weaker model that you, than the one that you use to synthesize the dataset or just bad examples of responses for every single example in the dataset.</p><p>[00:19:54] <strong>Alignment Lab:</strong> So if you have one that's like, how do I make a cup of tea? And then instructions about how to make a cup of tea, then you'd also have that paired with a negative example of, a response to how do I make a cup of tea? And then, the response is something else, like how to build a Lego house or whatever.</p><p>[00:20:08] <strong>Alignment Lab:</strong> And when you go to actually train it, you show it both at once, and you tell it which one is the positive and which one's the negative, and you just bias it towards the positive. It's quite similar, conceptually, to the way that OpenChat does the CRLFT training, although OpenChat actually has a specific token for the good and bad examples that it has weighted.</p><p>[00:20:34] <strong>Alignment Lab:</strong> But functionally, it's, the idea is the same. You're just doing reinforcement learning which lets you take data where you may have bad examples in there, and rather than having to remove them and waste data, you can now make a good example and get more out of it than you would have been by just replacing it.</p><p>[00:20:50] <strong>Alignment Lab:</strong> So it lets you recoup extra performance out of bad data.</p><p>[00:20:54] <strong>Alex Volkov:</strong> Thanks for the explanation. And definitely we've seen at least in my game plays with the bigger model and the DPO version of noose. RMS mixture this feels like the DPO at least behaves a little bit. Actually don't know how to attribute this to the technique or to the datasets, but it's really good.</p><p>[00:21:13] <strong>Alignment Lab:</strong> Yeah, we've noticed if we do a regular supervised fine tune first, like a just normal fine tuning, and then we DPO over that we, the models push just much further than either thing alone, too. I don't know if that's unilaterally true, because we do a fairly, specific kind of model when we make these big releases, but it seems, at least for the case of just general reasoning skill it helps a lot.</p><p>[00:21:37] <strong>Alex Volkov:</strong> Yeah, it's super cool. And I guess the downside of this, not the downside, but the outcome of some of this is that folks now have, folks who want to just use a model and are trying to maybe tune in to Thursday Eye to know which model is good to use, or maybe they're reading the local Lama stuff.</p><p>[00:21:53] <strong>Alex Volkov:</strong> There's now so many choices, including so many configurations. So maybe we should do Like a recap and also a simplification LDJ for like system messages and the prefixes alignment with DPO versus SFT. Just simplify and say, Hey folks, use this. Because right now there's so many, you can choose between quantization methods.</p><p>[00:22:11] <strong>Alex Volkov:</strong> There's at least four or five different ones for you to choose from. And LM studio says in a few of them, use this is recommended, but it says recommended for five, five different ones. There's different quantization providers as well, right? So the bloke is obviously the most familiar one,</p><p>[00:22:26] <strong>Alex Volkov:</strong> there's now a choice between DPO or SFT or DPO plus SFT, and We haven't even begun to talk about merges, which is coming as well. So there's a lot of choice and we need to simplify this for folks. So definitely just to simplify the Hermes models are usually very well behaved and great for role play as well.</p><p>[00:22:43] <strong>Alex Volkov:</strong> Try them out. If you have the room to run Mixtrl for your stuff, Mixtrl is definitely by far the best open source models that we have. Go ahead, Levent.</p><p>[00:22:52] <strong>Alignment Lab:</strong> Yeah, so Mixtrel is, that model is the architecture is very similar to a really old, comparatively old architecture that's been tried and true before. And so because of that, there's a lot of efficiencies that we just haven't integrated into the modern stack, but that will come.</p><p>[00:23:09] <strong>Alignment Lab:</strong> And there's a bunch of new ones that people have been making. And between the new quantization methods that you can do with Mixtro, because since it's sparse MOE, it doesn't actually, need all of its weights as much as it, as as each other. So some of them are, like, less important. It lets you quantize those quite a lot without actually hurting the model's performance very much.</p><p>[00:23:27] <strong>Alignment Lab:</strong> And you can also offload these layers when they're not being used. And then you can do like expert pre caching, where you predict some experts ahead of time, which lets you get faster inference speed. And at the end of the day, if the sort of quick sharp, which is like 2 bit quantization method continues to prove out that it's as performant as it claims, We could end up running Mixtro on 4 gigs of VRAM, like on a laptop.</p><p>[00:23:58] <strong>Alex Volkov:</strong> And</p><p>[00:23:59] <strong>Nisten Tahiraj:</strong> We will.</p><p>[00:24:00] <strong>Alex Volkov:</strong> we will.</p><p>[00:24:00] <strong>Nisten Tahiraj:</strong> it to perform a bit better.</p><p>[00:24:02] <strong>Alex Volkov:</strong> So I guess this takes us to the next, I'll go ahead and stand, and it's going to take us to the next optimization stuff.</p><p>[00:24:09] <strong>Nisten Tahiraj:</strong> We could definitely have it run on on 4 gigs. I've had it a little above 4. However, but the point is to have it run well. The quantization, it still makes it a little bit unfit for anything other than very short conversations. And we'll get it there.</p><p>[00:24:30] <strong>Alex Volkov:</strong> All right. So in this, in, in this</p><p>[00:24:32] <strong>Nisten Tahiraj:</strong> we'll have Mixtro under 4 gigs very soon and it'll be good.</p><p>[00:24:37] <strong>Nisten Tahiraj:</strong> Yes.</p><p>[00:24:37] <strong>Alex Volkov:</strong> And that's a promise. That's a promise.</p><p>[00:24:39] LMsys SGlang - increased inference by 5X</p><p>[00:24:39] <strong>Alex Volkov:</strong> So what happens is once you go and put those bigger models on slower hardware, which is possible you then wait painfully a long time for inference to actually happen. But this takes us to the next thing from the folks from LMSys. They released a fast and expressive LLM inference with Radix attention and SG Lang.</p><p>[00:24:59] <strong>Alex Volkov:</strong> So folks from [00:25:00] LMSys, if you guys remember from Models like Vicuna that took Lama and trained it on additional datasets. and NMSIS Arena and all these places like we definitely trust them at least with some of the evaluation stuff. I think, is MMLU also in NMSIS's area? Or at least they test on MMLU. They released a inference optimization kind of collection of techniques.</p><p>[00:25:24] <strong>Alex Volkov:</strong> I don't think it's one specific technique because there's like Radix attention. Yeah, go ahead.</p><p>[00:25:28] <strong>Alignment Lab:</strong> It's where all this was going in the first place between all these sort of different prompting programming frameworks and inference engines. What they've done is they built out the back end with the end goal of having an extremely controllable, steerable compiling system for programming outputs from a, from like an AI in the way, like a Pydantic or in the way that you would typically use sort of structured grammars and sampling techniques.</p><p>[00:25:58] <strong>Alignment Lab:</strong> And way more. It's hard to explain in, in summary in a way that's very easily grokkable without getting too technical but it's a combination of many things that we've been doing individually, which were always gonna be one big thing, they just saw it first and did it first, and now, when you're looking at it, it seems very obvious that this is probably how things should look going forward</p><p>[00:26:17] <strong>Alex Volkov:</strong> so let's actually talk about</p><p>[00:26:18] <strong>Bluetooth:</strong> overall, just a</p><p>[00:26:19] <strong>Alex Volkov:</strong> they have. Yeah, they propose like different co designing the backend runtime and the frontend language, which is like Alain said, a structured domain specific language embedded in Python to control the inference generation process. It's called domain specific language, DSLs.</p><p>[00:26:35] <strong>Alex Volkov:</strong> I, I think many folks have been using some of this. I think DS p Ys as well from is being like mentioned in the same breath. And then this language like executed in the interpreter code or in compiler code. And on the backend they have this radix attention technique for automatic and efficient KV cache reuse.</p><p>[00:26:53] <strong>Alex Volkov:</strong> I don't know if that's like instance like MOE specific or not yet, but definitely. The combination of those two plus the code that they've released shows just incredible results. Like folks, we live in an age, and we've talked about multiple of those techniques. We live in the age where somebody like this can come up and say, Hey here's an example of a set of techniques that if you use them, you get.</p><p>[00:27:12] <strong>Alex Volkov:</strong> 5x improvement on inference. In the same breath that we're saying, Hey, we're going to take Mixtrel and put it in 4GB, and we've seen this obviously with Stable Diffusion, which we're going to mention that runs fully in the browser, we're now seeing releases like this from a very reputable place. A collection of techniques that have been used to some extent by some folks, and now all under one roof, under one like GitHub.</p><p>[00:27:35] <strong>Alex Volkov:</strong> Thing that actually improves the inference by 5x on all of the major evaluations, at least that they've tested, that we always talk about. So 5x on MMLU and HelloSwag is significantly more performant, all these things. Quite impressive. One thing that I would definitely want to shout out is that the maintainer of Lava the LMM, the kind of the visual Lama, is definitely also replied and said that the execution of Lama is actually, of Lava, is actually written in the report itself.</p><p>[00:28:07] <strong>Alex Volkov:</strong> And it improves lava execution by 5x as well. And by execution, I mean like inference speed, basically. So without going like too much into Radix attention, because honestly, it's way too heavy for the space. It's quite incredible that we get, do we get stuff like this from like places like LMCS, specifically in the area of running smaller models, sorry, running bigger models with smaller hardware.</p><p>[00:28:33] <strong>Alex Volkov:</strong> Go ahead, Nissan.</p><p>[00:28:36] <strong>Nisten Tahiraj:</strong> I'll say something. So it does automate a lot of the tricks that people have been pulling, and it works great for large amounts of smaller prompts. Once you go to longer prompts, the benefit is not that much compared to VLLM. I think it felt like five or ten percent faster when it came to VLLM. So again, I haven't taken a very deep dive into it.</p><p>[00:29:01] <strong>Nisten Tahiraj:</strong> Just want to just make people aware that it's fantastic for smaller prompts and stuff. But for longer ones, you don't necessarily need to switch your whole stack to it. VLLM still works fine. Yeah, I think for if you're doing like what you would normally be doing with VLLM, which is like processing like large amounts of data or serving for just general purposes.</p><p>[00:29:24] <strong>Nisten Tahiraj:</strong> Probably, there's no need to switch your stack. I think for, specifically what it feels optimized for is Asian frameworks, in which you have many models communicating short strings back to each other. One model wearing many hats. And the optimizations just while we're on the topic, is crazy right now.</p><p>[00:29:43] <strong>Nisten Tahiraj:</strong> There's still three papers with major inference optimizations for MixedRole alone, as well as for VLLM, and that seem to compose everything pretty well. Having an alternative to VLM that's similarly. Performance is huge because VLM is a big bottleneck on a lot of stacks because of the way that it handles attention off on the CPU.</p><p>[00:30:00] <strong>Nisten Tahiraj:</strong> It feels a lot like when llama CPP got like offloading the same week that speculative decoding came out with hugging face transformers and. Everything just got a hundred times faster, like a half a year ago or so.</p><p>[00:30:12] <strong>Alex Volkov:</strong> Yeah, I would also it definitely felt like that day when LMS released the SG Lang optimization that we just now talking about I don't have a link for this, but also LES released from IST Austria. Released Marlin, which is a 4 bit, I think the way I know it's cool is that, Tim Dittmers from QLOR retweeted this and said this is a huge step forward.</p><p>[00:30:33] <strong>Alex Volkov:</strong> And Tim Dittmers is the guy who in KUDO mode, the codes, KUDO kernels, within like a night or something, planning for 3 months and then finishing. So I know that Tim Dittmers, when he says something is a huge deal, he probably Probably knows what's up. So Marlin released the same day that like the SGLang released and it's a linear kernel for LLM entrants with near ideal.</p><p>[00:30:53] <strong>Alex Volkov:</strong> 4x speedup up to batch sizes of 16 to 32 tokens. And they came out pretty much the same day yesterday on January 17th. So I'm going to add this in the show notes. So Marlin is also like an exciting optimization. And Nostia, I fully agree with you where we see these breakthroughs or collections of method that suddenly are finally collected in the same way.</p><p>[00:31:11] <strong>Alex Volkov:</strong> A bunch of papers that haven't, released code as well or haven't played with different things. And it's very exciting to see them Keep coming out, we're only at the beginning of this year. And I think to the second point that you just mentioned, with agent frameworks Specifically, RAG, Retrieval Augmented Generation this benefit is significant like you said, because the short strings back and forth, these agents communicate with each other.</p><p>[00:31:34] <strong>Alex Volkov:</strong> Last week we've talked with one such author from Cru AI, Cru specifically is an orchestration of different agents that do different tasks and coordinate and talk to each other and improving inference there. Many of them run on GPT 4 and I haven't fully gotten into how to do this yet, but SGLang also say that they're like LLM programming can actually work with various backends.</p><p>[00:31:55] <strong>Alex Volkov:</strong> So OpenAI as well and Tropic and Gemini and local models. That's very interesting if they actually improve OpenAI inference in Python. But DSPY RAG, so RAG on DSPYs from Omar Khattab is definitely mentioned in the SGLANG report. I know I'm throwing like a lot of a lot of acronyms at you guys.</p><p>[00:32:14] <strong>Alex Volkov:</strong> So SGLANG is the stuff we talk about as the That's the new language from LMCS org that speeds up some stuff. DSPY I haven't talked about yet, so we'll cover but one of the tasks on, on, on DSPY's RAG, so retrieval is mentioned that it gets like a significant boost. Like Nissen and Austin said, not necessarily for longer context prompts.</p><p>[00:32:35] <strong>Alex Volkov:</strong> 30, 000 tokens for summarization, maybe this technique that caches a bunch of. Stuff between calls is not going to be super helpful, but for fast execution of multiple things is definitely significant 5x. And like I think Lyman said, it's only the beginning of optimization cycles that we see, and it's quite exciting to to see them come out.</p><p>[00:32:56] <strong>Alex Volkov:</strong> I think we've covered two optimization techniques, SGLang, and then Marlin as well. I'll put a link to the show notes as well.</p><p>[00:33:03] NeuralMagic, compressing models with sparcification</p><p>[00:33:03] <strong>Alex Volkov:</strong> And I think now it's time to move to Yeah, one, one, one thing that we're going to chat about is neuromagic and I definitely focus on stage. Feel free to talk about neuromagic because I saw [00:33:20] somebody told me it's cool, but I have no idea how to even simplify this.</p><p>[00:33:23] <strong>Alex Volkov:</strong> So if you want us and you want to take a lead on this one, definitely feel free.</p><p>[00:33:28] <strong>Alignment Lab:</strong> Okay Neural Magic. This is actually the first conversation I think that me and LDJ both geeked out really hard on we were talking, because we were both the only people the other person knew who even knew about this company. Neuromagic has been making miracles in the corner for years.</p><p>[00:33:44] <strong>Alignment Lab:</strong> I first got interested in them because they had made a BERT model that was initially, it was nearly like I think a gig on your computer to run and, it spoke English perfectly well and all this other stuff. And they had compressed it to the point that the full model completely On your computer was like 15 megabytes and it, and what blew my mind was like, how does that even know English?</p><p>[00:34:06] <strong>Alignment Lab:</strong> And it's it was at like 96 percent the original accuracy, despite all of that. They specialize in these like optimization and compression techniques. And so what they do typically is they have a stack, which they wrote a paper about a while ago, which I'll post in the comments here.</p><p>[00:34:22] <strong>Alignment Lab:</strong> It's called Overt Surgeon, which is basically a process in which they have a teacher model. In a student model, in the student model they use distillation in the the more traditional sense than I think it's more commonly used now, where you're just training on a model's output, and they use the actual logits during they basically load both models in during the training run, and train the smaller model to behave like the larger model, and while they're doing that, they're also pruning it, which is, Essentially, you reduce the weights that are not getting used during training to zero, which lets your computer not have to calculate them, so it moves much faster.</p><p>[00:34:58] <strong>Alignment Lab:</strong> And then they also quantize, which is where you reduce the accuracy. Basically, without getting too technical, you're literally summarizing the parameters of the model, such that it's literally a smaller file. And they do this all at once, which takes the larger model, And compresses it into the student model that's starting out smaller, and then they're quantizing the student model and pruning it, so it's both running faster and literally getting smaller, and they can, as far as I'm aware, there's nobody who's even coming close as far as being able to compress a model so much and recently I think about two months ago we first saw that they're integrating transformers with Sparsify Alpha, which is now just out and it's called Sparsify on the GitHub.</p><p>[00:35:43] <strong>Alignment Lab:</strong> Totally check it out. You can make a tiny llama and do all that stuff to it and make it microscopic. It's amazing. And</p><p>[00:35:49] <strong>Alex Volkov:</strong> here, Austin, just real quick. So we've been talking about quantization for folks who are like not following the space look super closely. Let's say there's different quantization techniques in, and some of them create like small files, but the performance or like the accuracy, is getting lowered.</p><p>[00:36:03] <strong>Alex Volkov:</strong> How is Sparsification different from quantization, at least on the basic level. Are they compatible? Will they be used could you use both of them on the same file? What is this thing, sparsification?</p><p>[00:36:15] <strong>Alignment Lab:</strong> so in reality, probably if it were like more accessible of a tool, we would all likely just be doing both every single training run. But since there's always new quantization techniques, it doesn't make sense to. But with sparsification, the specific difference is rather than taking the same model and reducing the accuracy of its, the calculations, but making it smaller, the model's staying the same size physically on your drive, but you're reducing the weights that aren't getting used to to a zero value.</p><p>[00:36:50] <strong>Alignment Lab:</strong> And what that does is just means your, your GPU just has to do less calculations for the model to do inference and it makes it just much faster.</p><p>[00:36:59] <strong>Alex Volkov:</strong> All</p><p>[00:36:59] <strong>Nisten Tahiraj:</strong> Also, we for the next Baklava version, Neural Magic did make a A clip model for us. So shout out to them. They were able to cut down the size by from about four times smaller.</p><p>[00:37:14] <strong>Nisten Tahiraj:</strong> So we'll we'll have that out soon. And yeah, also for anybody else that. wants to learn about sparsity, just look up Nir Shavit on on YouTube. N I R S H A V I T. He's the OG MIT professor that pioneered sparsity and has a lot of videos out, and Neuromagic is his company. And yeah, it's looking really promising in the future because they can optimize at a deep level for CPU inference.</p><p>[00:37:45] <strong>Nisten Tahiraj:</strong> And it's not necessarily just quantization, it's also They are reducing the amount of unused weights. So yeah, expect to see a lot more stuff about sparsity from the GPU poor side of the spectrum, , because that's where the benefits are yet to be read.</p><p>[00:38:02] <strong>Nisten Tahiraj:</strong> Anyway, shout out to Neural magic as well.</p><p>[00:38:04] <strong>Alex Volkov:</strong> shout out to Neer Shovit and Neural Magic, it looks cool, and they just got into sparsifying fine tuned models as well, I think they sparsified some new models, and I don't know if they got to open chat yet, but I think some folks are waiting for PHY sparsification, definitely. The area of smaller models running on smaller hardware is advancing super, super fast.</p><p>[00:38:26] Star Coder from Stability AI - 3B coding model bearing CodeLLama</p><p>[00:38:26] <strong>Alex Volkov:</strong> Let's move on, folks, because we've been in the open source area for quite a while, and then we also need to get to our to the end of our conversations here and start doing deep dives. So StarCoder was released from Stability. A brief review here is a 3 billion parameter language model.</p><p>[00:38:41] <strong>Alex Volkov:</strong> From Stability AI it does code completion and obviously it runs offline cause it's a small model and you can run it. They claim it can run on MacBook Airs as well. And they say something like without GPU. Interesting. Accurate completion across 18 languages at level comparable to models twice their size.</p><p>[00:38:57] <strong>Alex Volkov:</strong> This is a Code Llama. Interesting comparison to Code Llama at this point, because we've seen a bunch of other models already beat, I think, Code Llama on different metrics. But people still compare themselves to the big dog. And it's very interesting. They use the multi stage process, pre training in natural language.</p><p>[00:39:15] <strong>Alex Volkov:</strong> fine tuning on code datasets to improve programming language performance. And it supports fill in the middle and expanded contact sizes compared to previous versions of stable coder. And I think, oh yeah the stable diffusion now has like a commercial membership plan because everybody's thinking about, okay how was.</p><p>[00:39:33] <strong>Alex Volkov:</strong> Table going to make money. So they have this membership where you can use their models. So it's not like fully open source. I think you can use this models commercially if you participate in this membership, otherwise you can use them for research. So stable quarter, check it out. I think it's new on, on hug and face.</p><p>[00:39:48] <strong>Alex Volkov:</strong> I think from today I believe,</p><p>[00:39:50] Discussion on Neural Beagle 7B & Model merging</p><p>[00:39:50] <strong>Alex Volkov:</strong> And I think the last thing that I want to chat about in open source just briefly is Neural Beagle 7B from Maxim who's in the audience and is going to come up hopefully in the interview in a few.</p><p>[00:39:59] <strong>Alex Volkov:</strong> Minutes, I want to say maybe 20 minutes, Maxim. Neural Beagle back when I added this to my notes, was the top performing 7 billion parameter fine tune in, in, in open source LLM leaderboard. It's no longer the top performing, it was definitely number 4, at least.</p><p>[00:40:14] <strong>Alex Volkov:</strong> And it's a merge plus a DPO, that's what I saw from Maxim, a merge of Actually interesting what it's a merge of, so let's go into the model card and check this out.</p><p>[00:40:24] <strong>Alex Volkov:</strong> But Maxim looks like have a bunch of models and Neural Beagle, the, this Neural Beagle 14, 7 billion parameters has an average of 60 on the, all the scores, 46 on AGI eval. And yeah, it's one of the top performing models and it's a merge of different things. And it already has a demo space that I'll link in the show notes as well.</p><p>[00:40:43] Insights on Lazy Merge Kit</p><p>[00:40:43] <strong>Alex Volkov:</strong> Yeah, it uses Lazy Merge Kit, which is a collab that Maxim also we're going to chat about and figure out what this means, what this merging thing means but definitely, I think that this model triggered one of the Nathan's in AI that says, Hey, I wanted to ignore this merge business for a while, but I guess I can't anymore because, merges is not to be ignored at this point.</p><p>[00:41:04] <strong>Alex Volkov:</strong> And this is a merge of the Wunna And distilled Markoro. Slurp. So which is also a merge. So if you guys hear me and you're like confused, like what are all these things mean? Hopefully we'll be able to clarify this one. Maxim. Maxim also had a tweet where there's now a collab where you can take a model like this and basically map out the genealogy of these models.</p><p>[00:41:25] <strong>Alex Volkov:</strong> What is based on what? And it's quite cool to see. And what else should I say about this model? I think that's pretty much it. It's very performative. I actually haven't had the chance to use this, but it's right up there and it's a merge model. There is, there's the [00:41:40] checkbox, like we said, in the open LLM leaderboards.</p><p>[00:41:42] <strong>Alex Volkov:</strong> If you don't want for some reason to see the merge models and we'll see like more trained models, you will uncheck that. But definitely the merge models are competing for the top of the LLM leaderboards right now. Haven't seen a lot of them on the LMCs arena, so it's going to be interesting to see how they treat the merge models.</p><p>[00:42:02] <strong>Alex Volkov:</strong> And I think that's most on open source, and we've given this corner almost 40 minutes, so I think it's time to move on a little bit here, folks. So I'll, yeah, I don't have breaking news here, so I'll just do this, a small transition so I can take a breath, haha.</p><p>[00:42:17] <strong>Sounds:</strong> Namaskaram to all of</p><p>[00:42:22] Deep mind to Alpha Geometry</p><p>[00:42:22] <strong>Alex Volkov:</strong> LMs and APIs, and I think the biggest player in this whole, Aparigraha, Niyama, Shaucha, Satya, Ashtanga, Yama, Ashtanga, Niyama Ashtanga, Ashtanga, Ashtanga, Ashtanga, Ashtanga, Ashtanga, Ashtanga, Ashtanga, Ashtanga, Ashtanga, Ashtanga, Ashtanga, Ashtanga, Ashtanga, Ashtanga, Ashtanga, Ashtanga, Ashtanga, Ashtanga, Ashtanga, Ashtanga, Ashtanga, is deep mind, deep mind released, A Nature article, which they always do, they always publish in Nature, this time the link to Nature article didn't really work but hopefully they fix it by now, and they released Alpha Geometry, so they released like a bunch of stuff, Alpha Fold, if you remember, Alpha Go Alpha Zero, they had a model that, that self trains to play anything, not only chess, or, not only Go, and now they've released Alpha Geometry, that solves geometry, almost a gold medal Level at the at the Olympiad level, so they have this this how should I say, this nice chart that says the previous state of the art on this Olympia Gold Medallist Standard gotten to ten problem solved there's like time limits. I'm not sure what the time limits are actually are. I don't have it in my notes. But you have to solve these like very like difficult geometry levels. Folks compete for the gold medals in this Olympiad. And alpha geometry now comes very close to the gold medalist standard.</p><p>[00:43:29] <strong>Alex Volkov:</strong> So the gold medalist is answers 25.9 problems solved, and alpha geometry now answers 25, and they claim that the previous state of the art answered 10, just 10. So they more than doubled and they're getting close to the Olympiad. I think I saw like a tweet from Nat Friedman or somebody. That says they would offer a 1, 000, 000 prize for somebody who solves the Geometry Olympiad at the Golden Medal, and now we're getting there.</p><p>[00:43:53] <strong>Alex Volkov:</strong> They use the newer symbolic approach and they combine all of them with a symbolic deduction engine to leverage the strength of both. Which some folks compare to thinking fast and slow, where you have system 1, system 2 thinking, or at least the outline system 1, system 2 thinking.</p><p>[00:44:09] <strong>Alex Volkov:</strong> In this case, this does actually help. They have the neuro symbolic approach. I think they use this, the neuro symbolic approach. I don't think I've seen this before. And I think the most interesting parts It was trained on over a hundred million synthetic geometry examples generated from one billion random diagrams.</p><p>[00:44:27] <strong>Alex Volkov:</strong> Completely, solely synthetic geometry examples. This whole data set for training of this model that beats Humans at Geometry, which was previously very difficult, is fully synthetic. And I think that's super cool. We only began this year, but definitely this is going to be the year where full synthetic datasets are going to rule.</p><p>[00:44:49] <strong>Alex Volkov:</strong> And Yeah. Opinions, folks here on stage. Have you read about this? What's interesting to you? I would love to hear folks kind of chime in on this as well, because I think it's like super cool and kudos for them to releasing this. Also, I saw somebody said, I think Bindu said that they released this open source, but I haven't seen anything.</p><p>[00:45:06] <strong>Alex Volkov:</strong> Definitely Luigi Go and then Nistan.</p><p>[00:45:09] <strong>Luigi Daniele:</strong> Yeah it's funny that you brought up Nat Friedman having that bet up. Because I remember that too, and now I'm thinking, I wonder if he'd be willing to give up like the million dollars or whatever the money is to DeepMind. Ha</p><p>[00:45:20] <strong>Luigi Daniele:</strong> was done by Google DeepMind, so that'd be funny.</p><p>[00:45:25] <strong>Nisten Tahiraj:</strong> How has Google not discovered AGI yet and fallen so behind?</p><p>[00:45:30] <strong>Nisten Tahiraj:</strong> This almost feels like an internal illness or something. Something's going on. Because yeah.</p><p>[00:45:40] <strong>Alignment Lab:</strong> I don't think that Google needs to compete is the thing. I just don't think they're incentivized to release anything into the space because they don't have to. There's really not anything here except money to lose for them.</p><p>[00:45:51] <strong>Alignment Lab:</strong> They already have all the data and stuff. Yeah, and back to the geometry problems, I can't wait to test this, if they release it, as to how it does when given really random, very long numbers. If it still solves the problem, then that, that will be extremely impressive. And yeah, I've done those Math Olympias with geometry questions and they're not easy at all.</p><p>[00:46:18] <strong>Alignment Lab:</strong> You have to picture stuff in 3D. 4D and whatever in your head. They're very tricky problems. So yeah this is pretty huge. That's all. Yeah.</p><p>[00:46:26] <strong>Alex Volkov:</strong> Quite, quite huge and kudos on them. Umesh, I think you actually found the source, right? I just</p><p>[00:46:32] <strong>Umesh Rajiani:</strong> Yeah so there is GitHub repo on Google DeepMind. So if you go to Google DeepMind on GitHub and then alpha geometry, you can find the code repo for that. So Nistan, if you want to test it out, it's there for you. So I'm taking your</p><p>[00:46:47] <strong>Alex Volkov:</strong> hark on this just like for a little bit. Did Google release code for us finally? Did Google like open source something? Welcome back, Google.</p><p>[00:46:54] <strong>Umesh Rajiani:</strong> yeah, so this is like first release kind of thing, coming out of Google. So it's going to be, yeah, it is quite quite interesting.</p><p>[00:47:01] <strong>Alex Volkov:</strong> Definitely moves us towards like more generalist</p><p>[00:47:04] <strong>Bluetooth:</strong> I'll have it up in a sec.</p><p>[00:47:05] <strong>Alex Volkov:</strong> Yeah, listen, please put this and we'll add this to the show notes as well. Definitely the question, how have they not solved AGI yet? Solving math at the Olympiad level seems like moving us forward, definitely. This neuro symbolic approach where they combine language models with a symbolic deduction engine, which I have no idea what symbolic deduction means in this case.</p><p>[00:47:24] <strong>Alex Volkov:</strong> But leveraging strength of both, this seems like going towards the right path. We've seen, I think Similar things with vision as well, where you combine kind of vision heads into one model they can understand. I don't think this model was multi modal at all. Doesn't look like, but maybe I'm wrong here.</p><p>[00:47:42] <strong>Alex Volkov:</strong> And I think Yeah, the solutions for this thing is verifiable by machines. I saw this one tweet that will go down in history. Somebody said, computers has always been good for calculations. So I don't understand the big deal there, here. And I think I think it's really funny to like, keep this tweet behind the scenes.</p><p>[00:48:04] <strong>Alex Volkov:</strong> Alright, so shout out to DeepMind for this fairly incredible release. Hopefully some of the techniques they used will be then used by folks in other areas as well to get us AIs that are significantly better at the geometry and different things. Oh yeah, Umesh, just before, before we continue, you want to talk about this NeuroSymbolic thing? Cause we've talked about this. I think Daniel Jeffries talked about this last time we've talked about Rabbit.</p><p>[00:48:27] <strong>Alex Volkov:</strong> If you guys remember, this was at the end of the last space and we've talked about Rabbit LAM, Large Action Model. And Umesh, you just mentioned something that they also use NeuroSymbolic to an extent, right?</p><p>[00:48:39] <strong>Umesh Rajiani:</strong> Yeah, so the LAM Large Action Model, basically based on Neuro Symbolic Programming for when, specifically when they are talking about training the model from the actions that you're going to perform is basically they are encoding Neuro Symbolic Programming to train the model or capture the actions, basically.</p><p>[00:48:55] <strong>Umesh Rajiani:</strong> So that's what we're trying to do. Namaste. In theory, they are saying we have to see what comes out in practice.</p><p>[00:48:59] <strong>Alex Volkov:</strong> Yeah, and based at least on their examples, it looks like very compelling and potentially like being able to solve a bunch of stuff or like to remember based on your actions. So neuro symbolic not a new approach. I apologize. I will edit this. Definitely Rabbit said this, you're right and hopefully we're going to get to see this lamb thing.</p><p>[00:49:19] <strong>Alex Volkov:</strong> So back to OpenAI as elections are happening right now and everybody was fearing like, Hey, what's going to happen with deepfakes, et cetera. OpenAI released their guidelines toward election, as they prepare for elections, obviously, they're aware that they're happening. And I think the few interesting things there that they're taking steps to prevent their tools like Dalai and Shajipati from being abused.</p><p>[00:49:38] <strong>Alex Volkov:</strong> I don't know. We have open source, so I don't know if folks will go to the GPT 4 to generate let's say, propaganda. But DALI, for example, starts to integrate some cryptography to their images, which is very interesting. Cryptography solutions, which, again, In case you download the actual file and then send it, could be a thing.</p><p>[00:49:58] <strong>Alex Volkov:</strong> But I don't know if [00:50:00] somebody takes a screenshot of a Dalit generation, if that will apply at all. There are definitely like usage policies for like stuff like Chajapati enforcing limits on political campaigning and impersonating candidates and discouraging voting. And then they want to run ahead of what happened with Facebook and Cambridge Analytica, and like all these things they want to get ahead of us which, it makes sense.</p><p>[00:50:18] <strong>Alex Volkov:</strong> So the technology they use to detect images were generated by DALI I haven't seen any release on them that says, Hey, we'll build a tool for you to actually identify if those are generated images or not. It's going to be interesting because like with LLM writing all of these tools that you use to like dump AI text in there, they're all can be obscured with another LLM.</p><p>[00:50:38] <strong>Alex Volkov:</strong> I don't know if it's a futile attempt, but definitely a worthwhile one. And at least in the basic UI, I think blocking some attempts of destabilizing democracy, I think it's a good idea. And I think that's mostly it. I think there's one different mention that somehow silently they removed where the terms and conditions thing where their outputs is not to be used for war or weapon developing.</p><p>[00:51:04] <strong>Alex Volkov:</strong> And I think they removed that and I think they're also like signed something with Department of Defense, but I think that's all for OpenAI.</p><p>[00:51:11] Microsoft announces CoPilot pro</p><p>[00:51:11] <strong>Alex Volkov:</strong> And then I wanted to mention about Microsoft and Umesh, definitely feel free to chime in here as well, because the underlines the benefit for open source, but quickly Microsoft announced Copilot, we've talked about Copilot, the kind of previously BingChat, Copilot everywhere.</p><p>[00:51:25] <strong>Alex Volkov:</strong> So they've announced like different paid plans for Copilot Pro, 20 bucks a month premium, and then it does. Enhanced image creation, where we don't even get We don't even get in, in, in Dali like by default, and it's now generally available for small businesses with no user minimum. So if you guys remember, we've talked about Copilot before when Microsoft announced it for large enterprises it integrates into Microsoft 365 everywhere.</p><p>[00:51:49] <strong>Alex Volkov:</strong> And now the Copilots are also open for smaller businesses. And soon there's going to be like this Copilot Studio to build custom GPTs. Very cool for small businesses. We'll see how much actually folks will use this. And there's also some Microsoft Saga that they've changed some stuff in their pipeline.</p><p>[00:52:04] Corporate Drama - Microsoft Azure changing moderation flows and breaking products</p><p>[00:52:04] <strong>Alex Volkov:</strong> So Umesh, you mentioned this in the beginning. We'd love to hear from you what's been going on as you guys are big Azure users through Microsoft.</p><p>[00:52:11] <strong>Umesh Rajiani:</strong> Ooh happened</p><p>[00:52:15] <strong>Umesh Rajiani:</strong> day before yesterday. Actually, we got a call from one of our clients, which is one of the, one of a very big financial institution. And we have a deterministic pipeline, which was constructed using Azure studio, in fact. And we work together with very core Microsoft team actually to make sure that it is right.</p><p>[00:52:36] <strong>Umesh Rajiani:</strong> properly deterministic because there are some legal implications and everything. And and then the tool started failing and because we had some function calling, which would actually go into the knowledge base of the company. And that function calling was was getting extracted, getting triggered using what you call the deterministic intent from user's prompts, basically.</p><p>[00:52:56] <strong>Umesh Rajiani:</strong> And and that entire function calling was failing. Now, we carried out all types of work and everything it was very frantic because it was a front end tool and it started having some impact. And it was, remember, it was working for six months. So it's it worked without any problems for six months and suddenly it just stops working.</p><p>[00:53:14] <strong>Umesh Rajiani:</strong> And the reason was that there were two words that were in the definition of The tool, so that definition of tool was actually informing the pipeline what the tool is all about and that's how the tool was getting invoked and those two words basically were getting flagged into The OpenAI API.</p><p>[00:53:32] <strong>Umesh Rajiani:</strong> So we're basically Azure OpenAI API, not OpenAI's direct API. We are routing it through Azure and it's a separate separate instance of of GPT 4 and there are separate guidelines. They mimic some of the guidelines that are there in OpenAI, but Microsoft has its own guidelines and they change the guidelines without actually informing the clients. That basically triggered. Yeah. So we literally we literally had legal people and literally had fight. It was an open fight, literally, with Microsoft. If you were in that room, you would have you would have seen. It was really bad. And and then eventually there were talks about cases and stuff like that.</p><p>[00:54:08] <strong>Umesh Rajiani:</strong> And eventually, basically actually this company is actually modifying the contract with Microsoft. Where Microsoft will be liable to inform the company before they change any kind of guidelines. And you know what happened after that is, is the beauty because in the beginning of my startup, like beginning of the year, we implemented some solutions where we have a direct contract with Microsoft And we have implemented solution on the backing of those contracts.</p><p>[00:54:34] <strong>Umesh Rajiani:</strong> So in last two days, actually, I've gone back to those clients with whom we have implemented solutions so that they have a direct contract with Microsoft, because we don't want to be a party involved as far as the SLAs are concerned, because this is very dangerous if you're developing solutions for.</p><p>[00:54:49] <strong>Umesh Rajiani:</strong> For people and and if the core solution through which you are driving the entire application pipeline is getting changed without any kind of data contract backing, so to say. Yeah, this is a great learning for us and I've been always a proponent of. Open source solutions, and I think this has given one more kind of a booster to us because now we can go back to the new clients and say, Hey, guys if possible, if we give you the kind of solution that you're looking for, then let's go to open source solution rather than going for a closed source solution.</p><p>[00:55:20] <strong>Umesh Rajiani:</strong> So</p><p>[00:55:20] <strong>Alex Volkov:</strong> And this is like a huge, yeah, a huge like reason why, right? Getting, it's very interesting, like in this area we mentioned, definitely feel free to chime in on this a little bit more. The outputs of LLMs are usually non deterministic. And so this has to be built into understanding when you build tools on top of this.</p><p>[00:55:36] <strong>Alex Volkov:</strong> But this is not that. This is them adding not like a different model or something like a different that you can switch. They're adding something in between or some like policy thing without announcing this to the customers. And supposedly if you go to Azure instead of OpenAI, for example, you would go for the most stability as underlined by the fact that when OpenAI had downtime after Dev Day, Microsoft Azure, GPT for like endpoints, they were all fine.</p><p>[00:56:02] <strong>Alex Volkov:</strong> They were all green, right? So supposedly you would go for the stability and kind of the kind of the corporate backing. There's also like different ISO things and HIPAA compliances, like all these things that Microsoft Azure like proposes on top of OpenAI. But here we have a case where like underlines how.</p><p>[00:56:17] <strong>Alex Volkov:</strong> How important open models that you host yourself are, even if you host them, like maybe on Azure as well, because then nobody can change the moderation endpoints for you and suddenly decide that a few words in your prompt are not, to be used anymore.</p><p>[00:56:32] <strong>Umesh Rajiani:</strong> Yeah, but Alex this had nothing to do with the prompt, actually. It was actually the definition of the function that was there. And the key is like I would draw an analogy to what you call the data contracts. I don't know how many people are aware of data contracts, but when you have.</p><p>[00:56:47] <strong>Umesh Rajiani:</strong> Ownership of data within a very large organization, let's say 20, 000, 30, 000 people up you have data contracts where the data originates from a particular source and some other division is using that data. So you have a contract between those two and that data contract details the data definitions which are there and the contract sign, the signatory of the contract is responsible to ensure that if they change any kind of data structure or data definition.</p><p>[00:57:14] <strong>Umesh Rajiani:</strong> Then the receiver of the data or the client of the data contract is supposed to be informed. That is a part of your data contract. And that's how these large organizations function. And what we need is that kind of a framework where you have a data contract with the service provider.</p><p>[00:57:30] <strong>Umesh Rajiani:</strong> So even if you're going with an open source solution, and if your open source solution is hosted by someone, Then you need to have that kind of a contract in place. So it's not just that open source solution is a solution for everything. It's about the person who is providing the inference. So if you are controlling the inference, then you are secure because you are not going to make the changes without, understanding the repercussions of those changes.</p><p>[00:57:52] <strong>Umesh Rajiani:</strong> But if you are let's say hosting open source model on Amazon Bedrock, for example. And if they have a system prompt that lies in front of your prompt that goes to the the model, then you have to make sure that Amazon adheres to their responsibility in terms of giving you the required inference.</p><p>[00:58:12] <strong>Alex Volkov:</strong> Absolutely. Thanks for giving us the, first of all, like it's, it sucks that it happens and hopefully now Microsoft, like you said, they [00:58:20] changed their their approach here. Aniston, go ahead if you want to follow up.</p><p>[00:58:26] <strong>Nisten Tahiraj:</strong> Yeah. So for us, this has been amazing. I already have clients lining up to pay for the Baclav API. So I'll just say that first before it's even out. However It is extremely unfortunate for those that built, let's say, apps in a hospital or for a therapist because now those kinds of applications just had a moderation engine added, and they added apparently for their safety, and now whoever was relying on these applications, now they just stop working.</p><p>[00:59:02] <strong>Nisten Tahiraj:</strong> Out of nowhere, and this is an extremely immature thing to do this is something you expect from like a random startup with kids, not from freaking Microsoft, and it is pretty worrisome that this safety hysteria has gotten to the point where You're literally just breaking medical applications in production without modifying, without notifying people.</p><p>[00:59:27] <strong>Nisten Tahiraj:</strong> That's just, you lost people's trust now. You're not going to gain that back for a couple of years. And I hope they realize and don't do this again. Don't break production and make changes. To people in Prad that are relying on this for like SOC 2 or as in the case of UMass that have signed service level agreements.</p><p>[00:59:49] <strong>Nisten Tahiraj:</strong> Because now those people lose all their money if they don't, if they don't provide the service. And it's really bad. That's all I have to say. It's pretty bad.</p><p>[00:59:58] <strong>Alex Volkov:</strong> Yep. Very bad look from Microsoft. Even I think I remember like not entirely OpenAI, when they talked about Sunsetting some models and there was like a developer outcry that said, Hey, like we use those, we haven't had time to change how we work with different prompts, et cetera, for the newer models.</p><p>[01:00:15] <strong>Alex Volkov:</strong> And so OpenAI actually went back and said, Hey, we heard you and we're going to release we're going to deprecate deprecation is going to be pre announced in advance. It's going to be way longer Omesh let's yeah, let's go ahead.</p><p>[01:00:27] <strong>Umesh Rajiani:</strong> Yeah, very quickly I think you have raised a very valid point, Alex, that I think all the models that they actually put out of service, they actually should make them open source. I think that's the best solution.</p><p>[01:00:39] <strong>Alex Volkov:</strong> Nah, I wish this was the case. We're still waiting for potentially like open source GPT 2. 5. We haven't seen any open sources from OpenAI for a while. Besides like some GitHub code, I agree with you. There should be a way for folks to keep doing this, the same exact thing they're doing.</p><p>[01:00:52] <strong>Alex Volkov:</strong> I don't know, in my example, I use Whisper, no matter like what their API really says, what it's like, what they deem inappropriate to translate, the Whisper that I use is hosted and it will be the same version until I decide basically and test everything. All right, folks, we're moving forward, I think, just quickly.</p><p>[01:01:10] <strong>Alex Volkov:</strong> There's not a lot of stuff in the vision area. I will mention briefly we've been here for more than an hour. So I'll definitely like recap the space a little bit. If you're joining, let me just play the music and then I'll recap and then we'll get into the interview. So with with Hour 15, you're listening to Thursday Eye. Those of you who just joined us, welcome. If you haven't been here before, this is a weekly space all about AI, open source, as our friend of the pod, Jan, just tweeted out, everybody and everybody in LLM space and open source is in here, and very great to see.</p><p>[01:01:45] <strong>Alex Volkov:</strong> We've covered open source stuff, we've covered corporate drama right now, and then we're moving on to an interview. Thank you.</p><p>[01:01:53] This weeks Buzz from Weights & Biases</p><p>[01:01:53] <strong>Alex Volkov:</strong> And then we're going to talk about AI, art, and diffusion, if we're going to have time at the end of this. There's a brief mention that I want to say, but basically, let me just reintroduce myself.</p><p>[01:02:01] <strong>Alex Volkov:</strong> My name is Alex Volkov. I'm the AI Evangelist with Weights Biases. And we have a small segment here for Weights Biases that I want to choose to just bring. I just came back a few days ago from San Francisco Hackathon, the WeHub sponsor with TogetherAI and LengChain. It was a pretty cool hackathon.</p><p>[01:02:20] <strong>Alex Volkov:</strong> It was very brief, like a few hours with AGI House. But basically the theme was RAG versus FineTune. And I think the theme was versus, and I just promised I'll bring some learnings from this. So there's a bunch of projects that did different things. They used Together's endpoint for FineTune.</p><p>[01:02:35] <strong>Alex Volkov:</strong> So if you can FineTune. On your models and your GPUs that's one thing, but for many of the AI engineers, that's very difficult to do. So there's a bunch of startups together as one that they offer like very simple fine tuning. I'll definitely add my my Link to the show notes, to the presentation I gave there, which talks about how easy it is to fine tune using their endpoints.</p><p>[01:02:56] <strong>Alex Volkov:</strong> And the folks that won the hackathon, some folks won different prizes, basically used both Reg and FineTune. And it looks like also there was a paper released afterwards from some folks trying to identify what's better. Is it just doing RAG on top of Hindu models or just doing basic RAG?</p><p>[01:03:13] <strong>Alex Volkov:</strong> And I don't think we have a clear answer yet. Definitely this hackathon wasn't the end all of all answers. However it does look like doing RAG on top of a fine tuned model improves just a little bit on top of just basic RAG. And it looks like RAG wins on top of just a regular fine tuned for information retrieval tasks as well.</p><p>[01:03:30] <strong>Alex Volkov:</strong> So definitely do not skip RAG. And I think from the open source perspective, which we love here on Thursday Eye getting more RAG kind of Related models is definitely going to happen. I think we saw some from John Durbin. I think I saw Technium. You mentioned something about like function calling.</p><p>[01:03:47] <strong>Alex Volkov:</strong> Datasets are coming to, to, from news as well. So definitely that area is still to be explored. But it looks like the combination of FineTune and RAG wins just a little bit on top of just basic RAG. I think this is the outcome of that hackathon. Next week in this corner of 1B is going to be an interview with Jason.</p><p>[01:04:06] <strong>Alex Volkov:</strong> Stay tuned for that.</p><p>[01:04:07] BREAKING NEWS - Meta announces LLama 3 is training and will be pen source</p><p>[01:04:07] <strong>Alex Volkov:</strong> I think now we have, and many folks have been DMing me because right now we have breaking news. Breaking news actually happening right now.</p><p>[01:04:17] <strong>Sounds:</strong> AI breaking news coming at you only on Thursday ice.</p><p>[01:04:27] <strong>Alex Volkov:</strong> You know I love to use this sound. You know I love to use this sound, everyone. We have some updates from BigZuck. you guys see this because it's over on threads. And I don't know how many of us are on threads. I definitely know that I barely go there. We have some updates from BigZuck specifically around Training Lama 3.</p><p>[01:04:43] <strong>Alex Volkov:</strong> There's like key updates about the long term vision. I think the summary there is They have an insane amount of GPUs this year. So like literally he says at the end of this year, we'll have three, around 350, 000 NVIDIA H100s. I'm going to repeat this slowly for the people in the back. 350, 000 NVIDIA H100s and overall 600, 000 H100s or equivalents of compute if you include other GPUs.</p><p>[01:05:13] <strong>Alex Volkov:</strong> You remember those hats that people wear, like GPU poor, GPU rich hats? I think Zack can stack the GPU rich hats, like one on top of the other and it still won't be enough because 600, 000 H100 compute is just like ridiculous. And he talks about. Two major parts of their vision, AI and Metaverse are connected.</p><p>[01:05:32] <strong>Alex Volkov:</strong> I love how like it was Metaverse, and then suddenly AI started being a thing and now oh, they're connected. I definitely am expecting AI to exist in some form of virtual slash world, et cetera. But definitely he talks about Lama 3. And Lama 3 is coming. They're currently training it per BigZakh.</p><p>[01:05:48] <strong>Alex Volkov:</strong> We know that's coming or like we at least expected this, but I think now is like more of a confirmation. And I'm very excited about Lama 3. I will just mention that it's not been a year since Lama 1 yet. So we're in January Lama was released in like around February 12th, 13th or so.</p><p>[01:06:06] <strong>Alex Volkov:</strong> And it's not half, like it hasn't been a year yet. And here we are like training the third model on top of Lama. We've had just an incredible amount of like innovation on top of it. So definitely expecting and we're obviously going to cover this as much as possible. So this is I think most of it.</p><p>[01:06:23] <strong>Alex Volkov:</strong> Oh and this last thing that he added, Zak has added and I think it's Adding to Thursday as well where we have to start talk about hardware is that he says I think lots of people will talk to A. I. s frequently through the day using smart glasses like what we're building with Ray Ban Meta.</p><p>[01:06:38] <strong>Alex Volkov:</strong> And I think we've [01:06:40] talked about their smart glasses that they're like multi modal glasses. They have a camera built in them. You can press a button and actually pass the image into the LLM. They're making improvements in speed as well. I will say just like an additional one thing we've talked how Meta is adding a bunch of AI into every chat and nobody like necessarily used them.</p><p>[01:06:58] <strong>Alex Volkov:</strong> Recently, a friend of mine, maybe because, I'm an AI evangelist, so he felt free to do this in our chats. He just added an AI bot to our chat. Literally, just like my DM with a friend who has no, nothing about AI, like it's not part of his world. He does something else. Recently, he's Hey, let me add this thing.</p><p>[01:07:14] <strong>Alex Volkov:</strong> So Meta is definitely letting folks experiment with AI more than some other places. And he just added in the AI to our chat. It was super cool. So here's an update from Zack BigZack. Allama3 is training and then they have a lot of GPUs. They're like super GPU rich and, hopefully we'll get the benefit.</p><p>[01:07:30] <strong>Alex Volkov:</strong> Go ahead, Nissan. Yeah,</p><p>[01:07:36] <strong>Nisten Tahiraj:</strong> H100s? Yeah, they're going to need that if they're going to have visual stuff from people's glasses. But it's an insane amount. That's all. Yeah, I just ran some quick calculations. I got roughly similar numbers to what Nishtan just said. And if I'm doing my math I'm running just some numbers based off the alleged GPT 4 leaks of the amount of GPU hours that it might take, let's say if they used all those meta GPUs.</p><p>[01:08:08] <strong>Nisten Tahiraj:</strong> It's do a GPT 4 level model. I'm getting numbers it would take less than a week pretty much to train, yeah, this is an insane amount of GPUs for people that, don't have good references for this. Yeah.</p><p>[01:08:18] <strong>Alex Volkov:</strong> I think it's insane enough to maybe open a new category like on top of GPU rich. It's just quite incredible and like hopefully they're committed to the open source of this in Lemma 3. Omesh, you had a comment as well?</p><p>[01:08:32] <strong>Umesh Rajiani:</strong> Yeah, what if Lama 3 is going to be multi modal? Then they will need those GPUs.</p><p>[01:08:37] <strong>Alex Volkov:</strong> I'm really hoping it will. Like they're training the models, like multimodality is something they talked about. It's time. To move towards the LMM world and multimodality, and they will need all those GPUs to crank out. The vision part of this hopefully multimodal in other areas reminder meta has released like bull a bunch of attempts at multimodality in other areas, not only image.</p><p>[01:08:59] <strong>Alex Volkov:</strong> IMU motion units and they've talked about F-F-M-R-I signals they've talked about, like incredible stuff. But definitely modality, other modality like sounds like audio. Live video would be super cool, like I think this year is the year of live video, so not only, hopefully not only vision, and if it's vision, then hopefully it's like a live video.</p><p>[01:09:18] <strong>Alex Volkov:</strong> Alright folks, we're coming up on two hours,</p><p>[01:09:20] <strong>Alex Volkov:</strong> and with that, I think this is the summary of today's Thursday Eye. Thank you everyone for joining. If you haven't subscribed yet, definitely feel free to subscribe at ThursdayEye. News. I appreciate everyone's time and attention here. Thank you so much for the Co hosts and guests for today's pod and shallow with everyone.</p><p>[01:09:36] <strong>Alex Volkov:</strong> And I have to end this on the very happy note of the alchemy thing, because the one thing that came out from the conversation with with Maxim, who merges and Nistan and everything is that a lot of this is alchemy and a lot of this is like trying to see how things work when you combine and not continue to train models, they still perform better.</p><p>[01:09:55] <strong>Alex Volkov:</strong> So I have to end on this very happy tune, which will represent the alchemy that we're all doing. And we love it. Thank you everyone for joining this Thursday. I will see you next week. Cheers. And we'll add this banger to the show notes as well. Bye everyone.</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-jan-18-nous-mixtral-deepmind</link><guid isPermaLink="false">substack:post:140821949</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Fri, 19 Jan 2024 00:27:46 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/140821949/fe4399622a8142cc9997a27780bcb574.mp3" length="50885627" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>4240</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/140821949/143a428e6a52334d76a8f3082ec7269b.jpg"/></item><item><title><![CDATA[🔥 ThursdAI Sunday special - Deep dives into Crew AI with Joao then a tasty Bagel discussion with Jon Durbin]]></title><description><![CDATA[<p>ThursdAI - Sunday special deep dive, interviews with Joao, and Jon, AI agent Crews and Bagel Merges. </p><p>Happy Sunday dear reader, </p><p>As you know by now, ThursdAI pod is not a standard interview based podcast, we don't focus on a 1:1 guest/host conversation, but from time to time we do! And this week I was very lucky to have one invited guest and one surprise guest, and I'm very happy to bring you both these conversations today. </p><p>Get your Crew together - interview with João Moura, creator of CrewAI</p><p>We'll first hear from <strong>João Moura</strong>, the creator of Crew AI, the latest agent framework. João is a director of AI eng. at Clearbit (acquired by Hubspot recently) and created Crew AI for himself, to automate many of the things he didn't want to keep doing, for example, post more on Linkedin. </p><p>Crew has been getting a lot of engagement lately, and we go into the conversation about it with João, it's been trending #1 on Github, and received #2 product of the day when Chris Messina hunted this (to João's complete surprise) on Product Hunt. </p><p>CrewAI is built on top of Langchain, and is an agent framework, focusing on Orchestration or role-playing, autonomous agents. </p><p>In our chat with João we go into the inspiration, the technical challenges and the success of CrewAI so far, how maintenance for crew is now partly a family effort and what's next for crew</p><p>Merges and Bagels - chat with Jon Durbin about Bagel, DPO and merging</p><p>The second part of today's pod was a conversation with <strong>Jon Durbin</strong>, a self described AI tinkerer and software engineer. Jon is a Sr. applied AI researcher at Convai, and is well known in our AI circles as a master finetuner and dataset curator. </p><p>This interview was not scheduled, but I'm very happy it happened! If you've been following along with the AI / Finetuning space, Jon's Airoboros dataset and set of models have been often mentioned, and cited, and Jon's latest work on the Bagel models took the lead on HuggingFace open LLM leaderboard</p><p>So when I mentioned on X (as I often do) that I'm going to mention this on ThursdAI, Jon came up to the space and we had a great conversation, in which he shared a LOT of deep insights into finetuning, DPO (Direct Preference Optimizations) and merging. </p><p>The series of Bagel dataset and models, was inspired by the Everything Everywhere All at Once movie (which is a great movie, watch it if you haven't!) and is alluding to, Jon trying to throw as many datasets together as he could, but not only datasets! </p><p><p>There has been a lot of interest in merging models recently, specifically many folks are using <a target="_blank" href="https://github.com/cg123/mergekit">MergeKit</a> to merge models with other models (and often a model with itself) to create larger/better models, without additional training or GPU requirements. This is solely an engineering thing, some call it frankensteining, some frankenmerging.</p><p>If you want to learn about Merging, <a target="_blank" href="https://twitter.com/maximelabonne/status/1744867841436700850">Maxime Labonne</a> (the author of Phixtral) has co-authored a great deep-dive on <a target="_blank" href="https://huggingface.co/blog/mlabonne/merge-models">Huggingface</a> blog, it's a great resource to quickly get up to speed</p></p><p>So given the merging excitement, Jon has set out to create a model that can be an incredible merge base, many models are using different prompt techniques, and Jon has tried to cover as many as possible. Jon also released a few versions of Bagel models, DPO and non DPO, that and we had a brief conversation about why the DPO versions are more factual and better at math, but not great for Role Playing (which is unsurprisingly what many agents are using these models for) or creative writing. The answer is, as always, dataset mix! </p><p>I learned a TON from this brief conversation with Jon, and if you're interested in the incredible range of techniques in the Open Source LLM world, DPO and Merging are definitely at the forefront of this space right now, and Jon is just at the cross-roads of them, so definitely worth a listen and I hope to get Jon to say more and learn more in future episodes so stay tuned! </p><p>So I'm in San Francisco, again... </p><p>As I've mentioned on the previous newsletter, I was invited to step in for a colleauge and fly to SF to help co-host a hack-a-thon with friends from TogetherCompute, Langchain, in AGI house in Hillsborough CA. The Hackathon was under the Finetune VS RAG theme, because, well, we don't know what works better, and for what purpose.</p><p>The keynote speaker was Tri Dao, Chief Scientist @ Together and the creator of Flash Attention, who talked about SSM, State space models and Mamba. </p><p>Harrison from Langchain gave a talk with a deepdive into 5 techniques for knowledge assistants, starting with basic RAG and going all the way to agents 👏</p><p>I also gave a talk, but, I couldn't record a cool gif like this for myself, but thanks to Lizzy I got a pic as well 🙂 Here is the link to my slides if interesting (<a target="_blank" href="https://gamma.app/docs/RAG-vs-Fine-tune-Hackathon-iovrvc077fn9i6l">SLIDES</a>)</p><p>More than 150 hackers got together to try and find this out, and it was quite a blast for me to participate and meet many of the folks hacking, hear what they worked on, what worked, what didn't, and how they used WandB, Together and Langchain to achieve some of the incredible results they hacked together in a very short time. </p><p>The projects showcased a range of creative applications leveraging RAG, finetuning, and other large language models. Several projects like Magic RAG, CareerNavigator-AI, and CompetitionAI used RAG for document retrieval and knowledge enhancement. Others like rags2pizza and Naturalist DALL-E focused more on finetuning models for specific domains. Some projects compared finetuning and RAG, finding that combining both gives superior performance over using either alone but that result wasn't conclusive. </p><p>My vote as a judge (which I did not expect to be) eventually went to the team that built the <strong>OptiMUS</strong> project, they had generated a systentic dataset, cleaned it up, finetuned a model on it, and showed that they want to optimize AI agents. They used WandB to track their work and I hope they take this project forward and keep making advancements in AI. Congrats for the win Ali and Shayan, hope you enjoy the WandB branded Airpods (even I don't have those) and the Meta Quest, well deserved! </p><p>Thank you for tuning in! See you next week! </p><p>Full Transcription :</p><p>[00:00:00] <strong>Alex Volkov:</strong> Hi. Welcome back to Thursday. The Sunday special episode. This is Alex Volkov. And I'm recording this in. A gorgeous space. In San Francisco. Where I was. Invited to judge hackathon. And now I'm hanging out with a few friends from cerebral valley. So thank you. Valley folks. For letting me use this place for recording and Today, we have a special episode for you. As If you hear this on Sunday. Today's not a Thursday. We often times have special guests on the pod. Where conversations. Or deeper.</p><p>[00:00:45] <strong>Alex Volkov:</strong> And usually I reserve that slot for a Sunday special release. So this is what you're hearing now. In today's episode, we actually have two conversations. Although I only planned on one. And the first part is the planned part that you hear from Joao Maura. He is a director of AI in Clearbit, and now acquired by HubSpot. And he's also the creator of Crew AI and the Gentek AI framework that can run. By orchestrating.</p><p>[00:01:14] <strong>Alex Volkov:</strong> c</p><p>[00:01:15] <strong>Alex Volkov:</strong> The digital AI agents and have them work together.</p><p>[00:01:19] <strong>Alex Volkov:</strong> And I think you'll hear from, Joao why this peaked interest. For many folks. Specifically. Because as we caught up with. Wow.</p><p>[00:01:29] <strong>Alex Volkov:</strong> Crew AI was trending on GitHub and getting number two on product hunt at the same time. And it's a really cool framework. And I think the underlying. Power of this is that it can use open source, local models. A lot of previous agent attempts used GPT4 For example, and the crew AI can use things like Mistral or Mixtral running in LM studio or Ollama on your Mac, which I think is super cool.</p><p>[00:01:55] <strong>Alex Volkov:</strong> And I think on device AI, plus something like this framework is going to be very, very powerful. It was a great conversation was wow. And surprising to me, the second guest was not planned. However you may have heard from the previous Thursday that the. Bagel series of models from a. Self-proclaimed AI, tinker, John Durbin. Have taken over the leaderboards on hung and face. Including a bunch of mergers and we haven't. Done a deep dive into merges and merge good and Franklin state models.</p><p>[00:02:32] <strong>Alex Volkov:</strong> But if you've been to Thursday for awhile, you probably heard about them. Merging is a technique to take a model or different models. And without any computation, great, bigger or different models using a dissection and some computing. Process of the layers of those models just based on weights without any training or continuing to fine tuning, which is incredibly interesting.</p><p>[00:02:58] <strong>Alex Volkov:</strong> And John goes into this a little bit and he created. Bagel. Based on the inference of what I'll let you hear this at the end. And it's a very fascinating conversation. I took a lot from it and unfortunately we didn't have time for a long, deep dive, but I learned a lot from John and hopefully he'll come from the podcast and we'll be able to deep even dive even deeper and talk with John about. How to create data sets, why DPO is better than PPO and all of these great things. So we had two great guests. And I. Had a blast having them on the bud and I probably should do more of these deep dives.</p><p>[00:03:37] <strong>Alex Volkov:</strong> So please let me know what you think. Don't forget to subscribe to the newsletter or I sent a summary and in the newsletter, you'll find my. Trip report, quote unquote for the hackathon. There was co-sponsor with together, AI. And Lang chain and Harrison was there and I gave a brief talk as well. And the, sorry, I that a bunch of pictures.</p><p>[00:03:57] <strong>Alex Volkov:</strong> So if you're hearing this in your car, check out the newsletter afterwards on Thursday, either.</p><p>[00:04:02] <strong>Alex Volkov:</strong> And with that, I give you our first guests as well. Maura. All right, everyone. Welcome back to ThurdsAI. And we have a great guest today. João Moura from I want to say clear a bit. If I'm not mistaken. Or Joao, could you please introduce yourself and what you do and then we're going to talk about the thing we're here to talk about.</p><p>[00:04:36] <strong>Joao Moura:</strong> A hundred percent. Thank you for having me. First of all, you got my name right, it's hard to pronounce. I go by Joao, make it easier for everyone. I work at Clearbit, yes, but we just got acquired by HubSpot. I'm not sure. I'm Joao from Clearbit, from HubSpot, and from Crew AI. Everything at once.</p><p>[00:04:54] <strong>Alex Volkov:</strong> Awesome.</p><p>[00:04:58] <strong>Alex Volkov:</strong> Eye. I think it's your first time here on stage. Welcome. We've met In San Francisco, at the Ollama open source event, and I think like Teknium was there and a bunch of other folks, Ollama, and I met you and we had like a brief conversation, and you mentioned CREW. AI to me, and it sounds like super, super interesting, and then, this week and the previous week, there was like an explosion of interest in CREW.</p><p>[00:05:17] <strong>Alex Volkov:</strong> AI, so I would love to hear from you how your like last few weeks have been going, definitely the time that we spent like together since then. A lot of stuff happened to Kurei. Could you just, without saying what Kurei is, could you just, like, recap on your experience for the past two weeks?</p><p>[00:05:33] <strong>Joao Moura:</strong> A hundred percent, a hundred percent and first of all, that Oyama event the other day was so good. Had so much, so much fun on it.</p><p>[00:05:41] <strong>Alex Volkov:</strong> It was</p><p>[00:05:41] <strong>Joao Moura:</strong> last couple of weeks have been intense I gotta tell you, kind of like, the thing. Got, like, blown up out of proportion. Like, I have a lot of DMs, a lot of messages, a lot of issues, and not that many requests, I want to say, but but it has been a lot of fun.</p><p>[00:05:59] <strong>Joao Moura:</strong> Kriyai just like, seems to have a lot of interest in from different people. I think this idea of building like AI agents is something that captivate most of the tinkerers out there, like how you can automate your life away. And it seems that have been resonating with a bunch of engineers out there.</p><p>[00:06:16] <strong>Joao Moura:</strong> The last couple of weeks has been very intense in terms of Writing code, like at late nights having like to spare a few hours to insert DMs and help with the Discord community. And actually, I actually ended up recruiting my wife to help me with that. So if you see Bianca on Discord or over GitHub issues, that's my wife helped me out, make sure that I get it all covered.</p><p>[00:06:41] <strong>Alex Volkov:</strong> Definitely shout out Bianca, thanks for helping. And uh, as well so, now trending on GitHub, I think number one , I think somebody submitted this to Product Hunt as well?</p><p>[00:06:50] <strong>Joao Moura:</strong> That was a thing. So I have been working on this and like as an engineer working on an open source project, you don't, you don't think about this project as products necessarily from the get go, but as it starts to get more traction it got the interest of like this one guy that seems to be like a I don't know if it's a big deal or not, but it seems that he hunts a lot of products and product hunt.</p><p>[00:07:14] <strong>Joao Moura:</strong> And for some reason he got like. The super fun thing is that I have been part of like, and I have seen other like product, product launches, and I know how much effort goes into preparing those and to be ready for it and have like a, like social media ready to post a lot of content about it. And I had none of that.</p><p>[00:07:36] <strong>Joao Moura:</strong> I woke up in the morning and there was a message from a VC saying like, Hey, congratulations on your launch. And I was like, What is this guy talking about? I have no clue. It was very interesting because I, I opened like Product Hunt's website and I'm searching like, how do I cancel this? Like, I, I didn't want to launch this, at least not right now.</p><p>[00:07:58] <strong>Joao Moura:</strong> And on Product Hunt's like documentation, they mentioned that you have two options, either, You send them a message like super urgent so that they can pull like the, the brakes on it, or you run with it.</p><p>[00:08:13] <strong>Joao Moura:</strong> And at the end of the day, I was like, I'm just going to run with it. I'm going to see how it goes. And turns out we end up the day as [00:08:20] number two.</p><p>[00:08:20] <strong>Joao Moura:</strong> And that was, that was something else. Thanks.</p><p>[00:08:25] <strong>Alex Volkov:</strong> number one hunter. I think he hunted like most of the products on ProductHunt, so shout out Chris. And definitely, I saw this and what a surprise to wake up to and then get like the product number two. Definitely helped the stats probably. Right. So I think, I think with this excitement, let's talk about like why it's so exciting.</p><p>[00:08:43] <strong>Alex Volkov:</strong> Could you give us like a brief primer on Crew AI? We've talked about agents before. We obviously talk about like auto GPT previously and GPT engineer from, from Anton Musica and like a bunch of other very interesting projects. Could you give us the brief primer on like a crew AI, what it is?</p><p>[00:08:57] <strong>Alex Volkov:</strong> And then we're going to talk about like why you built it and the orchestration stuff.</p><p>[00:09:02] <strong>Joao Moura:</strong> percent Crew I is a very thin framework. It's a Python framework. It's in the process of being converted to TypeScript as well, but it's a Python framework that allows you to build a group of AI agents. You can think about it as if it AutoGem and ChatBath had a child.</p><p>[00:09:21] <strong>Joao Moura:</strong> That's the way that I usually describe it. So you're going to have a group of AI agents that are role playing in order to. perform a complex series of tasks. And you can do all sorts of automations on it and you can plug it to all sorts of different systems out there. I think that's the easiest way to describe it right now.</p><p>[00:09:43] <strong>Alex Volkov:</strong> Awesome. And could you, you briefly mentioned this, GPT, could you talk about like the, the inspiration here? what made you start this as Clearbit was getting acquired and, or, or around this area, at least I think what made you work on this? There's a bunch of other , orchestration platforms out there, the bunch of agents what made you write your own instead of like taking something off the shelf on open source?</p><p>[00:10:06] <strong>Joao Moura:</strong> So turns out that this is a fun story. There was so you're back into my wife again, always propping me up. I love her. She's so great. She was she was telling me, Hey, you have been doing all this amazing work at Clearbit. Because at Clearbit, we have been doing work with LLMs for the past one year.</p><p>[00:10:22] <strong>Joao Moura:</strong> And at a scale that I believe not many have. And she was like, You should be sharing more about this. Like, you're leading these efforts and you're doing all these complex systems at scale. And this could definitely help and benefit other people. So she was telling me that I should do a better job at posting online in things like LinkedIn and Twitter.</p><p>[00:10:41] <strong>Joao Moura:</strong> And Twitter, I, I think like I'm okay with, but LinkedIn was always hard to me. I feel like there is a, there is a harder threshold, like a higher threshold for how well your idea must be before you post it on LinkedIn. So I was considering like how, how I can do better LinkedIn posting. And because I was so excited about AI agents, I was like, can I build like a couple of agents that will actually help me out with this, where I can like shoveling my, like, like my, my draft and rough ideas.</p><p>[00:11:11] <strong>Joao Moura:</strong> And it's going to come up with like some guidance and a better post that I can just edit and post. It turns out that I could and that's, that's how I started QueryAI. I looked into AutoGem and I was not a huge fan on how they, like, one, they didn't have the option to execute tasks sequentially. They also have a lot of assumptions on how this agent should work together.</p><p>[00:11:34] <strong>Joao Moura:</strong> And I think The way that they work together should vary depending on the tasks that you're trying to accomplish. I was not a huge fan of it. Chat dev on the other side, I saw a lot of like good stuff on it, but it just didn't feel like a production system, right? Like it has like a game like UI, something that you would experiment with, but not something that you would deploy in production.</p><p>[00:11:56] <strong>Joao Moura:</strong> So that's, that's how I came up with this idea of like, maybe I should do something myself so I can build this LinkedIn automation. And if that works, then I can build other sorts of automations. And that's how I started to create AI. I viewed it. Five agents from A social network researcher all the way to a chief content officer to help me create great ideas so that I can post them on LinkedIn.</p><p>[00:12:23] <strong>Joao Moura:</strong> And it works great. I went from never posting on LinkedIn to post like three to four times every week. And I love what I post and it seems other people do as well. So from that point on. I decided that I want to create more automations and that's how CREATE. AI came to be. I just abstracted what I learned from that experience into this framework that I could then use to build other sorts of automations and things took off from there.</p><p>[00:12:50] <strong>Alex Volkov:</strong> Wow, that's incredible. Incredible story. As a lot of the engineering stories happen when people create like cool things, laziness is somewhere there. Like I want to automate something that I don't want to do, but I definitely need done. I definitely have a bunch of those as well, at least for Thursday. The collection stuff and the other stuff that I would love to just like happen for me.</p><p>[00:13:10] <strong>Alex Volkov:</strong> So definitely. Want to check out KuroAI for that and create like a Thursday collection thing. Could you, could you mention like the, like, like technical challenges here? You did mention that it's based on LengChain, if I'm not mistaken. You mentioned that there's not a lot of like, pull requests for people to help out with Could you talk about like the, the technical challenges you ran into?</p><p>[00:13:30] <strong>Joao Moura:</strong> Yes so basically when I start to build this out, I realized pretty quickly that Agents are just as useful as how many tools you can connect them with. And when I was looking online, I realized that both YamaIndex and LinkChain already had all these amazing tools that you could, you could run with.</p><p>[00:13:52] <strong>Joao Moura:</strong> So I wanted to make sure that I could, people could use those tools too. And build, like, Crews that use them. Because of that, I took the decision to build CREAI around LinkChain. So that if anyone wants to hook that up with their GitHub or their Gmail, there are already tools that were built out for that, and they're pretty easy to plug in and just work.</p><p>[00:14:15] <strong>Joao Moura:</strong> And it seems Lemma Index tools also work. I'm putting together some experiments around that to share with more people. But basically that was some of the initial decision that that will lead to this design. I think some of the technical challenges that came from it is It's just realizing that as people start creating all these different curls for these different use cases, there's so many edge cases, right?</p><p>[00:14:38] <strong>Joao Moura:</strong> You know that you can try to, like, steer LLMs your way, but especially if you're using, like, open source LLMs and smaller LLMs, they have a harder time just sticking with, like, a given.</p><p>[00:14:54] <strong>Joao Moura:</strong> I started to add a bunch of guardrails in Cree AI that actually makes it way better than what you would get with any other agent framework out there, where if it's For example, one of them is if you're running out of iterations, like you're, like your, your agent is stuck on a cycle or taking too long to come up with an answer it's gonna force it to come up with an answer if it goes over a certain number of iterations that you could define.</p><p>[00:15:21] <strong>Joao Moura:</strong> Another one is if it tries to use the same two in a row, it's going to prevent it to do that and guide there towards moving on. Another one is it has caching. So every two any agent uses is going to be cached so that if any other agent in the group decides to use the same two they don't need to actually execute it.</p><p>[00:15:41] <strong>Joao Moura:</strong> So I think a lot of the challenges come from like how I can add all these guardrails to make sure that Independently of what the use case and what the person is building a group of agents for, that's still going to run smoothly. And that's, that's where a lot of the work has been, has been put, been putting on.</p><p>[00:16:01] <strong>Joao Moura:</strong> So you mentioned local modals as well</p><p>[00:16:04] <strong>Alex Volkov:</strong> we mentioned, we met in the OLAMA event, and OLAMA is a CLI, a shout out, OLAMA folks, is a CLI to be able to download and run open source models on your hardware, basically. Many of the previous agent attempts, Auto GPT like different ones, they use maybe GPT 4 or something.</p><p>[00:16:20] <strong>Alex Volkov:</strong> We're getting to the tools and we heard previously in the space we heard from John Durbin that there are models now that are like better for specific tasks like function calling as well. Jao, could you speak a little bit about the difference that you see? Could Crue AI work with both, right? Open source and also like, API ones.</p><p>[00:16:39] <strong>Alex Volkov:</strong> And could you [00:16:40] talk about a little, the difference that you see between like the open source models as we have them right now versus kind of the, the online models and which ones would you prefer for your tasks?</p><p>[00:16:50] <strong>Joao Moura:</strong> Turns out that I think that the fact that crew AI supports local models is some like thing that, that. Make it take off because that's something that I wanted from the get go. Like these agents, especially if you're trying to automate complex tasks, they can become rather costly if you want to run them like 24 7.</p><p>[00:17:09] <strong>Joao Moura:</strong> But with like the ability to use local models, you can basically just set and forget, and they're going to keep doing work for you. So I wanted to make sure to support local models because of that. Guru AI supports like any of the vendors that you're going to find support in link chain. So you can use any of the open source models out there, a drawback, GPT, you name it.</p><p>[00:17:30] <strong>Joao Moura:</strong> And you can also use Zolyama, you can also use LM studio whatever is the best way that you have to run your models locally, you can use that. I. Specifically, like personally, love Olym. Olym is amazing. I love the guys that built it as well. And I think it's so easy to use that I ended up using that. And I have been using some of the smaller models.</p><p>[00:17:51] <strong>Joao Moura:</strong> Shout out to Nose Research. I love that OpenARMS 2. 5 model. It's just amazing and so small. Like I can't believe how good it is. And that's one that I use a lot for like when I'm using I'm using OpenARMS 2. 5 just because of how well it works, but I also tried with Mistro, I also tried with Solar, I also tried with Nexus so many models out there, so good.</p><p>[00:18:19] <strong>Joao Moura:</strong> One thing that I want to call out as well is that These local models, they definitely struggle a little bit more when compared to GPT 4 in terms of sticking with a given format. I'm also collecting all my executions data so that I can fine tune agentic models. Similar to how you have like instruct models and chat models, I want to make sure that we start to see more agentic models out there.</p><p>[00:18:46] <strong>Joao Moura:</strong> I have seen some closed source ones that are not like, You're not able to touch on. So I'm building an open source data set that I can then use to fine tune those models. And then you basically are going to have these agents run on local models without a glitch. That would be at least the end goal.</p><p>[00:19:05] <strong>Joao Moura:</strong> That's incredible, incredible specifically because</p><p>[00:19:08] <strong>Alex Volkov:</strong> we've, we've had interviews with a bunch of folks who build agentic stuff. So one, one of the more successful episodes of last year for Thursday, I was in an interview with Killian Lucas from Open Interpreter and the open source community here definitely opened the thread with Killian specifically to say, Hey, when the users run a bunch of this stuff, we would love to have.</p><p>[00:19:27] <strong>Alex Volkov:</strong> Users opt in maybe for some like telemetry or analytics to be able to build the data sets for the tasks that were completed or not completed. I don't know if you have this plan, but definitely this is a benefit to the community if you do have a way for folks to like, log their stuff. I also mentioned that like, I probably should reach out to you separately to see if like, these runs for these agents in crew could be logged in Weights Biases with the integration.</p><p>[00:19:50] <strong>Alex Volkov:</strong> Would be definitely more than happy to like participate and see if we can like look at the execution stuff of your agent on Weights Biases. As well, I think before I let Umesh wanted to have like a bunch of questions for you as well. He's been running and he, he does agents of his own. I want to say</p><p>[00:20:06] <strong>Alex Volkov:</strong> what's the next plans for crew? Where are you planning to take this? Many of these projects, suddenly people ask for a UI because maybe they don't want to do like, installing and, and doing like Python stuff. So you already mentioned TypeScript. Could you give us a little bit of a future sense of like, where are you planning to take this?</p><p>[00:20:23] <strong>Joao Moura:</strong> think what we are getting now is a bunch of feature requests from most bunch of different sites. So there is some prioritization going on so that I can figure out what to focus next. One thing that seems to be a no brainer to me though is that we need to have a UI for this.</p><p>[00:20:37] <strong>Joao Moura:</strong> I think this would be pretty cool and unlock a lot of use cases for people out there. I know there are other people that have been building UIs for their, like their businesses that are being built around this. I, I just think like an open source version would be better. So I'm definitely already working on the UI for this.</p><p>[00:20:53] <strong>Joao Moura:</strong> We're going to be able to. Put your agents together, bring your, all your cartoons together, and then you can basically have these agents like run by yourselves. I, I might look into offering an option where you, like, we can even host it for you, and I'm still figuring out what that would look like.</p><p>[00:21:10] <strong>Joao Moura:</strong> Maybe that's too far ahead. But but yeah, I think like the UI for it makes a lot of sense. Also another thing is that it seems a lot of the use cases kind of like go back into very similar tools over and over again. And even though you can hook them up with like link chain or lemma index tools, those might still require some like configuration.</p><p>[00:21:30] <strong>Joao Moura:</strong> It might not be as straightforward for some people. So we might take an opinionated take. On a tool specific repository and package that you can basically use to bring, let's say, let's say they want to create an agent that does reg you might be able to do that with one line versus having to be like a custom.</p><p>[00:21:51] <strong>Joao Moura:</strong> Like two cents for that. So that's another thing that we have been looking at as well. I think there's so many use cases. One thing that I'm trying to do more now is just kind of like chat with more people that are using this. Especially on the business side of things to understand like what other use cases we could support there.</p><p>[00:22:08] <strong>Joao Moura:</strong> But yeah, a lot of interesting things cooking.</p><p>[00:22:11] <strong>Alex Volkov:</strong> I'm looking forward to hear more about Kuru AI and upcoming things. I think Umesh Arkohos here has been doing Agents for a while and has a few questions as well. Umesh, go ahead.</p><p>[00:22:23] <strong>Umesh Rajiani:</strong> Yeah. Hey, Joe thank you for, for coming in. We are almost 80, 80, 90 percent of our workflow now is agentic workflow. So we are employing the generative AI library of. I think that's pretty much it for the introduction of Google for Gemini and also a lot of work using Autogen.</p><p>[00:22:41] <strong>Umesh Rajiani:</strong> And we got introduced to Crue AI, I think, four weeks ago through one of my engineers and found it pretty interesting. There are going to be a lot of pull requests coming in from us, actually, because we are thinking about a few things. I just wanted to ask you one particular question about the process part.</p><p>[00:22:59] <strong>Umesh Rajiani:</strong> Your current library, as I understand, is is a linear process library and what we have is what we are employing with Autogen is, is also. Bit of a, a graph of actions as well as the dag approach as well. Dag approach, can be implemented using your process. But do you have a, a, a graph of actions, workflow in planning or something that is coming up?</p><p>[00:23:24] <strong>Joao Moura:</strong> Yes, so this idea of processes, I want this to be like one of the cornerstones for our career AI. I, my understanding is that a lot, as I said earlier, like a lot of the different outcomes that you're going to get, a lot of the magic happens when you define true what processes these agents are going to work together, right?</p><p>[00:23:43] <strong>Joao Moura:</strong> And there are so many options out there. Like you can have them work like sequentially, you can have them work like in a group, like if they're in a meeting, you can have like a consensus strategy where they can kind of like bet to see who is going to take on the task and even evaluate the results.</p><p>[00:23:59] <strong>Joao Moura:</strong> So there's just a A lot of different processes that can be implemented there. And the idea is to implement all these processes so that people can have some work happen in parallel if they want to, or sequentially or whatnot. About a graph specific API, I I'm not sure how much I can tell about it, but we have been talking with link chain folks about it.</p><p>[00:24:19] <strong>Joao Moura:</strong> And there's, there's some things that have been cooking there.</p><p>[00:24:23] <strong>Umesh Rajiani:</strong> Enough said. This last question. So currently it is all Python but most of our implementations now because of the latency and everything and complexity of. The workflows that we are implementing, mostly our applications are enterprise applications.</p><p>[00:24:36] <strong>Umesh Rajiani:</strong> We are employing a lot of Rust to, for, for a compiled workflow. So do you have any plans of porting it to Rust or you're looking for kind of a support in that area or something?</p><p>[00:24:47] <strong>Joao Moura:</strong> Yeah. So we are, we are porting it to TypeScript right now, and there's some work being done in to build like an API where you might be able to just spin it off as like a service.</p><p>[00:24:58] <strong>Joao Moura:</strong> And you can then like [00:25:00] add agents, create agents, outrun API. So you don't have to create one yourself. You just need to figure out how you want to host it. I haven't thought about porting in trust yet, but I would be open to that idea. For sure. If I can get enough people to help out, I create a repository and we can get things working for sure.</p><p>[00:25:16] <strong>Umesh Rajiani:</strong> I'll, I'll reach out to you separately. Thanks Alex for, for allowing me to ask questions. Of course I have many questions, but I'll reach him out on his Discord.</p><p>[00:25:23] <strong>Alex Volkov:</strong> Yeah, thank you Umesh, and João, I just want to like recap on the awesome success of Kuru AI. I agree with you. I think the fact that, like, we've had many frameworks like this, we've talked about many frameworks like this, the ability to run this completely on your machine, the ability to, like, not pay for somebody else the ability to like, use Olama.</p><p>[00:25:43] <strong>Alex Volkov:</strong> I didn't know that you also support LM Studio. Shout out LM Studio, a friend of the Pada, hopefully we're, we're going to get on, on the next Thursday, I so I didn't know that I can, like, open up a local model on LM Studio and, and then the crew would use this API. Definitely. Definitely want to play with this now.</p><p>[00:26:00] <strong>Alex Volkov:</strong> I want to say, I want to give you a few minutes to just like talk to the community. A lot of things are happening in this world. I find it very interesting where kind of the AI engineers, the kind of the traditional software engineer background folks, they're building the tools, they're building the rag systems, let's say they use the link chain.</p><p>[00:26:17] <strong>Alex Volkov:</strong> From the other side, we have a bunch of machine learning folks who are Building the models, fine tuning the models, and working on that space, and reading the papers. And I do see a connection between, and obviously my role in Ways and Biases specifically is to connect these two worlds. I do want to see more people that train models also kind of like think about the agentic behaviors as well.</p><p>[00:26:37] <strong>Alex Volkov:</strong> We heard John Durbin before talk about like, hey, there's specific data sets for RAG, there's specific data sets for execution and function. I think Eroboros has The, the data set has a bunch of like function calling as well. So definitely I want to see a connection here. João, please feel free to talk to the community in terms of like what you need to make crew the best crew ever.</p><p>[00:26:57] <strong>Alex Volkov:</strong> Where can they find you, what you can get help with the floor is yours. Feel free to take over and ask everything. Community will provide.</p><p>[00:27:06] <strong>Joao Moura:</strong> A hundred percent. And just to tap into what you said there, I agree. I think like there's something magical that happened like last year with like GPT taking the world by the storm is that it like it connected two groups of engineers that in the past didn't talk very much.</p><p>[00:27:22] <strong>Joao Moura:</strong> And that was like AI and ML engineers with. regular software engineers. I have managed teams in both areas in the past, and I definitely have seen like that there isn't much interaction there, but right now it's, it's amazing to see all the amazing stuff that have been coming up from like those two groups to interacting more together.</p><p>[00:27:40] <strong>Joao Moura:</strong> It has been a lot of fun. About, about CREATE. AI. Yes, I would say give me a follow on Twitter or X, I would say now, so give me a follow on X and I definitely will keep posting and share more about CRE AI and all the things related to LLMs, Agents, and everything else. You can know more about CRE AI by looking into its GitHub.</p><p>[00:28:00] <strong>Joao Moura:</strong> So you can go into my profile slash Guru AI. I probably add the link to my ex account as well. From that, if you have follow up questions or if you want to like see what people have been cooking with it, I would say join the Discord community. We have around 500 people there and has been growing daily.</p><p>[00:28:18] <strong>Joao Moura:</strong> So if you join that, you might be able to see other use cases and things like that. If you're curious about it, but you're just like, you're, you're not sure what you could build with it there's a bunch of examples in the README and even some videos that I recorded crews doing like, stock analysis or tree planners and all that.</p><p>[00:28:38] <strong>Joao Moura:</strong> There is there's a lot of content there that you can consume in order to get your ideas. And if you do decide to give it a try, don't miss out on the custom GPT. It's also linked in the README and it can help you write the code. It can help you with ideas for the agents, ideas for the roles or for the tasks or anything around using QrooAI.</p><p>[00:28:58] <strong>Joao Moura:</strong> If you're also curious at contributing to the project. GitHub has a bunch of issues. My wife, again, has been flagging and tagging all of them. So thank you so</p><p>[00:29:07] <strong>Joao Moura:</strong> much.</p><p>[00:29:07] <strong>Alex Volkov:</strong> out, Bianca.</p><p>[00:29:08] <strong>Joao Moura:</strong> can find like all the ones that are tagged with help wanted or the ones that are related with questions And you can help answer them as well And we're gonna be writing new documentation from the scratch So this might be a great opportunity to help with like more simpler stuff as well if you're into that</p><p>[00:29:24] <strong>Alex Volkov:</strong> Awesome and I think I saw something, I don't know if I have a link</p><p>[00:29:28] <strong>Alex Volkov:</strong> to, to the generous documentation on the fly from, from just the, the code itself. And it looks super cool. I'll, I'll try to send this to you. Joao, thank you so much for joining Thursday. I, this is your first time here. Hopefully not the last.</p><p>[00:29:40] <strong>Alex Volkov:</strong> Congrats on the success of Kru AI and it's been great meeting you and then having you on definitely thank you for coming and folks should definitely check out Kru AI, give Joao a follow and we will expect more. I can't wait to like run a few Kru myself to help me with Thursday night tasks, especially on local, local models.</p><p>[00:29:58] <strong>Alex Volkov:</strong> It was super cool. Thank you for coming, man.</p><p>[00:30:01] <strong>Joao Moura:</strong> I love it. Thank you so much for having me catch you folks online.</p><p>[00:30:04] <strong>Alex Volkov:</strong> Awesome, and your audio quality was great by the way, thanks for testing out your mic.</p><p>[00:30:07]</p><p>[00:30:11] Bagel models the leaderboard from Jon Durbin</p><p>[00:30:11] <strong>Alex Volkov:</strong> We're moving forward into the top open source on the LLM leaderboard and the creator. So if you guys open the open source LLM leaderboard, which we often talk about. On HuggingFace we, we've talked about kind of the, the difference between human evaluation and the automatic evaluations that OpenLLM leaderboard runs.</p><p>[00:30:32] <strong>Alex Volkov:</strong> You will see a bunch of models. The top three ones are from CloudU and they're, they're like, I think merges of Yee34 and then the Mixtroll34b as well, but it's not based on Mixtroll. And then the rest of the is like a bunch of John Durbin Bagel examples. And, so all of those, there's like six models there that are based basically on the John's Bagel DPO versions.</p><p>[00:31:00] <strong>Alex Volkov:</strong> And I just wanted to shout this out and shout out Durbin for, for working this hard and releasing these models.</p><p>[00:31:06] <strong>Alex Volkov:</strong> Let's see if we can hear from the man himself. Hey, John.</p><p>[00:31:09] <strong>Jon Durbin:</strong> Hey, how's it going?</p><p>[00:31:10] <strong>Alex Volkov:</strong> Good. Thanks for joining us. I don't remember if you've ever been on stage. So feel free to briefly introduce yourself to the audience who doesn't know you. And definitely they should and they should follow you as well.</p><p>[00:31:22] <strong>Jon Durbin:</strong> Yeah, I'm a software engineer. I'm an AI tinker. I've been doing synthetic stuff since I guess maybe April with Aragoros project. It's been tons of fun. Lately I've been mostly working on the bagel models. If you're wondering what the bagel name came from, it's from Everything, Everywhere, All at Once.</p><p>[00:31:37] <strong>Jon Durbin:</strong> Great movie. Yeah, so that, that's the kind of the premise of the model is like all the prompt formats. Yeah. All the data sources, all the training techniques, there's Neptune, there's DPO yeah, just fun stuff there. As far as the leaderboard, that wasn't really my goal. If you look at the actual, like, token count per data set, I think the highest And then the last amount of tokens is actually probably the Cinematica dataset, which is movie scripts converted to roleplay format.</p><p>[00:32:07] <strong>Jon Durbin:</strong> So it's, it's interesting that it does so well, but really I was targeting the model for general purpose as a merge base because I know that, MergeKit is so popular now. So I was trying to come up with a base model that has a little bit of everything and every prompt format so that anyone who wants to do this, alchemy with MergeKit.</p><p>[00:32:28] <strong>Jon Durbin:</strong> Can use the Bagel series as a base, because I should, if you have an alpaca based model and a vicuña based model, they're not going to merge very well. It'll have, weird stray user tokens or whatever. The idea with Bagel is to be a good base.</p><p>[00:32:42] <strong>Alex Volkov:</strong> I also saw quite a lot of work you're doing on new DPO data sets. Could you talk about those?</p><p>[00:32:48] <strong>Jon Durbin:</strong> And then, yeah, I keep cranking out new DPO datasets to enhance the stuff that's lacking right now.</p><p>[00:32:54] <strong>Jon Durbin:</strong> I think even the YI 34B. Might be a little bit overcooked. I used QLORA for both the supervised fine tuning stage and DPO. And it turns out DPO, you really need to use an incredibly low learning rate. I was even using, like, maybe 50x smaller learning rate for the DPO phase than the Then the supervised fine tuning phase, and even then [00:33:20] I stopped the run about halfway through and killed it because the eval started spiking all over the place.</p><p>[00:33:26] <strong>Jon Durbin:</strong> Yeah, still, still lots of stuff to learn and I'd love to do a full weight fine tune of the E34B. I'm probably going to work on a Solar 10. 7B version of it next and maybe a DeepSeq 67B. I'm curious if the DeepSeq's, deeper network is actually going to improve things in any sort of way. But</p><p>[00:33:46] <strong>Alex Volkov:</strong> awesome. John, thank you so much for joining and thank you so much for the deep dive. So I have two questions for you real quick. I did not expect you to join. So this is not a full blown interview, but I'm very happy that I have you. First of all, you mentioned that there's like two versions, DPO and non DPO, of Bagel.</p><p>[00:34:01] <strong>Alex Volkov:</strong> And you mentioned the differences between them. You said like DPO version is more factual and truthful, but not great for RP. I wasn't sure what RP is. Roleplay?</p><p>[00:34:10] <strong>Jon Durbin:</strong> Roleplay,</p><p>[00:34:11] <strong>Alex Volkov:</strong> Yeah. And then creative writing. Could you give us like a little bit of a, of a sense of like, what's like DPO versus non DPO version? Is that just dataset based or is there something more going on behind the scenes that like makes the one model behave differently than the other?</p><p>[00:34:27] <strong>Jon Durbin:</strong> Yeah, so really all of the Bagel series, you basically have two phases of training. There's the super, regular supervised, fine tuning stage where I just, you can look at the Bagel repository. Everything is completely open source and reproducible. But in the supervised fine tuning phase it's just a ton of data sets and and then I take that fine tuned model, fine tuned model, and then I apply DPO, direct preference optimization to it.</p><p>[00:34:52] <strong>Jon Durbin:</strong> And I have quite a few DPO datasets in there, but really, the DPO landscape is sparse right now. You basically have DPO datasets from NVIDIA, the Helpsteer database, which is a human annotated one where they ran a bunch of gen a bunch of prompts against LLMs and then had humans rank them.</p><p>[00:35:14] <strong>Jon Durbin:</strong> Then there's like the LIMSYS, 1, 000, 000, where you can find the exact same prompt sent to multiple models. And so you can take like the GPT 4 answers. Use that as the preferred answer, and then the Kunyu 33 or something as the rejected answer, because you're assuming the GPT 401 is better.</p><p>[00:35:31] <strong>Jon Durbin:</strong> Same with there's Orca DPO pairs. I know Argya just did a new release of that, which is better. But we don't have a ton of DPO datasets that are specifically for creative writing tasks and stuff. I made one which is actually based on the Eroboros 2. 2 compared to the Eroboros 3 series where I actually rewrote most of the creative writing prompts with a different prompt and some other stuff.</p><p>[00:35:59] <strong>Jon Durbin:</strong> I actually used the March version of GPT 4 which is better. So in that case you get Basically like three to four times the number of tokens in the output. So there's that DPO data set, which I make myself in the Bagel Code. But otherwise there's really no role play focused data in any of the DPO data sets.</p><p>[00:36:21] <strong>Jon Durbin:</strong> So what happens is you take that supervised or, fine tuned model from the first phase. And you apply DPO to it, and it kind of experiences, forgetting of what it learned during the fine tuning of some of the stuff like creative writing and role play. Yeah same with code. So if you look at, my Twitter feed, you can see that I've released there's a Python DPO dataset that'll hopefully fix some of that stuff.</p><p>[00:36:44] <strong>Jon Durbin:</strong> I just released another contextual question answering DPO dataset for better RAG performance after the DPO phase. I put out just a few minutes ago Gutenberg DPO, which is basically I parse maybe 14 or 15 books from Project Gutenberg that are public domain into chapters and then create prompts to actually write those chapters and then I create summaries so you have like the previous chapter summary inside the prompt and then I use that to prompt one of the local LLMs so I used Dolphin, eChat, and Lama 213b. To get the rejected values the outputs from these models are fine in some cases, but they're short and they, you'll notice with the LLM, like most of the LLMs, when you write a story, it's always a happy ending and it, and it ends with like, and they walked into the forest lived happily ever after.</p><p>[00:37:37] <strong>Jon Durbin:</strong> It's boring and cliche. My hope with the Gutenberg stuff is that when you actually prompt it to write a chapter of a book, it's gonna be, from human writing that are popular books. They're a little bit old timey because they have to be to be public domain, but,</p><p>[00:37:52] <strong>Alex Volkov:</strong> Yeah.</p><p>[00:37:53] <strong>Jon Durbin:</strong> hopefully it will improve the writing and create creativity of the late whatever bagel models I do in the future with So I'm trying to kind of improve, improve that, but still a lot of stuff I need to do. I think the next thing I'll do before I actually make another bagel model is use something like the Goliath 120B to make a role play centric dataset for DPO. That way it doesn't completely forget how to do that as well.</p><p>[00:38:15] <strong>Alex Volkov:</strong> Awesome. And I'm just looking at the number of data sets that, like you said, everything, everywhere, all at once. And this is why it's called Bagel, Everything Bagel. It's just like an insane amount of data sets. I'm just gonna run real quick. AI2, Arc, Error Bores Belly Belly, Blue Moon.</p><p>[00:38:30] <strong>Alex Volkov:</strong> You have Capybara in there, Cinematica. Imo Bang, Gutenberg, LMsys chat, like, like tons, tons of stuff. It's incredible how well the model performs. John, one thing that I wanted to follow up on before we move on. You mentioned something that's better for RAG as well. You mentioned a DPO data set that's better for RAG.</p><p>[00:38:45] <strong>Alex Volkov:</strong> Is that the contextual DPO that you released?</p><p>[00:38:49] <strong>Jon Durbin:</strong> Yep.</p><p>[00:38:50] <strong>Alex Volkov:</strong> What, what makes it better for, for RAG purposes? Could you, could you like maybe give two sentences about this?</p><p>[00:38:56] <strong>Jon Durbin:</strong> And this is actually something you can reproduce with the AeroBoros tool as well if you wanted to generate your own data, but I have this instructor in there called Counterfactual Contextual, and what that does is it makes a bunch of fake facts, like it'll say, the Battle of Midway happened in the Civil War, something like that and it'll put that into context and then ask a question about it.</p><p>[00:39:19] <strong>Jon Durbin:</strong> And then it'll have the real version of the fact as well, World War II, Battle of Midway and then the idea is that you want to train the model to always attend to the context and not try to base the answers on what it knows from the base pre training. For example, if you're doing I don't know, like a virtual, you have a different planet where the sky is purple.</p><p>[00:39:41] <strong>Jon Durbin:</strong> And you ask the model, what color is sky, is the sky based on your lore book or whatever. You want to make sure that the model always obeys your context and, and answers accordingly, and not says the sky is blue, because I know the sky is blue. So the, the data set that I put in there has a bunch of those kinds of things.</p><p>[00:39:59] <strong>Jon Durbin:</strong> You can't just put in the fake facts, because then the model will just You know, learn to answer incorrectly. So for every, for every fake version of the context, you have to put in a real version of the context as well. The other thing that makes it better for RAG is I actually stuff more than one piece of context into it because Like with RAG, the retrieval accuracy is the hardest part, so you want to retrieve more than one document.</p><p>[00:40:23] <strong>Jon Durbin:</strong> So suppose you want to retrieve ten documents. If you want to stuff all ten of those into a single prompt and then you want to provide references to the user, you have to know which segment of the prompt it came from. This data set also includes, like, you can put metadata into the prompt for each section that you retrieve, and then when you ask for references in the output, it'll actually only reference that segment.</p><p>[00:40:47] <strong>Jon Durbin:</strong> A bunch of stuff like that, yeah, I, I put in irrelevant context as well to make, try to confuse them all because retrieval is very noisy. All of that kind of stuff is in there.</p><p>[00:40:57] <strong>Alex Volkov:</strong> First of all, I think from the whole community, thank you a lot for everything that you do and your work. And I really appreciate your time here on Thursday. You're more than welcome to always join us. And I didn't expect you to be here when I talked about.</p><p>[00:41:09] <strong>Alex Volkov:</strong> The stuff that you just released, but it's really, really awesome when people from the community who work on the stuff that they do also come and have a chance to speak about them. So John, you're always welcome on Thursday. I would love to invite you again and talk deeper.</p><p>[00:41:20] <strong>Alex Volkov:</strong> And as you release the next stuff that you're working on, I know you're working on a bunch of next stuff more than welcome to come here and, and, and discuss, or even like DM me before. So we'll know what to chat about. I will. Definitely mentioned the, the DPO datasets in the fine tuning hackathon that I'm going to this week.</p><p>[00:41:35] <strong>Alex Volkov:</strong> And so thank you for that. That, that was why I wanted to do a little bit of a deep dive. [00:41:40] And also I want to shout out you as the, one of the most active users of Weights Biases. You posted your like recap that we sent and you have two reports there. And you're part of like the top 10 percent of most active users with 2, 500.</p><p>[00:41:53] <strong>Alex Volkov:</strong> Hours trained in 23 and like 900 plus models. So that's, that's incredible. I just wanted to shout this out.</p><p>[00:42:02] <strong>Jon Durbin:</strong> Yeah, I'm a little addicted.</p><p>[00:42:03] <strong>Alex Volkov:</strong> Yeah, it's amazing. It's amazing. And I, I appreciate everything that you do and I think the community as well</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/jan14-sunday-special-deep-dives</link><guid isPermaLink="false">substack:post:140689559</guid><dc:creator><![CDATA[Alex Volkov, João Moura, Jon Durbin, and Umesh Rajani]]></dc:creator><pubDate>Mon, 15 Jan 2024 01:53:00 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/140689559/a01a9a9b8cda48d61d91fe48cf9eade9.mp3" length="30513377" type="audio/mpeg"/><itunes:author>Alex Volkov, João Moura, Jon Durbin, and Umesh Rajani</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>2543</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/140689559/a8435c38b3319a3ce765d9472ed04dd9.jpg"/></item><item><title><![CDATA[📅 ThursdAI Jan 11 - GPTs store, Mixtral paper, Phi is MIT + Phixtral, 🥯 by Jon Durbin owns the charts + Alex goes to SF again and 2 deep dive interviews 🎙️ ]]></title><description><![CDATA[<p>Hey hey everyone, how are you this fine ThursdAI? 👋 I’m gud thanks for asking!</p><p>I’m continuing my experiment of spilling the beans, and telling you about everything we talked about in advance, both on the pod and in the newsletter, so let me know if this is the right way to go or not, for the busy ones it seems that it is. If you don’t have an hour 15, here’s a short video recap of everything we chatted about:</p><p>ThursdAI - Jan 11 2024 TL;DR</p><p>TL;DR of all topics covered + Show notes</p><p>* <strong>Open Source LLMs</strong></p><p>* 🔥 Donut from Jon Durbin is now top of the LLM leaderboard (<a target="_blank" href="https://x.com/jon_durbin/status/1743586067108831472?s=20">X</a>, <a target="_blank" href="https://huggingface.co/jondurbin/bagel-dpo-34b-v0.2">HF</a>, <a target="_blank" href="https://www.reddit.com/r/LocalLLaMA/comments/1916896/llm_comparisontest_confirm_leaderboard_big_news/">Wolframs deep dive and scoring</a>)</p><p>* OpenChat January Update - Best open source 7B LLM (<a target="_blank" href="https://x.com/openchatdev/status/1744985660870795635?s=20">X</a>, <a target="_blank" href="https://huggingface.co/openchat/openchat-3.5-0106">Hugging Face</a>)</p><p>* Our friends at NousResearch announce a seed round of 5.2M as their models pass 1.2 million downloads (<a target="_blank" href="https://x.com/nousresearch/status/1744865872563618128?s=12">X</a>)</p><p>* Argilla improved (Distillabeled?) the DPO enhanced Neural Hermes with higher quality DPO pairs (<a target="_blank" href="https://x.com/argilla_io/status/1745057571696693689?s=20">X</a>)</p><p>* New MoEs are coming out like hotcakes - PhixTral and DeepSeek MoE (<a target="_blank" href="https://x.com/XueFz/status/1745280372043296956?s=20">X</a>, <a target="_blank" href="https://x.com/osanseviero/status/1745402823682970036?s=20">Omar Thread</a>, <a target="_blank" href="https://twitter.com/maximelabonne/status/1744867841436700850">Phixtral Thread</a>)</p><p>* Microsoft makes Phi MIT licensed 👏</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* OpenAI adds personalization & team tiers (<a target="_blank" href="https://openai.com/blog/introducing-chatgpt-team">Teams announcement</a>)</p><p>* OpenAI launches GPT store (<a target="_blank" href="https://openai.com/blog/introducing-the-gpt-store">Store announcement</a>, <a target="_blank" href="chat.openai.com/gpts">Store link</a>)</p><p>* Mixtral medium tops the LMsys human evaluation arena, is the best LLM overall after GPT4 👏 (<a target="_blank" href="https://twitter.com/lmsysorg/status/1745061423724875891?ref_src=twsrc%5Etfw%7Ctwcamp%5Etweetembed%7Ctwterm%5E1745061423724875891%7Ctwgr%5E58a43f98e08b74e94594e238390ee283b99e9430%7Ctwcon%5Es1_c10&#38;ref_url=https%3A%2F%2Fspacesdashboard.com%2Fspace%2F1YpKkwDbXPrKj%2Fvirtual-grass-touching-not-recorded">X</a>)</p><p>* <strong>Hardware</strong></p><p>* Rabbit R1 is announced, $200/mo without a subscription, everybody has a take (<a target="_blank" href="https://twitter.com/rabbit_hmi/status/1744781083831574824?ref_src=twsrc%5Etfw%7Ctwcamp%5Etweetembed%7Ctwterm%5E1744781083831574824%7Ctwgr%5E58a43f98e08b74e94594e238390ee283b99e9430%7Ctwcon%5Es1_c10&#38;ref_url=https%3A%2F%2Fspacesdashboard.com%2Fspace%2F1YpKkwDbXPrKj%2Fvirtual-grass-touching-not-recorded">X</a>)</p><p>* <strong>This weeks Buzz from Weights & Biases</strong></p><p>* Hackathon with Together, Langchain and WandB (and ME!) this weekend in AGI house (<a target="_blank" href="https://x.com/togethercompute/status/1744771806400233685?s=20">X</a>, <a target="_blank" href="https://partiful.com/e/AlntdLtxh9Jh1J6Pcsma">Signup</a>)</p><p>* <strong>Video</strong></p><p>* Bytedance releases MagicVideo-V2 video gen that looks great and passes Pika labs in human tests (<a target="_blank" href="https://x.com/arankomatsuzaki/status/1744918551415443768?s=20">X</a>)</p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* Luma launched their online version of Genie and it's coming to the API (<a target="_blank" href="https://twitter.com/LumaLabsAI/status/1744778363330535860?ref_src=twsrc%5Etfw%7Ctwcamp%5Etweetembed%7Ctwterm%5E1744778363330535860%7Ctwgr%5E58a43f98e08b74e94594e238390ee283b99e9430%7Ctwcon%5Es1_c10&#38;ref_url=https%3A%2F%2Fspacesdashboard.com%2Fspace%2F1YpKkwDbXPrKj%2Fvirtual-grass-touching-not-recorded">X</a>)</p><p>* <strong>Show notes and links mentioned</strong></p><p>* MergeKit (<a target="_blank" href="https://github.com/cg123/mergekit">github</a>)</p><p>* Jon Durbins Contextual DPO dataset (<a target="_blank" href="https://huggingface.co/datasets/jondurbin/contextual-dpo-v0.1">HuggingFace</a>)</p><p>* Phixtral from Maxime Lebonne (<a target="_blank" href="https://twitter.com/maximelabonne/status/1744867841436700850">X</a>, <a target="_blank" href="https://huggingface.co/mlabonne/phixtral-4x2_8">HuggingFace</a>)</p><p>* WandGPT - out custom Weights & Biases GPT (<a target="_blank" href="https://wandb.me/gpt">GPT store</a>)</p><p>* Visual Weather GPT by me - <a target="_blank" href="https://chatg.pt/artweather">https://chatg.pt/artweather</a></p><p>* Ask OpenAI to not train on your chats - <a target="_blank" href="https://privacy.openai.com/policies">https://privacy.openai.com/policies</a></p><p>AI Hardware</p><p>It seems that the X conversation had a new thing this week, the AI hardware startup Rabbit, showcased their new $200 device (no subscriptions!) at CES and everyone and their mom had an opinion! We had quite a long conversation about that with (his first time on ThursdAI 👏) as we both pre-ordered one, however there were quite a few red flags, like for example, GPUs are costly, so how would an AI device that has AI in the cloud just cost a 1 time 200 bucks??</p><p>There were other interesting things they showed during the demo, and I’ll let you watch the <a target="_blank" href="https://twitter.com/rabbit_hmi/status/1744781083831574824">full 30 minutes</a> and if you want to read more, here’s a great deeper dive into this from .</p><p>UPDATE: Ss I’m writing this, the CEO of Rabbit (who’s also on the board of Teenage Engineering, the amazing company that designed this device) tweeted that they sold out the initial first AND second batch of 10K unites, netting a nice $2M in hardware sales in 48 hours!</p><p>Open Source LLMs</p><p><strong>Mixtral paper dropped (ArXiv, Morgans take)</strong></p><p>Mistral finally published the paper on Mixtral of experts, the MoE that's the absolutel best open source model right now, and it's quite the paper. Nisten did a full paper reading with explanations on X space, which I co-hosted and we had almost 3K people tune in to listen. <a target="_blank" href="https://x.com/nisten/status/1744562277947171060?s=20">Here's the link</a> to the live reading X space by <a target="_blank" href="https://x.com/nisten/">Nisten</a>.</p><p>And here's some notes courtecy <a target="_blank" href="https://twitter.com/morgymcg/status/1744691603292217412">Morgan McGuire</a> (who's my boss at WandB btw 🙌)</p><p><strong>Strong retrieval across the entire context window</strong></p><p><em>Mixtral achieves a 100% retrieval accuracy regardless of the context length or the position of passkey in the sequence.</em></p><p><strong>Experts don't seem to activate based on topic</strong></p><p><em>Surprisingly, we do not observe obvious patterns in the assignment of experts based on the topic. For instance, at all layers, the distribution of expert assignment is very similar for ArXiv papers (written in Latex), for biology (PubMed Abstracts), and for Philosophy (PhilPapers) documents.</em></p><p><em>However...</em></p><p><em>The selection of experts appears to be more aligned with the syntax rather than the domain</em></p><p><strong>Datasets - </strong>No info was provided to which datasets Mixtral used to pretrain their incredible models 😭</p><p><strong>Upsampled multilingual data</strong></p><p><em>Compared to Mistral 7B, we significantly upsample the proportion of multilingual data during pretraining. The extra capacity allows Mixtral to perform well on multilingual benchmarks while maintaining a high accuracy in English</em></p><p><strong>Mixtral Instruct Training</strong></p><p><em>We train Mixtral – Instruct using supervised fine-tuning (SFT) on an instruction dataset followed by Direct Preference Optimization (DPO) on a paired feedback dataset and </em>was trained on <a target="_blank" href="https://twitter.com/CoreWeave">@CoreWeave</a></p><p>Jon Durbin Donut is the 🤴 of open source this week</p><p>6 of the top 10 are donut based models or merges of it. If you remember Auroborous, Donut includes that dataset, and there are two varieties there, the DPO and the non DPO versions of Bagel, including two merges from Cloudyu, which are non trained merges with mergekit, based on Donut. Jon pro tip for selecting DPO vs Non DPO models is</p><p>FYI, the DPO version is more factual, truthful, better at math, etc., but is not great for RP, creative writing, etc. Use non-DPO for those tasks!</p><p>Donut includes an impressive amount of dataset mixed together, which are all linked from the model card but here they are:</p><p>"ai2_arc, airoboros, apps, belebele, bluemoon, boolq, capybara, cinematika, drop, emobank, gutenberg, lmsys_chat_1m, mathinstruct, mmlu, natural_instructions, openbookqa, pippa, piqa, python_alpaca, rosetta_code, slimorca, spider, squad_v2, synthia, winogrande, airoboros 3.1 vs airoboros 2.2.1, helpsteer, orca_dpo_pairs"</p><p>Jon also shared his end of the year WandB report nad has trained a whopping 917 models this year for a total of ~2500 hours and is in the top 10% of the top active users (among 800K or so users)</p><p>I didn't know that Jon is going to join, but was so happy that he joined the live recording that we ended up chatting for 20 minutes, and there was so many nuggets in that conversation, about how to prepare DPO datasets, which other ones Jon has been releasing, and just a bunch more gold, that I decided to CUT that out and post it as a separate special deepdive episode that's going to get released on the Sunday special. Stay tuned for that!</p><p>Nous Research announces $5.2 million funding seed round as they cross 1.1 million model downloads on the hub</p><p>Congrats to Karan, Emozilla, Teknium, Bowen, Shivani and the rest of the Nous team on this great news! 👏 We expect to hear more from them in the coming year, with a consistent commitment to open source, keep open sourcing the best models, and the upcoming Forge news!</p><p>With investors like Balaji, OSS capital, Vipul from Together, Nous completes the $5.2M seed round, and we had Karan (one of the co-founders of Nous) on the pod to chat to use about what they are planning to do with that money and what are their continuous commitments to open source!</p><p>In addition, they just recently passed 1.1 million downloads on the hub with Nous-Hermes-2-34B being their best model! 🤴</p><p>OpenChat Jan update becomes the leading open source 7B model (X, <a target="_blank" href="https://huggingface.co/openchat/openchat-3.5-0106">Hugging Face</a>)</p><p>This update mainly enhanced training methodology, in-context learning & coding skills, outperforming the last 1210 release on 7 out of 8 benchmarks! and scores <strong>71.3 </strong>on HumanEval, 65.8% on MMLU 👏</p><p>The previous version of OpenChat trails just behind OpenHermes on the human evals on Lmsys arena, but both are incredible 7B models.</p><p>Argilla</p><p>- Argilla used their Distilabel tool to build a preference dataset from ratings and critiques of AI response pairs, taking around 3 hours </p><p>- The original dataset assumed the GPT-4/3.5 responses were always best, but Argilla found this was not always the case</p><p>- Their dataset confirmed ~4,000 pairs had the same rating, 7,000 pairs were unchanged, and <strong>~2,000 times the rejected response was preferred</strong>  </p><p>- Improving existing DPO datasets with higher quality pairs is important for model fine-tuning</p><p>- They are releasing an improved version of the popular Orca Pairs DPO dataset from Intel, and a new OpenHermes model outperforming baselines with <strong>54% fewer DPO pairs</strong></p><p>Big CO LLMs + APIs</p><p>OpenAI has a big week, launches GPTs store and team pro accounts (<a target="_blank" href="https://openai.com/blog/introducing-the-gpt-store">Blog</a>)</p><p>Things of note about the store:</p><p>* My GPTs are getting feedback and crossed <strong>10K chats</strong> , was #6 on lifestyle and the disappeared, but has gained 2x more chats in 24 hours since the store has launched!</p><p>* Discoverability is great, trending GPTs are shown clearly, and folks are getting a lot of exposure</p><p>* Copycats already started copying a bunch of the great GPTs, see this example of what happens when you search for Gymstreak, most of the top GPTs are already being copy-catted.</p><p>Team accounts:</p><p>$25/mo per user for annual plans and at least 2 teams</p><p>The biggest confusion was from folks who didn't understand that OpenAI trains on Pro conversations, and there's an option to Opt-out!</p><p>This weeks Buzz (What I learned with WandB this week)</p><p>Weights and Biases (and ME!) are going to AGI house to lead a Rag vs Finetune hackathon with cool prizes!</p><p>There's still time to RSVP, will incredible guests speakers, this Hackathon is organized together with... LangChain, TogetherCompute and AGI house - If you're in the SF area, and you wanna hack on some cool RAG things and get awesome prizes (and meet me!) join the waitlist here <a target="_blank" href="https://partiful.com/e/AlntdLtxh9Jh1J6Pcsma">https://partiful.com/e/AlntdLtxh9Jh1J6Pcsma</a></p><p>Vision & Video</p><p>Luma released GENIE on Web and IOS, if you remember, we covered the GENIE text-to-3d model they first released on discord a while ago, and now it's incorporated into the luma website, and is significantly higher quality 3D assets.</p><p>The generations are free for now, and they look awesome! Here are some of mine, I created a Bee holding a Wand (get it? WandB? 😆) and a polish bear (internal joke) and they look so cool!</p><p>Friend of the pod and recent LUMA hire <a target="_blank" href="https://twitter.com/daken_">Arthur Islamov</a> jumped on and also told us that this is coming to the API, so developers would be able to automate asset creation and generate tons of 3D objects programmatically, and use cool prompt techniques to make sure they are a bit better every time maybe? Great news!</p><p>AI Art & Diffusion</p><p>Bytedance announces MagicVideo-V2 (<a target="_blank" href="https://t.co/nZOlH58Ev5">Arxiv</a>, <a target="_blank" href="https://t.co/nZOlH58Ev5">Project</a>)</p><p>We didn't get anything besides quite a few cherry picked videos and a paper, so we can't use this yet, but wow some of these videos look incredible!</p><p>MagicVideo-V2 that integrates the text-to-image model, video motion generator, reference image embedding module and frame interpolation module into an end-to-end video generation pipeline. Benefiting from these architecture designs, MagicVideo-V2 can generate an aesthetically pleasing, high-resolution video with remarkable fidelity and smoothness. It demonstrates superior performance over leading Text-to-Video systems such as Runway, Pika 1.0, Morph, Moon Valley and Stable Video Diffusion model via user evaluation at large scale</p><p></p><p>Lastly, I had the greatest time to interview my new friend João Moura, the creator of Crew AI, which been popping off, was the #1 trending on Github and #2 of the day on Product hunt, and is essentially an AI framework that lets you create a crew of AI agents to do tasks for you. I will be polishing up that conversation and post it together with the deep dive with Jon, so stay tuned, but here’s a sneak preview of how cool this is and expect that episode to drop soon!</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-jan-11-gpts-store-mixtral</link><guid isPermaLink="false">substack:post:140604135</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Fri, 12 Jan 2024 00:59:08 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/140604135/3f6e9204910efda7efec415e0cfd7f1b.mp3" length="55209121" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>4601</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/140604135/4ea912a94ab7d55681c8dd3a82069f2d.jpg"/></item><item><title><![CDATA[📅 ThursdAI Jan 4 - New WizardCoder, Hermes2 on SOLAR, Embedding King? from Microsoft, Alibaba upgrades vision model & more AI news]]></title><description><![CDATA[<p>Here’s a TL;DR and show notes links</p><p>* <strong>Open Source LLMs</strong></p><p>* New WizardCoder 33B V1.1 - 79% on HumanEval (<a target="_blank" href="https://twitter.com/WizardLM_AI/status/1742906065359167730">X</a>, <a target="_blank" href="https://huggingface.co/WizardLM/WizardCoder-33B-V1.1">HF</a>)</p><p>* Tekniums Hermes 2 on SOLAR 10.7B (<a target="_blank" href="https://x.com/Teknium1/status/1742041640775348460?s=20">X</a>, <a target="_blank" href="//huggingface.co/TheBloke/Nous-Hermes-2-SOLAR-10.7B-GGUF">HF</a>)</p><p>* Microsoft - E5 SOTA text embeddings w/ Mistral (<a target="_blank" href="https://twitter.com/osanseviero/status/1742555240363074000">X</a>, <a target="_blank" href="https://huggingface.co/intfloat/e5-mistral-7b-instruct">HF</a>, <a target="_blank" href="https://arxiv.org/abs/2401.00368">Paper</a>, <a target="_blank" href="https://twitter.com/Yampeleg/status/1742640557422268834">Yams Thread</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Samsung is about to announce some AI stuff</p><p>* OpenAI GPT store to come next week</p><p>* Perplexity announces a $73.6 Series B round</p><p>* <strong>Vision</strong></p><p>* Alibaba - QWEN-VL PLUS was updated to 14B (<a target="_blank" href="https://x.com/JustinLin610/status/1742184229453320451?s=20">X</a>, <a target="_blank" href="https://huggingface.co/spaces/Qwen/Qwen-VL-Plus">Demo</a>)</p><p>* OCU SeeAct - GPT4V as a generalist web agent if grounded (<a target="_blank" href="https://twitter.com/ysu_nlp/status/1742398541660639637">X</a>, <a target="_blank" href="https://arxiv.org/abs/2401.01614">Paper</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* Nvidia + Suno release NeMo Parakeet beats Whisper on english ASR (<a target="_blank" href="https://twitter.com/HaseoX94/status/1742286314513506307">X</a>, <a target="_blank" href="https://huggingface.co/nvidia/parakeet-rnnt-1.1b">HF</a>, <a target="_blank" href="https://huggingface.co/spaces/nvidia/parakeet-rnnt-1.1b">DEMO</a>)</p><p>* <strong>Tools & Agents</strong></p><p>* Stanford - Mobile ALOHA bot - Open source cooking robot (<a target="_blank" href="https://t.co/gFcgiuTxrg">Website</a>, <a target="_blank" href="https://twitter.com/DrJimFan/status/1742940245262471346">X thread</a>)</p><p>Open Source LLMs</p><p>WizardCoder 33B reaches a whopping 79% on HumanEval @pass1</p><p>State of the art LLM coding in open source is here. A whopping 79% on HumanEval, with Wizard Finetuning DeepSeek Coder to get to the best Open Source coder, edging closer to GPT4 and passing GeminiPro and GPT3.5 👏 (at least on some benchmarks)</p><p>Teknium releases a Hermes on top of SOLAR 10.7B</p><p>Downloading now with LMStudio and have been running it, it's very capable. Right now SOLAR models are still on top of the hugging face leaderboard, and Hermes 2 now has 7B (Mistral) 10.7B (SOLAR) and 33B (Yi) sizes.</p><p>On the podcast I've told a story of how this week I actually used the 33B version of Capybara for a task that GPT kept refusing to help me with. It was honestly kind of strange, a simple request to translate kept failing with an ominous “network error”.</p><p></p><p>Which only highlighted how important the local AI movement is, and now I actually have had an experience myself of a local model coming through when a hosted capable one didn’t</p><p>Microsoft releases a new text embeddings SOTA model E5 , finetuned on synthetic data on top of Mistral 7B</p><p>We present a new, easy way to create high-quality text embeddings. Our method uses synthetic data and requires less than 1,000 training steps, without the need for complex training stages or large, manually collected datasets. By using advanced language models to generate synthetic data in almost 100 languages, we train open-source models with a standard technique. Our experiments show that our method performs well on tough benchmarks using only synthetic data, and it achieves even better results when we mix synthetic and real data.</p><p>We had the great please of having <a target="_blank" href="https://sub.thursdai.news/p/thursdai-oct-26-jina-embeddings-sota">Bo Wang again</a> (One of the authors of the Previously SOTA Jina embeddings and a previous podcast gust) to do a deepdive into embeddings and specifically E5 with it's decoder only architecture. While the approach Microsoft researchers took here are interesting, and despite E5 claiming a top spot on the MTEB leaderboard (pictured above) this model doesn't seem to be super practical for most purposes folks use embeddings right now (RAG) for the following reasons:</p><p>* Context length limitation of 32k, with a recommendation not to exceed 4096 tokens.</p><p>* Requires a one-sentence instruction for queries, adding complexity for certain use cases like RAG.</p><p>* Model size is large (14GB), leading to higher costs for production use.</p><p>* Alternative models like bge-large-en-v1.5 are significantly smaller (1.35GB).</p><p>* Embedding size is 4096 dimensions, increasing the cost for vector storage.</p><p>Big CO LLMs + APIs</p><p>OpenAI announces that the GPT store is coming next week!</p><p>I can't wait to put the <a target="_blank" href="https://chatg.pt/artweather">visual weather GPT</a> I created and see how the store prompts it and if I get some revenue share like OpenAI promised during dev day. My daughter and I are frequent users of <a target="_blank" href="https://chatg.pt/alice">Alice - the kid painter</a> as well, a custom GPT that my Daughter named Alice, that knows it's speaking to kids over voice, and is generating coloring pages. Will see how much this store lives up to the promises.</p><p>This weeks Buzz (What I learned with WandB this week)</p><p>This week was a short one for me, so not a LOT of learnings but I did start this course from W&B, called Training and Fine-tuning Large Language Models (LLMs).</p><p>It features great speakers like Mark Sarufim from Meta, Jonathan Frankle from Mosaic, and Wei Wei Yang from Microsoft along with W&B MLEs (and my team mates) Darek Kleczek and Ayush Thakur and covers the end to end of training and fine-tuning LLMs!</p><p>The course is available <a target="_blank" href="http://wandb.me/llm-course">HERE</a> and it's around 4 hours, and well well worth your time if you want to get a little more knowledge about the type of stuff we report on ThursdAI.</p><p>Vision</p><p>SeeAct - GPT4V as a generalist web agent if grounded (<a target="_blank" href="https://twitter.com/ysu_nlp/status/1742398541660639637">X</a>, <a target="_blank" href="https://arxiv.org/abs/2401.01614">Paper</a>)</p><p>In June OSU NLP released <a target="_blank" href="https://osu-nlp-group.github.io/Mind2Web/">Mind2Web</a> which is a dataset for developing and evaluating web acting agents, LLMs that click buttons and perform tasks with 2350 tasks from over 130 website, stuff like booking flights, finding folks on twitter, find movies on Netflix etc'</p><p>GPT4 without vision was terrible at this (just by reading the website html/text) and succeeded at around 2%.</p><p>With new vision LMMs, websites are a perfect place to start training because of the visual (how website is rendered) is no paired with HTML (the grounding) and SeeAct uses GPT4-V to do this</p><p>SeeAct is a generalist web agent built on LMMs like GPT-4V. Specifically, given a task on an<em>y website (e.g., “Compare iPhone 15 Pro Max with</em> iPhone 13 Pro Max” on the Apple homepage), the agent first performs <strong>action generation</strong> to produce a textual description of the action at each step towards completing<em> the task (e.g., “Navigate to t</em>he iPhone category”), and then performs <strong>action grounding</strong> to identify the corresponding HTML element (e.g., “[button] iPhone”) and operation (e.g., CLICK, TYPE, or SELECT)<strong> </strong>on the webpage.</p><p>SeeAct achieves a 50% score on the Mind2Web evaluation task!</p><p>QWEN-VL was updated to PLUS (14B) and it's pretty good compared to GPT4V</p><p>Capabilities include: image captioning, visual question answering, visual grounding, OCR, visual reasoning. We had a chat with Junyang Lin, the tech lead for Qwen with Alibaba on the pod, and he mentioned specifically that they noticed that adding a larger "brain" (as in, LLM) to vision models, significantly increases the performance and vision understanding of the LMMs.</p><p>While this model is not yet released, you can demo it <a target="_blank" href="https://huggingface.co/spaces/Qwen/Qwen-VL-Plus">here</a>, and Junyang told us that it is coming to a release, like the previous QWEN models did before.</p><p>I noticed the advanced OCR capabilities and understanding, this example was really spot on. Notice the "logo for Browser company" , the model understood that this text was in fact a logotype! (which even GPT4V failed at in my test)</p><p>Voice</p><p>Parakeet from NVIDIA beats Whisper on English with a tiny model (<a target="_blank" href="https://nvidia.github.io/NeMo/blogs/2024/2024-01-parakeet/">blog</a>)</p><p>Brought to you by <a target="_blank" href="https://twitter.com/NVIDIAAI">@NVIDIAAI</a> and <a target="_blank" href="https://twitter.com/suno_ai_">@suno_ai_</a>, parakeet beats Whisper and regains its first place. The models are released under a commercially permissive license! The models inherit the same FastConformer architecture and come in 2 flavors: 1. RNNT (1.1B & 0.6B) 2. CTC (1.1B & 0.5B) Each model is trained on 65K hours of English data (40K private proprietary data by Suno & NeMo teams) over several hundred epochs. Key features of the parakeet model: 1. It doesn't hallucinate (if the audio sample has silence, the output is silent). 2. It is quite robust to noisy audio (if the audio sample has non-vocal sounds, it outputs silence).</p><p>We had the great please to have VB from the Audio team at HuggingFace, and he went in depth into the way in which Parakeet is better than Whisper (higher quality transcriptions while also being much much faster), it was trained on only 65K hours vs a few million with whisper, and we also covered that because of this different architecture, Parakeet is not able to receive any guidance for words that are hard for it to understand. For example, with whisper, I often provide <strong>ThursdAI</strong> in initial_prompt parameter to help guide whisper to know what it should say.</p><p>Regardless, having a model that's superfast, and can beat whisper, and is commercially licensed to build on top of is incredible! <a target="_blank" href="https://huggingface.co/spaces/nvidia/parakeet-rnnt-1.1b">Here's a demo </a>for you to try it out and it's available with the NVIDIA NeMO framework.</p><p>Coqui shuts down :(</p><p>We've had Josh from Coqui on our pod before, when they released XTTS, and they have been friends ever since. It's sad to see Coqui shut down, and we want to wish all the team an easy and great transition 👏 You guys did a great job and we're rooting for each and every one of you.</p><p>* Coqui is closing down.</p><p>* The team is praised for being small yet impactful, competing with big tech despite limited resources.</p><p>* Coqui began as the Machine Learning Group at Mozilla, creating DeepSpeech, Common Voice, and TTS.</p><p>* Spun out as Coqui in 2021 to accelerate their mission.</p><p>* Major achievement: XTTS, with openly released model weights for versions 1 and 2.</p><p>* 2021: Coqui STT v1.0 released, Coqui Model Zoo and SC-GlowTTS launched.</p><p>* 2022: YourTTS became viral, numerous open-source releases, team expansion.</p><p>* 2023: Coqui Studio webapp and API launched, XTTS open release, first customers acquired.</p><p>* Acknowledgment of the community, investors, customers, and partners for their support.</p><p>* Partners include HuggingFace, Mozilla, Masakhane, Harvard, Indiana University, Google, MLCommons, Landing AI, NVIDIA, Intel, and Makerere University.</p><p>* Future of generative AI in 2024 predicted to grow, with open-source playing a significant role.</p><p>* Coqui TTS remains available on Github for further innovation.</p><p>Tools</p><p>Stanford Mobile ALOHA bot open sources, shows cooking</p><p>Back in March, Stanford folks introduced ALOHA, (A Lowcost Open Hardware system for Bimanual Teleoperation)</p><p>Basically a 4 arm robot, that a human operator can operate tasks and do fine motor skills like break an egg or tie ziptie. Well now, just 10 months later, they are introducing the Mobile version. A mounted ALOHA gear, that uses the human to perform tasks like cooking, calling the elevator and is able to learn from those actions, and then perform them.The operating gear can be easily detached for self operation, it's mobile so compute and battery pack are on the wheel base.</p><p>Recently Meta released a huge dataset of first person operations called <a target="_blank" href="https://ego-exo4d-data.org/">Ego-Exo 4D</a> which combines first person and third person perspective for a big variety of tasks, such as cooking, cleaning, sports, healthcare and rock climbing, and this open hardware from Stanford is an additional example of how fast robotics advances into the physical world</p><p></p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p><p></p><p>And just like that, the first ThursdAI of the year is done! 🫡 Thank you for being a subscriber, see you next week 👏</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-jan-4-new-wizardcoder-hermes2</link><guid isPermaLink="false">substack:post:140372916</guid><dc:creator><![CDATA[Alex Volkov, Umesh Rajani, and Nisten]]></dc:creator><pubDate>Fri, 05 Jan 2024 00:48:41 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/140372916/9f61ff6cd546e996c67d4edf8146e38b.mp3" length="71246779" type="audio/mpeg"/><itunes:author>Alex Volkov, Umesh Rajani, and Nisten</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5937</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/140372916/dea52128c77c828e2c4db3a7d5d507b9.jpg"/></item><item><title><![CDATA[📅 ThursdAI - Dec 28 - a BUNCH of new multimodal OSS, OpenAI getting sued by NYT, and our next year predictions]]></title><description><![CDATA[<p>Hey hey hey (no longer ho ho ho 🎄) hope you had a great Christmas! And you know that many AI folks have dropped tons of OpenSource AI goodies for Christmas, here’s quite a list of new things, including at least 3 new multi-modal models, a dataset and a paper/technical report from the current top model on HF leaderboard from Upstage. </p><p>We also had the pleasure to interview the folks who released the Robin suite of multi-modals and aligning them to “good responses” and that full interview is coming to ThursdAI soon so stay tuned.</p><p>And we had a full 40 minutes with an open stage to get predictions for 2024 in the world of AI, which we fully intent to cover next year, so scroll all the way down to see ours, and reply/comment with yours! </p><p><strong>TL;DR of all topics covered:</strong> </p><p>* <strong>Open Source LLMs</strong> </p><p>* Uform - tiny(1B) multimodal embeddings and models that can run on device (<a target="_blank" href="https://github.com/unum-cloud/uform">HF</a>, <a target="_blank" href="https://unum-cloud.github.io/uform/">Blog</a>, <a target="_blank" href="https://github.com/unum-cloud/uform?tab=readme-ov-file#encoder">Github</a>, <a target="_blank" href="https://huggingface.co/spaces/aavetis/ugen-image-captioning">Demo</a>)</p><p>* Notux 8x7B - one of the first Mixtral DPO fine-tunes - (<a target="_blank" href="https://huggingface.co/posts/alvarobartt/458568556037122">Thread</a>, <a target="_blank" href="https://huggingface.co/spaces/argilla/notux-chat-ui">Demo</a>)</p><p>* Upstage SOLAR 10.7B technical report (<a target="_blank" href="https://arxiv.org/abs/2312.15166">arXiv</a>, <a target="_blank" href="https://twitter.com/winglian/status/1740081525826167060">X discussion</a>, <a target="_blank" href="https://twitter.com/hunkims/status/1740216908472058142">followup</a>)</p><p>* Capybara dataset open sourced by LDJ (<a target="_blank" href="https://twitter.com/ldjconfirmed/status/1739590400417964396">Thread</a>, <a target="_blank" href="https://huggingface.co/datasets/LDJnr/Capybara">HF</a>)</p><p>* Nous Hermes 34B (finetunes Yi34B) - (<a target="_blank" href="https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B">Thread</a>, <a target="_blank" href="https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B">HF</a>)</p><p>* Open Source long context pressure test analysis (<a target="_blank" href="https://www.reddit.com/r/LocalLLaMA/comments/18s61fb/pressuretested_the_most_popular_opensource_llms/">Reddit</a>)</p><p>* Robin - a suite of multi-modal (Vision-Language) models - (<a target="_blank" href="https://x.com/irinarish/status/1738333642244436154?s=20">Thread</a>, <a target="_blank" href="https://sites.google.com/view/irinalab/blog/robin-v1-0">Blogpost</a>, <a target="_blank" href="https://huggingface.co/agi-collective">HF</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Apple open sources ML-Ferret multi-modal model with referring and grounding capabilities (<a target="_blank" href="https://github.com/apple/ml-ferret?tab=readme-ov-file">Github</a>, <a target="_blank" href="https://github.com/apple/ml-ferret?tab=readme-ov-file#checkpoints">Weights</a>, <a target="_blank" href="https://arxiv.org/abs/2310.07704">Paper</a>)</p><p>* OpenAI & Microsoft are getting sued by NewYorkTimes for copyright infringement during training (<a target="_blank" href="https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec2023.pdf">Full Suit</a>)</p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* Midjourney v6 alpha is really good at recreating scenes from movies (<a target="_blank" href="https://twitter.com/ProperPrompter/status/1739989870502900109">thread</a>)</p><p>Open Source LLMs </p><p>Open source doesn't stop even during the holiday break! Maybe this is the time to catch up to the big companies? During the holiday periods? </p><p>This week we got a new <a target="_blank" href="https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B">34B Nous Hermes model</a>, the first <a target="_blank" href="https://huggingface.co/posts/alvarobartt/458568556037122">DPO fine-tune of Mixtral</a>, <a target="_blank" href="https://twitter.com/ldjconfirmed/status/1739590400417964396">Capybara dataset</a> but by far the biggest news of this week was in Multimodality. Apple quietly open sourced ml-ferret, an any to any model able to compete in grounding with even GPT4-V sometimes, Uform released tiny mutli-modal and embeddings versions for on device inference, and AGI collective gave NousHermes 2.5 eyes 👀</p><p>There's no doubt that 24' is going to be the year of multimodality, and this week we saw an early start of that right on ThursdAI. </p><p>Ml-Ferret from Apple (<a target="_blank" href="https://github.com/apple/ml-ferret?tab=readme-ov-file">Github</a>, <a target="_blank" href="https://github.com/apple/ml-ferret?tab=readme-ov-file#checkpoints">Weights</a>, <a target="_blank" href="https://arxiv.org/abs/2310.07704">Paper</a>)</p><p>Apple has been in the open source news lately, as we've covered their MLX release previously and the LLM in a flash paper that discusses inference for low hardware devices, and Apple folks had 1 more gift to give. Ml-Ferret is a multimodal grounding model, based on Vicuna (for some... reason?) which is able to get referrals from images (this highlighted or annotated areas) and then ground the responses with exact coordinates and boxes. </p><p>The interesting thing about the referring, is that it can be any shape, bounding box or even irregular shape (like the ferred in the above example or cat tail below) </p><p>Ferret was trained on a large new dataset called GRIT containing over 1 million examples of referring to and describing image regions (which wasn't open sourced AFAIK yet)</p><p>According to Ariel Lee (our panelist) these weights are only delta weights and need to be combined with Vicuna weights to be able to run the full Ferret model properly. </p><p>Uform - tiny (1.5B) MLLMs + vision embeddings (<a target="_blank" href="https://github.com/unum-cloud/uform">HF</a>, <a target="_blank" href="https://unum-cloud.github.io/uform/">Blog</a>, <a target="_blank" href="https://github.com/unum-cloud/uform?tab=readme-ov-file#encoder">Github</a>, <a target="_blank" href="https://huggingface.co/spaces/aavetis/ugen-image-captioning">Demo</a>)</p><p>The folks at Unum have released a few gifts for us, with an apache 2.0 license 👏 Specifically they released 3 vision embeddings models, and 2 generative models. </p><p>Per the <a target="_blank" href="https://unum-cloud.github.io/uform/#features">documentation </a>the embeddings can yield 2,3x speedup improvements to search from Clip like models, and 2-4x inference speed improvements given the tiny size. The embeddings have a multi-lingual version as well supporting well over 20 languages. </p><p>The generative models can be used for image captioning, and since they are tiny, they are focused on running on device, and are already converted to ONNX format and core-ML format. Seen the results below compared to LLaVa and InstructBLIP, both at the 7B range</p><p>I've tried a few images of my own (you can try the <a target="_blank" href="https://huggingface.co/spaces/aavetis/ugen-image-captioning">demo here</a>), and while there was hallucinations, this tiny model did a surprising amount of understanding for the size! </p><p>Also shoutout to Ash</p><p>Robin suite of multimodal models (<a target="_blank" href="https://x.com/irinarish/status/1738333642244436154?s=20">Thread</a>, <a target="_blank" href="https://sites.google.com/view/irinalab/blog/robin-v1-0">Blogpost</a>, <a target="_blank" href="https://huggingface.co/agi-collective">HF</a>)</p><p>The folks at the CERC-AAI lab in MILA-quebec have released a suite of multi-modal models, that they have finetuned and released a fork of NousHermes2.5 that can understand images, building on top of CLIP, and SigLIP as the image encoder. </p><p>In fact, we did a full interview with Irina, Kshitij, Alexis and George from the AGI collective, that full interview will be released on ThursdAI soon, so stay tuned, as they had a LOT of knowledge to share, from fine-tuning the clip model itself for better results, to evaluation of multimodal models, to dataset curation/evaluation issues and tips from Irina on how to get a government supercomputer compute grant 😈 </p><p>Big CO LLMs + APIs</p><p>OpenAI is being used by NYT for copyright infringement during training (<a target="_blank" href="https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec2023.pdf">Lawsuit</a>)</p><p>New York times is suing OpenAI and Microsoft for copyright infringement, seeking damages (amount unclear) and removal of NYT data from OpenAI models. The full lawsuit is a worthwhile read, and in includes a whopping 100 pages of examples of GPT4 completing NYT articles verbatim. I personally wasn't able to reproduce this behavior in the chatGPT app, but some folks on X suggested that it's possible in the OpenAI playground, with the right prompt and <a target="_blank" href="https://twitter.com/paul_cal/status/1740468163899519145">NYT URL in the prompt. </a></p><p>This lawsuit came after a round of attempted negotiations between NYT and OpenAI, which apparently failed, and it's worth noting a few things. First, OpenAI (<a target="_blank" href="https://newmedialaw.proskauer.com/2023/12/19/anthropic-joins-the-party-offers-copyright-shield-to-enterprise-ai-customers/">with almost every other AI company</a>) have a "Copyright shield" feature, where they protect the user of these services from getting sued for copyright violations. So there is no direct exposure for customers of OpenAI.  Additional thing of note is, the NYT information is compiled not by OpenAI directly, rather, OpenAI (and almost every other LLM) have used the CommonCrawl dataset (among others) which did the crawling and collection of text itself. </p><p>Per the CommonCrawl license, OpenAI should have reached out to each individual URL in that dataset and worked out the copyright on their own, which is a bit difficult to do, as CommonCrawl includes 3-5 billion pages collected each month. </p><p>Regardless of the claims, the hottest takes I saw in regards to this are,  that by the time anything moves with this lawsuit, we will be on GPT-6 or so and it won't matter by then, or that OpenAI will have to retrain a model without NYT data, which I find quite ludicrous personally and very unlikely to happen. </p><p>If this lawsuit actually sets a precedent, this will IMO be a very bad one for the US, considering other countries like Japan are already getting ahead of this, declaring all scraped data as fair us if used for training (<a target="_blank" href="https://petapixel.com/2023/06/05/japan-declares-ai-training-data-fair-game-and-will-not-enforce-copyright/">source</a>) </p><p>Of course, all of X became IP experts overnight, and the debates are very interesting, some are confusing technical terms, some are claiming that OpenAI will just cave and pay NYT, while some super libertarian ones take it all the way down to: if AI has human rights, and if it does, then preventing it learning from copyright material is like preventing people to read Hemingway. </p><p>This weeks buzz (What I learned in WandB this week)</p><p>This week, we sent out our annual emails of wrapped cards for everyone who used Weights & Biases to train models this year. This is a yearly tradition, similar to Spotify, however, for ML purposes, and this year the cards were generated with stable diffusion XL, generating hundreds of thousands of images based on autogenerated model run names! </p><p>The interesting thing I noticed also, is just how many folks shared their stats screenshots right from the email we send, including not only how many hours they spend training models this year, but also how many other features they used, like reports and sweeps. And I noticed just how many folks don't use reports, which is a shame, as it's such a cool feature! WandB literally has a built in blogging platform for all your ML needs and it includes live widgets of every metric you're tracking in your runs, it's really great. </p><p>AI Art & Diffusion</p><p>Midjourney v6 is incredible at recreating actual movie stills and scenes (<a target="_blank" href="https://twitter.com/ProperPrompter/status/1739989870502900109">Thread</a>)</p><p></p><p>Another potential lawsuit is waiting to happen? We already saw lawsuits against StabilityAI for supposed copyright infringement and stability did a lot of work to exclude proprietary art from their training datasets, however, the new incredible version of Midjourney, shows just.. a mind-blowing accuracy in recreating scenes from movies, and cartoon styles. Just look at some of these examples (collected by some folks on X) </p><p>This + the above lawsuit news coming for OpenAI & Microsoft from New York Times is setting up 24' to be the year where copyright law and AI finally meet for real. And we'll keep reporting on the outcomes. </p><p>Predicitons for 24' </p><p>In the last 20 minutes of the pod recording we opened up the floor to folks giving us their predictions for AI developments in the year 2024, and I also asked this question on X itself. The idea was, to come back next year during our yearly summary and see which predictions we hit, and which predictions we were not even remotely thinking about! </p><p>Here's a list of predictions with their category (Thanks to AI to help me sort these from different sources and transcription) </p><p>* <strong>Open Source LMs</strong></p><p>* 1GB models with Mix Trail performance levels - Nisten</p><p>* Continual pretraining and building on top of each other's work - Irina Rish</p><p>* Smaller models trained on more data - Irina Rish</p><p>* Consolidation and standardization of models - Irina Rish</p><p>* Agents running on 7B models with capabilities like web search and code interpretation - Shroominic</p><p>* End of dominance of transformer architecture - Far El</p><p>* Marriage of reinforcement learning and language models - Far El</p><p>* New benchmarking standards - Far El</p><p>* Plug and play weights for expertise - Umesh</p><p>* Self-improving pipeline framework - Umesh</p><p>* <strong>Big Companies/APIs</strong></p><p>* Mistral to become a major player, surpassing companies like Anthropic - Alex Volkov</p><p>* Apple AI device with multimodal capabilities - Umesh</p><p>* Google Gemini Pro commoditizing APIs - Umesh</p><p>* Model that can ace undergrad computer science curriculum - George Adams</p><p>* Extremely good but expensive model (~$1 per response) - Shroominic</p><p>* Apple spatial computing + AI product innovation - John Doe</p><p>* Real-time multilingual translation app/device - Umesh</p><p>* <strong>Vision/Video</strong></p><p>* AI-generated full length feature film - Umesh</p><p>* Artist AI model galleries for art generation - Umesh</p><p>* Real-time video understanding and multimodal models - Alex Volkov</p><p>* Public release of high quality, fast voice cloning tech - Alex Volkov</p><p>* 3D model/animation generation for video games - tobi</p><p>* Meta will outperform most companies in video AI and mixed reality - Alex Volkov</p><p>* <strong>Other</strong></p><p>* Localized national AI models - Ravi</p><p>* Rise in use of deepfakes - Ravi</p><p>* Surge in metadata embedding for ownership identification - <a target="_blank" href="R.AI">R.AI</a>.S.E</p><p>* Advances in AI for biology/healthcare - Ravi, Ash Vardanian</p><p>* A model capable of completing an undergrad CS curriculum at an A level by the end of the year - George Adams</p><p>* AI device, fully capable of multimodal capabilities, from Apple - Educated Guess</p><p>* Development in domain-specific LMs for bio applications, especially in synthetic biology - Ravi</p><p>* <strong>Twitter Prediction</strong></p><p>* CodeInterpreterAPI V2 - Shroominic</p><p>* Gemini will NOT outperform ChatGPT - Alex Northstar</p><p>* Tech slowdown in mass adoption, human creativity as bottleneck - “charles harben”</p><p>* Biology and Robots - Sinan</p><p>* Code LLMs near junior developer productivity - Karthik Kannan</p><p>* Tokenizers will work - Geronimo</p><p>* LLM curve plateaus, focus on refining and multimodal, OpenAI settles with NYT - hokiepoke</p><p>* Fully generated, rigged, voiced game characters, minimal human intervention - Rudzinski Maciej</p><p>* AI affects politics - 𝕄𝕏𝕊ℍℝ🤖</p><p>* Audio reaches DallE3 level, video and 3D advancements, new cool modality - Darth thromBOOzyt</p><p>* Synthetic data will be huge - Leo Tronchon</p><p>Ok now that our predictions are here, we'll come back here next year and see who hit what predictions! </p><p>If you have predicitons if your own, please reply to this email/substack and post them here as well, so we'll have a record 🫡 </p><p>With that, I want to wish you a happy new year, and as always, see you here next week 👏</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/dec-28</link><guid isPermaLink="false">substack:post:140159150</guid><dc:creator><![CDATA[Alex Volkov, Umesh Rajani, Nisten, Ariel N. Lee, and Irina Rish]]></dc:creator><pubDate>Fri, 29 Dec 2023 00:41:19 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/140159150/c48e9eb553203fa35604e8f99eb429cc.mp3" length="90105354" type="audio/mpeg"/><itunes:author>Alex Volkov, Umesh Rajani, Nisten, Ariel N. Lee, and Irina Rish</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5631</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/140159150/97a40eb62299ded0335492d9feed91b9.jpg"/></item><item><title><![CDATA[🎄ThursdAI - LAION down, OpenChat beats GPT3.5, Apple is showing where it's going, Midjourney v6 is here & Suno can make music! ]]></title><description><![CDATA[<p>Hey everyone, happy ThursdAI!</p><p>As always, here's a list of things we covered this week, including show notes and links, to prepare you for the holidays. </p><p><strong>TL;DR of all topics covered:</strong> </p><p>* <strong>Open Source</strong> <strong>AI</strong></p><p>* OpenChat-3.5-1210 - a top performing open source 7B model from OpenChat team beating GPT3.5 and Grok (<a target="_blank" href="https://twitter.com/openchatdev/status/1736840031266918616">link</a>, <a target="_blank" href="https://huggingface.co/openchat/openchat-3.5-1210">HF</a>, <a target="_blank" href="https://api.together.xyz/playground/chat/openchat/openchat-3.5-1210">Demo</a>)</p><p>* LAION 5B dataset taken down due to CSAM allegations from Stanford (<a target="_blank" href="https://cyber.fsi.stanford.edu/news/investigation-finds-ai-image-generation-models-trained-child-abuse">link</a>, <a target="_blank" href="https://stacks.stanford.edu/file/druid:kh752sm9123/ml_training_data_csam_report-2023-12-21.pdf">full report pdf</a>) </p><p>* FLASK - New evaluation framework from KAIST - based on skillset (<a target="_blank" href="https://x.com/SeonghyeonYe/status/1682209670302408705?s=20">link</a>)</p><p>* Shows a larger difference between open/closed source</p><p>* Open leaderboard reliability issues, vibes benchmarks and more</p><p>* HF releases a bunch of MLX ready models (LLama, Phi, Mistral, Mixtral) (<a target="_blank" href="https://x.com/awnihannun/status/1737510739987120248?s=20">link</a>)</p><p>* New transformer alternative architectures - Hyena & Mamba are heating up (<a target="_blank" href="https://twitter.com/natolambert/status/1737495286778331486">link</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Apple - LLM in a flash paper is making rounds (<a target="_blank" href="https://twitter.com/_akhaliq/status/1737300118070534468">AK</a>, <a target="_blank" href="https://twitter.com/atiorh/status/1737912777153609918">Takeaways thread</a>)</p><p>* Anthropic adheres to the messages API format (<a target="_blank" href="https://twitter.com/AnthropicAI/status/1737154202034671838">X</a>)</p><p>* Microsoft Copilot finally has plugins (<a target="_blank" href="https://twitter.com/Microsoft365/status/1737231388032577727">X</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* AI Music generation Suno is now part of Microsoft Copilot plugins and creates long beautiful songs (<a target="_blank" href="https://x.com/rowancheung/status/1737488785997287657?s=20">link</a>)</p><p>* <strong>AI Art & Diffusion</strong></p><p>* Midjourney v6 is out - better text, great at following instructions (<a target="_blank" href="https://x.com/skirano/status/1737854626995859837?s=20">link</a>)</p><p>Open Source AI</p><p>We start today with a topic I didn't expect to be covering, the LAION 5B dataset, was taken down, after a report from Stanford Internet Observatory found instances of CSAM (Child Sexual Abuse material) in the vast dataset. The <a target="_blank" href="https://stacks.stanford.edu/file/druid:kh752sm9123/ml_training_data_csam_report-2023-12-21.pdf">outlined report</a> had identified hundreds to thousands of instances of images of this sort, and used something called PhotoDNA by Microsoft to identify the images by hashes, using a sample of NSFW marked images. </p><p>LAION 5B was used to train Stable Diffusion, and 1.4 and 1.5 were trained on a lot of images from that dataset, however SD2 for example was only trained on images not marked as NSFW. The report is very thorough, going through the methodology to find and check those types of images. Worth noting that LAION 5B itself is not an image dataset, as it only contains links to images and their descriptions from alt tags. </p><p>Obviously this is a very touchy topic, given the way this dataset was scraped from the web, and given how many image models were trained on it, the report doesn't allege anything close to influence on the models it was trained on, and outlines a few methods of preventing issues like this in the future. One unfortunate outcome of such a discovery, is that this type of work can only be done on open datasets like LAION 5B, while closed source datasets don't get nearly to this level of scrutiny, and this can slow down the advancement of multi-modal open source multi modal models while closed source will continue having these issues and still prevail. </p><p>The report alleges they found and validated between hundreds to a few thousand of CSAM verified imagery, which considering the size of the dataset, is infinitesimally small, however, it still shouldn't exist at all and better techniques to clean those scraping datasets should exist. The dataset was taken down for now from HuggingFace and other places. </p><p>New version of a 7B model that beats chatGPT from OpenChat collective (<a target="_blank" href="https://twitter.com/openchatdev/status/1736840031266918616">link</a>, <a target="_blank" href="https://huggingface.co/openchat/openchat-3.5-1210">HF</a>, <a target="_blank" href="https://api.together.xyz/playground/chat/openchat/openchat-3.5-1210">Demo</a>)</p><p>Friend of the pod <a target="_blank" href="https://twitter.com/AlpayAriyak">Alpay Aryak</a> and team released an update to one of the best 7B models, namely OpenChat 7B (1210) is a new version of one of the top models in the 7B world called OpenChat with a significant score compared to chatGPT 3.5 and Grok and with very high benchmark hits (63.4% on HumanEval compared to GPT3.5 64%) </p><p>Scrutiny of open source benchmarks and leaderboards being gamed</p><p>We've covered State of the art models on ThursdAI, and every time we did, we covered the benchmarks, and evaluation scores, Whether that's the popular MMLU (Multi-Task Language Understanding) or HumanEval (Python coding questions) and almost always, we've referred to the <a target="_blank" href="https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard">HuggingFace Open LLM leaderboard</a> for the latest and greatest models. This week, there's a long thread on the hugging face forums that HF eventually had to shut down, that alleges that a new contender for the top, without revealing methods, used something called UNA to beat the benchmarks, and folks are suggesting that it must be a gaming of the system, as a model that's trained on the benchmarks can easily top the charts. </p><p>This adds to the recent observations from friend of the pod Bo Wang from Jina AI, that the BGE folks have stopped focusing on the <a target="_blank" href="https://huggingface.co/spaces/mteb/leaderboard">MTEB</a> leaderboard (Massive Text Embedding Benchmark) benchmarks as well, as those are also seem to be gamed (<a target="_blank" href="https://x.com/bo_wangbo/status/1737741665799209228?s=20">link</a>)</p><p>This kicked off a storm of a discussion about different benchmarks and evaluations, ability to score and check wether or not we're advancing, and the openness of these benchmarks. Including one Andrej Karpathy that chimed in that the only way to know is to read the r/LocalLlama comment section (e.g. vibes based eval) and check the ELO score on the <a target="_blank" href="https://arena.lmsys.org/">LMSys chatbot arena</a>, which pits 2 random LLMs behind the scenes and lets users choose the best answer/score. </p><p>LMsys also has a leaderboard, and that one only includes models they have explicitly added to their Arena, and also merges 3 different scores, the ELO score by human raters, the MTBench score and the MMLU. </p><p>This is the latest benchmark, showing that Mixtral is the highest ranking open source model at this point, and that a few other Apache 2.0 models like OpenChat (the previous version, the one from today should score even higher) and OpenHermes are inching closer as well and have honorable mentions given their license and size! </p><p>However, given the latest releases in HuggingFace lineage, where you could track the model finetunes to what models they were fine-tuned on, it's still a good place to check out those leaderboards, just... self evaluation and running models on your own tasks is always a good idea! Also a good idea is additional benchmarks, like the one proposed by KAIST this week called <a target="_blank" href="https://twitter.com/SeonghyeonYe/status/1682209670302408705">FLASK</a> that shows quite a significant distance between closed source models and open source ones based on several skills. </p><p>This weeks Buzz (What I learned this week in Weights & Biases)</p><p>This week we kicked off a buildweek internally, which unfortunately I wasn’t able to be a super active participant in, due to lying on my couch with a fever for most of the week, but regardless, I noticed how important is it to have these build weeks/hack weeks from time to time to actually use some of the new techniques we often talk about, like <a target="_blank" href="https://jxnl.github.io/instructor/blog/2023/11/05/chain-of-density/#original-prompt">chain-of-density prompting</a> techniques, or agent fine-tunes. I also got paired with my colleague Anish on our project, and while we work on our project (to be revealed later) he gave a kick ass webinar on the famous deeplearning.ai platform on the topic of enhancing performance for LLM agents in automation that more than 5K folks tuned into! Anish is a wealth of knowledge, so check it out if this topic interests you 👏</p><p>Big CO LLMs + APIs</p><p>Apple - LLM in a Flash + MLX stuff</p><p>Apple has been more and more in the AI news lately, having recently released MLX framework for running models directly on apple silicon devices, without a lot of dependencies, which was always possible, but is not optimized. This got many folks to start converting models to an MLX compatible format and there's no even a new tag on HF for those converted models </p><p>But the main news this week don't stop there, folks from Apple also released the LLM in a flash paper, which shows advances in running LLMs in hardware restricted environments like smartphones, where memory is limited, and shows interesting promise, and also a glimpse that Apple is likely moving towards on device or partially on device inference at some point if we combine the MLX stuff and this paper attempts. </p><p>Anthropic moves towards messages API </p><p>Anthropic Claude finally gives us some DX and introduces a similar to OpenAI messages API. </p><p>Voice</p><p>Microsoft copilot now has plugins and can create songs! </p><p>Microsoft copilot (FKA Bing Chat) now has Plugins (probably not new from this week, but we haven't yet reported on this) and one of the coolest ones is SUNO, which is an audio generation platform that has been around. And now it's super easy to create whole songs, directly from the Microsoft Copilot interface! </p><p>Here’s my 1 shot attempt and creating a holiday jingle for ThursdAI, it’s not good, but it’s fun 😂 </p><p>And I’ve seen some quite decent examples like <a target="_blank" href="https://x.com/karpathy/status/1737518588159041845?s=20">return to monkey</a> </p><p>AI Art & Diffusion</p><p>Midjourney v6 looks stunning and follows prompts very well</p><p>Midjourney finally dropped their version 6, and it looks, really really good. Notably, it's likely the highest quality / fidelity diffusion model out there that we can use, has better support for text, and follows prompts closely. DALL-E is still very impressive for folks given that the iteration via chatGPT interface is very easy and convinient, but still ,just look at some of these MJv6 generations 😻 </p><p>Nick gave it a very details prompt with 8 specific color assingments and besides the image looking insane, MJ nailed the super complex prompt! </p><p> 35mm film still, two-shot of a 50 year old black man with a grey beard wearing a brown jacket and red scarf standing next to a 20 year old white woman wearing a navy blue and cream houndstooth coat and black knit beanie. They are walking down the middle of the street at midnight, illuminated by the soft orange glow of the street lights --ar 7:5 --style raw --v 6.0</p><p>And just for fun, here’s a comparison of all previous versions of MJ for the same prompt, just to… feel the progress 🔥</p><p>Thanks for reading all the way through, I think I got more than I bargained for during NeurIPS and I came back with a fever and was debating wether to even record/send this weeks newletter, but now that I’m at the end of it I’m happy that I did! Though, if you listen to the full recording, you may hear me struggling to breathe a bit 😅 </p><p>So I’ll go rest up before the holidays, wishing you merry Christmas if you celebrate it 🎄 See you next week 🫡 </p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-laion-down-openchat-beats</link><guid isPermaLink="false">substack:post:139996808</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Fri, 22 Dec 2023 00:45:59 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/139996808/37a7e3f48608a7d9a010d817eaa7a8db.mp3" length="58751474" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>4896</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/139996808/57b054265c05a1170fa504d6b4516195.jpg"/></item><item><title><![CDATA[📅 ThursdAI - Live @ NeurIPS, Mixtral, GeminiPro, Phi2.0, StripedHyena, Upstage 10B SoTA & more AI news from last (insane) week]]></title><description><![CDATA[<p>Wow what a week. I think I’ve reached to a level that I’m not phased by incredible weeks or days that happen in AI, but I… guess I still have much to learn! </p><p>TL;DR of everything we covered (aka Show Notes) </p><p>* <strong>Open Source LLMs</strong> </p><p>* Mixtral MoE - 8X7B experts dropped with a magnet link again (<a target="_blank" href="https://x.com/GuillaumeLample/status/1734216541099507929?s=20">Announcement</a>, <a target="_blank" href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1">HF</a>, <a target="_blank" href="https://x.com/AravSrinivas/status/1734603265801613670?s=20">Try it</a>)</p><p>* Mistral 0.2 instruct (<a target="_blank" href="https://twitter.com/osanseviero/status/1734315723709731055">Announcement</a>, <a target="_blank" href="https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2">HF</a>)</p><p>* Upstage Solar 10B - Tops the HF leaderboards (<a target="_blank" href="https://en.upstage.ai/newsroom/solar10b-huggingface-no1">Announcement</a>)</p><p>* Together -Striped Hyena architecture and new models (<a target="_blank" href="https://x.com/togethercompute/status/1733213267185762411?s=20">Announcement</a>)</p><p>* EAGLE - a new decoding method for LLMs (<a target="_blank" href="https://twitter.com/hongyangzh/status/1733169111625064833">Announcement</a>, <a target="_blank" href="https://github.com/SafeAILab/EAGLE">Github</a>)</p><p>* <a target="_blank" href="http://Deci.ai">Deci.ai</a> - new SOTA 7B model</p><p>* Phi 2.0 weights are available finally from Microsoft (<a target="_blank" href="https://x.com/SebastienBubeck/status/1735050282210615431?s=20">HF</a>)</p><p>* QuiP - LLM quantization & Compression (<a target="_blank" href="https://x.com/tsengalb99/status/1733222467953422702?s=20">link</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Gemini Pro access over API (<a target="_blank" href="https://twitter.com/sundarpichai/status/1734952757722001626">Announcement</a>, <a target="_blank" href="https://twitter.com/abacaj/status/1734965635262669174">Thread</a>)</p><p>* Uses character pricing not token</p><p>* Mistral releases API inference server - La Platforme (<a target="_blank" href="https://docs.mistral.ai/api/">API docs</a>)</p><p>* Together undercuts Mistral with serving Mixtral by 70% and announces OAI <a target="_blank" href="https://x.com/togethercompute/status/1734680608855728541?s=20">compatible API</a></p><p>* OpenAI is open sourcing again - Releasing Weak-2-strong generalization <a target="_blank" href="https://openai.com/research/weak-to-strong-generalization">paper</a> and github! (<a target="_blank" href="https://twitter.com/OpenAI/status/1735349718765715913">announcement</a>)</p><p>* <strong>Vision</strong></p><p>* Gemini Pro api has vision <strong>AND video</strong> capabilities (<a target="_blank" href="https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/gemini">API docs</a>)</p><p>* <strong>AI Art & Diffusion</strong></p><p>* Stability announces Zero123 - Zero Shot image to 3d model (<a target="_blank" href="https://x.com/StabilityAI/status/1735009826814513342?s=20">Thread</a>)</p><p>* Imagen 2 from google (<a target="_blank" href="https://twitter.com/GoogleDeepMind/status/1734954295655534780">link</a>)</p><p>* <strong>Tools & Other</strong></p><p>* Optimus from Tesla is coming, and it looks incredible</p><p>This week started on Friday, as we saw one of the crazier single days in the history of OSS AI that I can remember, and I’ve been doing this now for .. jesus, 9 months! </p><p>In a <strong>single say</strong>, we saw a <a target="_blank" href="https://twitter.com/GuillaumeLample/status/1734216541099507929">new Mistral model</a> release called <strong>Mixtral</strong>, which is a Mixture of Experts (like GPT4 is rumored to be) of 8x7B Mistrals, and beats GPT3.5, we saw a <strong>completely </strong><a target="_blank" href="https://twitter.com/togethercompute/status/1733213267185762411"><strong>new architecture</strong></a><strong> that competes with Transformers</strong> called HYENA  from Tri Dao and Together.xyz + 2 new models trained with that architecture, we saw a <strong>new SoTA 2-bit quantization method </strong>called<strong> </strong><a target="_blank" href="https://cornell-relaxml.github.io/quip-sharp/"><strong>QuiP</strong></a><strong> </strong>from cornell<strong> </strong> AND a new <a target="_blank" href="https://twitter.com/hongyangzh/status/1733169111625064833">3x faster decoding method</a> for showing tokens to users after an LLM has done “thinking”. </p><p>And the best thing? All those advancements are stackable! What a day! </p><p>Then I went to NeurIPS2023 (which is where I am right now, writing these words!), which I cover at length at the second part of the podcast, but figured I’d write about it here as well, since it was such a crazy experience. </p><p>NeurIPS is the biggest AIML conference, I think they estimated 15K people from all over the world attending! Of course this brings many companies to sponsor, raise booths, give out swag and try to record! </p><p>Of course with my new position at Weights & Biases I had to come as well and experience this for myself!</p><p>Many of the attendees are customers of ours, and I was not expecting this amount of love given, just an incredible stream of people coming up to the booth, and saying how much they love the product! </p><p>So I manned the booth, did interviews and live streams, and connected with a LOT of folks and I gotta say, this whole NeurIPS thing is quite incredible from the ability to meet people! </p><p>I hung out with folks from Google, Meta, Microsoft, Apple, Weighs & Biases, Stability, Mistral, HuggingFace and PHD students and candidates from most of the top universities in the world, from KAIST to MIT and Stanford, Oslo and Shaghai, it's really a worldwide endeavor!</p><p>I also got to meet many of the leading figures in AI, all of whom I had to come up to and say hi, shake their hand, introduce myself (and ThursdAI) and chat about what they or their team released and presents at the conference! Truly an unforgettable experience!</p><p>Of course, This Weeks’ Buzz is that, everyone here loves W&B, from the PHD students, to literally every big LLM lab! They all came up to us (yes yes, even researches at Google who kinda low-key hate their internal tooling) and told us how awesome the experience was! (besides Xai folks, Jimmy wasn’t that impressed haha) and of course I got to practice the pitch so many times, since I manned the W&B booth! </p><p>Please do listen to the above podcast, there’s so much detail that’s in there that doesn’t get up on the newsletter, as it’s impossible to cover all, but it was a really fun conversation, including my excited depiction of this weeks NOLA escapades! </p><p>I think I’ll end here, cause I can go on and on about the parties (There were literally 7 at the same time last night, Google, Stability, OpenAI, Runway, and I’m sure there were a few more I wasn’t invited to!) and about New Orleans food (it’s my first time here, I ate a soft shell deep fried crab and turtle soup!) and I still have the poster sessions to go to and workshops! I will report more on my X account and the Weights & Biases X account, so stay tuned for that there, and as always, thanks for tuning in, reading and sharing ThursdAI with your friends 🫡 </p><p>P.S - Still can’t really believe I get to do this full time now and share this journey with all of you, bringing you all with me to SF, and now NeurIPS and tons of other places and events in the future! </p><p>— Alex Volkov, AI Evangelist @ Weights & Biases, Host of ThursdAI 🫡</p><p></p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-live-neurips-mixtral-geminipro</link><guid isPermaLink="false">substack:post:139791916</guid><dc:creator><![CDATA[Alex Volkov and Umesh Rajani]]></dc:creator><pubDate>Thu, 14 Dec 2023 23:24:53 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/139791916/829ab8cd80c1f695528d8ae79eead521.mp3" length="105047534" type="audio/mpeg"/><itunes:author>Alex Volkov and Umesh Rajani</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6565</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/139791916/3a77cbfd294b5dbbb8588a2cf965ffdb.jpg"/></item><item><title><![CDATA[🌉 ThursdAI Dec 7th - Gemini is out-ish, Grok is out, OSS AI Event in SF, Waymo rides, and more AI news from the past week 👏]]></title><description><![CDATA[<p>ThursdAI December 7th TL;DR</p><p>Greetins of the day everyone (as our panelist Akshay likes to sometimes say) and Happy first candle of Hannukah for those who celebrate! 🕎 </p><p>I'm writing this newsletter from the back of an Waymo self driving car, in SF, as I'm here for just a few nights (again) to participate in the Open Source AI meetup, that was co-organized by Ollama and Nous Research, Alignment Labs and hosted by A16Z in their SF office. </p><p>This event was the highlight of this trip, it was quite a packed meetup in terms of AI talent, and I got to meet quite a few ThursdAI listeners, mutuals on X, and AI celebs </p><p>We also recorded the podcast this week from the arena, thanks to Swyx and Alessio from latentspace pod for hosting ThursdAI this week form their newly built out pod studio (and apologies everyone for the rocky start and the cutting out issues, luckily we had local recordings so the pod version sounds good!) </p><p>Google finally teases Gemini Ultra (and gives us Pro)</p><p>What a week folks, what a week, as I was boarding the flight to SF to meet with Open Source folks, Google announced (finally!) the release of Gemini, their long rumored, highly performant model with a LOT of fanfare! </p><p>Blogposts authored by Sundar and Demis Hassabis, beautiful demos of unseen before capabilities, comparisons to GPT-4V which the Ultra version of Gemini outperforms on several benchmarks, and rumors that Sergey Brin, the guy who's net worth is north of 100Bn is listed as the core contributor on the paper and reports on benchmarks (somewhat skewed) show Ultra beaing GPT-4 on many coding and reasoning evaluations! </p><p>We've been waiting for Gemini for such a long time, that we spend the first hour of the podcast discussing it and it's implications basically. We were also fairly disillusioned by the sleight of hand tricks Google marketing department played with the initial launch video, where it purportedly shows Gemini being a fully multi-modal AI, that reacts to a camera feed + user voice in real time, when in fact, it was quickly clear (from their developer blog) that it was not video+audio but rather images+text (the same two modalities we already have in GPT-4V and given some prompting, it's quite easy to replicate most of it. We've also discussed how we again, got a tease, and not even a waitlist for the "super cool" stuff, while getting a GPT3.5 level of a model today in Bard upgrade. </p><p>To me, the most mind-blowing demo video was actually one of the other ones in the announcement, which showed that Gemini has agentic behavior in understanding user intent, asks for clarifications, <a target="_blank" href="https://twitter.com/GoogleDeepMind/status/1732447645057061279">creates a PRD</a> (Product Requirement Document) for itself, and then, generates Flutter code to create a UI on the fly, based on what the use asked it! This is pretty wild, as we all should expect that Just In Time UI will come to many of these big models! </p><p>Tune in to the episode if you want to hear more takes, opinions and frustrations as none of us actually got to use Gemini Ultra, and the experience with Gemini Pro (which is now live on Bard) was at least for me, underwhelming</p><p>This weeks buzz (What I learned in Weights & Biases this week) </p><p>I actually had a blast talking about W&B to many of the open source and fine-tuners community this and past week. I already learned that W&B doesn't only help huge companies (like OpenAI, Anthropic, Meta, Mistral and tons more) to train their foundational models, but is widely used by the open source fine-tuners community as well. I've met with folks like Wing Lian (aka Caseus), maintainer of Axolotl, who uses W&B together with Axolotl, and got to geek out about W&B, met with Teknium and LDJ (Nous Research, Alignment Labs) and in fact, got LDJ to walk me through some of the ways he uses and used W&B in the past, including how it's used to track model runs, show artifacts in the middle of runs, and run mini-benchmarks and evaluations for LLMS as they finetune. </p><p>If you're interested in this, here's an episode of a new “series” of me learning publicly (from scratch) so if you want to learn from scratch with me, welcome to check it out: </p><p>Open Source AI in SF meetup</p><p>This meetup was the reason I flew in to SF, I was invited by dear friends in the open source community, and couldn't miss it! There was such a talent density there, it was quite remarkable. Andrej Karpathy who's video about LLM I just finished re-watching, Jeremy Howard, folks from Mistral, A16Z, and tons of other startups, open source collectives, and enthusiasts, all came together to listen to a few lightning talks, but mostly to mingle and connect and share ideas.</p><p>Nous Research announced that they are a company (not anymore just a discord collective of rag tag open sourcers!) and that they are working on Forge, a product offering of theirs, that runs local AI, has a platform for agent behavior, and is very interesting to keep an eye for. </p><p>I've spent most of my time going around, hearing what folks are using (Hint: a LOT of axolotl), what they are finetuning (mostly Mistral) and what is the future (everyone's waiting for next Llama or next Mistral). Funnily enough, there was not a LOT of conversation about Gemini there at all, at least not among the folks that I talked to! </p><p>Overall this was really really fun, and of course, being in SF, at least for me, especially now as an AI Evangelist, feels like coming home! So expect more trip reports! </p><p>Here's a recap and a few more things that happened this week in AI: </p><p>* <strong>Open Source LLMs</strong> </p><p>* Apple released MLX - machine learning framework on apple silicon</p><p>* Mamba - transformers alternative architecture from Tri Dao</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Google Gemini beats GPT-4V on a BUNCH of metrics, shows cool fake multimodal demo</p><p>* Demo was embellished per the google developer blog</p><p>* Multimodal capabilities are real</p><p>* Dense model vs MOE</p><p>* Multimodel on the output as well</p><p>* For 5-shot, GPT-4 outperforms Gemini Ultra on MMLU </p><p>* <strong>AlphaCode</strong> <strong>2</strong> is here and Google claims it performs better than 85% competitive programmers in the world and it performs even better, collaborating with a competitive programmer.</p><p>* <strong>Long context prompting for Claude 2 shows 27% - 98% increase by using prompt techniques</strong></p><p>* <a target="_blank" href="http://X.ai">X.ai</a> finally released grok to many premium+ X subscribers. (<a target="_blank" href="https://x.com/nathanwchan/status/1733012293536174194?s=20">link</a>)</p><p>* <strong>Vision</strong></p><p>* OpenHermes Vision finally released - something there was not right there, back to drawing board</p><p>* <strong>Voice</strong></p><p>* Apparently Gemini beats Whisper v3! As part of a unified model no less</p><p>* <strong>AI Art & Diffusion</strong></p><p>* Meta - releases a standalong EMU AI art generator websites https://imagine.meta.com</p><p>* <strong>Tools</strong></p><p>* Jetbrains finally releases their own AI native companion + subscription</p><p>That's it for me this week, this Waymo ride took extra long as it seems that in SF, during night rush hour, AI is at a disadvatage against human drivers. Maybe I'll take an Uber next time. </p><p>P.S - here’s Grok roasting ThursdAI </p><p></p><p>See you next week, and if you've scrolled all the way here for the emoji of the week, it's hidden in the middle of the article, send me that to let me know you read through 😉 </p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-dec-7th-gemini-is-out-ish</link><guid isPermaLink="false">substack:post:139607889</guid><dc:creator><![CDATA[Alex Volkov and Latent.Space]]></dc:creator><pubDate>Fri, 08 Dec 2023 07:35:07 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/139607889/9f03ff5f9ac14785a5ecdb785e9fbc87.mp3" length="107158167" type="audio/mpeg"/><itunes:author>Alex Volkov and Latent.Space</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6697</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/139607889/c5c515ed23104ea3edf3b41b1773db87.jpg"/></item><item><title><![CDATA[🎉 ThursdAI Nov 30 // ChatGPT 1 year celebration special episode // covering the past 1 year in LLM/OSS AI 🥳]]></title><description><![CDATA[<p>🎶 Happy birthday to you, happy birthday to you, happy birthday chat GPT-eeeeeeee, happy birthday to you. </p><p>Hey everyone, welcome to this special edition of ThursdAI where you're probably gonna have two emails and two podcast episodes today and you can choose which one you want to but we actually recorded both of them live it just they went a little long. </p><p><p>ThursdAI - The only podcast that brings you yearly recaps since chatGPT was released (😂) </p></p><p>This one is the more celebratory one, today is one year from the release of chat GPT and we (and by we I mean I, Alex) decided to celebrate it by recapping not just the last week in AI but <strong>the last year (full timeline posted at the bottom of this newsletter)</strong></p><p>Going month by month with a swoosh sound in the editing and covering the most important thing that happened in LLM and open source LLMs since chatGPT was released and imagination unlocked the capability for everyone! </p><p>We also covered Meta stepping in with Lama and then everything that happened since then in the multi modality and vector databases and agents and everything everything everything, it was a one hell of an hour and a half, we had <strong>almost 1K audience members</strong>! and so I recommend you listen to this one first and then the week updates later because there were some incredible releases this week as well! (as there are every week)</p><p>I think it's important to do like a Spotify wrapped type thing for AI, for something like a one year for chat GPT and I think we'll be doing this every year so hopefully in the year we'll see you here on November 30th covering the next year in AI.</p><p>And hopefully the next year in AI system will actually help me summarize all this because it's a lot of work but with that I will just leave you with the timeline and no notes and you should listen to everything because we talked about everything live! </p><p>I hope you enjoy this special birthday celebration! (OpenAI sure did, check out this incredibly cute little celebration video they just posted) </p><p>Here’s the full timeline with everything important that happened month by month that we’ve covered:</p><p>* <strong>December 2022</strong> - ChatGPT becomes the fastest growing product in history</p><p>* GPT3.5 with 4K context window, instruction finetuning and conversational RLHF </p><p>* <strong>January</strong></p><p>* Microsoft invests additional $10B into OpenAI (Jan 23, <a target="_blank" href="https://openai.com/blog/openai-and-microsoft-extend-partnership">Blog</a>)</p><p>* <strong>February</strong> </p><p>* LLaMa 1 - Biggest Open Source LLM (February 24 - <a target="_blank" href="https://ai.meta.com/blog/large-language-model-llama-meta-ai/">Blog</a>)</p><p>* No commercial license</p><p>* 30% MMLU</p><p>* No instruction fine-tuninig (RL;HF)</p><p>* ChatGPT unofficial APIs exist</p><p>* <strong>March</strong> (the month of LLM superpowers)</p><p>* <strong>ChatGPT API</strong> (March 1, <a target="_blank" href="https://openai.com/blog/introducing-chatgpt-and-whisper-apis#LoganKilpatrick">announcement</a>)</p><p>* Developers can now build chatGPT powered apps</p><p>* All clones so far were completion based and not conversation based</p><p>* LLama.cpp from ggerganov + Quantization (<strong>March 10</strong>, <a target="_blank" href="https://github.com/ggerganov/llama.cpp/issues/33#issuecomment-1465108022">Blog</a>)</p><p>* Stanford - <strong>Alpaca 7B - </strong>Finetune on self-instruct GPT3.5 dataset  (March 13, <a target="_blank" href="https://crfm.stanford.edu/2023/03/13/alpaca.html">Blog</a>)</p><p>* <strong>GPT4 release</strong> + chatGPT upgrade (March 14 - <a target="_blank" href="https://www.youtube.com/watch?v=outcGtbnMuQ">GPT-4 demo</a>)</p><p>* 67.0% HumanEval | 86.4% MMLU</p><p>* 8K (and 32K) context windows</p><p>* Anthropic announces <strong>Claude</strong> + Claude instant (March 14 - <a target="_blank" href="https://www.anthropic.com/index/introducing-claude">Blog</a>)</p><p>* 56.0% HumanEval </p><p>* Folks previously form OAI leave to open Anthropic as research, then pivot from research into commercial</p><p>* LMSYS Vicuna 13B - Finetuned based on <a target="_blank" href="shareg.pt">shareg.pt</a> exports (March 30, <a target="_blank" href="https://lmsys.org/blog/2023-03-30-vicuna/">Blog</a>)</p><p>* <strong>April (Embedings & Agents)</strong></p><p>* <strong>AutoGPT</strong> becomes the fastest github starred project + writes it's own code (April 1, <a target="_blank" href="https://twitter.com/SigGravitas/status/1642181498278408193">Blog</a>)</p><p>* Agents start to pop up like mushrooms after the rain</p><p>* <strong>LLaVa</strong> - Multimodality open source begins (April 18, <a target="_blank" href="https://llava-vl.github.io">Blog</a>)</p><p>* CLIP +  Vicuna smushed together to get LLMs eyes</p><p>* Bard improvements </p><p>* <strong>May</strong> (Context windows)</p><p>* Mosaic <strong>MPT-7B</strong> with 64K context, 1T parameters, commercial license (May 5, <a target="_blank" href="https://www.mosaicml.com/blog/mpt-7b">Blog</a>) </p><p>* Anthropic updates <strong>Claude</strong> with 100K context window (May 11, <a target="_blank" href="https://www.anthropic.com/index/100k-context-windows">Blog</a>)</p><p>* LLongBoi summer begins (Context windows are being stretched)</p><p>* Nvidia shows <strong>Voyager</strong> agents that play Minecraft + Memory stored in Vector DB (May 27, <a target="_blank" href="https://twitter.com/DrJimFan/status/1662115266933972993?ref_src=twsrc%5Etfw%7Ctwcamp%5Etweetembed%7Ctwterm%5E1662219615559376905%7Ctwgr%5E74d94ec2931ec56015a60453a0da08692b3f2bd3%7Ctwcon%5Es3_&#38;ref_url=https%3A%2F%2Fspacesdashboard.com%2Fspace%2F1ynKOapEvenJR%2Fthis-week-in-ai-falcon-openai-roadmap-mind-reading-and-more">Blog</a>)</p><p>* <strong>June</strong></p><p>* <strong>GPT-3.5-turbo + functions API</strong> (June 6, <a target="_blank" href="https://openai.com/blog/function-calling-and-other-api-updates">Blog</a>)</p><p>* GPT3.5 and 4 got a boost in capabilities and steer-ability </p><p>* Price reduction on models + <strong>75% reduction on ada embeddings</strong> model</p><p>* LLaMa context window extended to 8K with RoPE scaling</p><p>* AI Engineers self determination essay by <a target="_blank" href="https://substack.com/profile/89230629-swyx">swyx</a> </p><p>* <strong>July</strong></p><p>* Code Interpreter GA - ChatGPT can code (July 11, <a target="_blank" href="https://www.latent.space/p/code-interpreter">Blog</a>)</p><p>* Anthropic <strong>Claude 2</strong> - (July 11 - <a target="_blank" href="https://www.anthropic.com/index/claude-2">Blog</a>)</p><p>* 200K context window</p><p>* 71% HumanEval</p><p>* <strong>LLaMa 2</strong> (July 18 - <a target="_blank" href="https://ai.meta.com/blog/llama-2/">Blog</a>)</p><p>* Base & Chat models (RLHF)</p><p>* Commercial license </p><p>* 29.9% Human Eval | 68.9% MMLU </p><p>* <strong>August</strong></p><p>* Meta releases Code-LlaMa, code finetune models</p><p>* <strong>September</strong></p><p>* DALL-E 3 - Adds multi-modality on output and chat to image gen (Sep 20, <a target="_blank" href="https://openai.com/dall-e-3">Blog</a>)</p><p>* Mistral 7B top performing open source LLM via torrent link (Sep 27, <a target="_blank" href="https://mistral.ai/news/about-mistral-ai/">Blog</a>)</p><p>* <strong>GPT4-V</strong> (vision & voice) - Adds multimodality on input (Sep 27, <a target="_blank" href="https://openai.com/blog/chatgpt-can-now-see-hear-and-speak">Blog</a>)</p><p>* <strong>October</strong></p><p>* <strong>OpenHermes</strong> - Mistral 7B finetune that tops the charts from Teknium / Nous Research (Oct 16, <a target="_blank" href="https://twitter.com/Teknium1/status/1714010838959612329?ref_src=twsrc%5Etfw%7Ctwcamp%5Etweetembed%7Ctwterm%5E1714010838959612329%7Ctwgr%5Eb524cb8422e746a8a9391a334b5bc74e5a6ca15e%7Ctwcon%5Es1_c10&#38;ref_url=https%3A%2F%2Fspacesdashboard.com%2Fspace%2F1RDxllVbqolxL%2Fthursdai-oss-llms-mojo-at-mac-baidu-ai-pi-and-claude-updates-more">Announcement</a>)</p><p>* Inflection PI gets connected to the web + supportPi mode (Oct 16, <a target="_blank" href="https://twitter.com/inflectionAI/status/1714018923916534226?ref_src=twsrc%5Etfw%7Ctwcamp%5Etweetembed%7Ctwterm%5E1714018923916534226%7Ctwgr%5Eb524cb8422e746a8a9391a334b5bc74e5a6ca15e%7Ctwcon%5Es1_&#38;ref_url=https%3A%2F%2Fspacesdashboard.com%2Fspace%2F1RDxllVbqolxL%2Fthursdai-oss-llms-mojo-at-mac-baidu-ai-pi-and-claude-updates-more">Blog</a>)</p><p>* Adept releases multimodal <strong>FuYu 8B</strong> (Oct 19, <a target="_blank" href="https://twitter.com/marktenenholtz/status/1715008550919864581?ref_src=twsrc%5Etfw%7Ctwcamp%5Etweetembed%7Ctwterm%5E1715008550919864581%7Ctwgr%5Eb524cb8422e746a8a9391a334b5bc74e5a6ca15e%7Ctwcon%5Es1_c10&#38;ref_url=https%3A%2F%2Fspacesdashboard.com%2Fspace%2F1RDxllVbqolxL%2Fthursdai-oss-llms-mojo-at-mac-baidu-ai-pi-and-claude-updates-more">blog</a>)</p><p>* <strong>November</strong></p><p>* <strong>Grok</strong> from Xai - with realtime access to all of X content</p><p>* <strong>OpenAI dev day</strong> </p><p>* Combined mode for MMIO (multi modal on input and output)</p><p>* <strong>GPT-4 Turbo with 128K context, 3x cheaper than GPT-4</strong></p><p>* Assistants API with retrieval capabilities</p><p>* <strong>Share-able GPTs</strong> - custom versions of GPT with retrieval, DALL-E, Code Interpreter and vision</p><p>* Chatbots with real business use-cases, for example WandBot (that we just launched today! <a target="_blank" href="https://wandb.ai/wandbot/wandbot_public/reports/RAGs-To-Riches-Bringing-Wandbot-Into-Production--Vmlldzo1ODU5ODk0">Blog</a>) </p><p>* Has vector storage memory</p><p>* Available via Discord/Slack</p><p>* And custom GPT!</p><p>* <strong>Microsoft</strong> has copilot everywhere in office</p><p>Aaaand now we’re here! </p><p>What an incredible year, can’t imagine what the next year holds for all of us, but 1 thing is for sure, ThursdAI will be here to keep you all up to date! </p><p>P.S - If you scrolled all the way to here, DM me the 🎊 emoji so I know you celebrated with us! It really helps me to know that there is at least a few folks out of the thousands that get this newsletter that scrolls all the way through! </p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-nov-30-chatgpt-1-year-celebration</link><guid isPermaLink="false">substack:post:139312484</guid><dc:creator><![CDATA[Alex Volkov, Nisten, yam, Umesh Rajani, and Piotr Skalski]]></dc:creator><pubDate>Thu, 30 Nov 2023 22:33:53 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/139312484/7ac5c9b9c402a1685fee8e51c6fb6737.mp3" length="81229633" type="audio/mpeg"/><itunes:author>Alex Volkov, Nisten, yam, Umesh Rajani, and Piotr Skalski</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5077</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/139312484/6c76c7c551d61c8741e8dd77c4459626.jpg"/></item><item><title><![CDATA[🦃 ThursdAI Thanksgiving special - OpenAI ctrl+altman+delete, Stable Video, Claude 2.1 (200K), the (continuous) rise of OSS LLMs & more AI news]]></title><description><![CDATA[<p>ThursdAI TL;DR - November 23 </p><p>TL;DR of all topics covered: </p><p>* <strong>OpenAI Drama</strong></p><p>* Sam... there and back again. </p><p>* <strong>Open Source LLMs</strong> </p><p>* Intel finetuned Mistral and is on top of leaderboards with <strong>neural-chat-7B </strong>(<a target="_blank" href="https://x.com/Yampeleg/status/1727679553714217421?s=20">Thread</a>, <a target="_blank" href="https://t.co/rjxz0U3NNQ">HF</a>, <a target="_blank" href="https://t.co/EJ5AOZxPVF">Github</a>)</p><p>* And trained on new Habana hardware! </p><p>* Yi-34B Chat - 4-bit and 8-bit chat finetune for Yi-34 (<a target="_blank" href="https://t.co/hgUdms5Lah">Card</a>, <a target="_blank" href="https://huggingface.co/spaces/01-ai/Yi-34B-Chat">Demo</a>)</p><p>* Microsoft released Orca 2 - it's underwhelming (<a target="_blank" href="https://x.com/erhartford/status/1726809360117219483?s=20">Thread</a> from Eric, <a target="_blank" href="https://huggingface.co/microsoft/Orca-2-13b">HF</a>, <a target="_blank" href="https://www.microsoft.com/en-us/research/blog/orca-2-teaching-small-language-models-how-to-reason/">Blog</a>)</p><p>* System2Attention - Uses LLM reasons to figure out what to attend to (<a target="_blank" href="https://x.com/jaseweston/status/1726784511357157618?s=20">Thread</a>, <a target="_blank" href="https://arxiv.org/abs/2311.11829">Paper</a>)</p><p>* Lookahead decoding to speed up LLM inference by 2x (<a target="_blank" href="https://lmsys.org/blog/2023-11-21-lookahead-decoding/">Lmsys blog</a>, <a target="_blank" href="https://github.com/hao-ai-lab/LookaheadDecoding">Github</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Anthropic Claude 2.1 - <strong>200K context</strong>, 2x less hallucinations, tool use finetune (<a target="_blank" href="https://twitter.com/AnthropicAI/status/1727001773888659753">Announcement</a>, <a target="_blank" href="https://www.anthropic.com/index/claude-2-1">Blog</a>, <a target="_blank" href="https://x.com/GregKamradt/status/1727018183608193393?s=20">Ctx length analysis</a>)</p><p>* InflectionAI releases Inflection 2 (<a target="_blank" href="https://x.com/inflectionAI/status/1727350646938960205?s=20">Announcement</a>, <a target="_blank" href="https://inflection.ai/inflection-2">Blog</a>)</p><p>* Bard can summarize youtube videos now </p><p>* <strong>Vision</strong></p><p>* Video-LLaVa - open source video understanding (<a target="_blank" href="https://github.com/PKU-YuanGroup/Video-LLaVA">Github</a>, <a target="_blank" href="https://x.com/_nateraw/status/1726783481248977037?s=20">demo</a>)</p><p>* <strong>Voice</strong></p><p>* OpenAI added voice for free accounts (<a target="_blank" href="https://twitter.com/OpenAI/status/1727065166188274145">Announcement</a>)  </p><p>* 11Labs released speech to speech including intonations (<a target="_blank" href="https://twitter.com/elevenlabsio/status/1727460218345242979">Announcement</a>, <a target="_blank" href="https://elevenlabs.io/speech-synthesis">Demo</a>)</p><p>* Whisper.cpp - with OpenAI like drop in replacement API server (<a target="_blank" href="https://twitter.com/ggerganov/status/1726860885602476102">Announcement</a>)</p><p>* <strong>AI Art & Diffusion</strong></p><p>* Stable Video Diffusion - Stability releases text2video and img2video (<a target="_blank" href="https://x.com/altryne/status/1727039443956088883?s=20">Announcement</a>, <a target="_blank" href="https://www.fal.ai/models/svd">Try it</a>)</p><p>* Zip-Lora - combine diffusion LORAs together - Nataniel Ruiz (Annoucement, <a target="_blank" href="https://t.co/1mzMAftwjt">Blog</a>)</p><p>* Some folks are getting NERFs out from SVD (Stable Video Diffusion) (<a target="_blank" href="https://x.com/KaiZhang9546/status/1727177336120868959?s=20">link</a>)</p><p>* LCM everywhere - In Krea, In <a target="_blank" href="https://twitter.com/tldraw/status/1726632746779652211">Tl;Draw</a>, in <a target="_blank" href="https://www.fal.ai/dynamic">Fal</a>, on <a target="_blank" href="https://huggingface.co/spaces/radames/Real-Time-Latent-Consistency-Model">Hugging Face</a></p><p>* <strong>Tools</strong></p><p>* Screenshot-to-html (<a target="_blank" href="https://twitter.com/DevDminGod/status/1725175630029803538">Thread</a>, <a target="_blank" href="https://github.com/abi/screenshot-to-code">Github</a>)</p><p>Ctrl+Altman+Delete weekend</p><p>If you're subscribed to ThursdAI, then you most likely either know the full story of the crazy OpenAI weekend. Here's my super super quick summary (and if you want a full blow-by-blow coverage, Ben Tossel as a great one <a target="_blank" href="https://bensbites.beehiiv.com/p/whats-going-open-ai">here</a>)</p><p>Sam got fired, Greg quit, Mira flipped then Ilya Flipped. Satya played some chess, there was an interim CEO for 54 hours, all employees sent hearts then signed a letter, neither of the 3 co-fouders are on the board anymore, Ilya's still there, company is aligned AF going into 24 and Satya is somehow a winner in all this.</p><p>The biggest winner to me is open source folks, who got tons of interest suddenly, and specifically, everyone seems to converge on the OpenHermes 2.5 Mistral from Teknium (Nous Research) as the best model around! </p><p>However, I want to shoutout the incredible cohesion that came out of the folks in OpenAI, I created a list of <a target="_blank" href="https://x.com/altryne/status/1726108290508374017?s=20">around 120 employees on X</a> and all of them were basically aligned the whole weekend, from ❤️ sending to signing the letter, to showing how happy they are Sam and Greg are back! </p><p>Yay</p><p>This Week's Buzz from WandB (aka what I learned this week)</p><p>As I’m still onboarding, the main things I’ve learned this week, is how transparent Weights & Biases is internally. During the whole OAI saga, Lukas the co-founder sent a long message in Slack, addressing the situation (after all, OpenAI is a big customer for W&B, GPT-4 was trained on W&B end to end) and answering questions about how this situation can affect us and the business. </p><p>Additionally, another co-founder, Shawn Lewis shared a recording of his update to the BOD of WandB, about out progress on the product side. It’s really really refreshing to see this information voluntarily shared with the company 👏 </p><p>The first core value of W&B is Honesty, and it includes transparency outside of matters like personal HR stuff, and after hearing about this during onboarding, it’s great to see that the company lives it in practice 👏 </p><p>I also learned that almost every loss curve image that you see on X, is a W&B dashboard screenshot ✨ and while we do have a share functionality, it’s not built for viral X sharing haha so in the spirit of transparency, here’s a video I recorded and shared with product + feature request to make these screenshot way more attractive + clear that it’s W&B </p><p>Open Source LLMs </p><p>Intel passes Hermes on SOTA with a DPO Mistral Finetune (<a target="_blank" href="https://x.com/Yampeleg/status/1727679553714217421?s=20">Thread</a>, <a target="_blank" href="https://t.co/rjxz0U3NNQ">Hugging Face</a>, <a target="_blank" href="https://t.co/EJ5AOZxPVF">Github</a>)</p><p>Yes, that intel, the... oldest computing company in the world, not only comes out strong with the best (on benchmarks) open source LLM, it also does DPO, and has been trained on a completely new hardware + Apache 2 license! </p><p>Here's Yam's TL;DR for the DPO (Direct Policy Optimization) technique: </p><p><em>Given a prompt and a pair of completions, train the model to prefer one over the other. </em>This model was trained on prompts from SlimOrca's dataset where each has one GPT-4 completion and one LLaMA-13B completion. The model trained to prefer GPT-4 over LLaMA-13B.</p><p>Additionally, even tho there is custom hardware included here, Intel supports the HuggingFace trainer fully, and the whole repo is very clean and easy to understand, replicate and build things on top of (like LORA)</p><p>LMSys Lookahead decoding (<a target="_blank" href="https://lmsys.org/blog/2023-11-21-lookahead-decoding/">Lmsys</a>, <a target="_blank" href="https://github.com/hao-ai-lab/LookaheadDecoding">Github</a>)</p><p>This method significantly improves the output of LLMs, sometimes by more than 2x, using some jacobian notation (don't ask me) tricks. It's copmatible with HF transformers library! I hope this comes to open source tools like LLaMa.cpp soon! </p><p></p><p>Big CO LLMs + APIs</p><p>Anthropic Claude comes back with 2.1 featuring 200K context window, tool use</p><p>While folks on X thought this was new, Anthropic actually announced Claude with 200K back in the May, and just gave us 100K context window, which for the longest time was the largest context window around. I was always thinking, they don't have a reason to release 200K since none of their users actually wants it, and it's a marketing/sales decision to wait until OpenAI catches up. Remember, back then, GPT-4 was 8K and some lucky folks got 32K! </p><p>Well, OpenAI releases GPT-4-turbo with 128K so Anthropic re-trained and released Claude to gain an upper hand. I also love the tool use capabilities. </p><p>Re: longer context window, there were a bunch of folks testing if 200K context window is actually all that great, and it turns out, besides being very expensive to run (you pay per tokens) it also loses a bunch of information at lengths over 200K using needle in the haystack searches. <a target="_blank" href="https://twitter.com/GregKamradt/status/1727018183608193393">Here's an analysis </a>by Greg Kamradt that shows that: </p><p>* Starting at ~90K tokens, performance of recall at the bottom of the document started to get increasingly worse</p><p>* Less context = more accuracy - This is well know, but when possible reduce the amount of context you send to the models to increase its ability to recall.</p><p>I had similar issues back in May with their 100K tokens window (<a target="_blank" href="https://twitter.com/altryne/status/1656813286602711040">source</a>)</p><p>Voice & Audio</p><p>ElevenLabs has speech-to-speech</p><p>Creating a significant jump in capabilities, ElevenLabs now allows you to be the actor behind the voice! With speech to speech, they would transfer the pauses, the intonation, the emotion, into the voice generation. Here's my live reaction and comparison: </p><p>* Notable: Whisper.CPP now supports a server compatible with OpenAI (<a target="_blank" href="https://x.com/ggerganov/status/1726860885602476102?s=20">Announcement</a>, <a target="_blank" href="https://github.com/ggerganov/whisper.cpp/pull/1380">Github</a>)</p><p>AI Art & Diffusion</p><p>Stable Video diffusion - text-2-video / img-2-video foundational model (<a target="_blank" href="https://twitter.com/altryne/status/1727039443956088883">Announcement</a>, <a target="_blank" href="https://t.co/pvjjBiQCAj">Hugging Face</a>, <a target="_blank" href="https://t.co/5i68VemyAF">Github</a>, <a target="_blank" href="https://www.fal.ai/models/svd">DEMO</a>) </p><p>Stable has done it again, Stable Video allows you to create increidbly consistent videos with images or just text! They are short now, but working on extending the times, and they videos look incredible! (And thanks to friends at Fal, you can try right now, <a target="_blank" href="https://www.fal.ai/models/svd">here</a>)</p><p>And here’s a quick gif I created with DALL-E 3 and Fal to celebrate the Laundry Buddy team at OAI while the outage was happening) </p><p></p><p>Tools</p><p>Screenshot to HTML (<a target="_blank" href="https://github.com/abi/screenshot-to-code">Github</a>)</p><p>I… what else is there to say? Someone used GPT4-Vision to … take screenshots and iteratively re-create the HTML for them. As someone who used to spend month on this exact task, I’m very very happy it’s now automated! </p><p></p><p>Happy Thanksgiving 🦃 </p><p>I am really thankful to all of you who subscribe and come back every week, thank you! I would have been here without all your support, comments, feedback! Including this incredible art piece that Andrew from spacesdashboard created just in time for our live recording, just look at those little robots! 😍 </p><p>See you next week (and of course the emoji of the week is 🦃, DM or reply!) </p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-thanksgiving-special-openai</link><guid isPermaLink="false">substack:post:139116179</guid><dc:creator><![CDATA[Alex Volkov, Umesh Rajani, and Nisten]]></dc:creator><pubDate>Thu, 23 Nov 2023 22:21:02 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/139116179/01448679727e96395a6b217ac102ccd8.mp3" length="111506330" type="audio/mpeg"/><itunes:author>Alex Volkov, Umesh Rajani, and Nisten</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6969</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/139116179/a6ead03e41a9da79129c89c036a41f19.jpg"/></item><item><title><![CDATA[📅 ThursdAI Nov 16 - Live AI art, MS copilots everywhere, EMUs from Meta, sketch-to-code from TLDraw, Capybara 34B and other AI news!]]></title><description><![CDATA[<p>Hey yall, welcome to this special edition of ThursdAI. This is the first one that I'm sending in my new capacity as the AI Evangelist <a target="_blank" href="https://reflect.app/g/altryne/a0467fc75abc43c3b89556318330ce5d">Weights & Biases (on the growth team)</a></p><p>I made the announcement last week, but this week is my first official week at W&B, and oh boy... how humbled and excited I was to <a target="_blank" href="https://twitter.com/altryne/status/1725130934423519459">receive all the inspiring and supporting</a> feedback from the community, friends, colleagues and family 🙇‍♂️ </p><p>I promise to continue my mission of delivering AI news, positivity and excitement, and to be that one place where we stay up to date so you don't have to. </p><p>This week we also had one of our biggest live recordings yet, with 900 folks tuned in so far 😮 and it was my pleasure to again to chat with folks who "made the news" so we had a brief interview with <a target="_blank" href="https://twitter.com/steveruizok">Steve Ruiz</a> and <a target="_blank" href="twitter.com/tldraw">Lou from TLDraw</a>, about their incredible GPT-4 Vision enabled "make real" functionality and finally got to catch up with my good friend Idan Gazit who's heading the <a target="_blank" href="mailto:Github@Next">Github@Next</a> team (the birthplace of Github Copilot) about how they see the future. So definitely definitely check out the full conversation! </p><p>TL;DR of all topics covered: </p><p>* <strong>Open Source LLMs</strong> </p><p>* Nous Capybara 34B on top of Yi-34B (with 200K context length!) (<a target="_blank" href="https://twitter.com/WolframRvnwlf/status/1724562856346071287">Eval</a>, <a target="_blank" href="https://huggingface.co/NousResearch/Nous-Capybara-34B">HF</a>) </p><p>* Microsoft - Phi 2 will be open sourced (barely) (<a target="_blank" href="https://x.com/EldanRonen/status/1724875098610712820?s=20">Announcement</a>, Model)</p><p>* HF adds finetune chain genealogy (<a target="_blank" href="https://x.com/julien_c/status/1724486580662829328?s=20">Announcement</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Microsoft - Everything is CoPilot (<a target="_blank" href="https://www.theverge.com/23961007/microsoft-ignite-2023-news-ai-announcements-copilot-windows-azure-office">Summary</a>, <a target="_blank" href="http://copilot.microsoft.com">copilot.microsoft.com</a>)</p><p>* CoPilot for work and 365 (<a target="_blank" href="https://blogs.microsoft.com/blog/2023/03/16/introducing-microsoft-365-copilot-your-copilot-for-work/">Blogpost</a>)</p><p>* CoPilot studio - low code "tools" builder for CoPilot + GPTs access (<a target="_blank" href="https://twitter.com/altryne/status/1724862274257502662">Thread</a>)</p><p>* OpenAI Assistants API cookbook (<a target="_blank" href="https://cookbook.openai.com/examples/assistants_api_overview_python">Link</a>)</p><p>* <strong>Vision</strong></p><p>* 🔥 TLdraw make real button - turn sketches into code in seconds with vision (<a target="_blank" href="https://twitter.com/altryne/status/1724917107274457336">Video</a>, <a target="_blank" href="http://makereal.tldraw.com">makereal.tldraw.com</a>)</p><p>* Humane Pin - Orders are out, shipping early 2024, multimodal AI agent on your lapel (</p><p></p><p>* )</p><p>* <strong>Voice & Audio</strong></p><p>* 🔥 DeepMind (Youtube) - <strong>Lyria</strong> high quality music generations you can HUM into (<a target="_blank" href="https://deepmind.google/discover/blog/transforming-the-future-of-music-creation/">Announcement</a>)</p><p>* EmotiVoice - 2000 different voices with <strong>emotional synthesis </strong>(<a target="_blank" href="https://github.com/netease-youdao/EmotiVoice">Github</a>)</p><p>* Whisper V3 is top of the charts again (<a target="_blank" href="https://twitter.com/reach_vb/status/1724912958826770711">Announcement</a>, <a target="_blank" href="https://huggingface.co/spaces/hf-audio/open_asr_leaderboard">Leaderboard</a>, <a target="_blank" href="https://huggingface.co/openai/whisper-large-v3">Github</a>)</p><p>* <strong>AI Art & Diffusion</strong></p><p>* 🔥 Real-time LCM (latent consistency model) AI art is blowing up (<a target="_blank" href="https://x.com/nickfloats/status/1725194622807523691?s=20">Krea</a>, <a target="_blank" href="https://x.com/gorkemyurt/status/1724485367892709526?s=20">Fal Demo</a>)</p><p>* 🔥 Meta announces EMU-video and EMU-edit  (<a target="_blank" href="https://twitter.com/_akhaliq/status/1725186401006653527">Thread</a>, <a target="_blank" href="https://emu-video.metademolab.com/">Blog</a>)</p><p>* Runway motion brush (<a target="_blank" href="https://x.com/runwayml/status/1723033256067489937?s=20">Announcement</a>)</p><p>* <strong>Agents</strong></p><p>* Alex's Visual Weather GPT (<a target="_blank" href="https://twitter.com/altryne/status/1722498086256411075">Announcement</a>, <a target="_blank" href="https://chatg.pt/artweather">Demo</a>) </p><p>* AutoGen, Microsoft agents framework is now supporting assistants API (<a target="_blank" href="https://twitter.com/pyautogen/status/1724201430398140439">Announcement</a>)</p><p>* <strong>Tools</strong></p><p>* Gobble Bot - scrape everything into 1 long file for GPT consumption (<a target="_blank" href="https://x.com/rafal_makes/status/1723786488578556203?s=20">Announcement</a>, <a target="_blank" href="https://gobble.bot/">Link</a>)</p><p>* ReTool state of AI 2023 - <a target="_blank" href="https://retool.com/reports/state-of-ai-2023">https://retool.com/reports/state-of-ai-2023</a></p><p>* Notion Q&A AI - search through a company Notion and QA things (<a target="_blank" href="https://www.notion.so/releases/2023-11-14">announcement</a>)</p><p>* GPTs shortlinks + analytics from Steven Tey (</p><p>https://chatg.pt</p><p>* ) </p><p>This Week's Buzz from WandB (aka what I learned this week)</p><p>Introducing a new section in the newsletter called "The Week's Buzz from WandB" (AKA What I Learned This Week).</p><p>As someone who joined Weights and Biases without prior knowledge of the product, I'll be learning a lot. I'll also share my knowledge here, so you can learn alongside me. Here's what I learned this week:</p><p>The most important things I learned this week is just how prevelant and how much of a leader Weights&Biases is. W&B main product is used by most of the foundation LLM trainers including OpenAI. </p><p>In fact GPT-4 was completely trained on W&B!</p><p>It's used by pretty much everyone besides Google. In addition to that it's not only about LLMs, W&B products are used to train models in many many different areas of the industry. </p><p>Some incredible examples are a pesticide dispenser that's part of the John Deere tractors that only spreads pesticides onto weeds and not actual produce. And Big Pharma who's using W&B to help create better drugs that are now in trial. And it's just incredible how much machine learning that's outside of just LLMs is there. But also I'm absolutely floored by just the amount of ubiquity that W&B has in the LLM World.</p><p>W&B has two main products, Models & Prompts, Prompts is a newer one, and we're going to dig into both of these more next week! </p><p>Additionally, it's striking how many AI Engineers, API users such as myself and many of my friends, have no idea of who W&B even is, of if they do, they never used it!</p><p><strong>Well, that's what I'm here to change, so stay tuned!</strong> </p><p>Open source & LLMs</p><p>In the open source corner, we have the first Nous fine-tune of Yi-34B, which is a great model that we've covered in <a target="_blank" href="https://sub.thursdai.news/p/nov-09#details">the last episode</a> and now is fine-tuned with the Capybara dataset by ThursdAI cohost, <a target="_blank" href="https://twitter.com/ldjconfirmed">LDJ</a>! Not only is that a great model, it now <a target="_blank" href="https://twitter.com/WolframRvnwlf/status/1724562856346071287">tops the charts for the resident reviewer</a> we WolframRavenwolf on /r/LocalLLama (and X) </p><p>Additionally, Open-Hermes 2.5 7B from <a target="_blank" href="https://x.com/Teknium1">Teknium</a> is now second place on HuggingFace leaderboards, it was released recently but we haven't covered until now, I still think that Hermes is one of the more capable local models you can get! </p><p>Also in open source this week, guess who loves it? Satya (and Microsoft) </p><p>They love it so much that they not only created this awesome slide (altho, what's SLMs? Small Language Models? I don't like it), they also announced that LLaMa and Mistral are coming to Azure services as inference! </p><p>And they gave us a little treat, <a target="_blank" href="https://twitter.com/SebastienBubeck/status/1724854157004190095">Phi2 is coming</a>. They said OpenSource (but folks looking at the license saw that it's only for research capabilities) but supposedly it's a significantly more capable model while only being 2.7B weights (super super tiny) </p><p>Big Companies & APIs</p><p>Speaking of Microsoft, they <a target="_blank" href="https://www.theverge.com/23961007/microsoft-ignite-2023-news-ai-announcements-copilot-windows-azure-office">announced so much</a> during their Ignite event on Wednesday (15th) that it's impossible to cover all of it in this newsletter, but basically here are the main things that got me excited! </p><p>CoPilot everywhere, everything is CoPilot</p><p>Microsoft rebranded Bing Chat to Copilot and it now lives on <a target="_blank" href="http://copilot.microsoft.com">copilot.microsoft.com</a></p><p> and it's basically a free GPT4, with vision and DALL-e capabilities. If you're not signed up for OpenAI's plus membership, this is as good as it gets for free! </p><p>They also announced CoPilot for 365, which means that everything from office (word, excel!) to your mail, and your teams conversations will have a CoPilot that will be able to draw from your organizations knowledge and help you do incredible things. Things like help book appointments, pull in relevant people for the meeting based on previous documents, summarize that meeting, schedule follow ups, and like a TON more stuff. Dall-e integration will help you create awesome powerpoint slides. </p><p>(p.s. all of this will be allegedly data protected and won't be shared with MS or be trained on)</p><p>They literally went and did "AI everywhere" with CoPilot and it's kinda incredible to see how big they are betting the farm on AI with Microsoft while Google... where's Google™? </p><p>CoPilot Studio</p><p>One of the more exciting things for me was, the CoPilot Studio announcement, a low-code tool to extend your company's CoPilot by your IT, for your organization. Think, getting HR data from your HR system, or your sales data from your SalesForce! </p><p>They will launch with 1100 connectors for many services but allow you to easily build your own. </p><p>One notable thing is, Custom GPTs will also be a connector! You will be literally able to connect yyour CoPilot with your (or someone's) GPTs! Are you getting this? AI Employees are coming faster than you think! </p><p>Vision</p><p>I've been waiting for cool vision demos since GPT-4V API was launched and oh boy did we get them! From friend of the pod Robert Lukoshko <a target="_blank" href="https://x.com/Karmedge/status/1724818581966180390?s=20">Auto screenshot analysis</a> which will take screenshots periodically and will send you a report of all you did that day, to Charlie Holtz live webcam narration by <a target="_blank" href="https://twitter.com/charliebholtz/status/1724815159590293764">David Attenborough</a> (which is available on <a target="_blank" href="https://github.com/cbh123/narrator">Github</a>!) </p><p>But I think there's 1 vision demo that takes the cake this week, by our friends (<a target="_blank" href="https://twitter.com/steveruizok">Steve Ruiz</a>) from TLDraw, which is a whiteboard canvas primitive. They have added a sketch-to-code button, that allows you to sketch something out and GPT-4 Vision will analyze this, and GPT-4 will write code, and you will get live code within seconds. It's so mind-blowing that I'm still collecting my jaw of the floor here. It also does coding, so if you ask it nicely to add JS interactivity, the result will be interactive 🤯</p><p></p><p>GPT4-V Is truly as revolutionary as I imagined it to be when Greg announced it on stage 🫡 </p><p>P.S - Have you played with it? Do you have cool demos? DM me with 👁️‍🗨️ emoji and a cool vision demo to be included in the next ThursdAI</p><p>AI Art & Diffusion & 3D </p><p>In addition to the TLDraw demo, one mind-blowing demo after another is coming this week from the AI Art world, using the LCM (Latent Consistency Model) + a whiteboard. This is yet another see it to believe it type thing (or <a target="_blank" href="https://www.fal.ai/dynamic">play with it</a>) </p><p>(video from <a target="_blank" href="https://twitter.com/LinusEkenstam/status/1725131315492634807">Linus</a>)</p><p>Dear friends from <a target="_blank" href="http://Krea.ai">Krea.ai</a> were the first to implement this insanity, that allows you to see real time AI art generation almost as fast as you type your prompts, and then followed up by the wizards at Fal to get the generations down to several mili-seconds (shoutout <a target="_blank" href="https://twitter.com/gorkemyurt">Gorkem</a>!), the real time drawing thing is truly truly mind-blowing. It's so mind-blowing that folks add their webcam feeds into this, and see almost real time generation on the fly of their webcam feeds.  </p><p>Meta announcing new Emus (Video & Edit) </p><p>Meta doesn't want me to relax, and during the space, announced their text-to-video and textual-editing models.</p><p>Emu Video produces great videos from a prompt, and emu-edit is really interesting, it allows you to edit parts of images by typing, think "remove the tail from this cat" or "remove the hat from this person" </p><p>They have this to say, which... dayum. </p><p>In human evaluations, our generated videos are strongly preferred in quality compared to all prior work– 81% vs. Google’s Imagen Video, 90% vs. Nvidia’s PYOCO, and 96% vs. Meta’s Make-A-Video. Our model outperforms commercial solutions such as RunwayML’s Gen2 and Pika Labs</p><p>It's really compelling, can't wait to see if they open source this, video is coming ya'll! </p><p>Audio & Sound</p><p>Deepmind + Youtube announced Lyria (<a target="_blank" href="https://deepmind.google/discover/blog/transforming-the-future-of-music-creation/">blogpost</a>)</p><p>This new music model is pretty breathtaking, but we only got a glimpse, not even a waitlist for that one, however, check out the pre-recorded demoes, folks at deep mind have a model you can hum into, sing into, and it'll create a full blown track for you, with bass, drums, and singing! </p><p>Not only that, it will also license vocals from mucisians (al-la Grimes) and will split the revenue between you and them if you post it on Youtube! </p><p>Pretty cool Google, pretty cool! </p><p>Agents & Tools</p><p>Look, I gotta be honest, I'm not sure about this category, Agent and Tools, if to put them into one or not, but I guess GPTs are kinda tools, so I'm gonna combine them for this one. </p><p><strong>GPTs (</strong><a target="_blank" href="https://chatg.pt/artweather"><strong>My Visual Weather</strong></a><strong>, </strong><a target="_blank" href="https://x.com/simonw/status/1724815901378347425?s=20"><strong>Simons Notes</strong></a><strong>)</strong></p><p>This week, the GPT that I created called Visual Weather GPT has blown up, with over 5,000 chats opened with it, and many many folks using this and texting me about this. super cool way to just like check all the capabilities of a GPT. If you remember, I thought of this idea a few weeks ago when we got a sneak preview to the "All tools" mode, but now I can share it with you all in the form of a GPT, that will browser the web for real time weather data, and create a unique art piece for that location and weather conditions! </p><p>It's really easy to make as well, and I do fully expect everyone to start making their own versions very soon, and I think we're inching towards the era of JIT (just in time) software, where you'll create software as you require it, and it'll be as easy as talking to a chatGPT! </p><p>Speaking of, friend of the pod Steven Tey from Vercel (who's <a target="_blank" href="http://dub.sh">dub.sh</a> I use and love for <a target="_blank" href="http://thursdai.news">thursdai.news</a> links) has released a GPT link shortener, called <a target="_blank" href="http://chatg.pt">chatg.pt</a> and you can register and get your own cool short link like <a target="_blank" href="https://chatg.pt/artweather">https://chatg.pt/artweather</a> 👏 And it'll give you analytics as well! </p><p>Pro tip for weather GPT, you can ask for a specific season or style in parentesises and then those as greeting cards for your friends. Happy upcoming Thanks giving everyone! </p><p>Speaking of Thanks</p><p>giving, we're not taking a break, next ThursdAI, November 23, join us for a live discussion and podcast recoding! We'll have many thanks, cool AI stuff, and much more!  </p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-nov-16-live-ai-art-ms-copilots</link><guid isPermaLink="false">substack:post:138930783</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Fri, 17 Nov 2023 00:55:11 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/138930783/74c938dc4b79e8b10df4db1a28f3211f.mp3" length="102425791" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6402</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/138930783/c4277468b616f80db12b7c2cafe0f8d3.jpg"/></item><item><title><![CDATA[📅 ThursdAI - OpenAI DevDay recap (also X.ai grōk, 01.ai 200K SOTA model, Humane AI pin) and a personal update from Alex 🎊]]></title><description><![CDATA[<p>Hey everyone, this is Alex Volkov 👋 </p><p>This week was an incredibly packed with news, started strong on Sunday with x.ai GrŌk announcement, Monday with all the releases during OpenAI Dev Day, then topped of with Github Universe Copilot announcements, and to top it all of, we postponed the live recording to see what <a target="_blank" href="http://hu.ma.ne">hu.ma.ne</a> has in store for us as AI devices go (Finally announced Pin with all the features) </p><p>In between we had a new AI Unicorn from HongKong called Yi from <a target="_blank" href="http://01.ai">01.ai</a> which dropped a new SOTA 34B model with a whopping 200K context window and a commercial license by ex-Google China lead Kai Fu Lee.</p><p>Above all, this week was a <strong>monumental for me personally</strong>, ThursdAI has been a passion project for the longest time (240 days), and it led me to incredible places, like being invited to <a target="_blank" href="http://ai.engineer">ai.engineer</a> summit to do media, then getting invited to OpenAI Dev Day (to also do podcasting from there), interview and befriend folks from <strong>HuggingFace</strong>, <strong>Github</strong>, <strong>Adobe</strong>, <strong>Google</strong>, <strong>OpenAI</strong> and of course open source friends like <strong>Nous Research</strong>, <strong>Alignment Labs</strong>, and interview authors of papers, hackers of projects, and fine-tuners and of course all of you, who tune in from week to week 🙏 Thank you!</p><p>It's all been so humbling and fun, which makes me ever more excited to share the next chapter. Starting Monday <strong>I'm joining Weights & Biases as an AI Evangelist</strong>! 🎊</p><p>I couldn't be more excited to continue ThursdAI mission, of spreading knowledge about AI, connecting between the AI engineers and the fine-tuners, the Data Scientists and the GEN AI folks, the super advanced cutting edge stuff, and the folks who fear AI with the backing of such an incredible and important company in the AI space.  </p><p>ThursdAI will continue as a X space, newsletter and podcast, as we'll gradually find a common voice, and continue bringing folks awareness of WandB incredible brand to newer developers, products and communities. Expect more on this very soon! </p><p>Ok now to the actual AI news 😅 </p><p><strong>TL;DR of all topics covered</strong>: </p><p>* <strong>OpenAI Dev Day</strong></p><p>* GPT-4 Turbo with 128K context, 3x cheaper than GPT-4</p><p>* Assistant API - OpenAI's new Agent API, with retrieval memory, code interpreter, function calling, JSON mode </p><p>* GPTs - Shareable, configurable GPT agents with memory, code interpreter, DALL-E, Browsing, custom instructions and actions</p><p>* Privacy Shield - Open AI lawyers will protect you from copyright lawsuits </p><p>* Dev Day emergency pod with <a target="_blank" href="https://open.substack.com/pub/swyx">Latent Space</a>  with Swyx, Allesio, Simon and Me! (<a target="_blank" href="https://latent.space/p/devday">Listen</a>)</p><p>* <strong>OpenSource LLMs</strong> </p><p>* 01 launches YI-34B, a 200K context window model commercially licensed and it tops all HuggingFace leaderboards across all sizes (<a target="_blank" href="https://x.com/kaifulee/status/1721321096727994590?s=20">Announcement</a>)</p><p>* <strong>Vision</strong></p><p>* GPT-4 Vision API finally announced, rejoice, it's as incredible as we've imagined it to be</p><p>* <strong>Voice</strong></p><p>* Open AI TTS models with 6 very-realistic, multilingual voices, no cloning tho</p><p>* <strong>AI Art & Diffusion</strong></p><p>* <2.5 seconds full SDXL inference with FAL (<a target="_blank" href="https://x.com/isidentical/status/1721629870277427372?s=20">Announcement</a>)</p><p>OpenAI Dev Day</p><p>So much to cover from OpenAI that this has it's own section today in the newsletter. </p><p>I was lucky enough to get invited, and attend the first ever OpenAI developer conference (AKA Dev Day) and it was an absolute blast to attend. It was also incredible to attend it together with <strong>all 8.5 thousand</strong> of you who <a target="_blank" href="https://twitter.com/altryne/status/1720284705915109546">tuned into our live stream</a> on X, as we were walking to the event, and then watched the keynote together (Thanks Ray for the restream) and talked with OpenAI folks about the updates. Huge shoutout to <a target="_blank" href="https://twitter.com/ldjconfirmed">LDJ</a>, <a target="_blank" href="https://twitter.com/nisten">Nisten</a>, <a target="_blank" href="https://twitter.com/RayFernando1337">Ray</a>, Phlo, Swyx and many other folks who held the space, while we were otherwise engaged with deep dives and meeting folks and doing interviews! </p><p>So now for some actual reporting! What did we get from OpenAI? omg we got so much, as developers, as users (and as attendees, I will add more on this later) </p><p>GPT4-Turbo with 128K context length</p><p>The major  thing that was announced is a new model, GPT-4-turbo, which is supposedly faster than GPT-4, while being 3x cheaper (2x on output) and having a <strong>whopping 128K context length</strong> while also being more accurate (with significantly <a target="_blank" href="https://x.com/swyx/status/1722441535235768372?s=20">better recall and attention</a> throughout this context length)</p><p>With JSON mode and significantly improved function calling capabilities, updated cut-off time (April 2023), and higher rate limits, this new model is already being implemented across all the products and is a significant significant upgrade to many folks</p><p>GPTs - A massive shift in agent landscapes by OpenAI</p><p>Another (semi-separate) thing that Sam talked about was the GPTs, their version of agents </p><p>not to be confused with the Assistants API, which is also Agents, but for developers, and they are not the same and it's confusing</p><p>GPTs I think is a genius marketing move by OpenAI and replaces Plugins (that didn't even meet product market fit) in many regards. </p><p>GPTs are instances of well... GPT4-turbo, that you can create by simply chatting with BuilderGPT, and they can have their own custom instruction set, and capabilities that you can turn on and off, like browse the web with Bing, Create images with DALL-E and write and execute code with Code Interpreter (bye bye Advanced Data Analysis, we don't miss ya). </p><p>GPTs also have memory, you can upload a bunch of documents (and your users as well) and GPTs will do vectorization and extract the relevant information out of those documents, so think, your personal Tax assistant that has all 3 years of your tax returns</p><p>And they have eyes, GPT4-V is built in so you can drop in screenshots, images and all kinds of combinations of things.</p><p>Additionally, you can define actions for assistants (which is similar to how Plugins were developed previously, via an OpenAPI schema) and the GPT will be able to use those actions to do tasks outside of the GPT context, like send emails, check stuff in your documentation and much more, pretty much anything that's possible via API is now possible via the actions. </p><p>One big thing that's missing for me is, GPTs are reactive, so they won't reach out to you or your user when there's a new thing, like a new email to summarize or a new task completed, but I'm sure OpenAI will close that gap at some point. </p><p>GPTs are not Assistants, they are similar but not the same and it's quite confusing. GPTs are created online, and then are share-able with links. </p><p>Which btw, I created a GPT that uses several of the available tools, browsing for real time weather info, and date/time and generates an on the fly, never seen before weather art for everyone. It's really fun to play with, let me know what you think (<a target="_blank" href="https://x.com/altryne/status/1722498086256411075?s=20">HERE</a>) the image above is generated by the <a target="_blank" href="https://x.com/altryne/status/1722498086256411075?s=20">Visual Weather GPT</a></p><p>Unified "All tools" mode for everyone (who pays)</p><p>One tiny thing that Sam mentioned on stage, is in fact huge IMO, is the removal of the selector in chatGPT, and all premium users now have access to 1 interface that is multi modal on input and output (I call it MMIO) - This mode now understands images (vision) + text on input, and can browse the web and generate images, text, graphs (as it runs code) on the output. </p><p>This is a significant capabilities upgrade to many folks who will use these tools, but previously didn't want to choose between DALL-E mode and browse or Code Interpreter mode. The model now intelligently selects what tool to use for a given task, and this means more and more "generality" for the models, as they are learning and getting new capabilities in the form of tools. </p><p>This in addition to a <strong>MASSIVE 128K</strong> context window, means that chatGPT has now been significantly upgraded, and you still pay $20/mo 👏 Gotta love that!</p><p>Assistant API (OpenAI Agents)</p><p>This is the big announcement for developers, we all got access to a new and significantly improved Assistants API, which improves on several our experience on several categories: </p><p>* <strong>Creating Assistants</strong> <strong>- </strong>Assistants are OpenAI's first foray into the world of AGENTS, and it's quite exciting! You can create an assistant via API (not quite the same as GPTs, we'll cover the differences later), you can create each assistant with it's own set of instructions (that you don't have to pass each time with the prompt), tools like code interpreter and retrieval, and functions. Also you can select models, so you don't have to use the new GPT-4-turbo (but you should!)</p><p>* <strong>Code Interpreter </strong>- Assistants are able to write and execute code now, which is a whole world of excitement! Having code abilities (that executes in a safe environment on OAI side) is a significant boost in many regards and many tasks require bits of code "on the fly", for example time-zone tasks. You will no longer have to write that code yourself, you can ask your assistant</p><p>* <strong>Retrieval</strong> - OpenAI (and apparently <a target="_blank" href="https://twitter.com/altryne/status/1721989500291989585">QDrant</a>!) have given all the developers a built in RAG (retrieval augmented generation) capabilities + document uploading and understanding. You can upload files like documentation via the API or let your users upload files, and parse and extract information out of! This is an additional huge huge thing, basically memory is built in for you now</p><p>* <strong>Stateful API</strong> - this API introduces the concept of threads, where OpenAI will manage the state of your conversation, and you can assign 1 user per thread and then just send the responses back to the user, and send the user queries to the same thread. No longer do you have to send the whole history back and forth! It's quite incredible, however it raises the question of pricing, and calculating tokens. Per OpenAI (I asked), if you would like to calculate costs on the fly, you'd have to use the get thread endpoint, and then count the amount of tokens that's already in the thread (and it can be a LOT since there's now 128K tokens in the context length) </p><p>* <strong>JSON and Better functions calling</strong> - You can now set the API to respond in JSON mode! Which is an incredible improvement for devs, and which we only were able to do via Functions before, and even functions got an upgrade, with an ability to call multiple functions. Functions are added as "actions" in the assistant creation, so you can give your assistant abilities that it will execute by returing to you functions with the right parameters. Thing "set the mood" will return a function to call the smart lights, and "play" will return a function that will call Spotify API</p><p>* <strong>Multiple Assistants can join a thread</strong> - you can create specific assistants that can all join the same thread with the user, each with a set of custom instructions and capabilities and tools</p><p>* <strong>Parallel Functions</strong> - this is also new, the assistant API can now return several functions for you to execute, which could lead to the creation of scenes, for example in a smart home, you want to "set the mood" and several functions would be returned from the API, one that will turn of the lights, one that will start the music, and one that will turn on mood lighting. </p><p>Vision</p><p>GPT-4 Vision </p><p>Finally, it's here, multimodality for developers to implement, the moment I personally have been waiting for since GPT-4 was launched (and ThursdAI started) back on March 14 (240 days ago, but who's counting) </p><p>GPT-4 vision is able to take images, and text, and respond with many vision related tasks, like analysis, understanding, summarization of captions. Many folks are splitting videos frame by frame and analyzing whole videos already (in addition to whispering the video to get what is said) </p><p>Hackers and developers like friend of the pod Robert, created quick hacks like a <a target="_blank" href="https://x.com/Karmedge/status/1721777152658444773?s=20">browser extension</a> that lets you select any screenshot on the page and ask GPT4 vision things about it, another friend of the pod SkalskiP created a <a target="_blank" href="https://x.com/skalskip92/status/1722031692347793575?s=20">hot dog classifier Gradio</a> space 😂 and is maintaining an awesome list of experiments with vision on <a target="_blank" href="https://github.com/roboflow/awesome-openai-vision-api-experiments">Github</a></p><p>Voice</p><p>Text to speech models</p><p>OpenAI decided to help us all build agents properly, and agents need not only ears (for which they gave us whisper, and released V3 as well) but also voice, and we finally got the TTS from OpenAI, 6 very beautiful, emotional voices, that you can run very easily, and cheaply. You can't generate more or clone yet (that's only for friends of OpenAI like Spotify and others) but you can use the 6 we got (plus a secret pirate one apparently they trained but never released!) </p><p>They sound ultra-realistic, and are multi-linugal as well, you can just pass different languages and voila. Friend of the pod Simon Willison created a <a target="_blank" href="https://simonwillison.net/2023/Nov/7/ospeak/">quick CLI tool called ospeak</a> to pipe text into and it'll use your OAI key to read that text out with those super nice voices! </p><p>Whisper v3 was released! </p><p><a target="_blank" href="https://github.com/openai/whisper/discussions/1762">https://github.com/openai/whisper/discussions/1762</a></p><p>The large-v3 model shows improved performance over a wide variety of languages, and the plot below includes all languages where Whisper large-v3 performs lower than 60% error rate on Common Voice 15 and Fleurs, showing 10% to 20% reduction of errors compared to large-v2:</p><p>HUMANE</p><p><strong>Humane AI pin is ready for pre-order at 699</strong></p><p>HUMANE pin was finally announced, and here is the break-down, they have a clever way to achieve "all day battery life" by having a hot swap system, with a magnetic booster that you can swap when you get low on battery (pretty genius TBH)</p><p>It's passive so it's not "always listening" but there is a wake word apparently, and you can activate by touch. Runs on the T-mobile Network ( which sucks for folks like me where T-mobile just doesn't have reception in their neighborhood 😂 )</p><p>No apps, just AI experiences powered by OpenAI, with a laser powered projector UI on your hand, and voice controls</p><p>AI voice input will allow interactions like asking for information (which has browsing) and is SIGNIFICANTLY better than "Siri" or "Ok Google" from the demo, being able to rewrite your messages for you, catch you up on multiple messages and even search through them! You can ask for retrieval from previous messages</p><p>Pin is multimodal, voice input and vision</p><p>Holding the microphone on Tab while someone's speaking to you in a different language will automatically translate that language for you and then translate you back to that language with your own intonation! Bye bye language barriers!</p><p>And with vision, you can do things like tracking calories from showing it what you ate, or buy things you're seeing in the store, but online, take pictures and videos and then store all of them transcribed in your personal AI memory</p><p>Starting at $699, with a $24/mo payment that comes with unlimited AI queries, storage and service (again, just T-mobile), Tidal music subscription and more.</p><p>I think it's lovely that someone tries to take on Google/Apple duopoly with a completely re-imagined AI device, and can't wait to pre-order mine and test it out. It will be an interesting experience of balance with 2 phone numbers, but also a monthly payment that basically makes the device use-less if you stop paying.</p><p>Phew, this was a big update, not to mention there's a whole 2 hour podcast I want you to listen to on top of this, thank you for reading, for subscribing, for participating in the community and I can't wait to finally relax after this long week (still Jet-lagged) and prepare for my new Monday! </p><p>I want to send a heartfelt shoutout to my friend <a target="_blank" href="https://substack.com/profile/89230629-swyx">swyx</a> who not only let me on to <a target="_blank" href="https://open.substack.com/pub/swyx">Latent Space</a> from time to time (including the last recap emergency pod), but also is my blood-line to SF, where everything happens! Thanks man, I really appreciate all you did for me and ThursdAI 🫡</p><p>Can't wait to see you all on the next ThursdAI, and as always, replies, comments, congratulations, are welcome as replies, DMs and send me the 🎉 for this one, I'd really appreciate it! </p><p>— Alex </p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/nov-09</link><guid isPermaLink="false">substack:post:138742667</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Thu, 09 Nov 2023 23:52:23 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/138742667/d3b7f07c4edc0fb2b479611a11ae29e9.mp3" length="111675445" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6980</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/138742667/240ca6612794d38bc331981d786422dc.jpg"/></item><item><title><![CDATA[📅 ThursdAI Nov 02 - ChatGPT "All Tools", Bidens AI EO, many OSS SOTA models, text 2 3D, distil-whisper and more AI news 🔥]]></title><description><![CDATA[<p>ThursdAI November 2nd</p><p>Hey everyone, welcome to yet another exciting ThursdAI. This week we have a special announcement, the co-host of and I will be hosting a shared X space live from Open AI Dev Day! Monday next week (and then will likely follow up with interviews, analysis and potentially a shared episode!)</p><p>Make sure you set a reminder on X (<a target="_blank" href="https://thursdai.news/next?rel=openai">https://thursdai.news/next</a>) , we’re going to open the live stream early, 8:30am on Monday, and we’ll live stream all throughout the keynote! It’ll be super fun!</p><p>Back to our regular schedule, we covered a LOT of stuff today, and again, were lucky enough to have BREAKING NEWS and the authors of said breaking news (VB from HuggingFace and Emozilla from Yarn-Mistral-128K) to join us and talk a little bit in depth about their updates!</p><p></p><p>[00:00:34] Recap of Previous Week's Topics</p><p>[00:00:50] Discussion on AI Embeddings</p><p>[00:01:49] Gradio Interface and its Applications</p><p>[00:02:56] Gradio UI Hosting and its Advantages</p><p>[00:04:50] Introduction of Baklava Model</p><p>[00:05:11] Zenova's Input on Distilled Whisper</p><p>[00:10:32] AI Regulation Week Discussion</p><p>[00:24:14] ChatGPT new All Tools mode (aka MMIO)</p><p>[00:35:45] Discussion on Multimodal Input and Output Models</p><p>[00:36:55] BREAKING NEWS: Mistral YaRN 7B - 128K context window</p><p>[00:37:02] Announcement of Mistral Yarn Release</p><p>[00:46:47] Exploring the Limitations of Current AI Models</p><p>[00:47:25] The Potential of Vicuna 16k and Memory Usage</p><p>[00:49:43] The Impact of Apple's New Silicon on AI Models</p><p>[00:51:23] Introduction to New Models from Nius Research</p><p>[00:51:39] The Future of Long Context Inference</p><p>[00:53:42] Exploring the Capabilities of Obsidian</p><p>[00:54:29] The Future of Multimodality in AI</p><p>[00:58:48] The Exciting Developments in CodeFusion</p><p>[01:06:49] The Release of the Red Pajama V2 Dataset</p><p>[01:12:07] The Introduction of Luma's Genie</p><p>[01:16:37] Discussion on 3D Models and Stable Diffusion</p><p>[01:17:08] Excitement about AI Art and Diffusion Models</p><p>[01:17:48] Regulation of AI and OpenAI Developments</p><p>[01:18:24] Guest Introduction: VB from Hug& Face</p><p>[01:18:53] VB's Presentation on Distilled Whisper</p><p>[01:21:54] Discussion on Distillation Concept</p><p>[01:27:35] Insanely Fast Whisper Framework</p><p>[01:32:32] Conclusion and Recap</p><p></p><p>Show notes and links:</p><p>* <strong>AI Regulation</strong></p><p>* Biden Executive Order on AI was signed (<a target="_blank" href="https://www.whitehouse.gov/briefing-room/presidential-actions/2023/10/30/executive-order-on-the-safe-secure-and-trustworthy-development-and-use-of-artificial-intelligence/">Full EO</a>, <a target="_blank" href="https://www.aisnakeoil.com/p/what-the-executive-order-means-for">Deep dive</a>)</p><p>* UK AI regulation forum (<a target="_blank" href="https://x.com/BloombergUK/status/1719686487858364849?s=20">King AI speech, no really</a>, <a target="_blank" href="https://x.com/arthurmensch/status/1720166234090512583?s=20">Arthur from Mistral</a>)</p><p>* Mozilla - Joint statement on AI and openness (<a target="_blank" href="https://open.mozilla.org/letter/">Sign the letter</a>)</p><p>* <strong>Open Source LLMs</strong></p><p>* Together AI releases RedPajama 2, 25x larger dataset (30T tokens) (<a target="_blank" href="https://together.ai/blog/redpajama-data-v2">Blog</a>, <a target="_blank" href="https://x.com/togethercompute/status/1719041744371884537?s=20">X</a>, <a target="_blank" href="https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2">HF</a>)</p><p>* Alignment Lab - OpenChat-3.5 a chatGPT beating open source model (<a target="_blank" href="https://huggingface.co/openchat/openchat_3.5">HF</a>)</p><p>* Emozilla + Nous Research - Yarn-Mistral-7b-128k (and 64K) longest context window (<a target="_blank" href="https://x.com/theemozilla/status/1720107186850877662?s=20">Announcement</a>, <a target="_blank" href="https://huggingface.co/NousResearch/Yarn-Mistral-7b-128k">HF</a>)</p><p>* LDJ + Nous Research release Capybara 3B & 7B (<a target="_blank" href="https://x.com/ldjconfirmed/status/1718912501998289287?s=20">Announcement</a>, <a target="_blank" href="https://huggingface.co/NousResearch/Obsidian-3B-V0.5">HF</a>)</p><p>* LDJ - Obsidian 3B - the smallest open source multi modal model (<a target="_blank" href="https://huggingface.co/NousResearch/Obsidian-3B-V0.5">HF</a>, <a target="_blank" href="https://huggingface.co/nisten/obsidian-3b-multimodal-q6-gguf">Quantized</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* ChatGPT "all tools" MMIO mode - Combines vision, browsing, ADA and DALL-E into 1 model (<a target="_blank" href="https://twitter.com/altryne/status/1718469565431062977">Thread</a>, <a target="_blank" href="https://x.com/ldjconfirmed/status/1718468139845812474?s=20">Examples</a>, <a target="_blank" href="https://raw.githubusercontent.com/spdustin/ChatGPT-AutoExpert/main/_system-prompts/all_tools.md">System prompt</a>)</p><p>* Microsoft CodeFusion paper - a tiny (75M parameters) model beats a 20B GPT-3.5-turbo (<a target="_blank" href="https://x.com/far__el/status/1718978907142078900?s=20">Thread</a>, <a target="_blank" href="https://arxiv.org/abs/2310.17680">ArXiv</a>)</p><p>* <strong>Voice</strong></p><p>* Hugging Face - Distill whisper - 2x smaller english only version of Whisper (<a target="_blank" href="https://x.com/iscienceluvr/status/1719900807574020478?s=46">X</a>, <a target="_blank" href="https://arxiv.org/abs/2311.00430">paper</a>, <a target="_blank" href="https://github.com/huggingface/distil-whisper">code</a>)</p><p>* <strong>AI Art & Diffusion & 3D</strong></p><p>* Luma - text-to-3D Genie bot (<a target="_blank" href="https://x.com/LumaLabsAI/status/1719792922646987152?s=20">Announcement</a>, <a target="_blank" href="https://lumalabs.ai/genie">Try it</a>)</p><p>* Stable 3D & Sky changer</p><p>AI Regulation IS HERE</p><p>Look, to be very frank, I want to focus ThursdAI on all the news that we're getting from week to week, and to bring a positive outlook, so politics, doomerism, and regulation weren't on the roadmap, however, with weeks like these, it's really hard to ignore, so let's talk about this.</p><p>President Biden signed an Executive Order, citing the old, wartime era Defence Production act (looks like the US gov. also has "one weird trick" to make the gov move faster) and it wasn't as bombastic as people thought. X being X, there has been so many takes pre this executive order even releasing about regulatory capture being done by the big AI labs, about how open source is no longer going to be possible, and if you visit Mark Andressen feed you'll see he's only reposting AI generated memes to the tune of "don't tread on me" about GPU and compute rights.</p><p>However, at least on the face of it, this executive order was mild, and discussed many AI risks and focused on regulating models from huge compute runs (~28M H100 hours // $50M dollars worth). Here's the relevant section.</p><p>Many in the open source community <a target="_blank" href="https://x.com/ClementDelangue/status/1719020491929682207?s=20">reacted</a> to the flops limitation with a response that it's very much a lobbyist based decision, and that the application should be regulated, not only the compute.</p><p>There's much more to say about the EO, if you want to dig deeper, I strongly recommend this piece from AI Snake oil :</p><p>and check out Yan Lecun's <a target="_blank" href="https://twitter.com/ylecun">whole feed</a>.</p><p>UK AI safety summit in Bletchley Park</p><p>Look, did I ever expect to add the King of England into an AI weekly recap newsletter? Surely, if he was AI Art generated or something, not the real king, addressing the topic of AI safety!</p><p>This video was played for the attendees of a few day AI safety summit in Blecheley park, where AI luminaries (<a target="_blank" href="https://twitter.com/ylecun/status/1720174065338511755">Yan Lecun</a>, <a target="_blank" href="https://x.com/RishiSunak/status/1720187297558065441?s=20">Elon Musk</a>, <a target="_blank" href="https://twitter.com/arthurmensch/status/1720166234090512583">Arthur Mensch Mistral CEO</a>, <a target="_blank" href="https://twitter.com/NaveenGRao/status/1720008907429425447">Naveen Rao</a>) attended and talked about the risks and benefits of AI and regulation. I think Naveen Rao had a great recap <a target="_blank" href="https://twitter.com/NaveenGRao/status/1720008907429425447">here</a>, but additionally, there were announcements about Safety Institute in the UK, and they outlined what actions the government can take.</p><p>In other regulation related news, Mozilla has a joint statement on AI safety and openness (<a target="_blank" href="https://x.com/natolambert/status/1719821847565578435?s=20">link</a>) that many signed, which makes the case for openness and open source as the way to AI safety. Kudos on mozilla, we stand by the letter 🤝</p><p>Big CO LLMs + APIs</p><p>OpenAI - ChatGPT "all tools" aka MMIO mode (that's now dubbed "confidential")</p><p>Just a week before the first Dev Day from OpenAI, we were hanging out in X spaces talking about what the regulation might bring, when a few folks noticed that their ChatGPT interface looks different, and saw a very specific popup message saying that ChatGPT can now talk with documents, and "<strong>use tools without switching</strong>", see and interact with DALL-E and Advanced Data Analysis (FKA Code Interpreter) all in one prompt.</p><p>While many X takes <a target="_blank" href="https://twitter.com/altryne/status/1718639938416079341">focused solely</a> on just how many "chat with your PDF" startups OpenAI just "killed", and indeed, the "work with PDFs" functionality seemed new, chatGPT could now get uploads of files, had the ability to search, to go to a specific page, even do a basic summary on PDF files, I was interested in the second part!</p><p>Specifically because given GPT-4V is now basically enabled for everyone, this "combined" mode makes ChatGPT the first MMIO model that we have, which is a multi modal on input (Text, Voice, Images) and output (Text, Images). You see, most MultiModal Models so far have been only multimodal on the input, ie, take in text or images or a combination, and while playing around with the above, we noticed some incredible use-cases that are now available!</p><p><strong>ChatGPT (for some lucky folks) can now do all these things in one prompt with shared context:</strong></p><p>* <strong>Read and interact with PDFs</strong></p><p>* <strong>See and understand images + text</strong></p><p>* <strong>Browse & Search up to date info with Bing</strong></p><p>* <strong>Write and execute code with ADA</strong></p><p>* <strong>Generate images with DALL-E</strong></p><p><strong>All in the same prompt</strong>, one after another, and often for several steps and iterations.</p><p>One such example was, I asked for "get the current weather in Denver and generate an image based on the conditions" and we got this incredible, almost on the fly "weather" UI, showing the conditions (it was the first snow in CO this year), weather, humidity and everything. Now, DALL-E is ok with text but not great, but it's incredible with scenery, so having this "on the fly UI" that has real time info was super great to show off the capabilities of a general model.</p><p>We also saw prompts from folks who uploaded a picture of an obscure object, and asked DALL-E to "add" this object to an already generated image, so DALL-E now has eyes, and can understand and "draw" some of the objects and add them to other images, which was an amazing thing to see, and I can't wait to play around with this functionality.</p><p>We noticed a few more things, specifically that DALL-E images are now stored on the same disk that you get access to with ADA, so you can then ask ChatGPT to upscale, crop and do things with those images for example, and generate code with those images as a background!</p><p>There are so many new potential use-cases that have opened up, that we spent a long evening / night on X spaces trying to throw the kitchen sink onto this mode, in the fear that it was a fluke by OpenAI and they weren't meant to release this, and we were right! Today on ThursdAI live recording, some users reported that they no longer have access to it (and they miss it!) and some reported that it's now called something like "Confidential"</p><p>Someone also leaked the full prompt for this "all tools" mode and it's a doosy! The "All Tools" omni-prompt takes a whopping 2,756 tokens, but it's also using the <strong>GPT-4 32k model</strong>, with a 32,767 token context window. (<a target="_blank" href="https://raw.githubusercontent.com/spdustin/ChatGPT-AutoExpert/main/_system-prompts/all_tools.md">link</a>)</p><p>I guess we're going to see the announcement on Dev Day (did you <a target="_blank" href="https://thursdai.news/next?rel=openai">set a reminder</a>?)</p><p>This new mode that we saw and played with, added to the many many leaks and semi-confirmed modes that are coming out of Reddit make it seem like ChatGPT is going to have an all out Birthday party next week and is about to blow some people's minds! We're here for it! 👏</p><p>Open Source LLMs</p><p>CodeFusion - 75M parameters model based on Diffusion Model for <strong>Code</strong> Generation</p><p>Code-fusion was taken down from ArXiv, claimed 3.5 is 20B params (and then taken down saying that this was unsubstantiated) <a target="_blank" href="https://x.com/iScienceLuvr/status/1718817117213163807?s=20">X link</a></p><p>The paper itself discusses the ability to use diffusion to generate code, and has much less data to get to a very good coding level, with a model small enough to fit on a chip's cache (not even memory) and be very very fast. Of course, this is only theoretical and we're going to wait for a while until we see if this replicates, especially since the PDF was taken down due to someone attributing the 20B parameters note to a forbes article.</p><p>The size of the model, and the performance score on some coding tasks make me very very excited about tiny models on edge/local future!</p><p>I find the parameter obsession folks have about OpenAI models incredible, because parameter size really doesn’t matter, it's a bad estimation anyway, OAI can train their models for years and keep them in the same parameter size and they would be vastly different models at the start and finish!</p><p>Together releases a massive 30T tokens dataset - RedPajama-Data-v2 (<a target="_blank" href="https://twitter.com/togethercompute/status/1719041744371884537">Announcement</a>, <a target="_blank" href="https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2">HF</a>)</p><p>This massive massive dataset is 25x the previous RedPajama, and is completely open, deduplicated and has enormous wealth of data to train models from. For folks who were talking the "there's no more tokens" book, this came as a surprise for sure! It's also multi-lingual, with tokens in English, French, Italian, German and Spanish in there. Kudos to Together compute for this massive massive open source effort 👏</p><p>Open source Finetunes Roundup</p><p>This week was another crazy one for open source fine-tuners, releasing SOTA after SOTA, many of them on ThursdAI itself 😅 Barely possible to keep up (and that's quite literally my job!)</p><p>Mistral 7B - 128K (and 64K) (<a target="_blank" href="https://huggingface.co/NousResearch/Yarn-Mistral-7b-128k">HF</a>)</p><p>The same folks who brought you the YaRN paper, Emozilla, Bowen Peng and Enrico Shippole (frequent friends of the pod, we had quite a few conversations with them in the past) have released the longest context Mistral fine-tune, able to take 64K and a whopping 128K tokens in it's context length, making one of the best open source model now compatible with book length prompts and very very long memory!</p><p>Capybara + Obsidian (<a target="_blank" href="https://huggingface.co/NousResearch/Obsidian-3B-V0.5">HF</a>, <a target="_blank" href="https://huggingface.co/nisten/obsidian-3b-multimodal-q6-gguf">Quantized</a>)</p><p>Friend of the pod (and weekly cohost) LDJ releases 2 Nous research models, Capybara (trained on StableLM 3B and Mistral 7B) and Obsidian, the first vision enabled multi modal 3B model that can run on an iPhone!</p><p>Capybara is a great dataset that he compiled and the Obsidian model uses the LLaVa architecture for input multimodality and even shows some understanding of humor in images!</p><p>Alignment Lab - OpenChat-3.5 a chatGPT beating open source model (<a target="_blank" href="https://twitter.com/alignment_lab/status/1720135417754702217">Announcement</a>, <a target="_blank" href="https://huggingface.co/openchat/openchat_3.5">HF</a>)</p><p>According to friends of the pod Alignment Lab (of OpenOrca fame) we get a Mistral finetune that beats! chatGPT on many code based evaluations (from march, we all think chatGPT became much better since then)</p><p>OpenChat is by nature a conversationally focused model optimized to provide a very high quality user experience in addition to performing extremely powerfully on reasoning benchmarks.</p><p>Open source is truly unmatched, and in the face of a government regulation week, open sources is coming out in full!</p><p>Voice</p><p>HuggingFace Distill Whisper - 6x performance of whisper with 1% WER rate (<a target="_blank" href="https://twitter.com/sanchitgandhi99/status/1719409022246220184">Announcement</a>, <a target="_blank" href="https://huggingface.co/distil-whisper/distil-large-v2">HF</a>)</p><p>Hugging face folks release a distillation of Whisper, a process (and a paper) with which they use a "teacher" model like the original Open AI whisper, to "teach" a smaller model, and in the process of distillation, transfer capabilities from one to another, while also making the models smaller!</p><p>This makes a significantly smaller model (2x smaller) with comparative (and even better) performance on some use-cases, while being 6x faster!</p><p>This distill-whisper is now included with latest transformers (and transformers.js) releases and you can start using this faster whisper today! 👏</p><p>That's it for today folks, it's been a busy busy week, and many more things were announced, make sure to join our space and if you have read all the way until here, DM me the 🧯 emoji as a reply or in a DM, it’s how I know who are the most engaged users are!</p><p></p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p><p></p><p></p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-nov-02-chatgpt-all-tools</link><guid isPermaLink="false">substack:post:138537392</guid><dc:creator><![CDATA[Alex Volkov, Luigi Daniele, Nisten, and Xenova]]></dc:creator><pubDate>Fri, 03 Nov 2023 03:05:54 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/138537392/99c3a52bbd41bf012229ace55f605999.mp3" length="92184539" type="audio/mpeg"/><itunes:author>Alex Volkov, Luigi Daniele, Nisten, and Xenova</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5761</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/138537392/f220dae8d3d83f7867e1e70bea00cc85.jpg"/></item><item><title><![CDATA[📅 ThursdAI Oct-26, Jina Embeddings SOTA, Gradio-Lite, Copilot crossed 100M paid devs, and more AI news]]></title><description><![CDATA[<p>ThursdAI October 26th</p><p>Timestamps and full transcript for your convinience</p><p>## [00:00:00] Intro and brief updates</p><p>## [00:02:00] Interview with Bo Weng, author of Jina Embeddings V2</p><p>## [00:33:40] Hugging Face open sourcing a fast Text Embeddings</p><p>## [00:36:52] Data Provenance Initiative at dataprovenance.org</p><p>## [00:39:27] LocalLLama effort to compare 39 open source LLMs +</p><p>## [00:53:13] Gradio Interview with Abubakar, Xenova, Yuichiro</p><p>## [00:56:13] Gradio effects on the open source LLM ecosystem</p><p>## [01:02:23] Gradio local URL via Gradio Proxy</p><p>## [01:07:10] Local inference on device with Gradio - Lite</p><p>## [01:14:02] Transformers.js integration with Gradio-lite</p><p>## [01:28:00] Recap and bye bye</p><p>Hey everyone, welcome to ThursdAI, this is Alex Volkov, I'm very happy to bring you another weekly installment of 📅 ThursdAI.</p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p><p></p><p>TL;DR of all topics covered:</p><p>* <strong>Open Source LLMs</strong></p><p>* JINA - jina-embeddings-v2 - First OSS embeddings models with 8K context (<a target="_blank" href="https://jina.ai/news/jina-ai-launches-worlds-first-open-source-8k-text-embedding-rivaling-openai/?123">Announcement</a>, <a target="_blank" href="https://huggingface.co/jinaai/jina-embeddings-v2-base-en?ref=jina-ai-gmbh.ghost.io">HuggingFace</a>)</p><p>* Simon Willison guide to Embeddings (<a target="_blank" href="https://simonwillison.net/2023/Oct/23/embeddings/">Blogpost</a>)</p><p>* Hugging Face - Text embeddings inference (<a target="_blank" href="https://twitter.com/jerryjliu0/status/1712943016590381554">X</a>,<a target="_blank" href="https://github.com/huggingface/text-embeddings-inference"> Github</a>)</p><p>* Data Provenance Initiative - public audit of 1800+ datasets (<a target="_blank" href="https://twitter.com/ShayneRedford/status/1717209456348434534">Announcement</a>)</p><p>* Huge open source LLM comparison from r/LocalLLama (<a target="_blank" href="https://www.reddit.com/r/LocalLLaMA/comments/17fhp9k/huge_llm_comparisontest_39_models_tested_7b70b/">Thread</a>)</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* NVIDIA research new spin on Robot Learning (<a target="_blank" href="https://blogs.nvidia.com/blog/2023/10/20/eureka-robotics-research/?utm_source=www.therundown.ai&#38;utm_medium=referral&#38;utm_campaign=nvidia-s-robot-autonomy-breakthrough">Announcement</a>, <a target="_blank" href="https://eureka-research.github.io/">Project</a>)</p><p>* Microsoft / Github - Copilot crossed 100 million paying users (<a target="_blank" href="https://twitter.com/ashtom/status/1717288896453579078">X</a>)</p><p>* RememberAll open source (<a target="_blank" href="https://github.com/reductoai/remembrall">X</a>)</p><p>* <strong>Voice</strong></p><p>* Gladia announces multilingual near real time whisper transcriptions (<a target="_blank" href="https://x.com/altryne/status/1717540855437819939?s=20">X</a>, <a target="_blank" href="https://www.gladia.io/blog/real-time-transcription-powered-by-whisper-asr?utm_source=Twitter&#38;utm_medium=Blog&#38;utm_campaign=Blog_conversion&#38;utm_content=recall">Announcement</a>)</p><p>* <strong>AI Art & Diffusion</strong></p><p>* Segmind releases SSD-1B - 50% smaller and 60% faster version of SDXL (<a target="_blank" href="https://blog.segmind.com/introducing-segmind-ssd-1b/">Blog</a>, <a target="_blank" href="https://huggingface.co/segmind/SSD-1B">Hugging Face</a>, <a target="_blank" href="https://www.segmind.com/models/ssd-1b">Demo</a>)</p><p>* <strong>Prompt techniques</strong></p><p>* How to use seeds in DALL-E to add/remove objects from generations (by - <a target="_blank" href="https://x.com/simonw/status/1717651997091082375?s=20">Thread</a>)</p><p></p><p>This week was a mild one in terms of updates, believe it or not, we didn't get a new State of the art open source large language model this week, however, we did get a <strong>new state of the art Embeddings model from JinaAI (supporting 8K sequence length)</strong>.</p><p>We also had quite the quiet week from the big dogs, OpenAI is probably sitting on updates until Dev Day (which I'm going to cover for all of you, thanks to Logan for the invite), Google had some leaks about Gemini (we're waiting!) and another AI app builder thing, Apple is teasing new hardware (but nothing AI related) coming soon, and Microsoft / Github announced that <a target="_blank" href="https://x.com/ashtom/status/1717288896453579078?s=20">CoPilot has 100 million paying users</a>! (I tweeted this and Idan Gazit, Sr. Director GithubNext where Copilot was born, tweeted that "we're literally just getting started" and <a target="_blank" href="https://x.com/altryne/status/1717339200930959379?s=20">mentioned November 8th</a> as... a date to watch, so mark your calendars for some craziness next two weeks)</p><p>Additionally, we covered the <strong>Data provenance initiative</strong> that helps sort and validate licenses for over 1800 public datasets, a massive effort led by Shayne Redford with assistance from many folks including friend of the pod <a target="_blank" href="https://x.com/EnricoShippole/status/1717253993271873721?s=20">Enrico Shippole</a>, we also covered another <a target="_blank" href="https://www.reddit.com/r/LocalLLaMA/comments/17fhp9k/huge_llm_comparisontest_39_models_tested_7b70b/">massive evaluation effort</a> by a user named WolframRavenwolf on the LocalLLama subreddit, that evaluated and compared 39 open source models and GPT4. Not surprisingly the best model right now is the one we covered <a target="_blank" href="https://thursdai.news/oct-19">last week</a>, OpenHermes 7B from Teknium.</p><p>Two additional updates were covered, one of them is Gladia AI, released their version of whisper over web-sockets, and I covered it on <a target="_blank" href="https://x.com/altryne/status/1717540855437819939?s=20">X with a reaction video</a>, it allows developers to stream speech to text, with very low latency and it's multi-lingual as well, so if you're building an agent that folks can talk to, definitely give this a try, and finally, we covered SegMind SSD-1B, a distilled version of SDXL, making it 50% smaller in size and 60% faster in generation speed (you can play with it <a target="_blank" href="https://www.segmind.com/models/ssd-1b">here</a>)</p><p>This week I was lucky to host 2 deep dive conversations, one with <a target="_blank" href="https://twitter.com/bo_wangbo">Bo Wang</a>, from <a target="_blank" href="https://twitter.com/jinaai_">Jina AI</a>, and we covered embeddings, vector latent spaces, dimensionality, and how they retrained BERT to allow for longer sequence length, it was a fascinating conversation, even if you don't understand what embeddings are, it's well worth a listen.</p><p>And in the second part, I had the pleasure to have <a target="_blank" href="https://twitter.com/abidlabs">Abubakar Abid</a>, head of Gradio at Hugging Face, to talk about Gradio, it's effect on the open source community, and then joined by <a target="_blank" href="https://twitter.com/intent/follow?original_referer=https%3A%2F%2Fspacesdashboard.com%2F&#38;ref_src=twsrc%5Etfw%7Ctwcamp%5Ebuttonembed%7Ctwterm%5Efollow%7Ctwgr%5Ewhitphx&#38;screen_name=whitphx">Yuichiro</a> and <a target="_blank" href="https://twitter.com/xenovacom">Xenova</a> to talk about the next iteration of Gradio, called Gradio-lite that runs completely within the browser, no server required.</p><p>A fascinating conversation, if you're a machine learning engineer, AI engineer, or just someone who is interested in this field, we covered a LOT of ground, including Emscripten, python in the browser, Gradio as a tool for ML, webGPU and much more.</p><p>I hope you enjoy this deep dive episode with 2 authors of the updates this week, and hope to see you in the next one.</p><p>P.S - if you've been participating in the emoji of the week, and have read all the way up to here, your emoji of the week is 🦾, please reply or DM me with it 👀</p><p></p><p>Timestamps and full transcript for your convinience</p><p>## [00:00:00] Intro and brief updates</p><p>## [00:02:00] Interview with Bo Weng, author of Jina Embeddings V2</p><p>## [00:33:40] Hugging Face open sourcing a fast Text Embeddings</p><p>## [00:36:52] Data Provenance Initiative at dataprovenance.org</p><p>## [00:39:27] LocalLLama effort to compare 39 open source LLMs +</p><p>## [00:53:13] Gradio Interview with Abubakar, Xenova, Yuichiro</p><p>## [00:56:13] Gradio effects on the open source LLM ecosystem</p><p>## [01:02:23] Gradio local URL via Gradio Proxy</p><p>## [01:07:10] Local inference on device with Gradio - Lite</p><p>## [01:14:02] Transformers.js integration with Gradio-lite</p><p>## [01:28:00] Recap and bye bye</p><p>Full Transcription:</p><p>[00:00:00] <strong>Alex Volkov:</strong> Hey, everyone. Welcome to Thursday. My name is Alex Volkov, and I'm very happy to bring you another weekly installment of Thursday. I. This week was actually a mild one in terms of updates, believe it or not. Or we didn't get the new state of the art opensource, large language model this week. However, we did get a new state of the art embeddings model. And we're going to talk about that. we got very lucky that one of the authors of this, a medics model, gold Gina embeddings V2, Bo Wang joined us on stage and gave us a masterclass in embeddings and share some very interesting things about this, including some stuff they haven't charged yet. So definitely worth a listen. Additionally recovered the data provenance initiative that helps sort and validate licenses for over 1800 public data sets. A massive effort led by Shane Redford with assistance from many folks, including a friend of the pod. Enrico Shippole.</p><p>[00:01:07] we also covered the massive effort by another user named Wolf from Ravenwolfe on the local Lama subreddit. Uh, that effort evaluated and compared to 39 open source models ranging from 7 billion parameters to 70 billion parameters and threw in the GPT4 comparison as well. Not surprisingly, the best model right now is the one we covered last week from friends of the politic new called open Hermes seven B.</p><p>[00:01:34] Do additional updates we've covered. One of them is Gladia AI, a company that offers transcription and translation APIs release their version of whisper over WebSockets. So live transcription, and I covered it on X with a reaction video. And I'll add that link in the show notes. It allows developers like you to stream speech, to text and. Very low latency and high quality and it's multi-lingual as well. So if you're building an agent that your users can talk to. Um, definitely give this a try. And finally Segmind segued mind accompany that just decided to open source a distilled version of. SDXL, making it 50% smaller in size and the in addition to that 60% faster in generation speed. The links to all these will be in the show notes.</p><p>[00:02:23] But this week I was lucky to host two deep dives, one with Bo Weng which I mentioned. Uh, we've covered the embeddings vector led in spaces that dimensionality and how they retrained Bert model to allow for a longer sequence length. It was a fascinating conversation. Even if you don't understand what embeddings are, it's well worth the listen. And, , I learned a lot. Now I hope you will, as well. And the second part, I had the pleasure to have a Brubaker a bit. The head of grandio at hugging face to talk about gradient. What is it? Uh, its effect on the open source community. And then joined by utero. And Sunnova to talk about the next iteration of Grigio called Grigio light that runs completely within the browser. No Serra required. We also covered a bit of what's coming to Gradio in the next release. on October 31st.</p><p>[00:03:15] A fascinating conversation. If you're a machine learning engineer, AI engineer, or just somebody who's interested in this skilled. You've probably used radio, even if you haven't written any Gradio apps, every model and hugging face usually gets a great deal demo.</p><p>[00:03:30] And we've covered a lot of ground, including M scripting. Then by filling the browser. As a tool for machine learning, web GPU, and so much more.</p><p>[00:03:38] Again, fascinating conversation. I hope you enjoy this deep dive episode. Um, humbled by the fact that sometimes the people. Who produced the updates we cover actually come to Thursday and talk to me about the things they released. And I hope this trend continues, and I hope you enjoyed this deep dive over an episode. And, um, I'll see you in the next one. And now I give you thursday october 26. oh, awesome. It looks like Bo, you joined us. Let's see if you're connecting to the audience, and can you unmute yourself, can you see if we can hear you?</p><p>[00:04:22] <strong>Bo Wang:</strong> Hi, can you hear me? Oh, we can hear you fine, awesome. this, this, this feature of, of Twitter.</p><p>[00:04:30] <strong>Alex Volkov:</strong> That's awesome. This, this usually happens, folks join and it's their first face and then they can't leave us. And so let me just do a little, maybe... Maybe, actually, maybe you can do it, right? Let me just present yourself.</p><p>[00:04:42] I think I followed you a while ago, because I've been mentioning embeddings and the MTB dashboard and Hug and Face for a while. And, obviously, embeddings are not a new concept, right? We started with Word2Vec ten years ago, but now, with the rise of LLMs, And now with the rise of AI tools and many people wanting to understand the similarity between the user query and an actual thing they, they, they stored in some database, embeddings have seen a huge boon.</p><p>[00:05:10] And also we've saw like all the vector databases pop up like mushrooms after the rain. I think Spotify just released a new one. And my tweet was like, Hey, do we really need another vector database? But Boaz, I think I started following you because you mentioned that you were working on something that's.</p><p>[00:05:25] It's coming very soon, and finally this week this was released. So actually, thank you for joining us, Beau, and thank you for doing the first ever Twitter space for yourself. How about can we start with your introduction of who you are and how are you involved with this effort, and then we can talk about Jina.</p><p>[00:05:41] <strong>Bo Wang:</strong> Yes, sure. Basically I have a very different background. I guess I was oriJinally from China, but my bachelor was more related to text retrieval. I have a retrieval experience rather than pure machine learning background, I would say. Then I came to the Europe. I came to the Netherlands like seven or eight years ago as a, as an international student.</p><p>[00:06:04] And I was really, really lucky and met my supervisor there. She basically guided me into the, in the world of the multimedia information retrieval, multimodal information retrieval, this kind of thing. And that was around 2015 or 2016. So I also picked up machine learning there because when I was doing my bachelor, it's not really hot at that moment.</p><p>[00:06:27] It's like 2013, 2014. Then machine learning becomes really good. And then I was really motivated, okay, how can I apply machine learning to, to search? That is, that is my biggest motivation. So when I was doing my master, I, I collaborated with my friends in, in, in the US, in China, in Europe. We started with a project called Match Zoo.</p><p>[00:06:51] And at that time, the embedding on search is just a nothing. We basically built a open source. Software and became at that time the standard of neural retrieval or neural search, this kind of thing. Then when the bird got released, then our project basically got queue because. Everyone's focus basically shifted to BERT, but it's quite interesting.</p><p>[00:07:16] Then I graduated and started to work as a machine learning engineer for three years in Amsterdam. Then I moved to Berlin and joined Jina AI three years ago as a machine learning engineer. Then basically always doing neural search, vector search, how to use machine learning to improve search. That is my biggest motivation.</p><p>[00:07:37] That's it.</p><p>[00:07:38] <strong>Alex Volkov:</strong> Awesome. Thank you. And thank you for sharing with us and, and coming up and Gene. ai is the company that you're now working and the embeddings thing that we're going to talk about is from Gene. ai. I will just mention the one thing that I missed in my introduction is the reason why embeddings are so hot right now.</p><p>[00:07:53] The reason why vectorDB is so hot right now is that pretty much everybody does RAG, Retrieval Augmented Generation. And obviously, For that, you have to store some information in embeddings, you have to do some retrieval, you have to figure out how to do chunking of your text, you have to figure out how to do the retrieval, like all these things.</p><p>[00:08:10] Many people understand that whether or not in context learning is this incredible thing for LLMs, and you can do a lot with it, you may not want to spend as much tokens on your allowance, right? Or you maybe not have enough in the context window in some in some other LLMs. So embeddings... Are a way for us to do one of the main ways to interact with these models right now, which is RAC.</p><p>[00:08:33] And I think we've covered open source embeddings compared to OpenAI's ADA002 embedding model a while ago, on ThursDAI. And I think It's been clear that models like GTE and BGE, I think those are the top ones, at least before you guys released, on the Hugging Face big embedding model kind of leaderboard, and thank you Hugging Face for doing this leaderboard.</p><p>[00:09:02] They are great for open source, but I think recently it was talked about they're lacking some context. And Bo, if you don't mind, please present what you guys open sourced this week, or released this week, I guess it's open source as well. Please talk through Jina Embeddings v2 and how it differs from everything else we've talked about.</p><p>[00:09:21] <strong>Bo Wang:</strong> Okay, good. Basically, it's not like embeddings for, how can I say, maybe two... point five years. But previously we are doing at a much smaller scale. Basically we built all the algorithm, all the platform, even like cloud fine tuning platform to helping people build better embeddings. So there is a not really open source, but a closed source project called fine tuner, which we built to helping user build better embeddings.</p><p>[00:09:53] But we didn't, we found it okay. Maybe we are maybe too early. because people are not even using embeddings. How could they find embeddings? So we decided to make a move. Basically, we basically scaled up our how can I say ambition. We decided to train, train our own embeddings. So six months ago, we started to train from scratch, but not really from scratch because in binding training, normally you have to train in two stages.</p><p>[00:10:23] The first stage, you need to pre train on massive scale of like text pairs. Your objective is to bring these text pairs as closer as possible, as possible, because these text pairs should be semantically related to each other. In the next stage, you need to fine tune with Carefully selected triplets, all this kind of thing.</p><p>[00:10:43] So we basically started from scratch, but by collecting data, I think it was like six months ago, we working with three to four engineers together, basically scouting every possible pairs from the internet. Then we basically created like one billion, 1. 2 billion sentence pairs from there. And we started to train our model based on the T5.</p><p>[00:11:07] Basically it's a very popular encoder decoder model. This is on the market. But if you look at the MTB leaderboard or all the models on the market, the reason why they only support 512 sequence lengths is constrained actually by the backbone itself. Okay, we figure out another reason after we release the V1 model.</p><p>[00:11:31] Basically, if you look at. And the leaderboard or massive text embedding leaderboard, that is the one Alex just mentioned. Sorry, it's really bad because everyone is trying to overfitting the leaderboard. That naturally happens because if you look at BGE, GTE, the scores will never that high if you don't add the training data into the, into the, That's really bad.</p><p>[00:12:00] And we decided to take a different approach. Okay. The biggest problem we want to solve first, improving the quality of the embeddings. The second thing we want to solve is. Enable user to making longer context lens. If we want to making user make user have longer context lens, so we have to rework the BERT model, because every basically the embedding model, the backbone was from BERT or T5.</p><p>[00:12:27] So we basically started from scratch. Why not we just borrow the latest research from large language model? Every large language model wants large context. Why not we just borrow the research ideas? into the musk language modeling modelings. So we basically borrowed some ideas, such as rotary position embeddings or alibi, maybe you did, and reworked BERT.</p><p>[00:12:49] We call it JinaBERT. So basically now the JinaBERT can handle much longer sequence. So we trained BERT from scratch. Now BERT has been a byproduct of our embeddings. Then we use this JinaBERT to contrastively train the models on the semantic pairs and triplets that finally allow us to encode 8K content.</p><p>[00:13:15] <strong>Alex Volkov:</strong> Wow, that's impressive. Just, just to react to what you're saying, because BERT is pretty much every, everyone uses BERT or at least use BERT, right? At least in the MTB leaderboard. I've also noticed many other examples that use BERT or distilled BERT and stuff like this. You're saying, what you're saying, if I'm understanding correctly, is this was the limitation for sequence length?</p><p>[00:13:36] for other embedding models in the open source, right? And the OpenAI one that's not open source, that does have 8, 000 sequence length. Basically, sequence length, if I'm explaining correctly, is just how much text you can embed without chunking.</p><p>[00:13:51] Yes. And you're basically saying that you, you guys saw this limitation and then retrained BERT to use rotary embeddings. We've talked about rotary embeddings multiple times here. We had folks behind the yarn paper for extending context windows. Alibi is we follow Ophir Press.</p><p>[00:14:08] I don't think Ophir ever joined ThursdAI, but Ophir, if you hear this, you're welcome to join as well. So Alibi is another way to extend context windows and I think Mosaic folks used Alibha and some other folks as well. Bo, could you speak more about like borrowing the context from there and retraining BERT to JinaBERT and whether or not JinaBERT is also open source?</p><p>[00:14:28] <strong>Bo Wang:</strong> Oh, we actually want to make JinaBERT open source, but I need to align with my colleagues. That's, that's, that's really, that's a decision to be made. And the, the idea is quite naive. If you didn't know, I don't want to dive into too much about technical details, but basically the idea of Alibi basically removed the position embeddings from the large language model pre training.</p><p>[00:14:55] And the Alibi technique allow us to train on the shorter sequences. But inference at every very long sequence. So in the end, I think if I, my remember is correct, the author of alibi paper, basically trained model on 512 sequence lens and 1,024 sequence lens, but he's able to inference on 16 K. 16 K, like sequence lens.</p><p>[00:15:23] If you further expand it, you are not capable because that's the limitation of hardware, that's the limitation of GPE. So he, he actually tested 16 K like a sequence lens. So what we did is just. Borrowed this idea from the autoregressive models into the mask language models. And integrate Alibi, remove the position embeddings from the bird, and add this Alibi slope and all the Alibi stuff back into the bird.</p><p>[00:15:49] And just borrowed the things how we train bird or something Roberta, something from Roberta, and retrained the bird. I never imagined bird could be a by product of our embedding model, but this... This happened. We could open source it. Maybe I have to discuss with my colleague.</p><p>[00:16:09] <strong>Alex Volkov:</strong> Okay. So when you talk to your colleagues, tell them that first of all, you already said that you may do this on ThursdAI Stage.</p><p>[00:16:15] So your colleagues are welcome also to join. And when you open source this, you guys are welcome to come here and tell us about this. We love the open source. The more you guys do, the better. And the more it happens on ThursdAI Stage, the better, of course, as well. Bo, you guys released the Jina Embedding Version 2, correct?</p><p>[00:16:33] Gene Embedding Version 2 has a sequence length of 8k tokens. So that actually allows to, if, just for folks in the audience, 8, 000 tokens is, I want to say, maybe like 6, 000 words in English around, right? And different languages as well. Could you talk about multilinguality as well? Is it multilingual, is it only English?</p><p>[00:16:53] How that how that appears within the embedding model?</p><p>[00:16:57] <strong>Bo Wang:</strong> Okay, actually, our Jina Embedding V2 is only English, so it's a monolingual embedding model. If you look at the MTV benchmark or all the public multilingual models, they are multilingual. But to be frankly, I don't think this is a fair solution for that.</p><p>[00:17:18] I think at least every major language.</p><p>[00:17:24] We decided to choose another hard way. We will not train a multilingual model, but we will train a bilingual model. Our first target will be German and Spanish. What we are doing at Jina AI is we basically Fix our English embedding model as it is just keep it at is, but we are continuously adding the German data, adding the Spanish data into the embedding model.</p><p>[00:17:51] And our embedding model cares two things. We make it bilingual. So it's either German, English or German English, Spanish, Spanish, English, German, English, or Japanese, English, whatever. And what we are doing is we want to build this embedding model to make it monolingual. So imagine you are, you have a German English embedding model.</p><p>[00:18:12] So if you search for German, you'll get German results. If you use English, you'll get English results. But we also care about the cross linguality of this bilingual model. So imagine you, you, you encode two, two sentences. One is in German, one is in English, which they are With the same meaning, we also want these vectors to be mapped into the similar semantic space.</p><p>[00:18:36] Because I, I'm a foreigner myself, sometimes, imagine I, I, I buy some stuff in the supermarket. Sometimes I have to translate, use Google Translate, for example, milk into Milch in German, then, then, then put it into the search box. I really want this bilingual model happen. And I believe every, at least, major language deserves such an embedding model.</p><p>[00:19:03] <strong>Alex Volkov:</strong> Absolutely. And thanks for clarifying this because one of the things that I often talk about here on Thursday Night is as a founder of Targum, which is inside videos, is just how much language barriers are preventing folks from conversing to each other. And definitely embeddings are... The way people extend memories parallel lines, right?</p><p>[00:19:21] So like a huge, a huge thing that you guys are working on and especially helpful. The sequence length is, and I think we have a question from the audience is what is the sequence lengths actually allow people to do? I guess Jina and I worked with some, some other folks in the embedding space. Could you talk about what is the longer sequence lengths now unlocking for people who want to use open source embeddings?</p><p>[00:19:41] Obviously. My answer here is, well, OpenAI's embeddings is the one that's most widely used, but that one you have to do online, and you have to send it to OpenAI, you have to have a credit card with them, blah, blah, blah, you have to be from supported countries. Could you talk about a little bit of what sequence length allows unlocks once you guys release something like this?</p><p>[00:20:02] <strong>Bo Wang:</strong> Okay, actually, we didn't think too much about applications. Most of the vector embeddings applications, you can imagine search and classification. You build another layer of, I don't know, classifier to classify items based on the representation. You can build some clustering. You can do some anomaly detection on the NLP text.</p><p>[00:20:22] This is something I can imagine. But the most important thing I I have to be frankly to you because we are, we are like writing a technical report as well. Something like a paper maybe we'll submit to academic conference. Longer embeddings doesn't really always work. That is because sometimes if the important message is in in the front of the document you want to embed, then it makes most of the sense just to encode let's say 256 tokens.</p><p>[00:20:53] or 512. But sometimes if you you have a document which the answer is at the middle or the end of the document, then you will never find it if the message is truncated. Another situation we find very interesting is for clustering tasks. Imagine you want to visualize your embeddings. Longer longer sequence length almost always helps and for clustering tasks.</p><p>[00:21:21] And to be frankly, I don't care too much about the application. I think people, we, what we're offering is the, how can I say, offering is, is like a key. We, we unlock this 512 sequence length. To educate and people can explore it. People, let's say I, I only need two K then, then people just set tokenize max lens to two k.</p><p>[00:21:44] Then, then embed. Based on their needs, I just don't want to be, people to be limited by the backbone, by the 500 to 12 sequence lengths. I think that's the most important thing.</p><p>[00:21:55] <strong>Alex Volkov:</strong> That's awesome. Thank you. Thank you for that. Thank you for your honesty as well. I love it. I appreciate it. The fact that, there's research and there's application and you not necessarily have to be limited with the application set in mind.</p><p>[00:22:07] We do research because you're just opening up doors. And I love, I love hearing that. Bo maybe last thing that I would love to talk to you about as the expert here on the topic of dimensions. Right. So dimensionality with embeddings I think is very important. Open the eye, I think is one of the highest ones.</p><p>[00:22:21] The kind of the, the thing that they give us is like 1200 mentioned as well. You guys, I</p><p>[00:22:26] think</p><p>[00:22:26] Jina is around 500. Or so is that correct? Could you talk a bit about that concept in broad strokes for people who may be not familiar? And then also talk about the why the state of the art OpenAI is so far ahead?</p><p>[00:22:39] And what will it take to get the open source embeddings also to catch up in dimensionality?</p><p>[00:22:46] <strong>Bo Wang:</strong> You mean the dimensionality of the vectors? Okay, basically we follow a very standard BERT size. The only thing we modified is actually the the alibi part and some training part.</p><p>[00:22:58] And our small model dimensionality is 512, and the base model is 768 and we have also a large model, haven't been released because of the training is too slow. We have so much data to change. Even the model size is small, but we have so much data and so large model dimensionality size is 1,024. And if my memory is correct, so are I embedding 0 0 2?</p><p>[00:23:23] Have but dimensionality of. 1, 5, 3, 6, something like that, which is a very strange dimensionality, I have to say, but I would say the dimensionality is, is, is the longer might be more Better or more expressive, but shorter, which means when you are doing the vector search, it's gonna be much more faster.</p><p>[00:23:48] So it's something you have to balance. So if you think the speed query speed, or the retrieval speed or whatever is more important to you. And if I, if I know correct, some of the Vector database, they make money by the dimensionality, let's say. They, they charge you by the dimensionality, so it's actually quite expensive if your dimensionality is too high.</p><p>[00:24:13] So it's a balance between expressionist and the, the, the, the speed and the, the, the, the cost you want to invest. So it's. It's very hard to determine, but I think 512, 768, and 1024 is very common as BERT.</p><p>[00:24:34] <strong>Alex Volkov:</strong> So great to hear that a bigger model is also coming, but it hasn't been released yet. So there's like the base model and the small model for embeddings, and we're waiting for the next one as well.</p><p>[00:24:46] I wanted to maybe ask you to maybe simplify for the audience, the concept of dimensionality. What does it mean between, what is the difference between embeddings that were stored with 512 and like 1235 or whatever OpenAI does? What does it mean for quality? So you mentioned the speed, right? It's easier to look up nearest neighbors, maybe within the 512 dimension space, what does it actually mean for quality of look up of different other ways that strings can compare? Could you maybe simplify the whole concept, if possible, for people who don't speak embeddings?</p><p>[00:25:19] <strong>Bo Wang:</strong> Okay maybe let me quickly start with the most basic version.</p><p>[00:25:24] If you imagine, if you type something in the search box right now, when doing, doing the matching, and it's actually also embedding, but it's something like if I make a simple version, it's a binary embedding. Imagine there 3, 000 words in English. Maybe there are much more, definitely. Imagine it's 3, 000 words in English, then the vector is 3, 000 dimensionality.</p><p>[00:25:48] Then what current solution of searching or matching do is just making... If the query has a token, if your document has a token, if your document has this token, then your occurrence will be one. If you query has the token, and this one will match your document token. But it's also about the, the frequency it appears, it's how, how rare it is.</p><p>[00:26:12] But the current solution is basically matching by the. By the English word, but with neural network, basically if you know about this, for example, ResNet know about a lot of different, for example, classification models, basically the output class of item, but if you chop up the classification layer, it will give you some a vector.</p><p>[00:26:36] Basically this vector is It's the representation of the information you want to encode. Basically it's a compressed version of the information in a certain dimensionality such as 512, 768, something like this. So it's a compressed list of non numerical numbers, which we normally call it dense vectors.</p><p>[00:26:57] because it's much more how can I say in English dense, right? Compared to the traditional way we store vectors, it's much more sparse. There is a lot of zero, there is a lot of one, because zero means not exist, one means exist. When one exists, then there is a match, then you've got the search result.</p><p>[00:27:16] So these dense vectors capture more about semantics, but if you match by the occurrence, then you might lose the semantics. But only matching by the occurrence of a token or a word.</p><p>[00:27:31] <strong>Alex Volkov:</strong> Thank you. More dimensions, basically, if I'm not saying it correctly, more dimensions just have more similarity vector. So like more things two strings or tokens can be similar on. And this basically means higher match rate. For more similarity things. And I think the basic stuff I think is covered in the Simon Wilson, the first pin tweet here, Simon Wilson did a basic, basic intro into what do dimensions embeddings mean and why they matter.</p><p>[00:28:00] And I specifically love the fact that there's arithmetic that can be done. I think somebody reads the paper even before this whole LLM thing, where if you take embeddings for Spain and embeddings for Germany, and then you take you, you can subtract like the embedding for Paris and then you get something closer to, to like Berlin, for example, right?</p><p>[00:28:19] So there's like concepts in, inside these things that are they're even arithmetic works and if you take like King and you subscribe male, then you get something closer to Queen and stuff like this. It's really, really interesting. And also Bo, you mentioned visualization as well. It's really impossible to visualize.</p><p>[00:28:36] 10, 24, et cetera, dimensions, right? Like we humans, we have perceived maybe three, maybe three and a half, four with time, whatever. And usually what happens is those multiple dimensions get down scaled to 3D in order to visualize in neighborhoods. And I think we've talked with folks from ARISE. They have a software called Phoenix that allows you to visualize embeddings for clustering and for semantics.</p><p>[00:29:02] Atlas does this as well, right? Nomic AI's Atlas does this as well. You can provide dimensions as well. And so you can provide embeddings and see clustering for concepts. And it's really pretty cool. If you haven't played with this, if you only did VectorDBs and you stored your stuff after you've done chunking, but you've never visualized how this looks, I strongly recommend you to do and I think well, thank you so much for joining us and explaining to us, the internals and sharing with us some exciting things about what's to come. Jina Burt is hopefully hopefully is coming, a, a retrained version of Burt, the, the, the, the, the... The grease of all how should I say, I can't, it's hard for me to define a verb, but I see it everywhere it's, it's the big base bone of a lot of NLP tasks, and it's great to see that you guys are about to first of all, retrain it for longer sequences, using tricks like Alibi and and I think you said Positional Embeddings, and hoping to see some open source action from this, but also that Jina Embedding's large model is coming as well with more dimensions waiting for that. Hopefully you guys didn't stop training that. And I just want to tell folks why I'm excited for this. And this kind of will take us to the next.</p><p>[00:30:08] Point as well is because, while I love OpenAI, I honestly do, I'm going to join their Dev Day, I'm going to report from their Dev Day and tell you all the interesting things that OpenAI does. We've been talking about we've been talking and we'll be talking today about local inference, about running models on edge, about running models of your own.</p><p>[00:30:28] Mistin is here, he even works on some bootable stuff that you can like completely off the grid run. And, so far, we've been focused on open source LLMs, for example, right? So we've had I see Pharrell in the audience from Skunks Works, and many other fine tuners, like Tignium, Alignment Labs, all these folks are working on local LLMs, and they never get to GPT 4 level yet.</p><p>[00:30:51] We're waiting for that, and they will. But the whole point of them is, you run them locally, they're uncensored, you can do whatever you want, you can fine tune them on whatever you want. However, the kind of the embeddings part Is the glue to connect it to an application and the reason is because there's only so much context window also context window is expensive and even if theoretically the yarn paper that we've talked with the authors of allows you to extend the context window to 128, 000 tokens The hardware requirements for that are incredible, right?</p><p>[00:31:22] Everybody in the world of AI engineers, they switch up to, to, to retrieval of data generation. Basically, instead of shoving everything in the context, they switched Hey, let's use a vector database. Let's say a Chroma. Or Pinecone, or Waviate, like all of those, vectorized from Cloudflare, and the other one from Spotify there, I forget its name or even Superbase now has one.</p><p>[00:31:43] Everybody has a vector database it seems these days, and the reason for that is because all the AI engineers now understand that you need to put some text into some embeddings, store them in some database. And many pieces of that were still requiring internet, requiring OpenAI API calls, requiring credit cards, like all these things.</p><p>[00:32:03] And I think it's great that we've finally got to a point where, first of all there are embeddings that are matching whatever OpenAI has given us. And now you can run them locally as well. You don't have to go to OpenAI. If you don't want to host, you can probably run them. I think though GeneEmbedding's base is very tiny.</p><p>[00:32:20] Like it's half like the small model is 770 megabytes, I think. Maybe a little bit more, if</p><p>[00:32:27] <strong>Bo Wang:</strong> I'm looking at this correctly. Sorry, it's half precision. So you need to double it to make it FP32.</p><p>[00:32:33] <strong>Alex Volkov:</strong> Oh yeah, it's half precision. So it's already quantized, you mean?</p><p>[00:32:37] <strong>Bo Wang:</strong> Oh no, it's just to store it as FV16,</p><p>[00:32:39] <strong>Alex Volkov:</strong> if you store it as FV16.</p><p>[00:32:43] Oh, if you store it as FV16. But the whole point is the next segment in ThursdAI today is going to be less about updates and more about the very specific things. We've been talking about local inference as well, and these models are tiny, you can run them on your own hardware, on Edge via Cloudflare, let's say, or on your computer.</p><p>[00:32:58] And you now can do almost end to end application wise. From the point of your user inputting a query embedding this query, running a match, a vector search, KNNN and whatever you want nearest neighbor search for that query for the user. Retrieve that all from like local open source. You basically you, you can basically go offline.</p><p>[00:33:20] And this is what we want in, in the era of upcoming regulation towards what AI can be and cannot be. And the era of like open source models getting better and better. We've talked last week where Zephyr and I think Mistral News from Technium is also matching some GPT 3. 5. All of those models you can download and nobody can tell you not to run inference on them.</p><p>[00:33:40] Hugging Face open sourcing a fast Text Embeddings Inference Server with Rust / Candle</p><p>[00:33:40] <strong>Alex Volkov:</strong> But the actual applications, they still require the web or they used to. And now I'm, I'm loving this like new move towards. Even the application layer, even the RAG systems, which are augmented generation, even the vector databases, and even the embeddings are now coming to, to open source, coming to your local computer.</p><p>[00:33:57] And this will just mean like more applications either on your phone or your computer. And absolutely love that. Bo, thank you for that. And thank you for coming to the stage here and talking about the things that you guys open sourced and hopefully we'll see more open source from Jina and everybody should follow you and, and Jina as well.</p><p>[00:34:13] Thank you. It looks like. Thank you for joining. I think the next thing that I wanna talk about is actually in this vein as well. Let me go find this o Of course, we love hug and face and the thing that I think that's already on top if you look, yeah if you look at the last thing, last tweet that's pinned it's a tweet from Jeri Lou from Lama Index, obviously.</p><p>[00:34:33] Well, well, well, we're following Jerry and whatever they're building and doing over at Lama Index because they implement everything like super fast. I think they also added support for Jina like extremely fast. He talks about this thing where HugInFace opensource for us something in Rust and Candlestick?</p><p>[00:34:51] Candlelight? Something like that? I forgot that they're like iteration on top of Rust. Basically, the open source is a server that's called TextEmbeddingsInferenceServer that you can run on your hardware, on your Linux. boxes and basically get the same thing that you get from OpenAI Embeddings.</p><p>[00:35:07] Because Embeddings is just one thing, but it's a model. And I think you could use this model. You could use this model with transformers but it wasn't as fast. And as Bo previously mentioned, there's considerations of latency for user experience, right? If you're building an application, you want it to be as responsive as possible.</p><p>[00:35:24] You need to look at all the places in your stack and say, Hey. What slows me down? For many of us, the actual inference, let's say use GPT 4, waiting on OpenAI to respond and stream that response is what slows many applications down. And but many people who do embeddings, let's say you have a interface of a chat or a search, you need to embed every query the user sends to you.</p><p>[00:35:48] And one such slowness there is how do you actually How do you actually embed this? And so it's great to see that Hackenface is working on that and improving that. So you previously could do this with transformers, and now they released this specific server for embeddings called TextEmbeddings Inference Server.</p><p>[00:36:04] And I think it's four, four times faster. than the previous way to run this, and I absolutely love it. So I wanted to highlight this in case you are interested. You don't have to, you can use OpenAI Embeddings. Like we said, we love OpenAI, it's very cheap. But if you are interested in doing the local embedding way, if you want to go end to end, complete, like offline, you want to build like an offline application, using their internet server I think is a good idea.</p><p>[00:36:29] And also it shows what HuggingFace is doing with Rust and I really need to remember what language there is but definitely a great attempt from Hug and Face, and yeah, just wanted to highlight that. Let's see. Before we are joined from the Grad. io folks, and I think there's some folks in the audience who are ready from Grad.</p><p>[00:36:48] io to come up here and talk about local inference which 15 minutes left,</p><p>[00:36:52] Data Provenance Initiative at dataprovenance.org</p><p>[00:36:52] <strong>Alex Volkov:</strong> I wanted to also mention the Data Provenance Initiative. Let me actually find this announcement, and then quickly... Quickly paste this here, and I was hoping that Enrico can be here. . There's a guy named Shane Longfree,</p><p>[00:37:05] and he released this massive, massive effort, included with many people. And basically what this effort is, it's called the Data Provenance Initiative. Data Provenance Initiative is now existing in dataprovenance. org. And hopefully can somebody maybe send me the, the direct link to the suite to add this.</p><p>[00:37:23] It... It is a massive effort to take 1, 800, so 1, 800 Instruct and Align datasets that are public, and to go through them to identify multiple things. You can filter them, exclude them, you can look at creators, and the most important thing, you can look at licenses. Why would you do this? Well, I don't know if somebody who builds an application needs this necessarily, but everybody who wants to fine tune models, the data is the most important key for this, and building data sets and running them through your fine tuning efforts is basically the number one thing that many people do in the fine tune community, right?</p><p>[00:38:04] Data wranglers, and now, thank you, Nishtan, thank you so much, and a friend of the pod, Enrico. is now pinned to the top of the tweet. Thank you for to the top of the space, the nest, whatever it's called. A friend of Enrico Cipolla, who we've talked previously in the context of extending I think Lama to first 16k and then 128k.</p><p>[00:38:24] I think Enrico is part of the team on yarn paper as well. I joined this effort, and I was hoping Enrique could join us to talk about this. But basically, if you're doing anything with data, this seems like a massive, massive effort. Many datasets from Lion, and we've talked about Lion, and Alpaca, GPT 4L Gorilla, all these datasets.</p><p>[00:38:46] It's very important when you release your model as open source that you have the license to actually release this. You don't want to get exposure, you don't want to get sued, whatever. And if you're in finding data sets and creating different mixes to fine tune different models, this is a very important thing.</p><p>[00:39:03] And we want to shout out, Shane Longpre, Enrico, and everybody who worked on this because I think... Just, I love these efforts for the open source, for the community, and it just makes, it's easier to fine tune, to train models. It makes it easier for us to advance and get better and smaller models, and it's worth celebrating and ThursdAI is the place to celebrate this, right?</p><p>[00:39:27] LocalLLama effort to compare 39 open source LLMs + GPT4</p><p>[00:39:27] <strong>Alex Volkov:</strong> On the topic of extreme, how should I say efforts that are happening by the community on the same topic, I want to add another one, and this one I think I have a way to pull it up, so give me just a second give me just a second, yes. A Twitter user named Wolfram Ravenwolf who is a participant of the local Lama community on Reddit and now is pinned to the nest at the top of the tweet did this massive effort of comparing open source LLMs and tested 39 different models ranging from 7 billion parameters to 70 billion, and also compared them to chat GPT, GPT 4.</p><p>[00:40:06] And I just want to circle back to something we've said. In the previous space as well, and I welcome like folks on stage also to jam in here. I've seen also the same kind of concepts from Hug and Face folks. I think Glenn said the same thing. It's really unfair to, to take a open source model like Mistral7b and then start comparing this to GPT 4.</p><p>[00:40:26] It's unfair for several reasons. But also I think it, it, it can obfuscate to some people when they do this comparison of how, just how advanced we've come for the past year in open source models. OpenAI has the infrastructure, they're backed by Microsoft, they have the pipelines to serve these models way faster.</p><p>[00:40:47] And also, those models don't run on like local hardware, they don't run on like one GPU. It's like a whole, a whole... Amazing MLOps effort to bring you this speed. When you're running local source model open source models locally when they're open source, they're, they're, they're small, there's drawbacks and there's like takeaways that you have to bake in into your evaluation.</p><p>[00:41:09] So comparing to the GPT 4, which is super general in many, many things, that will just lead to your disappointment. However, and we've been talking about this like with other open source models If you have a different benchmark in your head of if you're comparing open source to open source, then it's a whole completely different ballgame.</p><p>[00:41:26] And then you start seeing things like, Hey, we're noticing that the 7 billion parameter model is, beating 70 billion. We're noticing that size is not necessarily the king because if you guys remember, Three months ago, ni, I wanna say we've talked about Falcon 180 B. 180 B was like, three times the size of like the, the, the next largest model.</p><p>[00:41:47] And it was incredible the Falcon Open source this, and then it was like, like a wo like, no, nobody really was able to run 180 B because it's huge. But also once, once we did run it, we saw that like the difference between that and LAMA are not great at all. Maybe a few percentage points on, on the valuations.</p><p>[00:42:04] However, the benefits that we see are from local, like for tinier and tinier models from like 7D Mistral, for example, which is the, the one that the fine tuners of the world are now preferring to everything else. And so the kind of, when you're about to evaluate whatever next model that's coming up that we're going to talk about please remember that Comparing to large, open, big companies backed by billions of dollars that run on multiple split hardware, it's just going to lead to disappointment.</p><p>[00:42:34] However, when you do some comparisons, like the guy did, that is now pinned to the tweet this is the way to actually do this. However, on specific tasks, like for, say, coding go ahead,</p><p>[00:42:46] <strong>Nisten Tahiraj:</strong> Nisam. I was going to say, we're still a bit early to judge for example, Falcon could have used a lot more training.</p><p>[00:42:53] There's also other. parts where larger models play a big effect stuff like if you want to do very long context summarization then you want to use the 70b and as far as i'm getting it and this is probably inaccurate right now but the more tokens you have the more meat you have in there the Then the larger the thoughts can can be.</p><p>[00:43:23] So that's the principle which are going by. Well, Mistral will do extremely well in small analytical tasks and in benchmarks, and it's amazing as a tool. It doesn't necessarily mean that it'll be good at thinking big. You still need The meat there, the amount of tokens to do that. Now you could chop it up and, and do it one, one at a time, but anyway, just something to keep in mind because lately we also saw the announcement from Lama70Blong, which started getting really good at at summarization.</p><p>[00:44:04] So again, there's one particular part. Which is summarization where you, it looks like you need longer you need bigger models for that. And I've tested it myself with Falcon and stuff, and it's pretty good at summarization, I just want to also give them the benefit of the doubt that there is still something that could be done there.</p><p>[00:44:28] I wouldn't just outright dismiss.</p><p>[00:44:31] <strong>Alex Volkov:</strong> Yeah, absolutely, absolutely, and I want to join this non dismissal. Falcon open sourced fully commercially like Falcon 70B before, and this was the biggest open source model at the time. And then they gave us 180B, they didn't have to, and we appreciate like open sourcing.</p><p>[00:44:46] We're not going to say no. Bigger models have more information, more, more, maybe world model in them, and there's definitely place for that, for sure. The, the next thing you mentioned also, and I think I, I strongly connect to that, Nissen, and thank you, is GPT 4, for example, is very generalized. It does many, many, many things well.</p><p>[00:45:08] It's like kind of impressive and whatever Gemini is going to be from Google soon, hopefully, we're always waiting on ThursdAI, that the breaking news will come on Thursday. We're gonna be talking about something else and then Google suddenly drops Gemini on us. There's also other rumors for Google's other stuff.</p><p>[00:45:22] Whatever OpenAI's Arrakis was, and then they stopped training, and whatever next they're coming from OpenAI, will probably blow everything we expect in terms of generality out of the water. And this, the open source models, as, as they currently are, they're really great at... Focused tasks, right? So like the coder model, for example that recently Glaive Coder was released by Anton Bakaj, I think is doing very well on the evaluations for code.</p><p>[00:45:51] However, on general stuff, it's probably less, less good. And I think for open models expecting generality on the same level as GPT 4, I think, is, is going to lead to disappointment. But for tasks, I see, I think we're coming close to different things that a year ago seemed state of the art. If you guys remember, it's not even a year since JetGPT was released, right?</p><p>[00:46:14] I think JetGPT was released in November? No, not as an API even, just the UI, like middle of November. So we're coming up on one year, I think the Dev Day will actually be one year. That was 3. 5. 3. 5 now, many people use 3. 5 for applications, but, you want to go for 4. If you're paying for Chattopadhyay Plus and you have a task to solve, you're not going to go 3.</p><p>[00:46:35] 5 just because you feel like it. You know that 4 is better. But now we're having open source models way smaller. They're actually getting to some levels of 3. 5, and the above effort is actually an effort to try to figure out which ones. And so I strongly recommend, first of all, to get familiar with local Llama subreddit.</p><p>[00:46:54] If you don't use Reddit, I feel you, and I've been a Reddit user for a long time, and I stopped. Some parts of Reddit are really annoying. This is actually a very good one, where I get a bunch of my information outside of Twitter. And I think Andrej Karpathy also recommended this recently, which... Then became an item on that subreddit.</p><p>[00:47:12] It was really funny. And this massive effort was done by this user and he, he did like a full comparison of just 39 different models. And he outlined the testing methodology as well. We've talked about testing and evaluation methodology. Between ourselves, it's not easy to evaluate these things. A lot of them are like gut feeling.</p><p>[00:47:31] A lot of the, the evaluation, and Nathan and I have like our own prompts that we try on every new model, right? It's, it's like a lot, a lot of this for many people is like gut feel. And many people also talk about the problem with evals and I think Bo mentioned the same thing with the embedding leaderboards that then, you know.</p><p>[00:47:48] It then becomes like a sport for people to like fine tune and, and release models just to put their name on the board to overfit on whatever whatever metrics and evaluations happen there. And then there's a whole discussion on Twitter whether or not this new model that beats that model on, on some, some score actually, was trained on the actual evaluation data.</p><p>[00:48:09] But... Definitely the gut feeling variation is important and definitely having different things to test for is important. And you guys know, I think, those of you who come to ThursdAI, my specific gut feels are about like translation and multilingual abilities, for example, and direction following some other people like Jeremy, Jeremy Howard from ThursdAI have his own like approach.</p><p>[00:48:29] Everybody has their own approach. I think what's interesting there is... Kind of the community provides, right? We're like this decentralized brain of evaluating every new model. And for now, the community definitely landed on Mistral as being like the top. At least a model in the 7b range, and Falcon, even though it's huge and can do some tasks like Nissan said is less, less and Lama was there before. So if you start measuring the community responses to open source models, you start noticing better what does what. And this effort from this guy, he actually outlined the methodology, and I want to shout out... Friends of the pod, Tignium being the go to many, many things, specifically because Open Hermes, which, Hermes, which we've talked about before which was fine tuned from Mistral7b is probably like getting the, the, the top leaderboard from there, but also based on my experiences, right?</p><p>[00:49:22] So we've talked last week about Open Hermes being able, you're able to run Open Hermes on your... Basically, M1, M2, Max with LM Studio, which also shout out to LM Studio, they're great, and I've tested this, and this seems to be, like, a very, very well rounded model, especially for one that you can run yourself and comparing to GPT 4 and other stuff, this model for specific things is really great.</p><p>[00:49:45] It's good for coding. It's not the best for coding. I think there's a coding equivalent. And I just really encourage you, if you're interested, like figuring out what to use. And we've talked about this before. What to use. Is an interesting concept, because if you come to these spaces every week and you're like, Oh, this model is now state of the art, that model is state of the art.</p><p>[00:50:05] You may end up not building anything, because you just won't have the, you always keep chasing the latest and greatest. The differences are not vast from week to week, we're just seeing like better scores. But it's well worth checking out this effort for the methodology, for the confirmation that you have.</p><p>[00:50:21] Let's say you, you felt that Mistral is better and now you can actually understand. And also for friends of the pod I think John Durbin is also, Error Boris model is really great and it's also up there. And what Nistan highlighted is that bigger models sometimes excel at different things some summarization or just more knowledge.</p><p>[00:50:38] It's also outlined there as well. And You can also see models that are not that great, that maybe look good on the leaderboards, but don't necessarily perform as well, and you can see them as well in that effort.</p><p>[00:50:49] So maybe actually let me reset the space. Everybody who joined in the middle of me speaking is like, why is this guy speaking?</p><p>[00:50:56] And what's going on here? You are welcome. You, you're in the space of ThursdayAI. ThursdayAI we are meeting every week to talk about everything that happens in the world of AI. If you're listening to this and you're enjoying, you're the target audience, but generally we talk about everything from open source LLMs and now embeddings.</p><p>[00:51:13] We, we talk about big company APIs. There's not a lot of updates from OpenAI this week. I think they're quiet and they're going to release everything in a week and a half in their dev day. And, and Tropic obviously, and, and... Cloud and Microsoft and Google, like all these things we cover as much as possible.</p><p>[00:51:29] We also cover voice and audio. And in that vein, I want to shout out to friends from Gladia and I'll pin there actually, let me just pin this right now. Gladia just released a streaming of Whisper and I've been waiting for something like this to happen. Sometimes for AI engineers, you don't want to host everything yourself. And you want to trust that, the, the WebSocket infrastructure is going to be there when you don't want to build it out. And I'm not getting paid for this.</p><p>[00:51:53] This is like my, my personal, if I had to implement like something like the voice interface with ChatGPT, I would not build it myself. I would not trust my own MLOps skills for that. And so for that Gladia is, I've been following them since I wanted to implement some of their stuff and they just implemented like a WebSocket.</p><p>[00:52:11] Whisper transcription streaming, and it's multilingual, and it's quite fast, and I definitely recommend folks to check it out. Or check out my review of it, and try out the demo, and if you want it, use it. Because we've talked last week about the interface for ChatGPT that's voice based, and you can actually have a FaceTime call with ChatGPT, and that's incredible.</p><p>[00:52:30] And I think more and more removing the screen out of this talking to your AI agents, I think, with the latest releases also in text to speech, like 11 Labs and XTTS that we've covered as well. With advances there, with speed, you can actually start getting interfaces where you can talk, and the AI listens and answers back to you very fast.</p><p>[00:52:52] Worth checking out, and definitely an update. Thank you.</p><p>[00:52:57] <strong>Nisten Tahiraj:</strong> Okay. So this is a complete product. I was,</p><p>[00:53:00] <strong>Alex Volkov:</strong> yeah, this is a full, pay a little bit, get a WebSocket and then you use this WebSocket to just like stream and you can embed this into your applications like very fast. Setting that up, I think Koki, you can do this with Koki, which we also covered.</p><p>[00:53:13] Gradio Interview with Abubakar, Xenova, Yuichiro</p><p>[00:53:13] <strong>Alex Volkov:</strong> Alright, I think it's time to again, reset the space. ThursdAI I wanna thank Bo who is still on stage. Bo, you're welcome to keep, stay with us a little bit and now we're moving on to the second part of this.</p><p>[00:53:30] Welcome, Abubakar. Welcome, Zinova, Joshua. Welcome some folks in the audience from Hugging Face. It's great to see you here on ThursdAI, well, Zinova is always here, or hopefully, but Abubakar, I think this is your first time.</p><p>[00:53:41] I'll do a brief intro, and then we can, we can go and talk about Gradio as well.</p><p>[00:53:45] I my first inference that I ran on a machine model was a year and something ago, and this was via Gradio, because I, I got this weights file, and I was like, okay, I can, I can probably run something with CLI, but how do I actually visualize this? And back then, Gradio was... was the way and I think since then you already guys you were already part of Hug and Face and Everybody who visited a model page and tried a demo or something probably experienced Gradua even without knowing that this is what is behind all the demos So welcome, please feel free to present yourself.</p><p>[00:54:17] Give us Maybe two line, three line of how you explain Gradio to folks, and then we can talk about some exciting stuff that you guys have released this week.</p><p>[00:54:25] <strong>Abubakar Abid:</strong> Awesome. Yeah, first of all, thank you again for, for having me and for having several folks from the Gradio team here. I've known you, Alex, for a long time.</p><p>[00:54:32] I think you were one of the early users of Gradio or at least one of the early users of Gradio blocks and, and some of these viral demos. So I've seen, this podcast develop over time and it's It's a real honor to be to be able to come here and to be able to talk about Gradio.</p><p>[00:54:45] Yeah. Hi everyone. I'm Abu Bakr. I'm, I lead the Gradio team at Hugging Face. So Gradio is basically the way we describe it is it's the fastest way to, to build a GUI or an app from a machine learning model. So traditionally have, taking a machine learning model to production or at least letting...</p><p>[00:55:01] Users try it out has meant that you need to know a lot of front end. You need to know how to, setting up a server, web hosting. You have to figure all of these things out so that other people can play around with your machine learning model. But Gradio lets you do all of that with just a few lines of Python as I think Joshua was mentioning earlier.</p><p>[00:55:18] And Gradio has been used by a lot of people. We're very lucky that, we kind of coincide. We started Gradio a few years ago late 2019. It grew out of A project at Stanford, and then spun out to be a startup, and then we got acquired by Hugging Face, and we've been growing Gradio within that kind of ecosystem.</p><p>[00:55:32] But we're very lucky because during this time has coincided with a lot of real developments in machine learning. I come from an academic background, so before 2019 I was doing my PhD at Stanford. And, everyone's been doing, machine learning for a while now, but...</p><p>[00:55:45] The types of machine learning models that people wanted to build, you built it, you published a paper and that was it, but, since then, recently people are building machine learning models that other people actually want to use, other people want to play around with, things have gone very, exciting, and so that's led to a lot of people building radio demos I think, I was looking at the the stats recently we have something around more than three, four million demos, Gradio demos that have been built since we started the, library.</p><p>[00:56:09] And yeah, so recently we released something called Gradio Lite, which lets you run...</p><p>[00:56:13] Gradio effects on the open source LLM ecosystem</p><p>[00:56:13] <strong>Alex Volkov:</strong> Wait, before, before, Abubakar, if you don't mind, before Gradio Lite, let's not I just want to highlight how important this is to the ecosystem, right? I'm oriJinally a front end engineer I do component libraries for breakfast, and basically, I don't want to do them it's really nice to have a component library maybe Tailwind UI, or ShadCN, like, all these things, so even front end engineers, they don't like building things from scratch.</p><p>[00:56:35] Switching to machine learning folks who like build the model, let's say, and want to run some inference, that's not their cup of tea at all. And, just thinking about like installing some JavaScript packages, like running NPM, like all these things, it's not like where they live at all. And so what Gradio allows us to do this in Python.</p><p>[00:56:51] And I think this is, let's start there. That's on its own is incredible and lead, led to so many demos just look to happen in Gradio. And you guys built out pretty much everything else for them, like everything that you would need. And I think recently you've added stuff, before we get to Gradual Light, like components like chat, because you notice that, many people talk to LLMs, they need the chat interface, right?</p><p>[00:57:10] There's a bunch of multi modal stuff for video and stuff. Could you talk about, the component approach of how you think about providing tools for people that don't have to be designers?</p><p>[00:57:20] <strong>Abubakar Abid:</strong> Yeah, absolutely. So yeah, that's exactly right. Most of the time when you're, machine learning, developer you don't want to be thinking about writing front end, components that coupled with some, an interesting insight that we had with machine learning models.</p><p>[00:57:31] It's much more like the components from machine learning models are tend to be much more usable than in other kinds of applications, right? So one thing I want to be clear is that Gradio is actually not meant to be like a, build web apps in general in Python. That's not our goal. Our goal, we're heavily optimized toward building machine learning apps.</p><p>[00:57:50] And what that means is, the types of inputs and outputs that people tend to work with are a little bit more, contained. So we have a library right now of about 30 30, Different types of like inputs and outputs. So what does that mean? So things like images, image editing video inputs and outputs, chatbots as outputs JSON, data frames, various types of inputs and outputs that components that come prepackaged with Gradio.</p><p>[00:58:15] And then when you build a Gradio application, you basically say, Hey, this is my function. These are my inputs, and these are my outputs. And then Gradio takes care of everything else, stringing everything together sending them, message back and forth, and pre processing, post processing everything in the right way.</p><p>[00:58:29] So yeah, you just have to define your function in the backend, and your inputs and your outputs, and then Gradio spins up a UI for you.</p><p>[00:58:36] <strong>Alex Volkov:</strong> And so I really find it funny and I sent the laughing emoji and said that Gradio was not meant to build web apps, like full scale web apps, because I think the first time that we've talked, you reached out, because I joined whatever open source that was running for stable diffusion, this was before automatic, I think, and you told me hey, Alex, You did some stuff that we didn't mean for you to do, so I injected like a bunch of JavaScript, I injected a bunch of CSS, like I had to I had to go with my full on like front end developer, I was limited with this thing, and so I, even despite the limitation, I think we did like a bunch of stuff with just like raw JavaScript injection, and since then I think it's very interesting, you're mentioning like Gradio demos, Gradio demos, Automatic 1.</p><p>[00:59:16] 1. 1, which is maybe for most people, is the only way they know like how to run stable diffusion, is now getting investments from like NVIDIA and getting right, I saw like a bunch of stuff that Automatic does, so it's very interesting like how you started and how the community picked it up. So can you talk about like the bigger parts of this, like Automatic and some other that are like taking Gradio and pushing it to the absolute limit?</p><p>[00:59:37] <strong>Abubakar Abid:</strong> Yeah, absolutely. So that's yeah we're, we're, I'm, I'm, like, perpetually shocked by Automatic 111, every time I see a plug in, or, kind of the, the I think, like you said NVIDIA, now IBM, or something, released a plug in for Automatic 111? It's crazy. But yeah, so basically it's ever since we started Gradio, we've been noticing that, okay, okay, Gradio seems to work for 90 percent of the use cases, but then the last 10 percent people are pushing the limits of, of what's possible with Gradio.</p><p>[01:00:06] And so we've progressively increased what's possible. So in the early days of Gradio, there was actually just one class called Interface. And what that did was it allowed you to Specify some inputs and some outputs and a single function. And we quickly realized, okay, people are trying to do a lot more.</p><p>[01:00:20] So then about a year and a half ago, we released grad your Blocks, which allow you to like have arbitrary layouts. You can have multiple functions, string together connect inputs and in different ways. And that is what kind of allowed these very, very complex apps like automatic 1 1 1 SSD Next, and the equivalence in other domains as well.</p><p>[01:00:36] Of course, like the, the text, the text web, the, the UBA Booga. Text UI as well and then there's also similar kind of, very complex demos in the audio space as well. And music generation as well. So like these super complex, multiple tabs, all of that, that's possible with this new kind of architecture that we laid out called GradioBlox.</p><p>[01:00:55] And GradioBlox is this whole system for specifying layouts and and, and functions. And it's defined in a way that's intuitive to. Python developers, the, we like a lot of these like web frameworks in Python have, have popped up. And one of the things that I've noticed as someone who knows Python, but really not much JavaScript is that they're very much coming in from the perspective of a JavaScript engineer, and so like this kind of React inspired kind of frameworks and, and stuff like that.</p><p>[01:01:21] And, and what, that's not very intuitive to a Python developer, in my opinion. And so we've defined this whole thing. Where you can, have these, build these arbitrary web, kind of web apps, but still in this Pythonic way. And we're actually about to take this a step farther, and maybe I can talk about this at some point but next week we're going to release Gradio 4.</p><p>[01:01:38] 0, which takes this idea of being able to control what's happening on the page. To the next level. You can have arbitrary control over the ui, ux of any of our components. You can build your own components and use them within a Grady app app and get all of the features that you want in a grad app.</p><p>[01:01:52] Like the, the, API usage, pre-processing, post-processing. Everything just works out of out of out of the box. But now with your own kind of level of control, yeah. Awesome.</p><p>[01:02:01] <strong>Alex Volkov:</strong> And it's been honestly great to see just how much enablement. Something like as simple as Gradio for folks who don't necessarily want to install npm and css packages.</p><p>[01:02:11] There's not much enablement this gave the open source community because People release, like you said, different significant things. Many of them, maybe you are not even aware of, right? They're running in some Discord, they're running in some Reddit. It's not like you guys follow everything that happens.</p><p>[01:02:23] Gradio local URL via Gradio Proxy</p><p>[01:02:23] <strong>Alex Volkov:</strong> Additional thing that I want to just mention that's very important that. When you run Gradio locally, you guys actually expose it via like your server, basically my local machine. And that's been like a blast that that's been like a very, very important feature that people may be sitting behind the proxy or everything.</p><p>[01:02:39] You can share your like local instance with some folks, unfortunately only for 72 hours. But actually</p><p>[01:02:44] <strong>Abubakar Abid:</strong> that's about to change. So in 4. 0, one of the things that we're trying to get, so actually, we've been very lucky because Gradio has been developed along with the community. Like you said, like often times we don't know what people are using Gradio for until, they come to us and tell us that this doesn't work, and then they'll link to their repo and it's this super complex Gradio app and we're like, what?</p><p>[01:03:01] Okay, why are you even trying that? That's way too complicated. But, but, then we'll realize like to the extent to what people are building. And so this you mentioned the share, these share links as well, which I want to just briefly touch upon. So one, one of the things that we released in like the early days of, of, of Gradio is we realize People don't want to worry about hosting their machine learning apps.</p><p>[01:03:19] Oftentimes you want to share your machine learning app with your colleague. Let's say you're like the engineer and you have a colleague who's a PM or something who wants to try it out. Or it might be if you're in academia, you want to share it with fellow researchers or your professors, whatever it may be.</p><p>[01:03:33] And like, why do all of this hosting stuff if you just are, are, like, building an MVP, right? So we built this idea of a share link. So you just, when you launch your Gradio app, you just say share equals true. And what that does is it creates a it uses something called Fast Reverse Proxy to actually expose your local port to a to this FRP server which is running in a public...</p><p>[01:03:53] Machine, and what that does is it forwards any request from a public URL to your local, port. And, what the, in a, the long story short, what that does is it makes your Gradio app available on the web for anyone to try. It runs for 72 hours by default, but now what we're doing as part of 4.</p><p>[01:04:08] 0, we'll, announce this, is you can actually build your own share servers. So we have instructions for how to do that very easily and you can point your Gradio instance to that share server. So if you have an EC2 instance running somewhere, just point to it and then you can have that share link running for as long as you want and you can, share your share servers with other people at your company or your organization or whatever it may be and they can use that share link and, again, they can run for however they want.</p><p>[01:04:30] Wait,</p><p>[01:04:31] <strong>Nisten Tahiraj:</strong> wait, wait, is this out? Which branch is this? Is</p><p>[01:04:34] <strong>Abubakar Abid:</strong> this going to be out? This is going to be out on Tuesday for Gradio 4. 0 we're, we're going to launch on Tuesday.</p><p>[01:04:41] <strong>Nisten Tahiraj:</strong> It's like the most useful feature I'd say of, of Gradio, especially when you make a Google collab that you want people to just run in one click.</p><p>[01:04:49] And like, how are they going to even use this model? And you just throw the entire Gradio interface in there and you share equals true. And then they know, they can just give it, give the link to their friends and stuff. It's really, it makes it really easy, especially with Google Colab. But now that you can host your own, this is huge.</p><p>[01:05:09] This is going to... to another level. I have more questions for</p><p>[01:05:14] <strong>Alex Volkov:</strong> Google. I think I just Nissen, thank you. I just want to touch upon the Google collab thing. I think at some point Google started restricting how long you can run like a collab for, and I think you guys are the reason. This exact thing that Nissen said.</p><p>[01:05:30] People just kept running the Gradio thing with the URL within the Google collab and exposing like stable diffusion. They didn't build collab for that, and I think they quickly had to figure out how to go around this.</p><p>[01:05:41] <strong>Abubakar Abid:</strong> Yeah. And their approach is like literally blacklisting the name of the the of specific, GitHub repos, which, I, I completely understand where, where Colab is coming from, right?</p><p>[01:05:50] They're giving these GPUs for free. They have to have to prioritize certain use cases, but we're working with the Colab team and we're seeing if, there's ways, like right now it's like a blacklist on, on automatic one on one, and some other repos. So we're hoping we can find another way that's not That's not so restrictive.</p><p>[01:06:05] <strong>Nisten Tahiraj:</strong> No, but it still works. You can just fork the repo. It works for everything else. It works for LLMs. So if anybody else really needs it. Gradio works on Colab. Well, as far as language stuff goes, I haven't done that much.</p><p>[01:06:18] <strong>Abubakar Abid:</strong> Yeah, so Gradio works on Colab for sure. And, and that's, and that's early on, like one of the decisions we had to make actually was...</p><p>[01:06:25] Should we use like, the default python runtime or should we like change, like the interpreter and stuff like that? Because building GUIs is not necessarily python's like strength, and like oftentimes you wanna render re-render everything, and you, you wanna do certain things that may not be like what Python is suited for.</p><p>[01:06:42] But early on we decided, yeah, we wanna stick with the default python runtime because. One of the reasons was things like Colab, because we wanted people to be able to run Gradio wherever they normally run Python without having to change their workflows. And Colab, Gradio works in Colab.</p><p>[01:06:56] We had to do a lot of... Trickery to make it work. But yeah, it works. It's just like these certain very, very specific apps that have become too popular and apparently consume too many resources. They're blacklisted by Colab right now.</p><p>[01:07:10] Local inference on device with Gradio - Lite</p><p>[01:07:10] <strong>Alex Volkov:</strong> Alright thank you for this intro for Gradio. To continue this, we have on stage Zinova who introduced himself, authors of TransformerJS, we've been talking with Bo in the audience, also somebody who's like just recently open sourced, with Jina, the embeddings model, and everything that we love to cover in ThursdAI, a lot of it is talking about As open source, as local as possible, for different reasons, for, not getting restricted reasons.</p><p>[01:07:36] And you guys just recently launched Gradio Lite, and actually we have Yuichiro here on stage as well. So I would love to have you, Abubakar introduce and maybe have Yuichiro then follow up with some of the stuff about what is Gradio Lite? How does it relate to running models on, on device and open source?</p><p>[01:07:52] And yeah, please, please introduce it.</p><p>[01:07:54] <strong>Abubakar Abid:</strong> Yeah, absolutely. Yeah. Like you mentioned, I think one of the things that we think about a lot about at Gradia is like, it's the open source ecosystem and, right now, for example, where can open source LMs, for example, really shine and things like that.</p><p>[01:08:06] And one of the places is on device, right? On device or in browser is, open source has a huge, edge over proprietary models. And so we were thinking about how can Gradio be useful in this setting. And we were thinking about the in browser application in particular. And we were very, very lucky to have Yuchi actually reach out to us.</p><p>[01:08:25] And Yuchi has this, fantastic tracker, but if you don't already don't know Yuchi, he built Streamlit Lite, which is a way to run Streamlit apps in the browser. And then he reached out to us and basically had this idea of doing something similar with Gradio as well. And basically, I, almost like single handedly refactored much of the Gradio library so that it could run.</p><p>[01:08:43] In, with Pyodide, in WebAssembly, and basically just run in the browser. I'll let Yuchi talk more about that, but basically, if you know how to use Gradio, then you know how to use Gradiolite. You just write the same Python code, but wrap it inside Gradiolite tags, and then it runs within the browser, like in the front end.</p><p>[01:08:59] You can execute arbitrary Python, and it just works. Yuchi, if you want to share a little bit more. About that, yeah, or introduce yourself.</p><p>[01:09:08] <strong>Yuichiro:</strong> All right, Hey, can you hear me? Well, thank you very much, thank you very much for the quick short interaction about Gradual Light and Streaming Light 2.</p><p>[01:09:15] Well, as Abhakal explained about it,</p><p>[01:09:18] there was</p><p>[01:09:18] Sorry. OriJinally there was kind of a, a tech technological movement about edge computing or Python. It was started by uh ide that was c Python runtime compiled for web assembly that can completely run on web browsers. It started, it triggers the big band of edge computational Python, random starting with project that was.</p><p>[01:09:43] That was already deported to Web Assembly runtime as D Light, and it inspired many other Python frameworks, including Streamlet and any other existing Python frameworks. I dunno uh pet script or. HoloVis panel or Shiny for Python, something like that. So there was a huge movement about to, to make Python frameworks to compatible with WebAssembly or web browser environment.</p><p>[01:10:13] And I thought that was a great, opportunity to make machine learning or data science stuff completely run on web browser, including transformer things or many more stuff existing in the stream machine learning ecosystem. And I first created a Streamlit Lite that was forked version of Streamlit to WebAssembly.</p><p>[01:10:36] And yeah, the, remaining story were the same as what Abhakaal introduced. So technically it was not, my oriJinal stuff, but there was a huge movement about such kind of stuff. And I, simply followed that. flow and my the transfer such kind of analogies to gradial repository.</p><p>[01:10:58] <strong>Alex Volkov:</strong> Yeah, that's it. Awesome. Thank you so much. And okay. So can we talk about what actually do we do with now the ability to run Gradio all in the browser? Could maybe both of you give some examples and then I would like also to to add Zenova to the conversation because much of the stuff is using Transformers.</p><p>[01:11:18] js, correct? Can we maybe go and talk about what is now actually possible compared to like when I run Gradio on my machine with a GPU and I can run like Stable Diffusion? I just</p><p>[01:11:27] <strong>Nisten Tahiraj:</strong> want to say that this is crazy that this can happen at all for the audience to</p><p>[01:11:32] <strong>Abubakar Abid:</strong> prepare. Yeah, I was honestly blown away the first time Yuchi showed me a demo as well.</p><p>[01:11:36] Imagine you have a, any sort of machine learning model. Practically, not almost anything, but a super, really good speech recognition model running completely in your browser. Meaning that, for example, now you can take that demo, you can put it inside GitHub Pages.</p><p>[01:11:51] You can host it inside. We've seen people embed Gradio demos now with Gradio Lite inside Notion. So you have a Notion whatever page, you can take that demo, you can embed it inside Notion. One of the things that we launched when we launched Gradio Lite at the same time is we also launched something called the Gradio Playground.</p><p>[01:12:07] Now the Gradio Playground, you can actually just Google this, you can find this. But basically what it allows you to do is it allows you to write code in the browser. And as you're editing the code, you can see live previews of your Gradio application. And, and basically what's happening is, is taking that Gradio code, it's wrapping it inside Gradio Lite tags, and it's just running it.</p><p>[01:12:27] It's just straightforward application of Gradio Lite. And, we're excited by this personally just because if one, it opens up, it allows us to write interactive documentation. You can write, you can try stuff, you can, you can immediately, see the results. We're also excited because we've seen interest from other libraries including, for example, SacketLearn, who want to embed Gradio demos within their documentation.</p><p>[01:12:49] Within their docs, right? But they're they were hesitant before because they didn't want to have a separate server running these radio applications and have to worry about maintaining those servers, making sure they were up all the time, making sure they could handle the load. Now they can write it in their docs and they're like, their, their demos and everything, they'll just run in the user's browser.</p><p>[01:13:07] They won't have to worry about maintaining everything since it's, in the same code base and everything. So I think that's another kind of cool application that we're excited by is just... These potential for interactive documentations that, maybe potentially other, other maintainers or other libraries might want to include.</p><p>[01:13:22] So yeah, so stuff like, security, privacy, serverless type stuff, hosting, and all of that. And then also like these interactive documentations.</p><p>[01:13:30] <strong>Alex Volkov:</strong> I think the demo that you mentioned with the transition within Notion from VB from Hive Interface, I think that was great. I'm trying to find the actual link, but basically, because Notion allows to embed like basically iframes, right?</p><p>[01:13:42] So he embedded this whole Gradio Lite interface to translate. I think using... Burt or something like very similar that all runs within like the notion page. I think that's awesome. Joshua, you want to chime in here and say how transformers is built into this and now this allows for way more people to use transformers in like a UI way.</p><p>[01:14:02] Transformers.js integration with Gradio-lite</p><p>[01:14:02] <strong>Xenova:</strong> Yeah, sure. So first of all, literally nine, like almost, I would say the whole. Everything that we are talking about now has been, like,</p><p>[01:14:12] <strong>Abubakar Abid:</strong> led by the Gradio team.</p><p>[01:14:14] <strong>Xenova:</strong> And I am here piggybacking and be like, Whoa, look at this, Transformers JS is now working. That's really not what we're talking about today.</p><p>[01:14:23] It's the amazing work that the team has been able to do. To achieve the past, this has been going on for, for quite a while. It's been like codenamed like Gradioasm and now finally being released as as GradioLite and now. Sort of like the Transformers J side of it. Just oh, by the way there's this library called Transformers J you can sort of use it and, and with the, the Transformers.</p><p>[01:14:48] Oh, was that ? Sorry. You've been way too humble.</p><p>[01:14:51] <strong>Abubakar Abid:</strong> No, no, absolutely</p><p>[01:14:52] <strong>Xenova:</strong> not. I think it's, it's, it's so much has been done by, by you and your, and, and the amazing radio team that's it, it just so happens to be that these things are like coinciding. And now you can end up using Transformers. js with, with Gradio and Gradio Lite.</p><p>[01:15:07] And obviously this is also made possible by, okay, everyone, everyone stick with me. It's going to be a little, get a little complicated when I try to explain this. But Transformers. js Pi, which is, are you ready? A JavaScript port of a Python library turned into a JavaScript library so that I can run in a Python environment.</p><p>[01:15:29] Okay. We all caught up? That's, that's Transformers. js. py, which is which, which Yushi wrote in the audience obviously with his experience with streamlets bringing streamlets to the browser. It's sort of his, his invention, which is quite funny, but that's sort of how Transformers. js is able to be run.</p><p>[01:15:49] Inside Gradiolites there are other ways, but from what you'll see in the documentation, that's sort of like the, the go to way. And it's</p><p>[01:15:57] <strong>Alex Volkov:</strong> Yeah, I wanna ask about this, because I saw from Transformers. js, import import on Discord Transformers. js.</p><p>[01:16:04] So maybe you should, could you talk about this part that, that Zenova tried to explain? It was, like, a little A little complex Transformer. js is you can install it through npm and then run this, right? And then it runs in the, in the node environment and browser environment. Gradualite is basically Python within JavaScript.</p><p>[01:16:19] So then you have to turn Transformers into Python in order to get it into Gradualite so that it runs within the JavaScript context again? Is that, is that correct? Am I getting this right?</p><p>[01:16:30] <strong>Nisten Tahiraj:</strong> If I could say something for the audience, what has, what's happening here is that there's a layer called Pyodide and that uses kind of like what WebAssembly uses to run Python at native speeds.</p><p>[01:16:44] So it runs in the browser. And it goes down that stack, there's like a virtual machine and compiler, all that stuff in there. And then that's how Python is able to run at native speed. And this means that with PyScript, you can have inside the same index. html, just your regular index. html, you can have your JavaScript code and your objects and stuff.</p><p>[01:17:06] And you can have just straight Python code in there. Like you just add the tag. You just dump the Python as is nothing else. And the crazy part is that it can access JavaScript objects now. So you can do the math in Python in the browser, 'cause JavaScript can do math well again, but, and then you can access those objects.</p><p>[01:17:30] So this is a whole crazy stack here with PIO IDE and EM scripting. And again, that's only WebAssembly. So that's CPU. Only for now, because there's still a mountain of work to get, and to finish it off, Emscripten is like your POSIX layer, like your Unix layer. It's like there's an operating system being built inside the browser here that's going on.</p><p>[01:17:54] So that's why things are getting complicated, but yeah, just to keep that in mind, that's the base.</p><p>[01:17:59] <strong>Yuichiro:</strong> Yeah, but, what Nisten talked about was everything, because we can access the JS object from Python world inside the browser if you import transformer.</p><p>[01:18:10] js. py on Gradle Lite, under the hood, transformer. js is now still being imported in the browser environment. And what... When you write a Python code as a Gradle like application on the browser, what you do is, simply using the oriJinal JavaScript, JavaScript version of Transformer. js, just proxy from the Python code through the, proxying mechanism provided by Pyodite.</p><p>[01:18:42] What Transformer. js. py does is just a, thin, Proxying layer or some glue code between bridging these two, two words, Python and JavaScript. That's it.</p><p>[01:18:56] <strong>Abubakar Abid:</strong> Yeah, I just zooming out a little bit. So basically what, what Transformers. js underscore pi does, it lets you run everything that Transformers.</p><p>[01:19:03] js does. And what Transformers. js does, it lets you run a lot of the models, a lot of the tasks. There's a lot of the models there, you can now run in your browser, right? We're talking about all of the NLP related tasks, like things like translation, LLMs, but also, a lot of the vision tasks, a lot of the audio stuff.</p><p>[01:19:22] We're talking about speech recognition that's powered by Transformers, what Josh has been doing with Transformers. js. And I think Transformers. js just released, even for example, speech generation. Text to speech. And so now you can do that within Transformer. js, which means you can do it within Transformer.</p><p>[01:19:34] js Pi, which means now you can do it within Gradial Light as well.</p><p>[01:19:40] <strong>Alex Volkov:</strong> That's incredible. And I think the biggest part for me is that... Now that you guys ported Gradio, which is ubiquitous in machine learning, and everybody who releases the model uses either this or Streamlit, but I think it's, it's a clear winner between the two, as this is, as I'm concerned and as I see, then now you basically ported the same thing towards the browser, and the more we see, Models getting smaller and we've been always talking about this models getting smaller, models being uploaded to the browser.</p><p>[01:20:08] Browsers</p><p>[01:20:09] <strong>Abubakar Abid:</strong> getting more powerful and,</p><p>[01:20:10] <strong>Alex Volkov:</strong> and WebP is more browser getting more powerful. Yeah. And yeah, I'm getting, I'm getting to web GPU because we have backend here, Arthur, on stage. And I would love to introduce you guys if, unless you're already familiar. The more we see this move, the more like the need for something like a component library that's built in is, is very interesting.</p><p>[01:20:25] Even though this world already has a bunch of libraries. But you're basically, with this, you're also porting the people with the experience of Gradio, right? With the existing with the existing frameworks, with the existing Gradio interfaces, to this world. I find it very exciting, so thank you.</p><p>[01:20:38] And I want to introduce Arthur. Arthur, feel free to unmute yourself and maybe introduce yourself, briefly, and then yeah, feel free to chime in to this conversation.</p><p>[01:20:46] <strong>Arthur Islamov:</strong> Okay, so I did quite a lot of things with ONNX to create the Diffuser. js library and to load stable diffusion in the browser, and now I'm working on the SDXL version.</p><p>[01:20:58] So I was going to ask, do you know if there are some plans on adding WebGPU backends for PyTorch? Because when it happens... It'll be so much easier as web GPU backend can be launched on any platform, not even in the browser, but also locally without the browser, just using the metal backend the direct ticks or Vulcan or Linux.</p><p>[01:21:28] So I guess when that happens, we'll go to a whole new era as you'll be able to run those PyTorch models in the browser with GPU acceleration.</p><p>[01:21:40] <strong>Xenova:</strong> I can tag on to this. The TLDR of it is it's not at the point...</p><p>[01:21:46] Where, sort of, I'm sort of comfortable with upgrading the ONNX Runtime web Runtime, basically, to support the WebGPU backend right now, just because there's quite a few issues still left, like left, so we'll see. To solve before we get to the point where you can start running these models completely on WebGPU.</p><p>[01:22:07] The main, I think, the current issue at the moment is with, like, when you're generating text a lot of the... The buffers aren't reused properly during when you, when you start decoding. That's sort of leading to quite a massive performance bottleneck just because you're transferring memory between CPU and GPU every single time you're you're decoding.</p><p>[01:22:31] So that's, that's not quite there yet, however, with things like image image classification and I guess models with encode only, encoder only models, those are getting quite good, like births pretty fast we've, segment anything when you're just doing the encoding step, we the Onyx Runtime team has got to the point where it used to take around 40 seconds and now it takes around 4 seconds.</p><p>[01:22:55] And that's currently being worked on in like a dev branch, basically, of Transformers. js, just like making sure the integration's working. But it's, it's almost there. I keep, I keep saying it's almost there, but the amazing Microsoft team has been, has been really working hard on this. And if you just look at the commit history of on GitHub Microsoft slash Onyx Runtime and you go to the web version.</p><p>[01:23:18] There's just so many amazing people working on it and it's slowly getting to a point where and this will sort of be released with Transformers. js version 3. When we upgrade the Onyx runtime version to probably 1. 17, which will be, which will be the next one. It's currently 1. 16. 1. And then they'll, it's, and, and literally from the user's perspective, it's as simple as adding a line of code just saying, Basically use web GPU instead of web assembly.</p><p>[01:23:46] And, and also in the case where it's not supported, it'll fall back to the web assembly implementation. And, and that's, this will completely be transferable to how grid your light works, just because as was mentioned, it's sort of use as transformers js under the hood. So you any benefits that you'll see in Transformers j you'll see in transformers Jss pie, which you'll see in radio lights, which is, which is great.</p><p>[01:24:11] TLDR coming soon, it's an, it's an annoying answer to give, but it's, it's so close. And I guess this is also good because it sort of aligns with the time that more browsers will support WebGPU, sort of like without flags. I know Chrome is sort of leading the charge and other Chromium based browsers.</p><p>[01:24:30] But if you look at things like Safari and Firefox, it's quite far behind to the point that you it's. It's not, it's not ready for like mass adoption yet, but once it is, and once the ONNX Runtime backend has the WebGPU support has improved, you'll definitely be seeing that in Transformers Jest.</p><p>[01:24:48] So hopefully that answers the question. I</p><p>[01:24:52] <strong>Nisten Tahiraj:</strong> think stuff's about to get crazy on the front end because of this because the thing about you have all your WebGL stuff, you have all your maps, all your 3D, all your games. Now you can have an LLM even generate code for them, manipulate those objects, move stuff around on screen in 3D, and like the, the AI.</p><p>[01:25:14] Does that at all, all within your machine, but I do want to say that for, for Pyodide itself, it might take a long time for a WebGPU support because it depends on EM scripting. And if you want to do anything with Python, like open a file, write a file, output a file, you only can do what EM scripting gives you and EM scripting is like the base layer.</p><p>[01:25:39] Of the operating system, like it pretends, it fools your apps into thinking that there's an operating system there when, when there isn't. And as far as I've seen, like two, three months ago, WebGPU support was like really, really early on and might take a while for Emscripten to support that. So you're going to have to do that other ways by going straight to using WebGPU versus using it with that layer.</p><p>[01:26:06] So it might get a bit complex</p><p>[01:26:09] <strong>Alex Volkov:</strong> there. I agree about the stuff is about to get crazy. Go ahead, Arthur, and then we'll follow up on Gradio 4 and then we'll conclude.</p><p>[01:26:18] <strong>Nisten Tahiraj:</strong> Yeah, I</p><p>[01:26:18] <strong>Arthur Islamov:</strong> just wanted to tell that as yesterday or a few days ago I have seen that This distilled stable diffusion model I saw that they have previously released not Excel version, but the ordinary 2.1 or something like that, the distilled one.</p><p>[01:26:35] So I'm thinking to try to make my edema work with that distilled model without 64 beats. So for just ordinary 32 bit that will work in almost any browser without any additional flags or launching with some special parameters.</p><p>[01:26:54] <strong>Alex Volkov:</strong> Yeah. Arthur, you can't just mention an item on my updates list and not talk about this, right?</p><p>[01:26:59] Folks, let me just briefly cover what Arthur just said. Just on the fly SegMind, a company called SegMind, introduced like a distilled version of SDXL. And it's okay, Diffusion Excel, something they released a while ago. We've covered this multiple times. Understand like way better. Quality, obviously, generations and diffusion, but also way better text understanding, right?</p><p>[01:27:20] And it has two parts there's like a refiner part in addition. And so this company basically distilled that distillation we've talked about multiple times before. It's when you train your own model, but then you steal data from GPT 4 and you create the data so that GPT 4, you basically distill, it's like smartness to your own models.</p><p>[01:27:37] So they basically did this for SDXL they call it SegMind Stable Diffusion 1B, and it's a. 50 percent smaller and 60 percent faster than SDXL. Again, just to put in some time frames what Abubakar and I talked about, where I first experienced Gladio, this was Stable Diffusion 1. 4, a year ago, a year and a couple of months ago.</p><p>[01:27:57] Since then, we got Stable Diffusion multiple iterations of Stable Diffusion, then there is SDXL, which is like the Excel version. It it generates 124 by 124 images. And, and then now a few months after they released that, now we have a version that's 50 percent smaller and 60 percent faster.</p><p>[01:28:16] And so what Arthur is like now talking about diffusers JS is the ability to like load stable diffusion in the browser. Now there's a model that's half the size, and 60 percent has passed, which is good for the browser context. So I pinned it to the top of the tweet check out SegMind, it's definitely super cool.</p><p>[01:28:34] And the advancements that we see from week to week, and this is obviously super cool as well. And Arthur, sorry to interrupt with this, but you had one of my tasks that I had to finish before we finish and talk about this. So are you, have you already introduced it to diffusers? Have you tried it?</p><p>[01:28:52] <strong>Nisten Tahiraj:</strong> I have</p><p>[01:28:52] <strong>Arthur Islamov:</strong> tried it to convert it to omx, but it didn't work or maybe some of my code didn't work. So I guess I will try again on the weekend and yeah, most likely I will make it running.</p><p>[01:29:06] <strong>Alex Volkov:</strong> I think we had some folks on segment react and we will, let's try to connect there and, and hopefully get it running on, on as well so that we all benefit and I guess, maybe as the last part of this conversation Abukar and thank you for joining Uchi ABA Ali and the folks from Hugging Face. It is great to see all of you. I think you mentioned some folks that joined before like Uhb and some other folks on the Hugging Face. We're big fans here on ThursdAI, and we're like, always welcome you guys.</p><p>[01:29:33] Could you talk about what's coming in version four, because I think you, you, you gave us like one tidbit. But give us an update on that. I would love to hear.</p><p>[01:29:40] <strong>Abubakar Abid:</strong> Yeah, yeah, definitely. Yeah, so we're launching Gradio 4. 0 on Tuesday October 31st. And basically, the, the team has been working very, you mentioned earlier that, people are, are building these very, very complex apps with, with Gradio and, and really, honestly, stuff that we did not anticipate when we were designing Gradio.</p><p>[01:29:57] And more and more, what we want to do is almost take ourselves out of this feed, feedback loop. And let people build what they want to build, but let the community build stuff that, whatever you imagine, kind of just be, just be able to put that in a Gradio app. Let me be a little bit more concrete.</p><p>[01:30:11] So what is Gradio 4. 0 going to introduce? For example it's going to introduce the idea of custom components. So if you know a little bit of Python, a little bit of JavaScript, you can build your own component. You can use that within a Gradio app, just like you do normally, just like you use our built in, 30 or so built in components.</p><p>[01:30:27] Speaking of the built in components, we're redesigning some of the components from scratch, particularly the media components. So things like image audio video, they're going to be much, much nicer and they're going to be fully accessible. So one of the things that we're realizing is that, we're not, at Gradio, we're not just building a product for a specific audience, but we're building tools that let people build, apps for many different audiences.</p><p>[01:30:50] And so we want to make sure that all of the core components are accessible. That way it's easy to do the right thing and build accessible web applications. So we're redesigning that we're switching over from WebSockets to as server side events. There's several reasons for this and, and we'll talk about more about this on, on Tuesday.</p><p>[01:31:07] We're, we're having a little long live stream as well, but there's several reasons why server side events is the way to go for Gradio. And so there's, that's more of an internal refactor. You probably won't notice things. You might notice some speed ups in certain situations. It'll unlock a lot of things later on.</p><p>[01:31:22] We're open sourcing the the the sharing links process, the share servers at Gradio. So everyone will be able to set up their own, custom share links. So instead of, whatever dot Gradio dot live, you can have, you can have, some, some code, dot turjom dot video if you want, you can have whatever URL, custom URL you want for your share links.</p><p>[01:31:42] And then a lot of other changes as well we'll, we'll, we'll talk more about that on Tuesday. The team has been working super hard, so I'm, I'm excited to, to get it out for you guys to try out.</p><p>[01:31:51] <strong>Alex Volkov:</strong> That's so awesome, and, and can't, can't wait to see this. I I think the share links is like such a powerful virality thing, that once people start adding this to their domains, and, and start running different Gradio interfaces within Colab, outside of Colab, with their own domains.</p><p>[01:32:08] I think it's going to be super cool, especially if they don't expire. I absolutely received many of these links over DMs from multiple people. I think even people in the audience so far. And I think adding them to the custom domains. Thank you for open sourcing that. That's great.</p><p>[01:32:21] <strong>Abubakar Abid:</strong> I think part of it is also we want to reduce the load on our shared servers.</p><p>[01:32:25] We're getting too many of these links being created</p><p>[01:32:27] and stuff.</p><p>[01:32:27] <strong>Alex Volkov:</strong> Yes, absolutely. And I think the accessibility features are great. Folks, definitely check out follow Abu Bakr, follow Yuchi, and folks on stage, do Ali as well. To stay tuned to what's coming up to Gradio and then make sure to update your Gradio interfaces to the new accessible ones because you're building it's no longer demos.</p><p>[01:32:46] Everybody's using every new model is, is getting a Gradio interface and accessibility is very important for the level of web applications. I think... With that, I want to thank you guys for coming up and sharing with us Radio Light, which is very, very in accordance to what we love to talk about here, open source, open source LLM, on device inference, and taking control of your own, LLMs.</p><p>[01:33:07] I think, Nistan, you briefly, briefly talked about how crazy it's going to be. Where there is an LLM built into your website or web application that runs on the GPU of your device and is able to do stuff. And you can interact with it without basically offline. That's great. I think Nishtan, something that I want to talk about, but maybe I'll let you talk about this.</p><p>[01:33:27] I will say this thing. Now that we've concluded the interview with Gladio folks, one of the things that we love. Most of all, on Thursday, I have breaking news, and we actually have some breaking news. Nissen, go ahead, please, present the breaking news that you just sent.</p><p>[01:33:39] <strong>Nisten Tahiraj:</strong> I pasted a Gradius space above.</p><p>[01:33:43] If you click on it, that's what it is. And it's it's Kokui's, new release, new voice model. This is huge because they're allowing fine tuning on their voice model. And one criticism of the open source voice models has been that the dataset for training them has been of poor quality, like the, the microphone and, and stuff and the, the dataset that people use to, to train the models has been bad.</p><p>[01:34:12] So this is. Pretty important in this regard, because it's one of the very few, there's the one that Zenova released, and the Kokui one, that are open source and usable when it comes to text to speech, and that are like somewhat, somewhat pleasant, and that run relatively fast. Otherwise, it's pretty hard to have text to speech.</p><p>[01:34:37] Yeah, the, the part that you can fine tune, they, they open source the fine tuning code. Yeah, go there and</p><p>[01:34:43] <strong>Alex Volkov:</strong> get that, yeah. Thank you, Nissan. The, the folks from Cochlea, when they released XTTS, which is the open source text to speech that kind of, we know 11 Labs, we know Play. ht, we know OpenAI has one that Spotify uses to translation, and OpenAI haven't released any.</p><p>[01:34:59] We'll see next week if they're gonna give us an API for that. All of those require... A lot of money, just a lot of money 11 labs is basically rolling in cash because everybody wants to get their AIs to talk, right? And so we previously here, we talked about the listen part, we've talked about the streaming from Gladiator, but now you can have Whisper basically streaming.</p><p>[01:35:18] The other part of that was, hey, well, once your LLM listens and thinks, which is the inference part, you also want it. to talk to you. And TTS Texas speech is the way to do that. And Kokui, we had a chat with Joshua when they released TTS, which was very exciting. And now live on stage, live on Thursday.</p><p>[01:35:34] I, because this is why Thursday exists. Many people release stuff on Thursday. There is their own fine tuning with minutes of data. So you can create a voice, let's say maybe this is going to be disappointing for folks here on stage, but everybody here who spoke on stage more than a minute now is basically public for everybody else to take your voice and clone it with XTDS.</p><p>[01:35:56] It was possible before, somebody just had to pay money for it, but now... And Ali's laughing because Ali didn't talk yet, but everybody's basically now is, is going to get voice cloned. It's very easy. We're going towards this future. If this future scares you, there's no escape from that. Even I think VAL E from Microsoft, when it was released, they talked about like maybe 30 seconds of voice is enough to clone.</p><p>[01:36:18] But XTTS now gives us basically a framework and, and even the new language they said to, to add to Kokui, to the XTTS, and then you can use this. Within Transformers. Zinova, can we use Cochlear within Transformers. js or not yet? I think, I think we can. Not yet. Not yet. Okay, so soon, soon you'll be able to even do this all completely all within the browser.</p><p>[01:36:41] Hopefully once integration with WebGPU lands. So here we have it folks. We had an incredible Thursday I today. We started with talking with Bo and the guys from the embeddings team that released like the, the most kind of up how should I say, most comparable to OpenAI embeddings in open source.</p><p>[01:36:59] And that was great. Bo actually gave us like a masterclass in how embeddings work and Jina embedding models are available within, talked with Bakar and Yuchi and ANOVA and Arthur and all on stage the team behind Grado that if you haven't used Grado, you probably have used Grado. You just didn't know that it's Grado And, actually, this interface or slash library that started for demos only scaled all the way up to something like automatic, where like multiple people compute, contribute thousands of contributions including like NVIDIA and and I think IBM contribute now. It's like a full businesses run on this quote unquote component library.</p><p>[01:37:37] And I I just want to invite you to join Thursday and next week as well because... Some of this was planned, but definitely not all of this. But also, this is the way to stay up to date. And next week we're going to see some more incredible things. I think some very interesting things are coming up.</p><p>[01:37:52] I will have a personal announcement to make that's going to be very surprising to some folks here on stage. But definitely we'll keep ThursdAI going significantly, significantly more. And with that, I just want to thank you for joining us. It's been a pleasure to have these. It's been a pleasure to like have a space where, the graduate folks and the Jina folks can come and talk about what they released.</p><p>[01:38:12] And we can actually ask them questions. I want to thank everybody who joined on stage. Nistan, thank you always for joining, Zinova, Arthur. We, we were joined by new folks that we'll introduce next. Thank you so much for joining us this week because we just don't have the time and obviously thank you for folks in the audience who join every week and I see Enrico in there and Junaid and Tony and some other folks that like I love to see from week to week.</p><p>[01:38:33] If you missed any part of this, any part at all, or if the internet connection for you got stuck ThursdAI is about a live recording, but then it's getting released as a podcast episode, so you will get, if you're subscribed, and you should be already subscribed to ThursdAI on Apple or Spotify, you'll get this episode hopefully very quickly edited, if I'm not getting lost in some other interesting stuff, like Kokui.</p><p>[01:38:57] Thank you. And we will also release a newsletter with all the links and the conversations with Guardia team and, and Bo and all the updates as well in the form of links. And with that, I thank you for joining. It's been two hours. It's been a lovely time and now I need to go and actually edit the podcast.</p><p>[01:39:12] See you here next week. Thank you and yeah, please share with your friends as much as possible. The more crowd there is, the better these will be. And yeah, help and participate. Thank you all and have a good rest of your week. Bye bye.</p><p></p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-oct-26-jina-embeddings-sota</link><guid isPermaLink="false">substack:post:138319368</guid><dc:creator><![CDATA[Alex Volkov, Bo, Abubakar Abid, Xenova, and Nisten]]></dc:creator><pubDate>Thu, 26 Oct 2023 22:31:43 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/138319368/1c5ca0c10932e8bad66c20baea57217d.mp3" length="95719600" type="audio/mpeg"/><itunes:author>Alex Volkov, Bo, Abubakar Abid, Xenova, and Nisten</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5982</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/138319368/954f11e140a74b6b62b9879722a0dd6a.jpg"/></item><item><title><![CDATA[🔥 ThursdAI Oct 19 - Adept Fuyu multimodal, Pi has internet access, Mojo works on macs, Baidu announces ERNIE in all apps & more AI news]]></title><description><![CDATA[<p>Hey friends, welcome to ThursdAI Oct - 19. Here’s everything we covered + a little deep dive after the TL;DR for those who like extra credit. </p><p><p>ThursdAI - If you like staying up to date, join our community</p></p><p>Also, here’s the reason why the newsletter is a bit delayed today, I played with Riffusion to try and get a cool song for ThursdAI 😂</p><p>ThursdAI October 19th</p><p>TL;DR of all topics covered: </p><p>* <strong>Open Source MLLMs</strong> </p><p>* Adept open sources Fuyu 8B - multi modal trained on understanding charts and UI (<a target="_blank" href="https://x.com/AdeptAILabs/status/1714682075763405257?s=20">Announcement</a>, <a target="_blank" href="https://huggingface.co/adept/fuyu-8b">Hugging face</a>, <a target="_blank" href="https://huggingface.co/spaces/adept/fuyu-8b-demo">Demo</a>)</p><p>* Teknium releases Open Hermes 2 on Mistral 7B (<a target="_blank" href="https://twitter.com/Teknium1/status/1714010838959612329">Announcement</a>, <a target="_blank" href="https://huggingface.co/teknium/OpenHermes-2-Mistral-7B">Model</a>)</p><p>* NEFTune - a "one simple trick" to get higher quality finetunes by adding noise (<a target="_blank" href="https://x.com/tomgoldsteincs/status/1712498076340932855?s=20">Thread</a>, <a target="_blank" href="https://github.com/neelsjain/NEFTune">Github</a>)</p><p>* Mistral is on fire, most fine-tunes are on top of Mistral now</p><p>* <strong>Big CO LLMs + APIs</strong></p><p>* Inflection Pi got internet access & New therapy mode (<a target="_blank" href="https://twitter.com/inflectionAI/status/1714018923916534226">Announcement</a>)</p><p>* Mojo 🔥 is working on Apple silicon Macs and has LLaMa.cpp level performance (<a target="_blank" href="https://twitter.com/Modular_AI/status/1714020585775448473">Announcement</a>, <a target="_blank" href="https://twitter.com/tairov/status/1714103695321829551">Performance thread</a>)</p><p>* Anthropic Claude.ai is rolled out to additional 95 countries (<a target="_blank" href="https://x.com/AnthropicAI/status/1714025126516432996?s=20">Announcement</a>) </p><p>* Baidu AI announcements - ERNIE 4, multimodal foundational model, integrated with many applications (<a target="_blank" href="https://www.prnewswire.com/news-releases/baidu-launches-ernie-4-0-foundation-model-leading-a-new-wave-of-ai-native-applications-301958681.html">Announcement</a>, <a target="_blank" href="https://x.com/Baidu_Inc/status/1714185973318443024?s=20">Thread</a>)</p><p>* <strong>Vision</strong></p><p>* Meta is decoding brain activity in near real time using non intrusive MEG (<a target="_blank" href="https://twitter.com/AIatMeta/status/1714635316554772716">Announcement</a>, <a target="_blank" href="https://ai.meta.com/blog/brain-ai-image-decoding-meg-magnetoencephalography/">Blog</a>, <a target="_blank" href="https://ai.meta.com/static-resource/image-decoding">Paper</a>)</p><p>* Baidu YunYiduo<strong> </strong>drive - Can use text prompts to extract precise frames from video, and summarize videos, transcribe and add subtitles. (<a target="_blank" href="https://www.prnewswire.com/news-releases/baidu-launches-ernie-4-0-foundation-model-leading-a-new-wave-of-ai-native-applications-301958681.html">Announcement</a>)</p><p>* <strong>Voice & Audio</strong></p><p>* Near real time voice generation with <a target="_blank" href="http://play.ht">play.ht</a> - under 300ms (<a target="_blank" href="https://x.com/play_ht/status/1714382990523167197?s=20">Announcement</a>)</p><p>* I'm having a lot of fun with Airpods + chatGPT voice (<a target="_blank" href="https://x.com/altryne/status/1714375314036645938?s=20">X</a>)</p><p>* Riffusion - generate short songs with sound and singing (<a target="_blank" href="https://www.riffusion.com/riffs/9d0d122c-71e4-4ddd-bcff-0b6ebd2c3a75">Riffusion</a>, <a target="_blank" href="https://twitter.com/altryne/status/1715110205095125007">X</a>)</p><p>* <strong>AI Art & Diffusion</strong></p><p>* Adobe releases Firefly 2 - lifelike and realistic images, generative match, prompt remix and prompt suggestions (<a target="_blank" href="https://x.com/mreflow/status/1711825046719856795?s=20">X</a>, Firefly)DALL-E 3 is now available to all chatGPT Plus uses (<a target="_blank" href="https://x.com/OpenAI/status/1715050642560151963?s=20">Announcement</a>, <a target="_blank" href="https://cdn.openai.com/papers/dall-e-3.pdf">Research paper</a>!) </p><p>* <strong>Tools</strong></p><p>* LMStudio -  a great and easy way to download models and run on M1 straight on your mac (<a target="_blank" href="https://lmstudio.ai/">Download</a>)</p><p>* <strong>Other</strong></p><p>* ThursdAI is adhering to the techno-optimist manifesto by Pmarca (<a target="_blank" href="https://a16z.com/the-techno-optimist-manifesto/?utm_source=substack&#38;utm_medium=email">Link</a>)</p><p>Open source mLLMs</p><p>Welcome to multimodal future with Fuyu 8B from Adept</p><p>We've seen and covered many multi-modal models before, and in fact, most of them will start being multimodal, so get ready to say "MLLMs" or... we come up with something better. </p><p>Most of them so far have been pretty heavy, IDEFICS was 80B parameters etc' </p><p>This week we received a new, 8B multi modal with great OCR abilities from Adept, the same guys who gave us Persimmon 8B a few weeks ago, in fact, Fuyu is a type of persimmon tree (we see you Adept!)</p><p>In the podcast I talked about having 2 separate benchmarks for myself, one for chatGPT or any MultiModal coming from huge companies, and another for open source/tiny models. Given that Fuyu is a tiny model, it's quite impressive! It's OCR capabilities are impressive, and the QA is really on point (as well as captioning)</p><p>An interesting thing about FuYu architecture is, because it doesn't use the traditional vision encoders, it can scale to arbitrary image sizes and resolutions, and is really fast (large image responses under 100ms)</p><p>Additionally, during the release of Fuyu, Arushi from Adept authored <a target="_blank" href="https://x.com/itsamks/status/1714683180782071829?s=20">a thread</a> about visualQA evaluation datasets are, which... they really are bad, and I hope we get better ones! </p><p>NEFTune - 1 weird trick of adding noise to embeddings makes models better (<a target="_blank" href="https://twitter.com/neeljain1717/status/1712125846553554987">announcement thread</a>)</p><p>If you guys remember, a "this one weird trick" was discovered by KaiokenDev back in June, to extend the context window of LLaMa models, which then turned into RoPE scaling and YaRN scaling (which we covered in a <a target="_blank" href="https://sub.thursdai.news/p/thursdai-sunday-special-extending#details">special episode with the authors</a>) </p><p>Well, now we have a similar "1 weird trick" that by just adding some noise to embeddings at training time, the model performance can grow by up to 25%! </p><p>The results very per dataset of course, however, considering how easy it is to try, literally: </p><p>It's as simple as doing this in your forward pass
if training:
  return orig_embed(x) + noise
else:
  return orig_embed(x)</p><p>We should be happy that the "free lunch" tricks like this exist. </p><p>Notably, we had a great guest, <a target="_blank" href="https://twitter.com/intent/follow?original_referer=https%3A%2F%2Fspacesdashboard.com%2F&#38;ref_src=twsrc%5Etfw%7Ctwcamp%5Ebuttonembed%7Ctwterm%5Efollow%7Ctwgr%5Ewinglian&#38;screen_name=winglian">Wing Lian</a> the maintainer of Axolotl, a very popular tool to streamline fine-tuning, chime in and say that in his tests, and among the discord folks, they couldn't reproduce some of these claims (as they are adding everything that's super cool and beneficial for finetuners  to their library) so it remains to be seen how far this "trick" scales, and what else needed to be done here. </p><p>Similarly, back when the context extend trick was discovered, there was a lot of debates about it's effectiveness from Ofir Press (author of ALiBi, another context scaling methond) and futher iterations of the trick made into a paper and a robust method, so this develompment is indeed exciting! </p><p>Mojo 🔥 now supports Apple silicon Macs and has LLaMa.cpp level performance!</p><p>I've been waiting for this day! We've covered Mojo from Modular a couple of times and it seems that the promise behind it starts to materialize. Modular promises an incredible unbelieavable 68,000X boost over vanilla python, and it's been great to see that develop.</p><p>Today (October 19) they have released their support of Mojo Lang on Apple silicon which most developers use, and it's a native one and you can use it right now via CLI. </p><p>A friend of the pod Aydyn Tairov, hopped on the live recording and talked to use about his LLama.🔥 project (<a target="_blank" href="https://github.com/tairov/llama2.mojo">Github</a>) that he ported to the Apple silicon, and showed an incredible, LLaMa.cpp like performance, without crazy optimizations! </p><p>Aydyn collected many LLaMa implementations, including Llama.cpp, LLama.c by Karpathy and many others, and included his LLama.mojo (or Llama.🔥) and saw that the mojo one is coming very very close to LLama.cpp and significantly beats Rust and Go and Julia examples (on specific baby llama models) </p><p>The Mojo future is bright, and we'll keep updating with more, but for now, go play with it! </p><p>Meta is doing near-real time brain → image research! 🤯</p><p>We've talked about fMRI signals (and EEG) signals being translated to diffusion imagery before, and this week, Meta has shown that while fMRI signals to brain imagery is pretty crazy on it's own, using something called MEG (non invasive Magnetoencephalography) they can generate and keep generating images based on the brain signals, in near real time! </p><p>[TK video here]</p><p>I don't have a LOT to say about this topic, besides the fact that as an Aphant (I have Aphantasia) I can't wait to try this on myself and see what my brain actually "sees" </p><p>Baidu announces ERNIE and a bunch AI native products including maps, drive, autonomous ride hailing and more. </p><p>Baidu has just wrapped up their biggest conference of the year, BaiduWorld, where they announced a new version of their foundational model called ERNIE4, which is a multimodal (of unknown size) and is now integrated into quite a few of their products, many of which are re-imagined with AI. </p><p>A few examples beyond a basic LLM chat like interface are, a new revamped map experience with an AI assistant (with voice) built in to help you navigate and find locations, a new office management app that handles appointments and time slots called InfoFlow, and it apparently even does travel booking, to an AI "google drive" like product called YunYidou, that is able to find video content, based on what was said and when, and even pinpoint specific frames, summarize and do a bunch fo other incredible AI stuff, here's a translated video of someone interacting with YunYinou and asking for a bunch of stuff one after another. </p><p>Disclosure: I don't know if the video is edited or in real time. </p><p>Voice & Audio</p><p>Real time voice for agents is almost here, chatGPT voice mode is powerful</p><p>I've spent maybe 2 hours this week, with chatGPT in my ear, using the new voice mode + AirPods. It's almost like... being on a call with chatGPT. I started talking to it in the store, asking for different produce to buy for a recipe, then drove home and ask it to "prepare" me for the task (I don't usually cook this specific thing) and then during my cooking, I kept talking to it, asking for next steps. With the new IOS the voice mode shows up as a live activity and you can pause it and resume without opening the app: </p><p>It was literally present in my world, without me having to watch the screen or type. </p><p>It's a completely new paradigm of interactions when you don't have to type anymore, or pick up a screen and read, and it's wonderful! </p><p>Play.ht shows off an impressive <300ms voice generation for agents</p><p>After spending almost 2 hours talking to chatGPT, I was thinking, why aren't all AI assistants like this, and the answer was, well... generating voice takes time, which takes you out of your "conversation flow" </p><p>And then today, <a target="_blank" href="http://play.ht">play.ht</a> showed off a new update to their API that generates voice in <300ms, and that can be a clone of your voice, with your accent and all. We truly live in unprecedented times. </p><p>I can't wait for agents to start talking and seeing what I see (and remember everything I heard, via Tab or Pendant or Pin) </p><p>Riffusion is addictive, generate song snippets with life-like lyrics!</p><p>We've talked about music gen before, however, Riffusion is a new addition and is now generating short song segments with VOICE! Here are a few samples, and honestly, I've procrastinated writing this newsletter because it's so fun to generate these, and I wish they went for longer! </p><p>AI Art & Diffusion</p><p>Adobe releases Firefly2 which is significantly better at skin textures, realism, and hands. Additionally they have added a style transfer which is wonderful, upload a picture of something with a style you'd like, and your prompt will be generated in that style, it works really really well. The extra details on the skin is just something else, though I did cherry pick this example, the other hands were a dead give-away, still, the hands are getting better across the board! </p><p>Plus they have a bunch of prompt features, like prompt suggestion, ability to remix other creations and more, it's really quite developed at this point. </p><p>Also: DALL-E is now available to 100% of plus users and enterprise, have you tried it yet? What do you think? Let me know in replies!</p><p>That’s it for October 19. If you're into AI engineering, make sure you listen to the previous weeks podcast where Swyx and I recapped everything that happened on stage and off it in the seminal AI Engineer summit. And make sure to share this newsletter with your friends who like AI! </p><p>For those who are 'in the know', emoji of the week is 📣, please DM or reply with it if you got all the way here 🫡 and we'll see you next week (where I will have some exciting news to share!)</p><p></p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-oct-19-adept-fuyu-multimodal</link><guid isPermaLink="false">substack:post:138118559</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Fri, 20 Oct 2023 02:22:07 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/138118559/4e53b3e098cdbfb458a4ce1f586525eb.mp3" length="86340794" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5396</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/138118559/2bdfa1bf9c9a468ff18811df1b3efb24.jpg"/></item><item><title><![CDATA[A week of horror, an AI conference of contrasts]]></title><description><![CDATA[<p>A week of horror, an AI conference of contrasts</p><p>Hi, this is Alex. In the podcast this week, you'll hear my conversation with Miguel, a new friend I made in <a target="_blank" href="http://AI.engineer">AI.engineer</a> event, and then a recap of the whole <a target="_blank" href="http://Ai.engineer">Ai.engineer</a> event I had with Swyx after the end. </p><p>This newsletter is a difficult one for me to write, honestly, I wanted to skip this one entirely, struggling to fit the current events into my platform and the AI narrative, however, decided to write one anyway, as the events of the last week have merged into 1 for me in a flurry of contrasts. </p><p>Contrast 1 - Innovation vs Destruction</p><p>I was invited (among a few other Israelis or Israeli-Americans) to the <a target="_blank" href="http://ai.engineer">ai.engineer</a> summit in SF, to celebrate the rise of the AI engineer, and I was looking forward to that very much. Meeting many of you (Shoutout to everyone who listens to ThursdAI who I've met face to face!) and talking to new friends of the pod, interviewing speakers, meeting and making connections was a dream come true. </p><p>However a few days before the conference began, in a stark contrast to this dream, I had to call my mom, who was sheltering, 20km from the Gaza strip border, to ask if our friends and family are alive and accounted for, and to hear sirens as rockets flying above her head, as Hamas terrorists murder, pillage and kidnap, in what seems to be the 10x equivalent of 9/11 terror attack, relative to population size. </p><p>I grew up in Ashkelon, rocket attacks are nothing new to me, we've learned to live with them (thank you Iron Dome heroes) but this was something else entirely, a new world of terror. </p><p>So back to the conference, given that there's not a lot to be gained by doom scrolling, and watching (basically snuff) films coming out of the region, given that all my friends and family were accounted for, I decided to not give the terrorists what they want (which is to get people in state of terror) and instead to choose to have compassion, without empathy towards the situation and not bring sadness to every conversation I had there (over 200 I think) </p><p>So participating at an AI event, which hosts and celebrates folks who are literally at the pinnacle of innovation, building the future, using all the latest tools while also hurting and holding the dear ones in my thoughts was a very stark contrast between past and future, and huge credit goes to Dedy Kredo, CTO of Codium, who was in the same position, and gave a hell of a talk, with a kick-ass (no backup recording!) demo live, and then shared this image: </p><p>This is his co-founder, Itamar, who was called to reserve duty to protect his family and country, sitting with his rifle and his dashboard, seeing destruction + creation, past and future, negativity and positivity all at once. As Dedy masterfully said, we will prevail 🙏 </p><p>Contrast 2 - Progress // Fear</p><p>At the event, Swyx and Benjamin gave me a media pass and a free reign, and I asked to be teamed with a camera-person to go around the event and do some (not live) interviews. I was teamed with the lovely Stacey, from Chico, CA. Stacey has nothing to do with AI, in fact she's a wedding photographer, however she definitely listened with interest to the interviews I was holding, and to speakers on stage. </p><p>While we were taking a break, I looked out the window, and saw a driverless car (waymo) zip by, and since they only started operating after I left SF 3 years ago, I didn't yet have a chance to ride in one. </p><p>So I asked Stacey and some other folks, if they'd like to go for a ride, and to my complete bewilderement, Stacey said "no 😳" and when I asked why not, she didn't want to admin but then said that it's scary. </p><p>This struck me and since that moment, I've had as many conversations with Stacey as I had with other folks who came to be AI.engineers, since this was such a stark contrast between progress and fear. I basically was walking, almost hand in hand, with a person who doesn't use or understand AI, and fears it, amongst the folks who are building the future, exist at the pinnacle of innovation and discuss how to connect more AI to more AI, and how to build complete autonomous agents to augment human productivity and bring about the world of abundance. </p><p>This contrast was supported by several new friends of mine, who came to the <a target="_blank" href="http://AI.engineer">AI.engineer</a> and SF for the first time, from countries where English is not the first language, and where Waymo's are not zipping about on the streets freely, and it highlighted for me, how much of this shift is global, and how concentrated the decision making, the building, the innovation is, within the arena, SF, California and US. It's almost expected that AI is going to speak english, and to use/build it, we have to speak it as well, while most of the world doesn't use English as their first language. </p><p>Contrast 3 - Technological // Spiritual</p><p>This contrast was intimate and personal to me. You see, this <a target="_blank" href="http://ai.engineer">ai.engineer</a> event was the first such sized event, professional, with folks talking "my language" since I had burned out this summer. If you've followed for a while, you may remember we talked about Lk-99 and superconductor, and I overclocked myself back then so much (scaling a whole another podcast, hosting 7 spaces in 2 weeks, creating a community of 1,500 and following all the news 24/7) that I had didn't want to go on speaking, doing spaces, recording podcasts... I was just done. </p><p>Luckily my friend Junaid sent me a meditation practice recording with the saying "fill your own cup, before you give out to others" </p><p>That recording led me to discover effortless meditation (via AirChat, DM me for details), and going on a 5 day free virtual retreat, which  changed my life in several very significant ways and just keeps going.</p><p>So back to the conference, coming from that place, a place of center, a place where I balance my own inner piece + interact with 150+ people (I think, I haven't really counted) and equalizing energy among them, connecting many folks to many other folks, introducing and getting introduced, giving interviews, remembering names and faces, this was an incredible experience to me, and many of my new friends asked me, what's going on, and why and how I meditate for an hour every day. </p><p>So the contrast between my own spiritual journey, and the technological journey was not a very strong one, beyond 1 very... interesting and weird conversation I had, everyone else seemed to be into self improvement, spiritual growth, I've met many folks who went on vipasana retreats, had meditation practices themselves or used to have one and wanted to come back to it. </p><p>Maybe it was self selection, but I really enjoyed seeing and hearing all the stories from folks, as I believe that given contrast number 2, the way we talk about AI, the way we build AI, the way we then show AI to the world, needs to take into account compassion towards those who don't understand it, who fear it, and want nothing to do with it. </p><p>Not to mention, that I doubt that any of the doomers are walking on their own spiritual path. </p><p>I wanted to just send this newsletter but as I always do, I pasted it into Claude.ai, and asked for a review. And Claude suggested I end it on a positive note, because the middle was heavy, and even gave me some suggestion as to how to end it on a positive note. So instead of me doing it, here's an AI, doing a human job on ending something on a positive note: </p><p>[Claude] Here are some examples of how you could end your newsletter on a more hopeful, forward-looking note:</p><p>* Remind readers of the optimism and excitement you felt at the AI conference, and how seeing the potential of technology firsthand gives you hope for the future.</p><p>* Share an insight or realization you had about how to stay grounded and centered even amidst chaos - something your spiritual practices have taught you.</p><p>* Affirm that while destruction and violence will always be part of the human experience, you believe innovative thinkers will lead us to a brighter tomorrow.</p><p>* Note how connecting deeply with other humans, even strangers, fills you with hope. Finding common ground and shared humanity counterbalances the forces that divide.</p><p>* Point to a specific AI-related project or breakthrough that gives you hope for progress. e.g. medical advancements, climate change solutions, etc.</p><p>* Share an inspirational quote or verse that reflects resilience and perseverance in difficult times.</p><p>* End by expressing gratitude - for your community of readers, your loved ones, the chance to have profound human experiences, or your inner stillness amidst outer turmoil.</p><p>The key is ending on a note of hope, possibility, love, resilience - something that uplifts readers after processing weighty material. Let me know if these spark any ideas for how to conclude your newsletter on a high note!</p><p></p><p>P.S - I know this newsletter is not what you expected, and maybe not what you signed up for, and I deliberated if I even should write it and what if anything should I post on the podcast. However, this week was an incredibly full of contrast, of sadness and excitement, of sorrow and bewilderment, so I had to share my take on all this. </p><p>P.P.S - as always, if you read all the way to the end, dm me the ☮️ emoji </p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/a-week-of-horror-an-ai-conference</link><guid isPermaLink="false">substack:post:137935648</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Fri, 13 Oct 2023 19:21:00 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/137935648/d574a47a898a1670d4b10bc9ff8e769a.mp3" length="85699336" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5356</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/137935648/98800b84fc739baefff4a81be60d308c.jpg"/></item><item><title><![CDATA[📅 ThursdAI Oct 4 - AI wearables, Mistral fine-tunes, AI browsers and more AI news from last week]]></title><description><![CDATA[<p></p><p>Boy am I glad that not all AI weeks are like <a target="_blank" href="https://sub.thursdai.news/p/thursdai-sep-28-gpt4-sees-speaks?r=2imipa&#38;utm_campaign=post&#38;utm_medium=web">last week</a>, where we had so much news and so many things happening that I was barely able to take a breath for the week! </p><p>I am very excited to bring you this newsletter from San Fancisco this week, the AI mecca, the arena, the place where there are so many AI events and hack-a-thons that I don’t actually know how people get any work done!</p><p>On that topic, I’m in SF to participate in the <a target="_blank" href="https://ai.engineer/">AI.engineer</a> (by <a target="_blank" href="https://substack.com/profile/89230629-swyx">swyx</a> and <a target="_blank" href="https://substack.com/profile/32664430-benjamin-dunphy">Benjamin Dunphy</a>) next week, to host spaces and interviews with the top AI folks in here, and to discuss with the audience, what is an AI engineer, if you have any questions you’d like me to ask, please comment with them and I’ll make sure I’ll try to answer. </p><p><p>ThursdAI - subscribe eh? ↴</p></p><p>Here’s a table of contents of everything we chatted about: </p><p>[00:00:00] Intro and welcome</p><p>[00:04:53] Alex in San Francisco -  AI Engineer</p><p>[00:07:32] Reka AI - Announcing a new multimodal Foundational model called Yasa-1 </p><p>[00:12:42] Google adding Bard to Google Assistant</p><p>[00:18:56] Where is Gemini? </p><p>[00:23:06] Arc browser adding Arc Max with 5 new AI features</p><p>[00:24:56] 5 seconds link AI generated previews</p><p>[00:31:54] Ability to run LLMs on client side with WebGPU</p><p>[00:39:28] Mistral is getting love from Open Source, </p><p>[00:48:04] Mistral Open Orca 7B </p><p>[00:58:28] Acknowledging the experts of ThursdAI</p><p>[01:01:14] Voice based always on AI assistants</p><p>[01:09:00] Airchat adds voice cloning based translation tech</p><p>[01:14:23] Effects of AI voice cloning on society</p><p>[01:21:32] SDXL IKEA LORA</p><p>[01:23:17] Brief Recap</p><p>Show notes: </p><p>Big Co</p><p>* <strong>Google - adding Bard to Google Assistant (</strong><a target="_blank" href="https://twitter.com/Google/status/1709582904446050642"><strong>Announcement</strong></a><strong>)</strong><em>Come on google, just give us Gemini already!</em></p><p>* <strong>Reka AI - Multimodal Yasa-1 from Yi Tay and team (</strong><a target="_blank" href="https://twitter.com/YiTayML/status/1709265184576204820"><strong>Announcement</strong></a><strong>)</strong><em>With Yi Tay from Flan/Bard fame as chief scientist! But I wasn’t able to test myself!</em></p><p>* <strong>Arc - first browser AI features (</strong><a target="_blank" href="https://x.com/altryne/status/1709650830016872512?s=20"><strong>My thread</strong></a><strong>, </strong><a target="_blank" href="https://screenstudio.lemonsqueezy.com?aff=mAlzE"><strong>Brief video review</strong></a><strong>, </strong><a target="_blank" href="https://arc.net/gift/baefb8b4"><strong>Arc Invite</strong></a><strong>)</strong><em>I love Arc, I recommend it to everyone I meet, now with AI preview features it’s even more a non brainer, strongly recommend if you like productivity</em></p><p>Open Source LLMs</p><p>* <strong>Mistral vs LLaMa 2 boxing match (</strong><a target="_blank" href="https://x.com/charliebholtz/status/1709626774038978831?s=20"><strong>link</strong></a><strong>)</strong><em>A fun little battle arena to select which responses you personally find better to see the difference between Mistral 7B and LLaMa 13B</em></p><p>* <strong>Mistral-7B-OpenOrca (</strong><a target="_blank" href="https://www.reddit.com/r/LocalLLaMA/comments/16y6r3x/a_7b_better_than_llama_65b_now_mistral_orca_is_out/"><strong>announcement</strong></a><strong>)</strong><em>The folks from Alignment labs do it again! Great finetune that comes very close (98%) to LLaMa 70B on benchmarks! </em></p><p>* <strong>SynthIA-7B-v1.3 - (</strong><a target="_blank" href="https://huggingface.co/migtissera/SynthIA-7B-v1.3"><strong>Huggingface</strong></a><strong>)</strong><em>An uncensored finetune on top of Mistral that Reddit claims is a great model, especially since a chain of thought is somehow built in apparently</em></p><p>VISION</p><p>* Radiologists thread about GPT-4 V taking over radiology (or maybe not?) (<a target="_blank" href="https://x.com/cxbln/status/1709585434689569066?s=20">Thread</a>)</p><p>Voice</p><p>* <strong>AirChat added voice clone + translation features (</strong><a target="_blank" href="https://www.getairchat.com/airchat/babelfish?r=6304e57d-c556-412d-a973-b0a071ee4b25"><strong>Room</strong></a><strong>, </strong><a target="_blank" href="https://x.com/petrikajander/status/1708852873872695310?s=20"><strong>Demo</strong></a><strong>)</strong><em>I’ve been an avid AirChat user (It’s Naval’s social media platform that’s voice based) for a while, and am very excited they are destroying language barriers with this feature! </em></p><p>* <strong>Tab was revealed in a great demo by Avi Schiffman (</strong><a target="_blank" href="https://x.com/altryne/status/1708488579314540595?s=20"><strong>Demo</strong></a><strong>)</strong><em>Go Avi! Rooting for you brother, competition makes folk stronger!</em></p><p>* <strong>Rewind announced Rewind Pendant (</strong><a target="_blank" href="https://x.com/dsiroker/status/1708933247902892412?s=20"><strong>Announcement</strong></a><strong>)</strong><em>I ordered one, but Rewind didn’t announce a date of when this hits the market, going to be interesting to see how well they do!</em></p><p>Ai Art and Diffusion</p><p>  -<strong> IKEA Lora generate IKEA style tutorials for everything with SDXL (</strong><a target="_blank" href="https://twitter.com/ostrisai/status/1707731736774562015"><strong>Announcement</strong></a><strong>, </strong><a target="_blank" href="https://huggingface.co/ostris/ikea-instructions-lora-sdxl"><strong>HuggingFace</strong></a><strong>)</strong></p><p>* <strong>DALL-E3 seems to be available to all Plus members now</strong></p><p><em>This weeks pod was generated by talking to chatGPT, it’s so fun, you gotta try it!</em></p><p></p><p>No longer breakdown this week ,but we covered a bunch of it in the show, and I highly recommend listening to it!</p><p>Don’t forget to follow me on X to be aware of the spaces live from ai.engineer event in SF, the conference will be live-streamed as well on youtube! </p><p>See you next week 🫡 </p><p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-oct-4-ai-wearables-mistral</link><guid isPermaLink="false">substack:post:137707014</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Thu, 05 Oct 2023 22:25:24 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/137707014/7800bfa33e0d66a3cce0da05dc6e0c05.mp3" length="84812046" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5301</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/137707014/388d8eb5b7e14e6bd3b1e1288b4511bc.jpg"/></item><item><title><![CDATA[📅🔥ThursdAI Sep 28 - GPT4 sees, speaks and surfs, Cloudflare AI on GPUs,Mistral 7B, Spotify Translates, Meta AI everywhere, Qwen14B & more AI news from this INSANE week]]></title><description><![CDATA[<p>[00:00:00] Intro and welcome everyone</p><p>[00:00:52] GPT4 - Vision from OpenAI</p><p>[00:05:06] Safety concern with GPT4-V</p><p>[00:09:18] GPT4 can talk and listen as well</p><p>[00:12:15] Apple rumors, on device inference, and Siri</p><p>[00:17:01] OpenAI Voice Cloning Tech used in Spotify to translate podcasts</p><p>[00:19:44] On the risks of Voice Cloning tech being open sourced</p><p>[00:26:07] Alex statement on purpose of ThursdAI</p><p>[00:27:53] “AGI has been achieved internally”;</p><p>[00:32:10] OpenAI, Jonny Ive and Masa are rumored to be working on a hardware device</p><p>[00:33:51] Cloudflare AI - Serverless GPU on global scale</p><p>[00:37:13] Cloudflare AI partnership with HuggingFace to allow you to run many models in your own</p><p>[00:40:34] Cloudflare announced the Vectorize DB and embedings on edge</p><p>[00:46:52] Cloudflare AI gateway - proxy LLM calls, caching, monitoring, statistics and fallback</p><p>[00:51:15] Part 2 - intro an recap</p><p>[00:54:14] Meta AI announcements, bringing AI agents to 3 billion people next month</p><p>[00:56:22] Meta announces EMU image model to be integrated into AI agent on every platform</p><p>[00:59:38] Meta RayBan glasses upgraded to spatial computing, with AI and camera access</p><p>[01:00:39] On the topic os smart glasses, GoogleGlass, and the acceptance society wide to have</p><p>[01:05:37] Safety and societal implications of everyone having glasses and recording everything</p><p>[01:12:05] Part 3 - Open Source LLMs, Mistral, QWEN and CapyBara</p><p>[01:21:27] Mistral 7B - SOTA 7B general model from MIstralAI</p><p>[01:23:08] On the topic of releasing datasets publically and legal challenges with obtaining that</p><p>[01:24:42] Mistral GOAT team giving us a torrrent link to a model with Apache 2 license.</p><p></p><p>Truly, as I’ve been doing these coverages in one form or another for the past 9 months, and <strong>I don’t remember a week this full</strong> of updates, news, state of the art open source models and more.</p><p>So, here’s to acceleration (and me finally facing the fact that I need a niche, and decide what I’ll update on and what I won’t, and also be transparent with all of you about it)</p><p>On a separate note, this past two weeks, ThursdAI had exposure to Yann Lecun (RTs), joined on stage by VP of DevRel in Cloudflare and their counterpart in HuggingFace, CEO of Anaconda joined us on stage this episode and we’ve had the chief scientist of Mistral join in the audience 😮 ThursdAI really shapes to be the place where this community meets, and I couldn’t be more humbled and prouder of the show, the experts on stage that join from week to week, and the growing audience 🙇‍♂️ ok now let’s get to the actual news!</p><p>ThursdAI - Weeks like this one highlight how important it is to stay up to date on many AI news, subscribe, I’ve got some cool stuff coming! 🔥</p><p>All right so here’s everything we’ve covered on ThursdAI, September 28th:</p><p>(and if you’d like to watch the episode video with the full transcript, <a target="_blank" href="https://thursdai.news/sep28-descript">it’s here for free</a>):</p><p>Show Notes + Links</p><p>* <strong>Vision</strong></p><p>* 🔥 Open AI announces GPT4-Vision (<a target="_blank" href="https://x.com/OpenAI/status/1706280618429141022?s=20">Announcement</a>, <a target="_blank" href="https://t.co/m1hnCFcTgT">Model Card</a>)</p><p>* Meta glasses will be multimodal + AI assistant (<a target="_blank" href="https://twitter.com/boztank/status/1707105576424198290">Announcement</a>)</p><p>* <strong>Big Co + API updates</strong></p><p>* Cloudflare AI on workers, serverless GPU, Vector DB and AI monitoring (<a target="_blank" href="https://blog.cloudflare.com/workers-ai/">Announcement</a>, Documentation)</p><p>* Cloudflare announces partnerships with <a target="_blank" href="https://blog.cloudflare.com/partnering-with-hugging-face-deploying-ai-easier-affordable/">HuggingFace</a>, Meta</p><p>* Claude announces $4 billion investment from Amazon (<a target="_blank" href="https://twitter.com/AnthropicAI/status/1706202966238318670">Announcement</a>)</p><p>* Meta announces AI assistant across WhatsApp, Instagram</p><p>* <strong>Open Source LLM</strong></p><p>* 🔥 Mistral AI releases - Mistral 7B - beating LLaMa2 13B (<a target="_blank" href="https://mistral.ai/news/announcing-mistral-7b/">Announcement</a>, <a target="_blank" href="https://huggingface.co/mistralai">Model</a>)</p><p>* Alibaba releases Qwen 14B - beating LLaMa2 34B (Paper, Model, <a target="_blank" href="https://t.co/xJtK6qtpvD">Vision Chat</a>)</p><p>* <strong>AI Art & Diffusion</strong></p><p>* Meta shows off EMU - new image model</p><p>* Still waiting for DALL-E3 😂</p><p>* <strong>Tools</strong></p><p>* Spotify translation using Open AI voice cloning tech</p><p>Vision</p><p>GPT 4-Vision</p><p>I’ve been waiting for this release since March 14th (<a target="_blank" href="https://twitter.com/altryne/status/1635736338397020160">literally</a>) and have been waiting and talking about this on literally every ThursdAI, and have been comparing every open source multimodality image model (IDEFICS, LlaVa, QWEN-VL, NeeVa and many others) to it, and none came close!</p><p>And here we are, a brief rumor about the upcoming Gemini release (potentially a multimodal big model form Google) and OpenAi decided to release GPT-4V and it’s as incredible as we’ve been waiting for!</p><p>From creating components from a picture of UI, to solving <a target="_blank" href="https://x.com/skirano/status/1707468861929381959?s=20">complex math problems</a> with LaTex, to helping you get out of a parking ticket by looking at a picture of a complex set of parking rules, X folks report that GPT4-V is incredibly helpful and unlocks so many new possibilities!</p><p>Can’t wait to get access, and most of all, for OpenAI to land this in the API for developers to start building this into products!</p><p>On the pod, I’ve talked about how I personally don’t believe AGI can work without vision, and how personal AI assistants are going to need to see what I see to be really helpful in the real world, and we’re about to unlock this 👀 Super exciting.</p><p>I will add this one last thing, here’s Ilya Sutskever, OpenAI chief scientist, talking about AI + Vision, and this connects with our previous reporting that GPT-4 is not natively multimodal (while we’re waiting for rumored Gobi)</p><p>If you need more use-cases, check out this great breakdown by friend of the pod, <a target="_blank" href="https://twitter.com/skalskip92/">SkalskiP</a> (Pyotr) who is a vision engineer at RoboFlow which got really high Hacker News rankings.</p><p><a target="_blank" href="https://blog.roboflow.com/gpt-4-vision/">https://blog.roboflow.com/gpt-4-vision/</a></p><p>Meta RayBan smartglasses will have multimodal AI 👀</p><p>To add to the above increased interest about AI (and to rumors about OpenAI working with Jonny Ive from Apple + Masayoshi San about a rumored hardware device) Meta has announced a new iteration of their RayBan glasses, that will include a camera that will help you go live, include an AI agent in the Glasses and most of all, will be multimodal, by which they mean, the AI agent in there (we don’t know if it’s LLaMa based or something else) will have access to the camera, and to what you see.</p><p>Given how well this works, it may be revolutionary on it’s own right!</p><p></p><p>I’ve been on a MultiModality kick since that incredible March 14th day, and I’m very excited that it’s here! 🙌</p><p></p><p>Big CO + API updates</p><p>Cloudflare AI - Serverless GPU inference, VectorDB and AI Gateway</p><p>I was blown away by this, so much so, that I’ve hopped on an emergency space on Wednesday, to talk all about this. Some of you know, I’ve created https://targum.video a year ago, and it’s been accepted to CloudFlare workers launchpad. The whole website and backend is on workers, but the GPU and inference, I had to build in python and put on a LambdaLabs GPU machine.</p><p>So starting today, folks could build something like Targum, end to end on Cloudflare with the announcement of GPU inference.</p><p>If you’d like all the details, I was really humbled to host Ricky Robinette (VP Developer Experience @ Cloudflare) and Phillip Schmidt from Hugging Face join the X space on launch day (to my complete surprise) and you can find that conversation <a target="_blank" href="https://share.descript.com/view/e4VYV80NfYY">here</a> (it’s going to be on the pod soon after I find some time to edit this 😅)</p><p>Here’s my notes from that conversation:</p><p>* Inference on edge is here</p><p>* <strong>Serverless</strong> GPUs on cloudflare edge network</p><p>* Integrated with Workers platform</p><p>* What is the workers platform</p><p>* Give example of the many tools it has</p><p>* Targum example for what is done on workers and what is done on GPU</p><p>* Easy to get started and deploy</p><p>* Will have a free tier 🔆</p><p>* Models and Usecases</p><p>* LLMs - LLaMa 7B</p><p>* Embeddings - BGE-base</p><p>* Text Classification - DistillBert</p><p>* Translation - m2m100</p><p>* ASR - Whisper</p><p>* Preselected models right now</p><p>* Vectorize - an edge native vector DB</p><p>* Integrates with wrangler and ecosystem</p><p>* Supports existing vectors from OpenAI Ada (importable)</p><p>* Metadata can include R2 objects, KV storage and more!</p><p>* Build and deploy full RAG apps, including your own local models all inside 1 platform</p><p>* AI - gateway</p><p>* Proxy for OpenAI (and other providers calls)</p><p>* Shows a usage dashboard</p><p>* Global Coverage:</p><p>* <strong>Plan to be in 100 data centers by the end this year</strong></p><p>* <strong>And nearly everywhere by the end of 2024</strong></p><p>* WebGPU in workers</p><p>* Many HF models support ONNX</p><p>* WebGPU is now supporting FP-16</p><p>* This could open a new path to run smaller models within workers even without CFAI</p><p>* Partnership with HuggingFace</p><p>* 1 click deploy in a dropdown on half a million models</p><p>* Serverless inference - no more downloading and uploading</p><p>* Way faster as well</p><p>* Cloudflare will have a de-facto proxy/mirror of HF? 🤔</p><p>I’m very very excited by the HuggingFace partnership and you can hear it in the recording!</p><p>Meta announces AI assistant across chat apps, Instagram, WhatsApp, Messenger</p><p>I haven’t tested this yet, but this is going to be incredible to make AI experiences to over 3B people around the world!</p><p>In addition to just “chat with AI” , Meta has partnered with many celebs to “Embody” them into AI characters, which I found.. a bit unsettling? But I guess we’ll see how much this will affect the “personas” of the AI assistants.</p><p>Open Source LLM</p><p>Qwen 14B with chat and vision versions</p><p>QWEN model from Alibaba, which we’ve already talked about multiple times, then was taken down from the web, comes back, with a vengeance!</p><p>Qwen team comes back with a 14B model, that beats LlaMa2 34B on most evaluations, including a VL version (only 7B), which according to my tests, was the best performing open source vision model even at 7B</p><p>It was really cool to see the Qwen authors interact with Yam and I on Twitter, it’s like crossing the great firewall and hopefully we’ll have that team on ThursdAI recording at some point!</p><p>🔥 Mistral 7B (<a target="_blank" href="https://twitter.com/MistralAI/status/1706877320844509405">torrent tweet</a>) - SOTA LLM</p><p>Mistral team have made news when they raised $113 million without a product, just 3 co-founders, back in Jun, and the “takes” on twitter were, “we’re in a bubble, bla bla bla” and yesterday, this Goated team just posted a tweet with a magnet torrent link, and no description. So of course everybody downloaded it and found the best SOTA 7B model, that outperforms a much larger LLaMa 2 13B and MUCH larger LLaMa 34B on several benchmarks!</p><p>It even comes very close to the Code LLaMa performance benchmarks on code, while being a general model, which is incredible.</p><p>Needless to say, the team delivered the promise, and to see them commit this fully to OpenSource, by dropping a modal with Apache 2 license, straight to bit-torrent, is a great sight to see!</p><p>Also, we caught glimpes of <a target="_blank" href="https://twitter.com/GuillaumeLample">Guillaume Lample</a> in the audience while we were gassing Mistral up, and potentially at some point we may get Mistral folks to join a ThursdAI live space? 🫡</p><p>AI Art + Diffusion</p><p>Meta introduced EMU - A diffusion model integrated into it’s AI offerings with a /imagine command, available for free, in all their products, and it looks really good!</p><p>I wonder if it will do the same “chat with image” thing as DALL-E3 was announced to do, but in any case, giving this, for free, in this quality, to so many people, is remarkable 🙇‍♂️ Kudos to the team at Meta for ALL the releases today! Can’t wait to play with them.</p><p>Tools</p><p>Spotify translates podcasts using stealth OpenAI tech</p><p>Spotify announced translations for podcast, using some secret OpenAI voice cloning tech, and we had a long discussion about the implication of voice cloning, deep fakes and everything in between with Peter Wang and other folks on the pod, definitely recommended listening!</p><p>I love this, absolutely, not just because you may want to listen to ThursdAI pod in your native language (and I could finally show my mom who doesn’t speak English what I’m doing!) but also because language barriers should NOT exist, and Targum.video and this and all the models that Meta is releasing are a great testament to how fast language barriers are coming down!</p><p>I’m very very happy with this development and will keep you guys posted on these developments.</p><p>With that, I should probably stop here, it’s been an absolutely insane week, and if this summary helped, like, share and consider a premium subscription?</p><p></p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p><p></p><p>P.S - If you scrolled all the way to here, send me 🧨 in a DM on any platform 😉, I may have something for you</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-sep-28-gpt4-sees-speaks</link><guid isPermaLink="false">substack:post:137501010</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Fri, 29 Sep 2023 03:16:21 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/137501010/dd359059734fd99bf21f129796ca7f30.mp3" length="96983403" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6061</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/137501010/5185839c5162fe8b8690eaecd1c6bf1b.jpg"/></item><item><title><![CDATA[📆 ThursdAI Sep 21 - OpenAI 🖼️ DALL-E 3, 3.5 Instruct & Gobi, Windows Copilot, Bard Extensions, WebGPU, ChainOfDensity, RemeberAll]]></title><description><![CDATA[<p>Hey dear ThursdAI friends, as always I’m very excited to bring you this edition of ThursdAI, September 21st, which is packed full of goodness updates, great conversations with experts, breaking AI news and not 1 but 2 interviews</p><p><p>ThursdAI - hey, psst, if you got here from X, dont’ worry, I don’t spam, but def. subscribe, you’ll be the coolest most up to date AI person you know!</p></p><p>TL;DR of all topics covered</p><p>* <strong>AI Art & Diffusion</strong></p><p>* 🖼️ DALL-E 3 - High quality art, with a built in brain (<a target="_blank" href="https://twitter.com/sama/status/1704547625482203560">Announcement</a>, <a target="_blank" href="https://x.com/bio_bootloader/status/1704549892004397391?s=20">Comparison to MJ</a>)</p><p>* Microsoft - Bing will have DALL-E 3 for free (<a target="_blank" href="https://youtu.be/rd9mYTcT91A">Link</a>)</p><p>* <strong>Big Co LLMs + API updates</strong></p><p>* Microsoft - Windows Copilot 🔥 (<a target="_blank" href="https://blogs.microsoft.com/blog/2023/09/21/announcing-microsoft-copilot-your-everyday-ai-companion/">Announcement</a>, <a target="_blank" href="https://youtu.be/5rEZGSFgZVY">Demo</a>)</p><p>* OpenAI - GPT3.5 instruct (<a target="_blank" href="https://twitter.com/GrantSlatton/status/1703913578036904431">Link</a>)</p><p>* OpenAI - Finetuning UI (and finetuning your finetunes) (<a target="_blank" href="https://x.com/OfficialLoganK/status/1704181284036300970?s=20">Annoucement</a>, <a target="_blank" href="https://x.com/OfficialLoganK/status/1704532065281044982?s=20">Link</a>)</p><p>* Google - Bard has extensions (<a target="_blank" href="https://twitter.com/altryne/status/1704154333544382614">twitter thread</a>, <a target="_blank" href="https://x.com/TheTuringPost/status/1704487874945749254?s=20">video</a>)</p><p>* <strong>Open Source LLM</strong></p><p>* Glaive-coder-7B (<a target="_blank" href="https://x.com/abacaj/status/1704956542242447376?s=20">Announcement</a>, <a target="_blank" href="https://huggingface.co/glaiveai/glaive-coder-7b">Model</a>, <a target="_blank" href="https://t.co/lMhVm2ZNpk">Arena</a>)</p><p>* Yann Lecun testimony in front of US senate (<a target="_blank" href="https://x.com/ylecun/status/1704674116526035292?s=20">Opening Statement</a>, <a target="_blank" href="https://twitter.com/altryne/status/1704613075008282630">Thread</a>)</p><p>* <strong>Vision</strong></p><p>* Leak : OpenAI GPT4 Vision is coming soon + Gobi multimodal? (<a target="_blank" href="https://www.theinformation.com/articles/openai-hustles-to-beat-google-to-launch-multimodal-llm">source</a>)</p><p>* <strong>Tools & Prompts</strong></p><p>* Chain of Density - a great summarizer prompt technique (<a target="_blank" href="https://twitter.com/AlphaSignalAI/status/1703825582889263473">Link</a>, <a target="_blank" href="https://arxiv.org/pdf/2309.04269.pdf">Paper</a>, <a target="_blank" href="https://smith.langchain.com/hub/langchain-ai/chain-of-density?organizationId=ebbaf2eb-769b-4505-aca2-d11de10372a4">Playground</a>)</p><p>* Cardinal - AI infused product backlog (<a target="_blank" href="https://www.producthunt.com/posts/cardinal-6?utm_campaign=email-notification&#38;utm_medium=email&#38;utm_source=new_launch_update">ProductHunt</a>) </p><p>* Glaive Arena - (<a target="_blank" href="https://arena.glaive.ai/">link</a>)</p><p>AI Art + Diffusion</p><p>DALL-E 3 - High quality art, with a built in brain</p><p>DALL-E 2 was the reason I went hard into everything AI, I have a condition called Aphantasia, and when I learned that AI tools can help me regain a part of my brain that’s missing, I was in complete AWE. My first “AI” project was a <a target="_blank" href="https://x.com/altryne/status/1557167298837991424?s=20">chrome extension that injects prompts</a> into DALL-E UI to help with prompt engineering. </p><p>Well, now not only is my extension no longer needed, prompt engineering for AI art itself may die a slow death with DALL-E 3, which is going to be integrated into chatGPT interface, and chatGPT will be able to help you… chat with your creation, ask for modifications, alternative styles, and suggest different art directions! </p><p>In addition to this incredible new interface, which I think is going to change the whole AI art field, the images are of mind-blowing quality, coherence of objects and scene elements is top notch, and the ability to tweak tiny detail really shines! </p><p>Additional thing they really fixed is hands and text! Get ready for SO many memes coming at you! </p><p>Btw, I <a target="_blank" href="https://twitter.com/altryne/status/1598902799625961472">created</a> a conversational generation bot in my telegram chatGPT bot (before there was an API with stability diffusion and I can only remember how addicting this was!) and so did my friends <a target="_blank" href="https://x.com/krea_ai/status/1577063124288479233?s=20">from Krea</a> :) so y’know… where’s our free dall-e credits OpenAI? 🤔 </p><p>Just kidding, an additional awesome thing that now, DALL-E will be integrated into chatGPT plus subscription (and enterprise) and will refuse to generate any living artists art, and has a very very strong bias towards “clean” imagery. </p><p>I wonder how fast will it come to an API, but this is incredible news!</p><p>P.S - if you don’t want to pay for chatGPT, apparently DALL-E 3 conversational is <a target="_blank" href="https://x.com/MParakhin/status/1704563792645079143?s=20">already being rolled out</a> as a free offering for Bing Chat 👀 Only for a certain percentage of users, but will be free for everyone going forward!</p><p>Big Co LLM + API updates</p><p>Copilot, no longer just for code?</p><p>Microsoft has announced some breaking news on #thursdai, where they confirmed that Copilot is now a piece of the new windows, and will live just a shortcut away from many many people. I think this is absolutely revolutionary, as just last week we chatted with Killian from Open Interpreter and having an LLM run things on my machine was one of the main reasons I was really excited about it! </p><p>And now we have a full on, baked AI agent, inside the worlds most popular operating system, running for free, for all mom and pop windows computers out there, with just a shortcut away! </p><p>Copilot will be a native part of many apps, not only windows, here’s an example of a powerpoint copilot! </p><p>As we chatted on the pod, this will put AI into the hands of so so many people for whom opening the chatGPT interface is beyond them, and I find it incredibly exciting development! (I will not be switching to windows for it tho, will you?)</p><p>Btw, shoutout to <a target="_blank" href="https://twitter.com/MParakhin">Mikhail Parakhin</a> who lead the BingChat integration and is now in charge of the whole windows division! It shows how much dedication to AI Microsoft is showing and it really seems that they don’t want to “miss” this revolution like they did with mobile!</p><p>OpenAI releases GPT 3.5 instruct turbo! </p><p>For many of us, who used GPT3 APIs before it was cool (who has the 43 character API key 🙋‍♂️) we remember the “instruct” models where all the rage, and then OpenAI basically told everyone to switch to the much faster and more RLHFd chat interfaces.</p><p>Well now, they brought GPT3.5 back, with instruct and turbo mode, it’s no longer a chat, it’s a completion model, that is <a target="_blank" href="https://x.com/GrantSlatton/status/1703913578036904431?s=20">apparently much better at chess</a>? </p><p>An additional interesting thing is, it includes logprobs in the response, so you can actually build much more interesting software (by asking for several responses and then looking at the log probabilities), for example, if you’re asking the model for a multiple choice answer to a question, you can rank the answers based on logprobs! </p><p>Listen to the pod, Raunak explains this really well!</p><p>FineTune your finetunes</p><p>OpenAI also released a UI for finetuning GPT3.5 and upped the number of concurrent finetunes to 3, and now, you can finetune your finetunes!</p><p>So you can continue finetuning already finetuned models!</p><p></p><p>Bard extensions are like chatGPT plugins but more native. </p><p>While we wait for Gemini (cmon google, just drop it!) the multi modal upcoming incredible LLM that will beat GPT-4 allegedly, Google is shoving new unbacked features into Bard (remember Bard? It’s like the 5th most used AI assistant!) </p><p>You can now opt in, and @ mention stuff like Gmail, Youtube, Drive and many more Google services and Bard will connect to them, do a search (not a vector search apparently, just a keyword search) and will show you results (or summarize your documents) inside Bard interface. </p><p>The @ ui is really cool, and reminded me of Cursor (where you can @ different files or documentation) but in practice, from my 2 checks, it really didn’t work at all and was worse than just a keyword search. </p><p>Open Source LLM</p><p>Glaive-coder-7B reaches an incredible 63% on human eval</p><p>Friends of the pod <a target="_blank" href="https://twitter.com/abacaj">Anton Bacaj</a> and <a target="_blank" href="https://twitter.com/csahil28">Sahil Chaudhary </a>have open sourced a beast of a coder model <strong>Glaive-coder-7B</strong>, with just 7B parameters, this model achieves an enormous 63% on HumanEval@1, which is higher than LLaMa 2, Code LLaMa and even GPT 3.5 (based on technical reports) at just a tiny 7B parameters 🔥 (table from code-llama released for reference, the table is now meaningless 😂) </p><p>Yann Lecun testimony in front of US senate</p><p>Look, we get it, the meeting of the CEOs (and Clem from HuggingFace) made more waves, especially on this huge table, who wasn’t there, Elon, Bill Gates, Sundar, Satya, Zuck, IBM, Sam Altman</p><p>But IMO the real deal government AI thing was done by Yann Lecun, chief scientist at Meta AI, who came in hot, with very pro open source opening statements, and was very patient with the very surprised senators on the committee. Opening statement is worth <a target="_blank" href="https://x.com/ylecun/status/1704883900458176798?s=20">watching in full</a> (I transcribed it with Targum cause… duh) and Yann actually retweeted! 🫶 </p><p>Here’s a little taste, where Yann is saying, literally “make progress as fast as we can” 🙇‍♂️</p><p>He was also asked about, what happens if US over-restricts open source AI, and our adversaries will … not? Will we be at a disadvantage? Good questions senators, I like this thinking, more of this please. </p><p></p><p>Vision</p><p>Gobi and GPT4-Vision are incoming to beat Gemini to the punch? </p><p>According to <a target="_blank" href="https://www.theinformation.com/articles/openai-hustles-to-beat-google-to-launch-multimodal-llm">The Information</a>, OpenAi is gearing up to give us the vision model of GPT-4 due to the hinted upcoming release of Gemini, a multi modal model from Google (that’s also rumored to be released very soon, I’m sure they will release this on next ThursdAI, or the one after that!) </p><p>It seems to be the case for both DALL-E 3 and the leak about GPT-4 Vision, because apparently Gemini is multi modal on the input (can take images and text) AND the output (can generate text and images) and OpenAI maybe wants to get ahead of that. </p><p>We’ve seen images of GPT-4 Vision in the chatGPT UI that were leaked, so it’s only a matter of time. </p><p>The most interesting thing from this leak was the model codenamed GOBI, which is going to be a “true” multimodal model, unlike GPT-4 vision. </p><p>Here’s an explanation of the difference from <a target="_blank" href="https://twitter.com/Yampeleg/status/1702095404802637874">Yam Peleg</a> , ThursdAI expert on everything language models!</p><p>Voice</p><p>Honestly, nothing major happened with voice since last week 👀 </p><p>Tools</p><p>Chain of Density</p><p>The Salesforce AI team has developed a new technique for improving text summarization with large language models. Called Chain of Density (CoD), this prompting method allows users to incrementally increase the informational density of a summary.</p><p>The key insight is balancing the right amount of details and main ideas when summarizing text. With CoD, you can prompt the model to add more detail until an optimal summary is reached. This gives more control over the summary output.</p><p>The Salesforce researchers tested CoD against vanilla GPT summaries in a human preference study. The results showed people preferred the CoD versions, demonstrating the effectiveness of this approach.</p><p>Overall, the Salesforce AI team has introduced an innovative way to enhance text summarization with large language models. By tuning the density of the output, CoD prompts can produce higher quality summaries. It will be exciting to see where they take this promising technique in the future.</p><p>RememberAll - extend your LLM context with a proxy</p><p>We had Raunak from rememberAll on the pod this week, and that interview is probably coming on Sunday, but wanted to include this in tools as it’s super cool. </p><p>Basically with 2 lines of code change, you can send your API calls through RememberAll proxy, and they will extract the key information, and embed and store it in a vectorDB for you, and then inject it back on responses.</p><p>Super clever way to extend memory, here’s a preview from Raunak (<a target="_blank" href="https://x.com/raunakdoesdev/status/1702431233073102997?s=20">demo</a>) and a more full interview is coming soon! </p><p>Cardinal has launched on ProductHunt, from my friends Wiz and Mor (<a target="_blank" href="https://cardinalapp.io/?ref=producthunt">link</a>)</p><p>Quick friendly plug, Wix and Mor are friends of mine and they have just launched Cardinal, an AI infused product backlog, that extracts features, discussion about feature requests, and more, from customer feedback, from tons of sources. </p><p><a target="_blank" href="https://www.producthunt.com/posts/cardinal-6?utm_campaign=email-notification&#38;utm_medium=email&#38;utm_source=new_launch_update">Go give them a try</a>, if you’re looking to make your product backlog work better, it’s really really slick! </p><p></p><p>Hey, if you arrived here, do me a quick favor? Send me a DM with this emoji 🥔 , and then share this newsletter with 1 friend who like you, loves AI? </p><p>Thanks, I expect many potatoes in my inbox! See you next ThursdAI 🫡</p><p></p><p>Here’s the full transcript (no video this time, I’m finishing this up at 10:30 and video will take me at least 3 more hours, apologies 🙇‍♂️)</p><p>[00:10:21] Alex Introduces Yam Peleg</p><p>[00:10:57] Alex Introduces Nisten Tahiraj</p><p>[00:11:10] Alex Introduces Far El</p><p>[00:11:24] Alex Introduces Xenova</p><p>[00:11:44] Alex Introduces Roie S. Cohen</p><p>[00:11:53] Alex Introduces Tzafrir Rehan</p><p>[00:12:16] DALL-E 3 - An AI art model with a brain, coming to chatGPT plus</p><p>[00:20:33] Microsoft c launches Windows CoPilot</p><p>[00:30:46] Open AI leaks, GPT-4 Vision, Gobic</p><p>[00:38:36] 3.5 instruct model from OpenAI</p><p>[00:43:03] Raunak intro</p><p>[00:43:25] Bard Extensions allow access to GMail, Youtube, Drive</p><p></p><p>FULL transcript: </p><p></p><p>[00:00:00] <strong>Alex Volkov:</strong> So, Thursday I is this wonderful thing that happened and happened organically as well.</p><p>[00:00:26] And basically what happens is we have this live recording every Thursday, every ThursdAI on Twitter spaces. I am I'm very grateful to share the stage with experts in their fields, and we all talk about different things, because AI updates are so multidisciplinary right now. It's really hard for even experts in their one field to follow everything.</p><p>[00:00:51] I find this mixture of experts type model on stage very conducive because we all go and find the most up to date things from the last week. And then we have folks who, it's their specification, for example, to comment on them. And you guys in the audience get the benefit of this. And it just happened organically through many conversations we had on, on Spaces since GPT 4 was launched.</p><p>[00:01:16] Literally the day, March 14th, 2023 aka Pi Day. It was the first day we started these spaces, and since then the community has grown to just... An incredible amount of people who join quality experts, top of their field people. I'm, I'm just so humbled by all of this. And since then, many folks told me, like Roy here in the audience, that, Hey, Alex, you're doing this in this weirdest hour.</p><p>[00:01:42] Thursday a. m. in San Francisco, nobody's gonna come. It's really hard to participate in the actual live recording. And so, I started a newsletter and a podcast for this. And so, if you aren't able to make it, I more than welcome you to register to the newsletter. You know what? Even if you are here every week, register to the newsletter, because why not?</p><p>[00:02:03] Because, share it with your friends. We're talking about everything AI related. Hopefully, hopefully no hype. And I have friends here to reduce the hype when I'm getting too hypey. Definitely none of the, Hey, here's a new AI tool that will help you fix the thing you don't need fixing.</p><p>[00:02:18] And I think that's, that's been resonating with the community. And so, as you now are here, you're also participant in this community. I welcome everybody to Tag Thursday AI on their news about ai or #thursdAI, or just like the Thursday iPod, which probably should join this so people get some more visibility. but you are part of the community. Now, those of you who come back, those of you who listen in, those of you who share all of them. All of these things are very helpful for the community to grow and for us to just know about more stuff.</p><p>[00:02:49] It's actually an incredible signal when two or three or more of you react under a piece of news and say, hey, we probably should cover this in Thursday. It really helps, truly. I think with that, yeah, I think this intro is enough intro. Welcome. What's up, Tzafrir? How are you?</p><p>[00:03:06] <strong>Tzafrir Rehan:</strong> All's well. Thank you very much. I wanted to, to strengthen your point about the time factor. So we expand. So anyone here who wants to be a little bit interested in generative technologies and breaking news and have some things to do in the meanwhile, and also looking to actually build something cool from all of this.</p><p>[00:03:31] Time is the limiting factor here. That's like the, the hardest resource here. Having this group and having everyone explore everything together. It's a lifesaver. It's like a order of magnitude improvement on our ability to move forward each one individually. And that's a group together just to give examples.</p><p>[00:03:53] So I'm interested in generative images, videos, and audio. And for each of these, there are hundreds of models right now available. With the availability to make fine tunes on specific datasets for some of these generating a single asset like a video can take hours. Training takes hours. If you want to explore a little bit like the effect of different prompts, just generating hundreds of samples takes hours.</p><p>[00:04:26] So without this group, it would be impossible to even know. Where to go and where to invest my time And the name of the game right now is to just choose where you invest your time on To actually get things done and keep up. So thank you. Thank you. Thank you for you and for this group And let's have fun.</p><p>[00:04:46] <strong>Alex Volkov:</strong> Thank you. Thank you everyone. I definitely feel super powered by the people in this group who can like back me up on, I read one tweet and then I saw some people react to this tweet, but I didn't have the time or the capability or the experience to dive in.</p><p>[00:05:00] And then there's folks here who did, and then we're going to complete each other. And I think our model, I haven't shared since we started, but our motto is we stay up to date. So you don't have to and have to, I think is the operating word. You want to stay up to date and you're welcome to stay up to date and you're welcome to tag us and talk with us and leave comments here in the chat as well, but you don't have to anymore because, there's a, there's a newsletter that will update you and there's folks on stage who will talk about this.</p><p>[00:05:26] I want to briefly cover one tiny thing that I did on the podcast that I think I will start doing as well. So, so far editing this hour and a half, two hours that we have here live was a pain, but I just decided to lean into this because. The conversation we're having here is so much more informative and interesting that any type of summary that I want to do or wanted to do is not going to do it justice.</p><p>[00:05:50] And so I had some different feedback from different folks about the length of the podcast. Some people said, yeah, 25 minutes, just the updates is like the right spot. And yeah, the podcast is moving towards. This is going to be the live recording. I'm going to edit this don't worry.</p><p>[00:06:04] But besides that, the podcast will be this conversation. Going forward as much as I'm able to edit this, and ship both the newsletter and the podcast in time on Thursday But with that Tzafrir thank you for the kind words, man. I appreciate you being here and sharing with us your expertise</p><p>[00:06:20] I want to say hi to Zenova and Arthur.</p><p>[00:06:22] We'll start with Zenova. Welcome Josh. How are you?</p><p>[00:06:27] <strong>Xenova:</strong> Yeah. Hey Yeah, pretty good. Been busy, busy, busy</p><p>[00:06:33] for those who Don't know. I'll just quickly introduce myself. I am the creator of Transformers. js, which is a JavaScript library for running HuggingFace Transformers directly in the browser, or Node, or Deno, or maybe Bunsoon.</p><p>[00:06:49] Who knows when that gets sorted out properly, but any JavaScript environment that you're, that you're looking for. And, yeah, I recently joined HuggingFace, which is exciting. Now I'm able to sort of work on it basically full time. And yeah, lots of, lots of exciting things are, are in the pipeline.</p><p>[00:07:06] <strong>Alex Volkov:</strong> It's been incredible to have you here and then see your progress with Transformer.</p><p>[00:07:10] js and then you joining Hug and Faceman. I appreciate the time here.</p><p>[00:07:13] Arthur, thank you for joining. Please feel free to introduce yourself.</p><p>[00:07:18] <strong>Arthur Islamov:</strong> Okay. So, my name is Arthur and I'm fixing and making WebAssembly to work with big models.</p><p>[00:07:25] So, soon you will be able to run anything huge in the browser, and I'm particularly interested in diffusion models, so right now I'm making the Staple Diffusion 2. 1 to work in the browser, and then have some plans to make SDXL, and maybe as well as Lama and other models too. With all that work done.</p><p>[00:07:50] <strong>Alex Volkov:</strong> That's awesome. Thank you for joining.</p><p>[00:07:52] <strong>Far El:</strong> Yo, what's up? Yeah, I'm my name is Farouk. I'm like founder of Nod. ai where we build autonomous agents and also working on skunkworks. ai, which is an open source group where we are pushing the boundaries of what we can do with LLMs and AI as a whole, really.</p><p>[00:08:10] Our first, like, major project is this open source MOE architecture that we've been tinkering around with for the last couple months. We're also exploring even more, exotic AI arcs to try to get, to GPT 4 level capability for open source.</p><p>[00:08:28] <strong>Alex Volkov:</strong> Awesome. Awesome. Awesome. And Nistan, welcome brother.</p><p>[00:08:33] <strong>Yam Peleg:</strong> Yeah. Hey everyone, I'm Nistan Tahirai and I'm terminally online. That's the introduction. Thank you. Yeah, I, I'm also, I'm a dev in Toronto. I worked on the first doctor wrapper which is still doing pretty well. Like no complaints so far, six months later, knock on wood. And yeah, recently started doing a lot more open source stuff.</p><p>[00:09:03] Put out a bunch of open source doctor models on, on HuggingFace, which I still need to write a benchmark for because there is no safety benchmarks that are public. And yeah, lately been working with Farouk to make the whole Sconcrooks AI mixture of experts model more usable because it's still, it's not even bleeding edge.</p><p>[00:09:26] And this one is more like hemorrhaging edge technology. It takes like three people to get it to work. And yeah, I've been extremely interested on the web GPU side ever since Zenova on a random tweet just gave me the command to start Chrome Canary properly. And then I was able to load it. Whole seven B model.</p><p>[00:09:48] And yeah, I'm thinking next for the future, if, if things go okay. I mean, my goal that I've set myself is to have some kind of distributed. Mixture of experts running via WebGPU and then having Gantt. js encrypts the connections between the, the different nodes and experts. And we'll see how that plays out because everything is changing so quickly.</p><p>[00:10:14] But yeah, it's, it's good to be here. And I'm glad I found this Twitter space randomly way back in</p><p>[00:10:21] Alex Introduces Yam Peleg</p><p>[00:10:21] <strong>Alex Volkov:</strong> Yeah, for a long time. I just want to welcome Yam to the stage. And Yam doesn't love introducing himself, but I can do it for you Yam this time if you'd like.</p><p>[00:10:31] All right. So, I will just run through the speakers on stage just real quick. Yam, thank you for joining us. Folks, Yam is our, I could say, resident... Machine learning engineer extraordinaire everything from data sets and training large language models understanding the internals of how they work and baking a few of his own definitely The guy who if we found the interesting paper, he will be able to explain this to us</p><p>[00:10:57] Alex Introduces Nisten Tahiraj</p><p>[00:10:57] <strong>Alex Volkov:</strong> Nisten. I call you like The AI engineer hacker type, like the stuff that you sometimes do, we're all in awe of being able to run stuff on CPU and doing different, like, approaches that, like, nobody thought of them before.</p><p>[00:11:10] Alex Introduces Far El</p><p>[00:11:10] <strong>Alex Volkov:</strong> Far El you're doing, like, great community organizing and we're waiting to see from the MOE and Skunkworks.</p><p>[00:11:15] And folks should definitely follow Far El for that and join Skunkworks OS. It's really hard for me to say. Skunks. Works OS efforts in the discord.</p><p>[00:11:24] Alex Introduces Xenova</p><p>[00:11:24] <strong>Alex Volkov:</strong> Zenova is our run models on the client guy so Transformers. js, everything related to ONNX and everything related to quantization and making the models smaller.</p><p>[00:11:35] All of that. All models, all modularities, but I think the focus is on, on the browser after you're new, but obviously you introduce yourself, WebGPU stuff.</p><p>[00:11:44] Alex Introduces Roie S. Cohen</p><p>[00:11:44] <strong>Alex Volkov:</strong> We have Roy, who's a DevRel in Pinecon, who he didn't say, but Pinecon and VectorDB is in Context Windows and, and discussion about RAG, like all of these things Roy is our go to.</p><p>[00:11:53] Alex Introduces Tzafrir Rehan</p><p>[00:11:53] <strong>Alex Volkov:</strong> And Tzafrir also introduced himself, everything vision, audio, and excitement. So a very well rounded group here. And I definitely recommend everybody to follow. And now with that, now that we are complete, let's please start with the updates because we have an incredible, incredible Thursday, literally every week, right folks?</p><p>[00:12:12] Literally every week we have an incredible Thursday</p><p>[00:12:16] DALL-E 3 - An AI art model with a brain, coming to chatGPT plus</p><p>[00:12:16] <strong>Alex Volkov:</strong> so we'll start with, with two big ones. I want to say the first big update was obviously DALL-E 3. So I will just share briefly about my story with DALL-E and then I would love folks on stage also to chime in. Please raise your hand so we don't talk over each other. DALL-E when it came out, When the announcement came out for DALL-E 2, I want to say it was a year ago in, a year and a half ago, maybe, in January, February or something, this blew me away.</p><p>[00:12:47] I have something called aphantasia, where, I don't know if you saw this, but like, I don't have like the visual mind's eye, so I can't like visually see things, and it's been a thing with me all my life, and then here comes the AI tool that can draw. Very quickly, then I turned my, I noticed stable diffusion, for example, and I just like.</p><p>[00:13:04] It took away from there. Everything that I have, all my interest in AI started from DALL-E basically. And DALL-E 3 seems like the next step in all of this. And the reason I'm saying this is because DALL-E 3 is visually incredible, but this is not actually like the biggest part about this, right? We have mid journey.</p><p>[00:13:22] I pinned somebody's comparison between DALL-E and mid journey. And Midrani is beautiful and Gorgias is a way smaller team. DALL-E 3 has this beautiful thing where it's connected to ChatGPT. So not only is it like going to be not separate anymore, you're going to have the chat interface into DALL-E 3.</p><p>[00:13:41] ChatGPT will be able to help you. As a prompt engineer, and you'd be able to chat with the creation process itself. So you will ask for an image, and if you don't know how to actually define what you want in this image, which types, you'd be able to just chat with it. You will say, you know what, actually make it darker, make it more cartoony, whatever.</p><p>[00:14:01] And then chatGPT itself with its brain is going to be your prompt engineer body in the creation. And I think. Quality aside, which quality is really, really good. The thing they're highlighting for, for DALL-E 3 is the ability to have multiple. Objects and subjects from your prompt in one image because it understands them.</p><p>[00:14:23] But also definitely the piece where you can keep talking to an image is changing the image creation UI significantly where, mid journey. With all, all the love we have for Midjourney is still stuck in Discord. They're still working on the web. It's, it's taking a long time and we've talked about Ideogram to lead them from the side.</p><p>[00:14:44] We know that Google has multiple image models like Imogen and different ones. They have like three, I think at this point, that they haven't yet released. And DALL-E, I think is the first. Multimodal on the output model that we'll get, right? So multimodal on the output means that what you get back towards you is not only text generation and we saw some other stuff, right?</p><p>[00:15:06] We saw some graphs, we saw some code interpreter can run code, etc. But this is a multimodal on the output. And Very exciting. I, I, DALL-E 3 news took Twitter by storm. Everybody started sharing this, including us. We can't wait to play with DALL-E 3. I welcome folks on stage. I want to start with Zafreer reaction, but definitely to share what we think about this.</p><p>[00:15:26] And the last thing I'll say... Say is that now that the community community is growing, suddenly people dmm me. So first of all, you're all welcome to DM me about different stuff. I see I see somebody in the audience with DM me. I think she's still here. So shout out about joining the better test for DALL-E three, which now they, they're able to share about Funny tidbit, it will, it's right now baked into the UI.</p><p>[00:15:48] So Dally 3 is going to be baked into ChatGPT and ChatGPT Enterprise UIs. However, when they tested this, they tested it via a plugin. So OpenAI actually built a plugin and had like a restricted access to this plugin. And folks who like talked with this plugin, the plugin ran the Dally ChatGPT version behind the scenes.</p><p>[00:16:06] And we don't have access to it yet. I don't know if anybody on stage has access. Please tell me if you do. The access is coming soon, which is interesting from OpenAI. And I think that's most of the daily stuff that I had. And I want to, please, please, buddy, I want to hear from Zafira, please.</p><p>[00:16:23] And please raise your hand. I really need us to not talk over each other.</p><p>[00:16:30] Thank you.</p><p>[00:16:31] <strong>Tzafrir Rehan:</strong> So yeah, DALL-E 3 is looking amazing. I did see some examples that people with early</p><p>[00:16:38] access were</p><p>[00:16:38] generating, and it's far more detailed and coherent than the things we are used to seeing from stable diffusion. And much less randomness, I would say. And what's exciting here is a few changes in the paradigm of how it works.</p><p>[00:16:56] For example, like you said,</p><p>[00:16:59] it doesn't expect you to know all the intricacies. You can describe in</p><p>[00:17:03] your natural language what you want to see</p><p>[00:17:05] and it will use</p><p>[00:17:07] GPT, however much they are powering the, for generating a prompt to make the whole image. That's the one thing. The other thing is that it's not</p><p>[00:17:19] text to image.</p><p>[00:17:21] It's more a conversation. Similar to how chat GPT is a conversation between you and the assistant. DALL-E 3 is a chat. So you can see in the video that they released. You generate one image and then you discuss if you want to make changes to it, if you want to make more variations, and that would be very interesting to see the flow.</p><p>[00:17:44] From the AI artist perspective, I think it will be met with a little bit hesitation, at least not knowing how much fine control they are providing. If they are letting away... to influence all these various parameters that the model uses. That is a lot of the workflow for generating AI art.</p><p>[00:18:06] And when you want to make a piece for release as an artist, you spend a lot of time fine tuning it.</p><p>[00:18:13] And today with Stable Diffusion, and with Mid Journey, we have a lot of fine grained control over changing the parameters by a little bit, adding one more word, That's one thing, and another thing is that artists usually actually want to have that control over the prompt. For example, this week I saw an interesting example, I'll try to find it for you, where the artist adds the words Event horizon to an image.</p><p>[00:18:44] Now the image is not of space, but the model does take that idea of the event horizon shape, and makes the image more shaped like an event horizon. So those are the kinds of tricks that right now prompt engineers use to make very specific changes in the image. So I'm interested to knowing if DALL-E 3 will allow that kind of control.</p><p>[00:19:08] And most of all, finally, we had DAL E2 very early in the game, before Stable Diffusion even gave the first clunky models, before everything, and there was so much work and mid journey. And so many much interesting things coming out in image generation and open AI will always like hanging back.</p><p>[00:19:30] We have this very basic value too, which sometimes works and usually doesn't gives you very weird results. So yeah, good to see that they are still working on actually</p><p>[00:19:43] innovating</p><p>[00:19:44] and thinking of the next step and how we can combine all of these technologies. To make something that's much more fun to the user experience.</p><p>[00:19:53] <strong>Alex Volkov:</strong> Absolutely. And I will remind some folks the internals behind kind of diffusion models, like stable diffusion, et cetera. OpenAI actually made the whole field happen, I think, with some was it VIT? Vision Transformer that they released and,</p><p>[00:20:05] <strong>Yam Peleg:</strong> they released the first diffusion. The first diffusion model.</p><p>[00:20:08] <strong>Alex Volkov:</strong> Yes. And so like the whole field is all to open the eye and it's great. I, it's a fair, I joined you in the, it's super great to see them innovate and give us some new UIs for this because. I heard from multiple people who have access to this, that this, you can get lost in just chatting to a picture, to the creation process.</p><p>[00:20:26] It's like a whole new creation process, basically, like prompting, but chatting. I'm very excited about this, very excited.</p><p>[00:20:31] , so we'll definitely talk more about this.</p><p>[00:20:33] Microsoft c launches Windows CoPilot</p><p>[00:20:33] <strong>Alex Volkov:</strong> I want to move on to the next thing, which is exciting. And so. Until today, basically, the word co pilot meant GitHub co pilot, at least for those of us with VS Code, those of us who write code. GitHub co pilot obviously is the auto complete engine that, gives you code abilities.</p><p>[00:20:50] And many of us use it, many of us don't use it. But, today, I think, Microsoft who owns GitHub and who is very close with OpenAI has announced Copilot for Windows. And it's coming soon with the Windows update. And we've seen some previews about this in some discussions. And I find it very interesting that Microsoft is innovating in AI, whereas we're waiting for Google to come up with Gemini.</p><p>[00:21:18] We're waiting for Google to, we're going to talk about Bard updates as well. But Copilot for Windows will be able To be just like a shortcut away. I think windows C is the new shortcut and you'd be able to ask it like he asked you for different things. And for those of us in the audience who didn't join us in the previous ThursdAIs, we.</p><p>[00:21:40] Talked with Killian from this open source called Open Interpreter. And one of the things that we all like in Open Interpreter is that it runs on my machine and it generates code, and some of that code could be AppleScript. And so it's very easy to run stuff on the Mac using AppleScript. You can open Calendar, you can send emails, you can do a bunch of stuff.</p><p>[00:21:58] And so it was beautiful to see that, like, even an open source agent like Open Interpreter is able to Run code and then, activate stuff on your computer. Having, and I think Kilian mentioned, like, Microsoft's Copilot is coming. And not just a week later, exactly a week later after that discussion, we now have Windows Copilot.</p><p>[00:22:16] Which is going to be able to run Windows for you. It's going to be able to open apps and shut down apps. It's going to be able to just like... Be a, chat GPT, but living inside windows. And I think it's going to be based on GPT 4. It only makes sense with the Microsoft OpenAI collaboration. And like I can't understate this for a second.</p><p>[00:22:38] GPT 4 was released on March, right? Chat GPT was released less than a year ago on November something. And now the next version of world's probably most. Common operating system, Windows, is going to have AI built in as a companion. How insane is this, folks? I, I, I, I have a Windows machine, because I have an NVIDIA GPU, blah, blah, blah, and not only I'm not only on the Mac and I'm really excited to, like, play with this.</p><p>[00:23:09] An additional thing that they've announced together with this update is connecting to the previous thing that we said, which is Bing, Chat, and Windows Copilot will both have DALL-E 3 built in for free. So DALL-E 3 is going to be possible on GPT Plus subscribers, the ones of us who paid the 20 bucks.</p><p>[00:23:32] However... For, through Bing, you'll be able to get it for free, and it's going to be part of Windows. Right, so, my mom, who probably doesn't use Windows, okay, her husband, my mom's husband uses Windows, he'd be able to use GPT 4 to run his Windows and also generate images. I think that's incredible, and, only Microsoft can give it out for free.</p><p>[00:23:52] I think that's mostly it in... The Microsoft update. However, it's breaking news. Literally, they released the tweet once we started the space So I'm sure more stuff will come out of there But I invite folks on stage to chime in with Windows Copilot news What do you think about this whether or not, you know This is going to change multiple people's usage of Windows or Or not</p><p>[00:24:16] <strong>Nisten Tahiraj:</strong> I mean the whole Using software thing is all up in the air now, right? Everyone's in creative mode. Yeah, it's pretty hard to predict what's going to be the, the better interface voice is getting really good. Open interpreter show that it can do a whole bunch of stuff. You can also delete all the Jason files on your computer accidentally, but I think those, those will be worked out those issues.</p><p>[00:24:43] Yeah, it is hard to, it's hard to call because again, being is still a free beta service, they haven't quite figured out how to fully monetize that, because that's not cheap to run especially considering that it is the multimodal image one, so. Yeah, don't have that much an opinion.</p><p>[00:25:05] I think it's still too early to call as to how interfaces will change.</p><p>[00:25:09] <strong>Alex Volkov:</strong> I agree. I just, I'm excited that AI that we've come to known for less than a year is now baked into an operating system for everyone, right? Even going to a website like chatGPT registering is not for everyone and they will. They will definitely , lower the bar for usage here. What's up, Yam</p><p>[00:25:28] <strong>Yam Peleg:</strong> hi I just want to say that we've seen, because everything is so early, we've seen really great infrastructure for RAG but we haven't seen a wide scale product using RAG on this scale. So, and, and it makes sense at the end.</p><p>[00:25:47] I mean, you have a lot of information scattered around all different software and different devices. It's, I think it's the perfect idea to just merge everything with the RAG and just allow you to chat with whatever information you have everywhere. And Microsoft is perfectly positioned to do that. And I'm looking forward.</p><p>[00:26:13] I think that I think it's a great idea. I don't know if the implementation. Will be great. It's, we need to see, I think it will, but we need to see, but I think that's it. As a concept is a great concept.</p><p>[00:26:26] <strong>Alex Volkov:</strong> Something that I saw from a person who's very close with the Microsoft team, for some reason, the guy behind being his name is Michael Perakin, and he has this like very non branded Twitter account that barely has an avatar image.</p><p>[00:26:43] And he's been doing, he's open. Yeah. He's been doing, he's been doing like customer support basically on Twitter. Like people will say, Oh, Bing has this, has that. And he's like been very, very responsive to some people. And so two things that he did say, first of all, Dally three is already part of Bing for some percentage of population.</p><p>[00:27:00] So if you use Bing, and we've talked about Bing before about image and vision. If you use Bing, go try and generate images with it. It used to be Dally too, but if you get. Good ones. You may get value three, which is incredible. You may already have this. And the second thing is I saw somebody commented that he is now head of windows, right?</p><p>[00:27:17] So the guy behind being the guy who pushed a I into being is now moving to be ahead of windows. And I think this together with this release shows us that. How just how much Microsoft is serious about a I everywhere and is determined to not miss this new wave like they missed the mobile wave. And everybody says that, Apple overtook Microsoft and Microsoft was like late to mobile.</p><p>[00:27:37] And And it just goes to show like how much they invest in this whole thing. And I find it like very, very good because for many people, even going to a website is a barrier of entry. And then when it's just like one click in their operating system of choice, I think it's going to be very it's going to shove AI into way more people's faces.</p><p>[00:27:54] I also want to say that Microsoft out of the big ones is fairly based in terms of. Safety and regulation, which we usually don't talk about we can talk about in maybe the next space, but like, we can have worse than Microsoft, which is surprising for me because I used to hate on the Internet Explorer most of my life.</p><p>[00:28:12] And so now Microsoft is very based. I think less comments on Windows Copilot here, folks, and then we can move on to the next stuff from OpenAI, actually.</p><p>[00:28:22] <strong>Nisten Tahiraj:</strong> So my last one is I've started using Edge Canary as my daily browser just because of the sidebar and the splitting. So if you have a widescreen monitor, it's actually very handy because you can have code interpreter on one side, and I'll show an image of it very quickly.</p><p>[00:28:39] And I have Bing, which has an excellent voice back and forth. And it has really good voice generation, which normally would be very expensive if you're paying for it, but it's in beta And then I have the actual work and on the sidebar you can have Anyway, this interface is a bit convoluted and edge browser is it's still a little bit clunky, but Overall, it's been working pretty well for me.</p><p>[00:29:06] So I I don't know. I sort of see the browser as being more and more important. That's your operating system. Some people disagree. They're trying like Sean is, is trying to do more of a OS native stuff with his tool that lets you run multiple ones. But Yeah, you can see the screenshot of how I started using it with voice, so.</p><p>[00:29:28] In general, I see it as you'll just talk to it back and forth. I think That's,</p><p>[00:29:32] <strong>Alex Volkov:</strong> at least that's what I want. Were you referring to Swix's Godmode app where you can run all the LLMs in like a window?</p><p>[00:29:39] <strong>Nisten Tahiraj:</strong> Yes, but that one, for example, on the Mac is right, there's an icon right beside the clock. And you just click that and it pops up, so it's unintrusively there.</p><p>[00:29:49] And it adds to your experience instead of getting in the way. And I, I do like that part because it is using real estate on the screen efficiently, but again, if you have a. If you use a wider monitor, so can Edge with all of its right sidebar shortcuts, because then you can add your discord, your outlook and stuff there too, right where the GPT like right where I use the code interpreter window and even have some completion and document writing stuff too now.</p><p>[00:30:19] So that's how I see it. I, it's again, it's up in the air, what people will find most helpful</p><p>[00:30:25] <strong>Alex Volkov:</strong> absolutely. And I've been using Bing somewhat as well. And yes. The sidebar can also read from the page, right? So the Bing chat in the sidebar has access to the page if you give it.</p><p>[00:30:37] And that for like summarization and different things, that's really, really excellent as well. Like it completes your browsing experience. So I'm assuming that they're doing some stuff with the co pilot.</p><p>[00:30:46] Open AI leaks, GPT-4 Vision, Gobic</p><p>[00:30:46] <strong>Alex Volkov:</strong> All right, folks, we're moving forward because we have much to cover. And, there's more news from OpenAI.</p><p>[00:30:52] They actually came before DALL-E, and we were supposed to talk about them first, and then DALL-E, but sorry, and then DALL-E came out. And now let's cover some news from OpenAI. So... It feels like the theme behind all of these news is OpenAI is trying to rush stuff to the door or to announce some stuff to the door because they know or they hear or they saw the information from Google breaking out about Gemini, the multi model wolf.</p><p>[00:31:19] Huge model from, from Google that is potentially GPT 4 like and can do images in the input and output is multimodal on the output as well. And so we don't know many sorry, we don't know much information about Gemini so far, but we do know that the information kind of the publication called the information released that Gemini is coming very soon.</p><p>[00:31:40] And we see the response from OpenAI in multiple places, right? So DALL-E 3 is one of them. OpenAI released so the information also leaked. about open the eye gearing up to give us vision for those of you who remember pretty much every space since march we're talking about gpt4 that is also multi model on the input and yeah we can probably go into the details whether or not it's fully multi model versus gobby and i would love for you to participate in this but basically gpt4 when they announced they showed the demo of it they gave it some screenshot they gave it like a sketch of a website that was able to code that and then we didn't get That feature, the Multimodality from GPT 4, we didn't get it.</p><p>[00:32:20] The only people who got it, and me and Nisten interviewed the CEO of this, is Be My Eyes, which is this app for blind folks, and they just like shove GPT 4 vision in there to help those with eyesight issues. And it seems that now Google has finally stepping into the arena, sorry for the pun, and that we may get GPT 4 vision very soon.</p><p>[00:32:42] I actually saw some screenshots how it looks inside the GPT 4 chat GPT interface. And the additional exciting thing is, they have a different model. With the code name Gobi, that as apparently it works in OpenAI. And that one is going to be multi modal and like fully. So, Yam, I would love to, if you can repeat what we talked about last night, about the differences and how GPT 4 is multi modal, but not fully.</p><p>[00:33:06] I would love for you to expand on this.</p><p>[00:33:09] <strong>Yam Peleg:</strong> Yeah. First it's important to understand that there is a huge difference in infrastructure between the two companies. And the infrastructure dictates what is possible or not possible, what is hard or not hard. From the rumors nothing is confirmed, but from the rumors the, the structure and the size of GPT 4 is.</p><p>[00:33:34] It was chosen to fit the hardware, the infrastructure to actually run the model. It doesn't matter if you have the best model in the world, if you cannot just serve it. So Google is using its own hardware, which is not sharing with anyone else. And it's important to understand this. So when we see that Google is doing according to the rumors.</p><p>[00:33:58] And, and insane training run or, or preparing to ship or, or serve an insane model that is multimodal on the input and on the output. It, the reason we didn't see, I think, I think the reason open AI I didn't release a GPT with the image head is simply because it's. It's probably expensive. It's not that easy to deploy something like this, especially not with the amount of people that use OpenAI services.</p><p>[00:34:31] And, and I think this is this is what we see. This is the reason for what we see at the moment. Now it's important to understand that according to rumors, again, nothing is confirmed, take with a grain of salt, according to the rumors, which makes sense, GPT 4 is first a language model. It was trained as a language model, just language model.</p><p>[00:34:53] And once it was trained. It, there was they added an image head to the frozen model. This basically, this reduced the risk of something going wrong with full multimodality end to end. And moreover it allows you to just use the model on its own. And if you want, you can plug the head so you can use them.</p><p>[00:35:14] You can, it's flexible. You can use them with or without a head. Now, the thing is that there is you do pay a price because again, with a grain of salt, there, there is there is, there are caveats to this, but we have already seen multiple times that multimodality, when done right, benefits both modalities.</p><p>[00:35:36] So GPT 4 allegedly did not benefit from the multimodality. And this is the difference between GPT 4 and the new rumored model that we have. According to the rumors, the rumored model was trained end to end images and text throughout the whole training. So, we should, if it's true, if everything is true we should expect a better model only if you just use it for text, we should expect a better model because the, the images just influence the text and text influence the images and So on and so forth.</p><p>[00:36:12] <strong>Alex Volkov:</strong> That's great. That's what I have. One follow up question. You spoke about benefits from training on text and vision. And I remember Ilya Asatkov also talked about this. I think with the Jensen CEO of NVIDIA. He talked about different other places. Could you speak to some of those potential benefits of how multi model trains on text and images is actually better?</p><p>[00:36:37] <strong>Yam Peleg:</strong> If I remember correctly Ilya said Ilya gave the perfect example for this. You can, if you really want, you can describe what the color red mean with text or what, what objects are red. All of this will be nothing like just seeing the color red. So there is a difference between actually training on images.</p><p>[00:37:04] Versus training on text that describe the images which is just, it's just a different sensation. So the whole you can say the world model inside the, the language model is influenced by, by the images. And I think color is, is just a great example. And if I remember correctly, that was Example he gave in this interview.</p><p>[00:37:27] <strong>Alex Volkov:</strong> Yeah, absolutely. And I think the other one he said is It's obviously better at stuff like math or physics where it's able to actually read different, the graphs and everything. It's like, it just arrives at the question faster, but also like Yam you correctly pointed out the world model of this model is way better because it's able to see basically.</p><p>[00:37:50] So We have potentially exciting news. One thing I will add is that Yam I think you're correct opening. I just didn't want to spend the kind of this GPU cycles on the vision model and the being able to attach a head with vision. I think it's exciting. I do want to highlight that Microsoft likely has.</p><p>[00:38:08] The bandwidth for that, because being has the ability to have vision. Now, I don't know if it's like the full one. I don't know if they did some work because the examples that I tested with being vision gave less quality like responses on images than I was expecting GPT four from the example.</p><p>[00:38:25] So maybe they know if they, maybe they did some stuff for optimization speed, but yeah, definitely it feels like infrastructure was gearing up for this and hopefully we'll see it soon. From OpenAI.</p><p>[00:38:36] 3.5 instruct model from OpenAI</p><p>[00:38:36] <strong>Alex Volkov:</strong> Another thing we saw from OpenAI, and I think this is this last one, we have a bunch of OpenAI updates, is the 3.</p><p>[00:38:42] 5 Instruct model. And unlike the ChagGPT model, 3. 5 Instruct is very similar to how OpenAI APIs We're actually working before the ChatGPT explosion, right? Before you were able to like do back and forth conversation. Before it was RLHF for conversation purposes. And I saw many, many folks get very excited about 3.</p><p>[00:39:05] 5 Instruct. Because it's very similar to what we had before ChatGPT. But it's much faster. Now we don't know if it's faster because way less people use this because it's new. Or it's faster because they actually did some TurboMagic on it. But, we'd love to invite folks on stage, maybe Roy, maybe Mr.</p><p>[00:39:21] Yang to talk about the instruct and the difference between kind of this end point in the API versus the regular chat end point. If you have anything to, to, to, to add here from what you read, please feel free to, to add.</p><p>[00:39:36] <strong>Nisten Tahiraj:</strong> I used it in the playground to just like write An agreement for the site, like a privacy agreement.</p><p>[00:39:41] It was pretty good for that. It just it's annoying that the context window is so small. It's only a 4k context window. And it's more like only three and a half K because some of it will be your prompt. I think it has some very other very good usability uses, which we haven't experimented with yet.</p><p>[00:40:02] Like the one person got it to play chess very well. And I think it's, yeah, it's really worth looking at for stuff like doing automation or you're continuing some work on your desktop, for example, with open interpreter, and it'll be able to continue generating in that regard. So there are quite a few things to explore there.</p><p>[00:40:26] I'm just glad it's cheap and it's good. So that's that's what we want at the end of the day</p><p>[00:40:34] <strong>Alex Volkov:</strong> Yeah it's it's cheap and I think for many folks they were surprised with like the chat interface They had to switch for chgpt to get like the benefits and the speed and now they're happy that they have the instruct model of old They also added log props.</p><p>[00:40:47] So I, I would love to ask folks on stage because I'm not entirely sure what like logprops is in the API response. And I saw Alex Gravely and some other folks are getting excited about logprops. And I want to say, just before, I want to say hello to Ronak, if I'm pronouncing this correctly. Ronak. And we're going to talk about RamboRole in a second, or in a few minutes, but if you have comments on the Instruct API and LogProps, feel free to share.</p><p>[00:41:18] <strong>Raunak Chowdhuri:</strong> Yeah, I do. LogProps is awesome. It basically gives you like, token level probability distributions on, in terms of the model. So normally when you are using GPT 4 or GPT 3, You just get words back when you're, when you're querying the model. But what LogProbs allows you to do is, is see the probability distribution that's outputted by the model that is normally sampled by, like, the temperature parameter.</p><p>[00:41:43] And you can use that to do a lot of, like, really interesting things. Like, for example, if you're, if you're asking GPT to solve a multiple choice question, for example it's really useful to actually understand. the model's confidence in whether it's A, B, C, or D. And you can actually get that directly from the model by examining that probability distribution from the log prop.</p><p>[00:42:05] So it actually provides... A lot more insight into what the model is thinking and I think that's a pretty useful technology. You can actually do a lot of clever things with it, like someone built something called like JSONformer which is basically like a tool that allows you to if you have a model that exposes log props, you can only sample the tokens.</p><p>[00:42:24] That basically are valid JSON tokens, and construct a response that is very much aligned with like a certain format that you want. So I think that's a pretty powerful tool.</p><p>[00:42:36] <strong>Alex Volkov:</strong> Thank you, Ronak. Thank you. And I remember JSONformer and did not know that they use log, log prox for that. So here you have it, folks.</p><p>[00:42:43] There's a new endpoint for you, your usages that now exposes the token probabilities. So you can use this to build better tools and different types of tools. And yeah, Ronak, would you care to introduce yourself briefly? I will ask again once we record kind of your section, but feel free to introduce yourself.</p><p>[00:43:03] Raunak intro</p><p>[00:43:03] <strong>Raunak Chowdhuri:</strong> Yeah, absolutely. So I'm a senior at MIT. I'm graduating in a couple months. My background's in machine learning, artificial intelligence. I've been doing research in this area for quite a few years now. Yeah, I'm working on some interesting projects that we'll dive into later, but basically building long term memory for Language models.</p><p>[00:43:25] Bard Extensions allow access to GMail, Youtube, Drive</p><p>[00:43:25] <strong>Alex Volkov:</strong> Awesome, awesome. Thank you. Thank you for coming up and thank you for explaining log, log props as well. All right. So the next thing I want to talk about briefly, really briefly, because it's not that great is bard from Google. Before we get to Gemini, before we hear from Googles like Explosive, GPT 4, Combating Model, etc. Right now we have Bard. For some reason we also have Google Assistant, which I'm not sure what's the involvement with LLMs there. But Bard is something that some folks on stage here use. And I was never like very, very excited about Bard for some reason.</p><p>[00:44:00] However, they just released a few updates to Bard and they say like this is the best Bard ever. And it feels like very googly, very like product manager y to me, at least. What they released is something called extensions, right? So if you use Bard before and you haven't touched it in a while, like I haven't, if you go to Bard right now, what you will have is the chance to...</p><p>[00:44:22] Updated with extensions. Those extensions could access your Gmail, all of it, your Google Drive, all of it, YouTube. I think some other ones that I'm trying to remember. And the cool thing about this, which I actually like, is the UI. You can do at sign, like like you mentioned somebody on Twitter.</p><p>[00:44:38] And then you have access to those extensions. It's a different take on the plugins with ChagPT, where like ChagPT plugins, you have to be in that mode, it decides for you, blah, blah, blah. So here you can actually say like, add Gmail, and then ask it questions. It will actually go and do a search in your Gmail account and give you back answers with, with natural text.</p><p>[00:44:56] So. Conceptually pretty cool, right? We all use Gmail, or like at least most of us use Gmail. And so to be able to like get summaries of the latest emails, blah, blah, blah. So conceptually very cool. Google Docs as well. You can tag Google Docs. You can do Google Drive. Oh, Google Maps is the, is the other one.</p><p>[00:45:10] So you can actually say like, hey, what are some of the stuff that, in San Francisco, Seattle, whatever it will give you. The thing that I was really surprised by is just how bad it is, just honestly not to... If there's folks in the audience who work on Bard, I apologize. And sometimes we say these things, but there's like so many, so many people working on this stuff.</p><p>[00:45:31] And like, it's, the, the nights and weekends, they don't see family. So like, I apologize. Just network from the comparison point in my experience, I was really disappointed in how... Google, who's this like huge company that like created Transformers for us, like they, they're not afraid to release something this bad.</p><p>[00:45:50] And what is bad, I mean, specifically, I literally used two of the extensions. One is Gmail. To ask it about my upcoming flight to San Francisco, which I told you guys about. I'm going to be at the AI engineer event as, as a media person. And it couldn't find any information from this flight and just gave me flights from the past.</p><p>[00:46:07] I literally asked, give me flights from the future or like, give me my upcoming flights. And it gave me flights from the past. It also gave me two trips to the Denver museum, which is, which are not flights. And so, yeah, we know LLM hallucinates, blah, blah, blah. But if you, if you put your brand behind this and you're Google and you put Gmail in there and you cannot like do a basic search, that's upsetting.</p><p>[00:46:30] And so I said, all right, I'll give it another try. I did YouTube. And I asked, Hey, what does MKBHD, Marques Brownlee, if you guys don't follow him, he's like this great tech reviewer. What does he think about the latest iPhone? And it went to YouTube and it searched and it gave me. Marquez's videos from last year about the iPhone 14, and I literally took the same string that I pasted into Barg, went to the YouTube interface, pasted it in the YouTube search, and got like the latest videos that he had about the iPhone 15.</p><p>[00:46:58] And so I was thinking there like why would I ever use this if like the first two searches did not work, where this is the whole promise of this. So again, not to be negative. I don't love being negative. It's just like from a comparison standpoint. It's really I really got to wonder how many folks in Google are trying to rush through the LLM craze.</p><p>[00:47:19] We remember Sundar Pichai saying, AI, AI, AI, AI, AI, AI, AI, AI, AI, on the stage like 48 times, right? And they're shoving AI into everywhere. It's just, for me, it wasn't that useful. So I would love to hear, Safrir, I see your hand up. I would love to hear from folks on stage about your experience with BARD and those specific kind of extension new things.</p><p>[00:47:41] <strong>Tzafrir Rehan:</strong> So I don't have much experience with it, actually, for the same reasons that you said. But I want to give the perspective that I think what we're seeing here is Google jumped early to stay in the game. Maybe they didn't expect ChatGPT to go viral that big so fast. Well, this was developed like a sci fi technology and suddenly it's a household item overnight.</p><p>[00:48:09] But, if you're talking about Google, and I worked at Google actually for three years, about a decade ago, it's a company that can make very big moves very slowly. That means, if Gmail data, Drive data, it's the holiest of holy of privacy. If you want as an engineer at Google, if you want to touch that data to read even a single bite, you need to go through quarters of legal meetings.</p><p>[00:48:41] So the fact that they are going in this direction indicates a course heading that they took the time to think of it through and decide, yes, we are doing this very risky move in terms of privacy and user expectations. Because they believe in the value. So let's see where they get to when they actually, when they are actually fully implemented.</p><p>[00:49:05] Because I think right now, what we are seeing is a rushed out version.</p><p>[00:49:09] <strong>Alex Volkov:</strong> I agree. I think that's how it definitely feels where the basic stuff, like. A keyword search works better than like this search and they're basically hitting the API, which they have behind it definitely feels rushed very polished UI wise, very safe, very like protective, like googly, but very, it's not super helpful.</p><p>[00:49:27] I think at this point Yeah, I think this is most of the news unless I'm missing some so let me look and see in my template that I already Drafted for myself. Let's see if we have any more things to cover before we before we move on to the interviews So yes, one last thing I wanted to find I'll just find this thing.</p><p>[00:49:48] It's called chain of density So, I saw this, I think it was a paper first, and then yeah, I'll share this in the chat. I'm sorry, not in the chat, in the, in the Jumbotron. I saw somebody release a paper on this, and then I think Harrison from LangChain reposted this and actually put it up on their website with the prompt sharing, where you can play with prompts, is this new method called chain of density, which is actually really, really good at getting summarizations from From ChatGPT and different other places like Cloud as well.</p><p>[00:50:21] And I think it's really cool because I just posted it on top. It it asks for four summarizations with more and more density, right? So it starts with like, hey, summarize this text or article. And then it says give me like a JSON file in response with like four summarizations. The second</p><p>[00:50:37] one, give me a summarization.</p><p>[00:50:40] Extract from the first one that you just gave me, extract the entities that were missing, and give me another summarization with those entities, and then do it again and again. And I think there's, like, some cool prompt magic in there that says something to the tune of, make sure that this is understood on its own, and the person doesn't have to read the article to understand the summarization.</p><p>[00:51:00] I personally have gotten really good summarizations based on this technique, so much so that I've added it to my snippets. Where, where I have different snippets for prompts. And if you are doing any type of summarization, definitely check it out. Nistan, I saw your hand briefly up if you want to comment on this thing.</p><p>[00:51:16] <strong>Nisten Tahiraj:</strong> Yeah. Like the first person that I knew who got a contract as a prompt engineer actually used this technique a lot last year. And the way he was explaining it was when you do, when you compress an idea and then you extrapolate, that's how creativity happens in general. Like you, you compress. You extrapolate out of it, you compress and then you extrapolate.</p><p>[00:51:36] So it's pretty interesting that someone did this in a much more systematic way. I'm, I'm going to check it out.</p><p>[00:51:43] <strong>Alex Volkov:</strong> Chain of density. And I wanted to ping back real quickly on the compressing part, because yeah, I saw your tweet and there was a paper about compression as well. And Ilya gave a talk about compression recently.</p><p>[00:51:55] And I wanted to see if you want to talk about that compression part and paper. Briefly and if not, that's also okay. We can move on but I just like I think this is also this week.</p><p>[00:52:07] <strong>Yam Peleg:</strong> Yeah, I got I Had some controversial opinions in the last couple of weeks And as it turns out there are papers that support them coming up after them.</p><p>[00:52:19] But yeah, I highly, highly suggest reading the compression paper. Basically, basically what it says is that it just it just conveys the idea that what, what we are actually doing is I want to say. Reversing the process that generates the data and by reversing the process that generates the data.</p><p>[00:52:39] If you think about it, the process that generates the data is us. So, I don't, I don't want to, I don't wanna say the, the the words that I shouldn't. I got a, I got some heat for them, but you can find in my tweets. It's it's a really good paper. It's really It's much more scientific, you can say versus other papers that talk about intelligence, about general intelligence, and poke on this idea, and I highly recommend reading this paper if you're interested in this part of what we're doing.</p><p>[00:53:13] It doesn't prove anything because, general intelligence is a, is a big thing, but it. It is it is interesting the ideas there are, are, are solid and great to see.</p><p>[00:53:24] <strong>Alex Volkov:</strong> Yeah, I, I, I heard this multiple times this comparison or metaphor that intelligence is compression and, compressing a lot of ideas into, First of all, it compares to natural language, the ability of us to understand something, to put it into words, that's compression.</p><p>[00:53:39] Obviously, Feynman's quote, where like, you really understand something if you can explain this to a five year old, is also like, compressing down and also being able to explain some stuff. And so I heard this multiple times, and it's great to see that, there's now papers to talk about this. And continuous compression, like Nissen said, actually...</p><p>[00:53:54] Actually brings out better results and it's also good to see on the topic of literal compression. I know this like it's confusing There was also another paper that's worth checking out from this week Where they actually used llms and different transformers for an actual compression to compare to like png or or jpeg, etc And I think they saw very interesting compression results as well I don't remember if I have a tweet for that But yeah, be on the lookout for the for multiple types of different compression, uh as we as we move forward Thank you I think With that, I think we are ready to move on to our guests here on stage and to talk about two exciting things.</p><p>[00:54:30] So, first of all, actually three exciting things. One of them is, Nisten you had a visit to Geoffrey Hinton's lab that I wanted to hear from you a brief story about. After that, we're going to talk with Arthur and Zenova about WebGPU and going to do like a brief interview about like... Running models locally in the browser.</p><p>[00:54:47] And then at the end, we're going to talk about remember all with Ronak and his exciting approach to extending complex windows. So with that, I'll just give a brief kind of summary of the spaces we had today and some logistics, and then we can get started with the second part of Thursday. So again, everybody in the audience, we're just drawing the joint in the middle or have joined us from week to week.</p><p>[00:55:08] Thursday is about. Staying up to date together and give giving updates every week so that folks don't have to follow everything because it's almost impossible. I'm very happy to be joined by multiple folks from different disciplines and folks who can answer stuff and complete and find new things to get excited about about AI.</p><p>[00:55:28] From different fields every week here on stage. We also have a podcast and newsletter. If you're here and you're new and you just like just joined us and you can join next week, you can sign up for the newsletter as well. We stay up to date. So you don't have to, this is the model. And the first part of this is usually updates from last week, breaking news.</p><p>[00:55:46] There's another breaking news with YouTube something, but I think we'll cover this next time unless folks here want to read up on this and then give us an update at the end. But the second part of it is usually A deep dive into different conversations and, and guests. And today we have Arthur and we have Ronak to, to talk about different, very exciting things.</p><p>[00:56:05] And we'll start with Nistan's brief foray into the lab, AKA yeah, Nistan, give us, give us like a few minutes on, on your, your excursions.</p><p>[00:56:16] <strong>Nisten Tahiraj:</strong> Well, I've been going as a guest to Vector Institute for. Over a year and a year and a half and this time I, I went in and I'd never met Pharrell in real life.</p><p>[00:56:28] I didn't even know what he looked like. It was just some dude. He was GitHub. And yeah, so I, I invited him in and we were going to work on making the bootable. Bootable OS that just boots straight into a GDML model and then hopefully gets Chromium running with WebGPU. And essentially I just, I made before a, a tiny 4.</p><p>[00:56:54] 7 gig ISO that includes an entire Llama 7b model and an entire Linux distro. I use Slackworks, that's the smallest, and I used that for like 20 years. And yeah, so we were in the lab and Eventually, he's like, let's just try and get the whole thing working. So let's just try and get the mixture of experts.</p><p>[00:57:14] Let's just do it all at once and see where, where we get stuck. And anyway, I had to call another friend who was an extremely good DevOps engineer. To help and, and yeah, anyway, long story short, I couldn't get it to run on the GPU because there were no instances and I only had an A10 24 gig and MixtureFX needs more than that because it's 32 experts.</p><p>[00:57:39] So I had to run it on the CPU and that's what we spent the entire day and evening on. And it was really slow, but then we realized, yeah, this is probably like the first time someone has effectively ran Mixture of experts model on on, on a CPU. And again, it's, you can check out the REPL.</p><p>[00:57:58] I made a CPU branch and it's the V1 branch if you really want to get it to work. But yeah, that was the story. I just met with a random person from Twitter for the first time who was in their discord and yeah, it was, it was fun. And we also, the funniest part was that. Happened to be there a call in Rafael, who has been teaching about mixture of experts and writing a lot of the papers, and then we look behind and he's just like literally like five dusks away.</p><p>[00:58:30] And I was just like, taking a back. It's like, Oh, holy cow. He's here. And he had no idea who we were or anything. So, yeah, that was that was fun.</p><p>[00:58:39] <strong>Alex Volkov:</strong> There, if you don't mind me completing this story from what you told me multiple times, because I think it's like way more colorful than you, than you let on. First of all, VectorLab is the lab of Geoffrey Hinton, the grandfather of AI, right?</p><p>[00:58:52] This is the lab. This is like, he's widely considered the person who like, have kickstarted this whole field, basically. Is that, is that, that lab? Was he there?</p><p>[00:59:02] <strong>Nisten Tahiraj:</strong> Yeah, yeah, yeah. Ilyas Iskever has been a student. He wasn't there. He's rarely there. He only has like one PhD, one student under his wing this year.</p><p>[00:59:12] So he comes in very rarely. But yeah, Ilya Suskever was not in, in the smaller lab before they moved here. Also Adrian Gomez, the, one of the writers of the Transformers paper still comes there every once in a while. He was there regularly up until Cohere got funded last year. And yeah, this is, this is the lab and it's it's pretty funny because everyone's very, very academic and we're just straight up hacking and whatever we can find.</p><p>[00:59:45] <strong>Alex Volkov:</strong> So the second thing that I wanted to cover here is that exactly what you built in the lab of Geoffrey Hinton, because He's now very public about the AI kind of doomerism and AI different kind of potential bad things that will happen with AI and how to not open source, how to regulate. He's very public.</p><p>[01:00:04] He's on every news. And here you are, you and Pharrell are working on an ISO, a bootable AI disc that you literally can run offline that has Lama and offline LLM. The, that basically will say, even if they regulate, you can just like take an old. CPU based machine and run this thing. So you basically get democratizing AI in the lab of the person who's now like very, very vocal about like stopping it.</p><p>[01:00:27] So that's, that's the second part that I personally like very enjoy.</p><p>[01:00:31] <strong>Nisten Tahiraj:</strong> It's not just that. Also, if you listen further than what the news media. Shows it's a lot more complex than that. He, he wants people to acknowledge that the risks are real and show that they are mitigating them. But at the same time, he's been doing research to do molecularly grown chips.</p><p>[01:00:51] And that architecture first didn't work. So. They're still going full speed ahead. They're just making, the reason that they went that way was just saying to a lot of the community, just don't act like idiots, just regulate yourselves. That, that was why they were challenging that.</p><p>[01:01:09] It's it was a lot more complex than people realize. And the professor there, Colin, he's been a big pusher for demarketizing and open sourcing. Model C in general, and so, yeah, it's a lot more, it's a lot more nuanced than what you see in the media, and when you think about it, the safest form of AI that you can have is one that you can just literally unplug, and you have full control over, so there is nothing safer than that.</p><p>[01:01:40] Otherwise, you're just trusting some incompetent politician with regulatory or whatever legal hacks to control it. So, it's yeah. It's a lot, I want to say, it's a lot more nuanced than people, than what you've just seen in media snippets and reactionary Twitter checks.</p><p>[01:01:58] <strong>Alex Volkov:</strong> Yeah, I hear you. And definitely we'll, we'll, we'll check out the nuances and Jeffrey Hinton on the topic very briefly before after our apologies, we'll get to in just a second, just like something that also happened this week. Yan Likun, the GOAT, aka, the chief meta AI chief scientist, well, went in front of the Senate. I think a couple of days ago, and he, I just pinned the tweet on the top that he actually retweeted, which was like, sent by notices, and he gave an incredible opening statement, talking about how open sourcing is very important, why the open source LLAMA, talking about the fact that, the open source LLAMA1 and the sky didn't fall, and all of these things, and he also outlined a bunch of the safety protocols that they have into account when they release LLAMA2, and I think it's a First of all, very important to have somebody like Jan in front of Senate and talking about legislators and regulators and about regulation, because we see more and more Jan.</p><p>[01:02:52] I think you brought up last week about there was another discussion and Elon Musk was there and Sundar Pichai was there. Everybody was there talking about AI and how to regulate. And I think it's very important to have voices like Jan LeCun talk about like, talk with different things with clarity and safety.</p><p>[01:03:07] And so definitely. Recommend everybody to check out his opening statement because you know the doomers it's very easy to scare Especially on like the engaged baiting networks like x and etc It's very easy to like take something that people don't understand use it to scare folks And I think it's very important to have very clear very Credentialed and very like, understanding people from this world to actually explain that there's benefits and explain how open source can benefit as well.</p><p>[01:03:36] And I think you also mentioned how excited the open source community was about the Lama to release. And I want to believe that we all had like a small, tiny part to play in, in this. And so, yeah, we're definitely on Yam's map sorry Yam Likun's map and definitely worth checking this out. I think with that listen, thank you for sharing your story.</p><p>[01:03:52] Tell us more escapades from vector lab. And if you get to meet Geoffrey Hinton, tell him about Thursday night and also Colin.</p><p>[01:03:59] ​</p><p>[01:03:59] <strong>Alex Volkov:</strong> All right folks this actually concludes the two hours that we've allotted for thursday night today I I know there's like many folks. I see dave in the audience. What's up, dave? I haven't I see. I see other folks just stepping in with, with all the sadness of, I want to keep talking with all of you.</p><p>[01:04:20] There's also now a need to, transcribe this and, and, and put this into a newsletter, a podcast form Thursday is here every week. We're here literally every week since GPT four came out, I think. Mr. Did I miss one week on vacation? Yeah, newsletter came out, but we didn't talk that week. I felt like, oh, I miss my guys.</p><p>[01:04:37] I miss my friends. We need to get up to date together. So we're here every Thursday. I there's so much always to talk about. I want to just like to highlight how much boring this would have been without friends like distance and Nova Arthur, now the new friend of the pod, it's a freer and some other folks who stepped away young and far and for real, like many, many other folks who joined this week to week.</p><p>[01:04:58] And Help us bring you, the audience, the best AI news roundup possible on, on X slash Twitter. Now almost six, seven months already into this. This has opened many, many opportunities for many folks on stage, including myself. I'm going to the AI Engineer Conference. As a media person, I'm going to do some spaces from there.</p><p>[01:05:19] If you're in the AI Engineer Conference in a couple of weeks, definitely reach out and, we'll talk over there. With that, I want to just say... Without the audience here, this also would be very, very boring. So thank you for joining from week to week. Thank you for listening. Tuning in. Thank you for subscribing.</p><p>[01:05:34] Thank you for sharing with your friends. And thank you for leaving comments as well. And with that, I wanna wish you a happy Thursday. I, I'm sure there's going to be many, many, many new things we're listening just today. But you know, we can only cover so much. With that, thank you folks. Have a nice rest of your Thursday.</p><p>[01:05:49] I, and we'll meet you here next week. And yeah. Cheers. Have a good one.</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-sep-21-openai-dall-e-3-35</link><guid isPermaLink="false">substack:post:137246415</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Fri, 22 Sep 2023 04:33:02 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/137246415/69d77958e74e11f1cf230aac072b1a95.mp3" length="66568770" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>4160</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/137246415/a6ee2969cc4ed0af84ada58d4b734092.jpg"/></item><item><title><![CDATA[📅 ThursdAI - Special interview with Killian Lukas, Author of Open Interpreter (23K Github stars for the first week) 🔥 ]]></title><description><![CDATA[This is a free preview of a paid episode. To hear more, visit <a href="https://sub.thursdai.news?utm_medium=podcast&#38;utm_campaign=CTA_7">sub.thursdai.news</a><br/><br/><p>Hey! Welcome to this special ThursdAI Sunday episode. Today I'm excited to share my interview with <a target="_blank" href="https://twitter.com/hellokillian">Killian Lucas</a>, the creator of <a target="_blank" href="https://openinterpreter.com/"><strong>Open Interpreter</strong></a> - an incredible new open source project that lets you run code via AI models like GPT-4 or local models like Llama on your own machine. </p><p><p>Just a quick note, that while this episode is provided for free, premium subscribers enjoy the full write up including my examples of using Open Interpreter, the complete (manually edited) transcript and a video form of the pod for easier viewing, search, highlights and more. Here’s a trailer of that in case you consider subscribing</p></p><p>If you haven’t caught up with GPT-4 Code Interpreter yet (now renamed to Advanced Data Analytics), I joined <a target="_blank" href="https://substack.com/profile/5753967-simon-willison">Simon Willison</a> and <a target="_blank" href="https://substack.com/profile/89230629-swyx">swyx</a> when it first launched and we had a deep dive about it on <a target="_blank" href="https://open.substack.com/pub/swyx">Latent Space</a> and even at the day of the release, we were already noticing a major restricting factor, Code Interpreter is amazing, but doesn’t have internet access, and can’t install new packages, or use new tools. </p><p>An additional thing we immediately noticed was, the surface area of “what it can do” is vast, given it can write arbitrary code per request, it was very interesting to hear what other folks are using it for for inspiration, and “imagination unlock”.</p><p>I started a hashtag called <a target="_blank" href="https://twitter.com/hashtag/codeinterpretercan?src=hashtag_click">#codeinterpreterCan</a> and have since documented many interesting use cases, like <a target="_blank" href="https://twitter.com/ArghZero/status/1691768136431988783">comitting to git</a>, running a vector DB, convert audio & video to different formats, <a target="_blank" href="https://x.com/shivkuma_k/status/1678410077407961088?s=20">plot wind rose diagrams</a>, <a target="_blank" href="https://x.com/altryne/status/1678504194574110724?s=20">run whisper</a> and so much more. </p><p>I personally have all but switched to Code Interpreter (ADA) as my main chatGPT tab, and it’s currently the reason I’m still paying the 20 bucks! </p><p>Enter, Open interpreter</p><p>Just a week after open sourcing Open Interpreter, it already has over 20,000 stars on GitHub and a huge following. You can follow Killian on <a target="_blank" href="https://twitter.com/hellokillian"><strong>Twitter</strong></a> and check out the <a target="_blank" href="https://github.com/KillianLucas/open-interpreter/"><strong>Open Interpreter GitHub repo</strong></a> to learn more. </p><p>Installing is as easy as pip install open-interpreter. (but do make sure to install and run it inside a venv or a conda env, trust me!) </p><p>And then, you just.. ask for stuff! (and sometimes ask again as you’ll see in the below usage video)</p><p>Specifically, highlighted in the incredible launch video, if you’re using a mac, Code Interpreter can write and <strong>run AppleScript</strong>, which can run and control most of the native apps and settings on your mac. </p><p>Here’s a quick example I recorded while writing this post up, where I ask Open Interpreter to switch system to Dark mode, then I use it to actually help me extract all the chapters for this interview and cut a trailer together! </p>]]></description><link>https://sub.thursdai.news/p/thursdai-special-interview-with-killian</link><guid isPermaLink="false">substack:post:137110739</guid><dc:creator><![CDATA[Alex Volkov and Killian Lucas]]></dc:creator><pubDate>Sun, 17 Sep 2023 15:11:56 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/137110739/ffe671bef18f8adc10afe9ef19da6120.mp3" length="52894673" type="audio/mpeg"/><itunes:author>Alex Volkov and Killian Lucas</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>3306</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/137110739/85598389ddd22d905a001c0f90320af6.jpg"/></item><item><title><![CDATA[🔥 ThursdAI Sep 14 - Phi 1.5, Open XTTS 🗣️, Baichuan2 13B, Stable Audio 🎶, Nougat OCR and a personal life update from Alex]]></title><description><![CDATA[This is a free preview of a paid episode. To hear more, visit <a href="https://sub.thursdai.news?utm_medium=podcast&#38;utm_campaign=CTA_7">sub.thursdai.news</a><br/><br/><p>Hey, welcome to yet another ThursdAI 🫡 </p><p>This episode is special for several reasons, one of which, I shared a personal life update (got to listen to the episode to hear 😉) but also, this is the first time I took the mountainous challenge of fixing, editing and “video-fying” (is that a word?) our whole live recording! All 3 hours of it, were condensed, sliced, sound improved (x audio quality is really dogshit) and uploaded for your convenience. Please let me know what you think! </p><p><p>Premium folks get access to the full podcast in audiogram format, and a full transcription with timestamps and speakers, here’s a sneak preview of how that looks, why not subscribe? 😮</p></p><p>TL;DR of all topics covered</p><p>* Open Source LLM</p><p>* Microsoft Phi 1.5 - a tiny model that beats other 7B models (with a twist?) (<a target="_blank" href="https://arxiv.org/pdf/2309.05463.pdf">Paper</a>, <a target="_blank" href="https://huggingface.co/microsoft/phi-1_5">Model</a>)</p><p>* Baichuan 7B / 13B - a bilingual (cn/en) model with highly crafted approach to training (<a target="_blank" href="https://cdn.baichuan-ai.com/paper/Baichuan2-technical-report.pdf">Paper</a>, <a target="_blank" href="https://github.com/baichuan-inc/Baichuan2">Github</a>) </p><p>* Big Co LLMs + API updates</p><p>* Nothing major this week</p><p>* Voice & Audio</p><p>* Stable Audio 🎶 - A new music generation model from Stability AI. (<a target="_blank" href="https://www.stableaudio.com/">Website</a>)</p><p>* Coqui XTTS - an open source multilingual text to speech for training and generating a cloned voice (<a target="_blank" href="https://github.com/coqui-ai/tts">Github</a>, <a target="_blank" href="https://huggingface.co/spaces/coqui/xtts">HuggingFace</a>)</p><p>* AI Art & Diffusion</p><p>* Würstchen v2 - A new super quick 1024 diffusion model  (<a target="_blank" href="https://twitter.com/multimodalart/status/1702017999044006347">Announcement</a>, <a target="_blank" href="https://huggingface.co/spaces/warp-ai/Wuerstchen">Demo</a>, <a target="_blank" href="https://github.com/dome272/Wuerstchen">Github</a>)</p><p>* DiffBIR - Towards Blind Image Restoration with Generative Diffusion Prior (<a target="_blank" href="https://x.com/_akhaliq/status/1701347631727780070?s=20">Annoucement</a>, <a target="_blank" href="https://t.co/dcFZuzAsdD">Demo</a>, <a target="_blank" href="https://t.co/tNt7E15PnL">Github</a>)</p><p>* Tools</p><p>* Nougat from Meta -  open-source OCR model that accurately scans books with heavy math/scientific notations (<a target="_blank" href="https://facebookresearch.github.io/nougat/">Announcement</a>, <a target="_blank" href="https://github.com/facebookresearch/nougat">Github</a>, <a target="_blank" href="https://arxiv.org/abs/2308.13418">Paper</a>)</p><p>* GPT4All Vulkan from Nomic - Run LLMs on ANY consumer GPUs, not just NVIDIA (<a target="_blank" href="https://twitter.com/nomic_ai/status/1702350105540452645">Announcement</a>)</p><p>* Nisten’s AI ISO disk - <a target="_blank" href="https://twitter.com/nisten/status/1702343936520392757?ref_src=twsrc%5Etfw%7Ctwcamp%5Etweetembed%7Ctwterm%5E1702343936520392757%7Ctwgr%5E2791cbb9f4bc9095066e1a688b8550675fd676a8%7Ctwcon%5Es1_c10&#38;ref_url=https%3A%2F%2Fspacesdashboard.com%2Fspace%2F1OyKAVjreewGb%2Fthursdai-phi-15-stable-audio-coqui-tts-mojo-llama-interview">Announcement</a> </p><p></p><p>And here are timestamps and chapter/discussion topics for your convenience: </p><p>[00:05:56] Phi 1.5 - 1.3B parameter model that closely matches Falcon & LLaMa 7B</p><p>[00:09:08] Potential Data Contamination with Phi 1.5</p><p>[00:10:11] Data Contamination unconfirmed</p><p>[00:12:59] Tiny models are all the rage lately</p><p>[00:16:23] Synthetic Dataset for Phi</p><p>[00:18:37] Are we going to run out of training data?</p><p>[00:20:31] Breaking News - Nougat - OCR from Meta</p><p>[00:23:12] Nisten - AI ISO disk</p><p>[00:29:08] Baichuan 7B - an immaculate Chinese model</p><p>[00:36:16] Unique Loss Terms</p><p>[00:38:37] Baichuan ByLingual and MultiLingual dataset</p><p>[00:39:30] Finetunes of Baichuan</p><p>[00:42:28] Philosophical questions in the dataset</p><p>[00:45:21] Let's think step by step</p><p>[00:48:17] Is breath related text in the original dataset?</p><p>[00:50:27] Counterintuitive prompting for models with no breath</p><p>[00:55:36] Idea spaces</p><p>[00:59:59] Alex - Life update about ThursdAI</p><p>[01:04:30] Stable Audio from Stability AI</p><p>[01:17:23] GPT4ALL Vulkan</p><p>[01:19:37] Coqui.ai releases XTTS - an open source TTS - interview With Josh Meyer</p><p>[01:30:40] Summary</p><p></p><p>Here’s a full video of the pod, and a full transcription, and as always, 🧡 thank you for bring a paid subscriber, this really gives me the energy to keep going, get better guests, release dope podcast content, and have 3 hours spaces and then spend 7 hours editing 🔥</p>]]></description><link>https://sub.thursdai.news/p/sep-14</link><guid isPermaLink="false">substack:post:137052400</guid><dc:creator><![CDATA[Alex Volkov]]></dc:creator><pubDate>Fri, 15 Sep 2023 02:35:37 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/137052400/eeb4413c0a04c94b71beb559adb7b220.mp3" length="88026419" type="audio/mpeg"/><itunes:author>Alex Volkov</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>5501</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/137052400/7ab385bacb7a124712c0008e3f38b334.jpg"/></item><item><title><![CDATA[🔥🎙️ ThursdAI Sunday special - Extending LLaMa to 128K context window (2 orders of magnitude) with YaRN [Interview with authors]]]></title><description><![CDATA[This is a free preview of a paid episode. To hear more, visit <a href="https://sub.thursdai.news?utm_medium=podcast&#38;utm_campaign=CTA_7">sub.thursdai.news</a><br/><br/><p>Happy Sunday everyone, I am very excited to bring you this interview with the folks who took LLaMa 2 and made it LLoooooongMa!</p><p>Extending LLaMa 2 context window from 4,000 to a whopping 128,000 tokens (<a target="_blank" href="https://huggingface.co/conceptofmind/Yarn-Llama-2-13b-128k">Yarn-Llama-2-13b-128k on Hugging Face)</a>, these guys <strong>also</strong> came up with a <a target="_blank" href="https://twitter.com/_akhaliq/status/1698497385230389585">paper</a> called YaRN (Efficient Context Window Extension of Large Language Models) and showed that YaRN is not only requires <strong>10x</strong> less tokens to create these long contexts, but also <strong>2.5x</strong> less training steps! </p><p>And, the models generalize so there’s now no need to collect extremely long sequences (think books length sequences) for the models to understand those context lengths. </p><p>I have decided also to do something different (which took me half of Sunday so I can’t promise and am not committing to this format, but for the premium subscribers, you can now watch this interview with running Karaoke style subtitles and improved audio! This will be uploaded to Youtube in a week but aren’t you glad you subscribed and is getting this first?) </p><p>Here’s a teaser preview: </p><p>And here’s the chapter for your convenience (the only thing that’s ai generated 😂)</p><p>0:00 - Introduction</p><p>3:08 - Discussion of extending LLAMA2's context length from 4,000 tokens to 128,000 tokens using the YaRN method</p><p>8:23 - Explanation of rope scaling for positional encodings in transformers</p><p>13:21 - How the rope scaling idea allows for longer context through positional interpolation</p><p>18:51 - Using in-context learning to train models on shorter sequences but still handle long contexts</p><p>25:18 - Sourcing long-form data like books to train 128k token models</p><p>31:21 - Whether future models will natively support longer contexts</p><p>37:33 - New model from Adept with 16k context using rope scaling</p><p>42:46 - Attention is quadratic - need better algorithms to make long context usable</p><p>49:39 - Open source community pushing state of the art alongside big labs</p><p>52:34 - Closing thoughts</p><p><p>As always, full (manually edited) transcription (and this time a special video version!) is reserved for the premium subscribers, I promise it’ll be worth it, so why not .. y’know? skip a cup of coffee from SB and support ThursdAI? </p></p>]]></description><link>https://sub.thursdai.news/p/thursdai-sunday-special-extending</link><guid isPermaLink="false">substack:post:136909612</guid><dc:creator><![CDATA[Alex Volkov, Enrico Shippole, Jeffrey Quesnelle, and WallTapecapval]]></dc:creator><pubDate>Sun, 10 Sep 2023 19:51:43 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/136909612/fae6a600d1b9b64125956bc7b65cd75b.mp3" length="52181496" type="audio/mpeg"/><itunes:author>Alex Volkov, Enrico Shippole, Jeffrey Quesnelle, and WallTapecapval</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>3261</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/136909612/d86ff1adec2b66aa3fc6c4bc8793c044.jpg"/></item><item><title><![CDATA[ThursdAI Sep 7 - Falcon 180B 🦅 , 🔥 Mojo lang finally here, YaRN scaling interview, Many OSS models & more AI news]]></title><description><![CDATA[<p>Hey ya’ll, welcome to yet another ThursdAI, this is Alex coming at you every ThursdAI, including a live recording this time! </p><p>Which was incredible, we chatted about Falcon 180B,had a great interview in the end with 3 authors of the YaRN scaling paper and LLongMa 128K context, had 3 breaking news! in the middle, MOJO🔥 has been released and Adept released a LLaMa comparable OSS model (and friend of the pod) @reach_vb  showed an open ASR leaderboard on hugging face! We also covered an incredible tiny model called StarCoder 1B that was finetuned by friend of the pod (who joined the space to talk to us about it!) </p><p>As always, you can listen to the whole 3 hour long form conversation (raw, unedited) on our <a target="_blank" href="https://zealous.one/@altryne/thursdAI/convos/cnv_2V2PnHbIY8GYEGMCVCiF7mzajRu">Zealous page</a> (and add it to your podcatcher via <a target="_blank" href="https://zealous.one/@altryne/thursdAI/rss">this RSS</a>) and this short-form pod is available on <a target="_blank" href="https://podcasts.apple.com/us/podcast/the-top-ai-news-from-the-past-week-every-thursdai/id1698613329?itsct=podcast_box&#38;itscg=30200&#38;ls=1">Apple</a>, <a target="_blank" href="https://open.spotify.com/show/2J3lqMPD0BUI0bF9KJYKc1?si=b30f7bafc83b4a3f">Spotify</a> and everywhere. </p><p><p>ThursdAI - Hey, if you enjoy these, how about subscribing for real? Would love to do this full time! Every paid subscriber is like a dear friend 🧡</p></p><p>TL;DR of all topics covered</p><p>* Open Source LLM</p><p>* Falcon 180B announced by TIIUAE (<a target="_blank" href="https://falconllm.tii.ae/">Announcement</a>, <a target="_blank" href="https://huggingface.co/spaces/tiiuae/falcon-180b-demo">Demo</a>)</p><p>* YaRN scaling paper - scaling LlaMa to 128K context (<a target="_blank" href="https://twitter.com/_akhaliq/status/1698497385230389585">link</a>)</p><p>* OpenHermes-13B from @teknium1 (<a target="_blank" href="https://twitter.com/Teknium1/status/1699887247196348676">link</a>)</p><p>* Persimmon-8B from Adept.AI (<a target="_blank" href="https://www.adept.ai/blog/persimmon-8b">link</a>)</p><p>* Starcoder-1B-sft from @abacaj (<a target="_blank" href="https://twitter.com/abacaj/status/1699602420882378932?s=20">link</a>) </p><p>* Big Co LLMs + API updates</p><p>* OpenAI first ever Dev conference (<a target="_blank" href="https://openai.com/blog/announcing-openai-devday">link</a>)</p><p>* Claude announces a $20/mo Claude Pro tier (<a target="_blank" href="https://www.anthropic.com/index/claude-pro">link</a>)</p><p>* Modular releases Mojo🔥 with  68,000x improvement over python (<a target="_blank" href="https://www.modular.com/blog/mojo-a-journey-to-68-000x-speedup-over-python-part-3">Link</a>)</p><p>* Vision</p><p>* Real time deepfake with FaceFusion (<a target="_blank" href="https://twitter.com/henryruhs/status/1699362697941254629?s=20">link</a>)</p><p>* HeyGen released AI avatars and AI video translation with lipsync (<a target="_blank" href="https://twitter.com/joshua_xu_/status/1687129787267973123">link</a>, <a target="_blank" href="https://twitter.com/HeyGen_Official/status/1699860837761183850">translation announcement</a>)</p><p>* Voice</p><p>* Open ASR (automatic speech recognition) leaderboard from HuggingFace (<a target="_blank" href="https://huggingface.co/spaces/hf-audio/open_asr_leaderboard">link</a>)</p><p>* Tools</p><p>* LangChain Hub (re) <a target="_blank" href="https://twitter.com/hwchase17/status/1699117061782528353">launched </a></p><p>* Open Interpreter (<a target="_blank" href="https://twitter.com/hellokillian/status/1699156860073640038">Announcement</a>, <a target="_blank" href="https://github.com/KillianLucas/open-interpreter/">Github</a>)</p><p></p><p><strong>Open Source LLM</strong></p><p></p><p><strong>🦅 Falcon 180B - The largest open source LLM to date (</strong><a target="_blank" href="https://falconllm.tii.ae/falcon-models.html"><strong>Announcement</strong></a><strong>, </strong><a target="_blank" href="https://huggingface.co/spaces/tiiuae/falcon-180b-demo"><strong>Demo</strong></a><strong>)</strong></p><p>The folks at the “Technology Innovation Institute” have open sourced the huge Falcon 180B, and have put it up on Hugging Face. Having previously open sourced Falcon 40B, the folks from TIIUAE have given us a huge model that beats (base) LLaMa 2 on several evaluations, if just slightly by a few percentages points. </p><p>It’s huge, was trained on <strong>3.5 trillion tokens</strong> and weights above 100GB as a file and requires 400GB for inference. </p><p>Some <a target="_blank" href="https://twitter.com/DrJimFan/status/1699459647592403236?s=20">folks</a> were not<a target="_blank" href="https://twitter.com/TheSeaMouse/status/1699502666869563795?s=20"> as impressed</a> with Falcon performance, given it’s parameter size is 2.5 those of LLaMa 2 (and likely it took a longer time to train) but the relative benchmarks is just a few percentages higher than LLaMa. It also boasts an embarrassingly low context window of just 2K tokens, and code was just 5% of it’s dataset, even though we already know that more code in the dataset, makes the models smarter! </p><p></p><p>Georgi Gerganov is already running this model on <a target="_blank" href="https://twitter.com/ggerganov/status/1699791226780975439">his M2 Ultra</a> because he’s the Goat, and co-host of ThursdAI spaces, Nisten, was able to run this model with CPU-only and <a target="_blank" href="https://twitter.com/nisten/status/1699815000947233136?s=20">with just 4GB of ram</a> 🤯  We’re waiting for Nisten to post a Github on how to run this monsterous model on just CPU, because it’s incredible! </p><p>However, given the Apache2 license and the fine-tuning community excitement about improving these open models, it’s an incredible feat. and we’re very happy that this was released! </p><p>The complete open sourcing also matters in terms of geopolitics, this model was developed in the UAE, while in the US, the export of A100 GPUs was banned to the middle easy, and folks are talking about regulating foundational models, and this release, size and parameter model that’s coming out of the United Arab Emirates, for free, is going to definitely add to the discussion wether to regulate AI, open source and fine-tuning huge models! </p><p><strong>YaRN scaling</strong> <strong>LLaMa to 128K context window</strong></p><p>Last week, just in time for ThursdAI, we posted about the <a target="_blank" href="https://twitter.com/EnricoShippole/status/1697317625116742119?s=20">release</a> of <strong>Yarn-Llama-2-13b-128k</strong>, a whopping <strong>32x</strong> improvement in context window size on top of the base LLaMa from the folks at Nous Research, Enrico Shippole, @theemozilla with the help of Eluether AI.</p><p>This week, they released the <a target="_blank" href="https://twitter.com/_akhaliq/status/1698497385230389585">YaRN: Efficient Context Window Extension of Large Language Models</a> paper which uses Rotary Position Embeddings to stretch the context windows of transformer attention based LLMs significantly. </p><p>We had friends of the pod <a target="_blank" href="https://twitter.com/EnricoShippole">Enrico Shippole</a>, <a target="_blank" href="https://twitter.com/theemozilla">theemozilla</a> (Jeff) and <a target="_blank" href="https://twitter.com/bloc97_">Bowen Peng</a> on the twitter space and an special interview with them will be released on Sunday, if you’re interested in scaling and stretching context windows work, definitely subscribe for that episode, it was incredible! </p><p>It’s great to see that their work is already applied into several places, including CodeLLaMa (which was released with 16K - 100K context) and the problem is now compute, basically, context windows can be stretched, and the models are able to generalize from smaller datasets, such that the next models are predicted to be released with infinite amount of context window, and it’ll depend on your hardware memory requirements.</p><p><strong>Persimmon-8B from AdeptAI (</strong><a target="_blank" href="https://www.adept.ai/blog/persimmon-8b"><strong>announcement</strong></a><strong>, </strong><a target="_blank" href="https://github.com/persimmon-ai-labs/adept-inference"><strong>github</strong></a><strong>)</strong></p><p>AdeptAI, the company behind Act-1, a foundational model for AI Agent that does browser driving, and has a few co-founders that are the original transformers paper authors, have dropped a ThursdAI surprise, a fresh (read, not a LLaMa clone) model!</p><p>Releasing an completely open source model called Persimmon-8B, with a full Apache 2 license, 16K context window (using custom RoPE scaling methods) and some interesting inference speedups with C++. </p><p>A very interesting 8B model that can fit on most consumer hardware, with additional tricks and a huge context window, is definitely welcome! </p><p>Additional interesting point is, they have 70K unused embeddings for multimodal extensions! Can’t wait to see what’s that about!</p><p><strong>Starcoder-1B-sft - tiny model that’s great at code</strong></p><p>Anton Bacaj (<a target="_blank" href="http://abacaj">@abacaj</a>) has finetuned StarCoder, to achieve some incredible results, for such a tiny model! Remember the first item, a whopping 180B parameter Falcon? We’ll, this is just 1B parameters model, finetuned on 65K sampled dataset of code, that’s outperforming Falcon, LLaMa2, Palm-2 (and Persimmon) on coding tasks, and runs on your device, so fast, that it’s hard to read! </p><p>It boasts an incredible <strong>39% on HumanEval</strong> task and <strong>31% on MBPP</strong>! (Anton reran and updated the MBPP score later) and can run locally. Friend of the pod <a target="_blank" href="https://substack.com/profile/137613856-xenova">Xenova</a> has already ported this model to <a target="_blank" href="https://huggingface.co/docs/transformers.js/index">transformers.js</a> and it’ll soon run in your browser! </p><p><strong>OpenHermes-13B from @teknium1 (</strong><a target="_blank" href="https://twitter.com/Teknium1/status/1699887247196348676"><strong>link</strong></a><strong>)</strong></p><p>Our friend Teknium1 (who we’ve interviewed a few weeks ago) releases OpenHermes on top of LLaMa2, but this time it’s a completely open model and datasets, marking this the first time that Hermes models have been open!</p><p>OpenHermes was trained on 242,000 entries of primarily GPT-4 generated data, from open datasets across the AI landscape, including: </p><p>* GPTeacher - General Instruct, </p><p>* Roleplay v1, Roleplay v2, and Code Instruct Datasets, by Teknium </p><p>* WizardLM (v1, evol_instruct 70k), by WizardLM </p><p>* Team/nlpxucan Airoboros GPT-4 (v1.0), by JonDurbin </p><p>* Camel-AI's domain expert datasets, by the Camel-AI Team </p><p>* CodeAlpaca, by Sahil2801 </p><p>* GPT4-LLM and </p><p>* Unnatural Instructions, by Microsoft</p><p>Check it out folks! </p><p><strong>Big Co LLM + API updates</strong></p><p>Modular finally ships Mojo 🔥 (<a target="_blank" href="https://www.modular.com/blog/mojo-its-finally-here">Announcement</a>)</p><p>I <a target="_blank" href="https://twitter.com/altryne/status/1697366346840039911">just knew it</a>, that Mojo will finally be shipped during ThursdAI, and in fact, this was a great #BreakingNews moment on twitter spaces!</p><p>Modular, and it’s co-founder Chris Lattner (author of LLVM, MLIR, Swift and many other things) have finally released their Mojo 🔥 language, for AI. </p><p>Mojo 🔥 is like Python++, includes strong types, full interoperability with python ecosystem but is able to run basic vanilla python, and has so so much more in it, but the main thing Modular is claiming is a whopping 68,000x improvement over vanilla python! </p><p>You didn’t misread this, 68,000 improvement, when using all the Modular inference compilers, and Mojo virtualization tricks and compilation improvements. It’s incredible. </p><p>The beauty of Mojo is that it meets developers where they are and allows them to adopt new features to achieve high performance gradually. By combining the best of dynamic and static languages, Mojo can deliver performance up to 68,000 times faster than Python today. That's quite a leap! If you want to delve deeper into Mojo's origin story, you can find more information in their documentation. But for now, let me highlight a few key benefits that Mojo offers:</p><p>Firstly, Mojo allows you to write everything in one language, merging the usability of Python with the systems programming features that typically require developers to rely on C, C++, or CUDA. This means that both research and deployment teams can work within a common codebase, streamlining the workflow from research to production.</p><p>Secondly, Mojo unlocks Python's performance potential. While Python is widely used, it may not be the best tool for high-performance or specialized hardware tasks. However, Mojo bridges that gap by enabling high performance on CPUs and providing support for exotic accelerators like GPUs and ASICs. With Mojo, you can achieve performance levels on par with C++ and CUDA.</p><p>Thirdly, and this is a big one, Mojo seamlessly integrates with the entire Python ecosystem. You can leverage the extensive library collection available in Python while making use of Mojo's features and performance benefits. This means you can easily combine libraries like NumPy and Matplotlib with your Mojo code – talk about flexibility!</p><p>Finally, Mojo allows you to upgrade your AI workloads effortlessly. By tightly integrating with the Modular AI Engine, Mojo empowers you to extend your AI workloads with custom operations. This includes pre-processing and post-processing operations, as well as high-performance mathematical algorithms. You can even integrate kernel fusion, graph rewrites, shape functions, and more. Mojo is all about expanding the possibilities!</p><p>Mojo’s playground has been around since May and I have a <a target="_blank" href="https://twitter.com/altryne/status/1655471344912596997">deep dive here</a> but you should really watch  for over 3 hours on everything from Why they chose to be a python superset, to why he thinks the community will pick it up, it’s an incredible watch and will make you excited about Mojo! </p><p><strong>WebGPU ships with support for FP16 in Chromium </strong></p><p>Chrome has shipped with WebGPU back in April of 23’, after years of development, it allows high performance 3D graphics (and of course, transformers inference) in the browser and on the web! </p><p>However, for inference of models, GPU access is not enough, you also need to be able to run <a target="_blank" href="https://twitter.com/xenovacom/status/1699082090757726223?s=20">smaller models</a>. Well, one way to make models smaller is to run them in fp16 format. Essentially cutting the precision of the weights numbers by half, we can use much smaller (read compressed) models with a slight loss in accuracy. </p><p>Friends of the pod Nisten and Xenova (transformers.js author) have given us an update that a new, updated <strong>fp16 support</strong> has shipped in nightly of chromium, allowing for much much smaller models to be run on clientside! </p><p><strong>OpenAI first dev conference</strong> <strong>(</strong><a target="_blank" href="https://openai.com/blog/announcing-openai-devday"><strong>Announcement</strong></a><strong>)</strong></p><p>OpenAI has announced their first developer focused conference, to happen in SF during November 6th! </p><p>In person only (with the keynote being streamed to all) and they also said that they won’t do any model announcement like GPT-5 😂</p><p>But we'll all expect at least a few API updates! </p><p><strong>Vision</strong></p><p>FaceFusion 1.1.0 - a deepfake faceswapper (<a target="_blank" href="https://twitter.com/henryruhs/status/1699362697941254629">Announcement</a>, <a target="_blank" href="https://t.co/E2VOcogGuO">Github</a>) </p><p>We all know deepfakes are here, I mean, don’t we? But did you know that it’s now super easy to face swap your face into an image or a video? </p><p>FaceFusion does just that, an incredibly fast way to deepfake someone’s face into an image or a video with a few clicks, works on CPU (I couldn’t make it work on GPU but it’s possible) and shows incredible results! </p><p>Enjoy Steve Buschemi dance around as Harry Styles? 3 clicks and 10 minutes and you get this 🔥</p><p>Friend of the pod CocktailPeanut, has made it <a target="_blank" href="https://twitter.com/cocktailpeanut/status/1696548061277749356?s=20">incredible easy to install with just 1 click</a> with his <a target="_blank" href="http://pinokio.computer">pinokio.computer </a>app, which I use and love! </p><p>Facefusion also has a webcam mode that is able to deepfake any image onto a webcam stream for a lot of fun on zoom calls! (which I wasn’t able to test for some reason) </p><p>HeyGen launches their deep AI face creator</p><p>Many of us used 11Labs to clone voices, but what if you can clone a voice AND an image of a person? With just 2 minutes of their recording? </p><p>That’s what HeyGen are claiming to be able to do, and we’ve previously reported that their incredible realistic AI avatar generation from videos/images + voice really blew us away. </p><p>Heygen just launched their service and you can sign up and get a few minutes for free, here’s a sample (with the CEO avatar, they couldn’t make my own for some launch day errors) </p><p>The video you see on top of just that, the CEO of HeyGen, thanking you for reading this weeks ThursdAI! </p><p><strong>Voice</strong></p><p><strong>ASR leaderboard + New top ASR model from Nvidia</strong></p><p>I love doing ThursdAI, and one of the things I love most, is folks sending me stuff they worked on, and then coming to ThursdAI to chat about it. </p><p>Friend of the pod Vaibhav (VB) Srivastav, who’s an incredible dev rel at HuggingFace, focusing on Audio, has shipped a <a target="_blank" href="https://huggingface.co/spaces/hf-audio/open_asr_leaderboard">new Open-ASR</a> (automatic speech recognition) leaderboard on huggingface! </p><p>Showing the top ASR models like Whisper and a new comer, Nvidia FastConformer, which I didn’t even know existed, and now it’s topping Whisper for english speech to text tasks! </p><p>HuggingFace leaderboards like these are definitely a boon for the Open Source industry as they allow all of us to easily select open source models, but also allow the open source community to start racing towards the top, while we all benefit! </p><p><strong>Tools</strong></p><p>Open Interpreter (Announcement, <a target="_blank" href="https://github.com/KillianLucas/open-interpreter/">Github</a>)</p><p>One tool that I’ve used this week, and is incredible, is <a target="_blank" href="https://twitter.com/hellokillian/status/1699156860073640038">OpenInterpreter from  @heyitskillian </a> </p><p>It’s incredibly easy to install and run, and behaves like OpenAI Code Interpreter (renamed to Advanced Data Analytics) but on your computer, and is able to do things like control your apps, lower volume, edit images/files and tons more</p><p>pip install open-interpreter</p><p>And that’s it! </p><p>Give it a try (and you have to approve each command that it runs) </p><p>It’s a great agent, and hopefully we’ll get Killian to chat with us about it on next ThursdAI!</p><p><strong>LangChain hub has launched (</strong><a target="_blank" href="https://twitter.com/hwchase17/status/1699117061782528353"><strong>link</strong></a><strong>)</strong></p><p>If you’re into LangChain, and even if you aren’t, it’s undeniable the weight LangChain has in the ai engineer industry! They have a connector for everything, tons of folks use them, and they have raised a bunch of funding. </p><p>They have just launched their new LangChain Hub and it’s exciting! Many folks are sharing their best prompts on there, and ways to work with langchain, with upvotes and sharable links! </p><p>Also, worth nothing that our friends <a target="_blank" href="https://substack.com/profile/89230629-swyx">swyx</a> and Alessio from <a target="_blank" href="https://open.substack.com/pub/swyx">Latent Space</a> have recently released an episode with Harrison on Latent space, and it’s WELL worth listening (and reading) as <a target="_blank" href="https://substack.com/profile/89230629-swyx">swyx</a> did a deep dive into Landchain, it’s nay-sayers and everything in between! </p><p>Check it out below : </p><p></p><p>Thank you, see you next time (with some incredible personal news I’ll have to share)</p><p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></p><p></p><p></p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-sep-7-falcon-180b-mojo-lang</link><guid isPermaLink="false">substack:post:136796766</guid><dc:creator><![CDATA[Alex Volkov, Enrico Shippole, Nisten, yam, and WallTapecapval]]></dc:creator><pubDate>Thu, 07 Sep 2023 23:04:12 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/136796766/aca2ce656f6b1f29f9c063caca06e86b.mp3" length="21109748" type="audio/mpeg"/><itunes:author>Alex Volkov, Enrico Shippole, Nisten, yam, and WallTapecapval</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>1759</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/136796766/3b5c123a0490d68c5b4976ab7d25031b.jpg"/></item><item><title><![CDATA[ThursdAI Aug 24 - Seamless Voice Model, LLaMa Code, GPT3.5 FineTune API & IDEFICS vision model from HF]]></title><description><![CDATA[<p>Hey everyone, this week has been incredible (isn’t every week?), and as I’m writing this, I had to pause and go check out <a target="_blank" href="https://twitter.com/altryne/status/1694723574677037170">breaking news about LLama code</a> which was literally released on ThursdAI as I’m writing the summary! I think Meta deserves their own section in this ThursdAI update 👏</p><p>A few reminders before we dive in, we now have a website (<a target="_blank" href="http://thursdai.news">thursdai.news</a>) which will have all the links to <a target="_blank" href="https://podcasts.apple.com/us/podcast/the-top-ai-news-from-the-past-week-every-thursdai/id1698613329?itsct=podcast_box&#38;itscg=30200&#38;ls=1">Apple</a>, <a target="_blank" href="https://open.spotify.com/show/2J3lqMPD0BUI0bF9KJYKc1?si=b30f7bafc83b4a3f">Spotify</a>, Full recordings with transcripts and will soon have a calendar you can join to never miss a live space!This whole thing would have been possible without <a target="_blank" href="https://twitter.com/Yampeleg/status/1694754466287767906">Yam</a>, <a target="_blank" href="https://twitter.com/nisten/status/1694400381722165589?s=20">Nisten</a>, <a target="_blank" href="https://twitter.com/xenovacom/status/1683471704213864449?s=20">Xenova</a> , <a target="_blank" href="https://twitter.com/reach_vb/status/1694020457798815811?s=20">VB</a>, <a target="_blank" href="https://twitter.com/far__el">Far El</a>, <a target="_blank" href="https://twitter.com/ldjconfirmed">LDJ</a> and <a target="_blank" href="https://twitter.com/i/lists/1691246404273123416?s=20">other expert speakers</a> from different modalities who join and share their expertise from week to week, and there’s a <a target="_blank" href="https://twitter.com/i/lists/1691246404273123416?s=20">convenient way to follow all of them</a> now!</p><p></p><p>TL;DR of all topics covered</p><p>* <strong>Voice</strong></p><p>* <strong>Seamless M4T</strong> Model from Meta (<a target="_blank" href="https://seamless.metademolab.com/demo">demo</a>)</p><p>* <strong>Open Source LLM</strong></p><p>* <a target="_blank" href="https://about.fb.com/news/2023/08/code-llama-ai-for-coding/">LLaMa2 - code</a> from Meta</p><p>* <strong>Vision</strong></p><p>* <a target="_blank" href="https://huggingface.co/spaces/HuggingFaceM4/idefics_playground">IDEFICS</a> - A multi modal text + image model from Hugging face</p><p>* <strong>AI Art & Diffusion</strong></p><p>* 1 year of Stable Diffusion 🎂</p><p>* <a target="_blank" href="https://ideogram.ai/">IdeoGram</a></p><p>* <strong>Big Co LLMs + API updates</strong></p><p>* <a target="_blank" href="https://twitter.com/OfficialLoganK/status/1694728881666920777?s=20">GPT 3.5 Finetuninng API</a></p><p>* <strong>AI Tools & Things</strong></p><p>* Cursor IDE</p><p>Voice</p><p>Seamless M4t - A multi lingual, mutli tasking, multimodality voice model.</p><p>To me, the absolute most mindblowing news of this week was Meta open sourcing (not fully, not commercially licensed) <strong>SeamlessM4T</strong></p><p>This is a multi lingual model that takes speech (and/or text) can generate the following:</p><p>* Text</p><p>* Speech</p><p>* Translated Text</p><p>* Translated Speech</p><p>In a single model! For comparison sake, I takes a whole pipeline with whisper and other translators in <a target="_blank" href="https://targum.video">targum.video </a>not to mention much bigger models, and not to mention I don’t actually generate speech!</p><p>This incredible news got me giddy and excited so fast, not only because it simplifies and unifies so much of what I do into 1 model, and makes it faster and opens up additional capabilities, but also because I strongly believe in the vision that <strong>Language Barriers should not exist </strong>and that’s why I built Targum.</p><p>Meta apparently also believes in this vision, and gave us an incredible new power unlock that understands <strong>100 languages</strong> and does so multilingually without effort.</p><p></p><p>Language barriers should not exist</p><p></p><p>Definitely checkout the discussion in the podcast, where VB from the open source audio team on Hugging Face goes in deeper into the exciting implementation details of this model.</p><p>Open Source LLMs</p><p>🔥 LLaMa Code</p><p>We were patient and we got it! Thank you Yann!</p><p>Meta releases LLaMa Code, a LlaMa fine-tuned on coding tasks, including “in the middle” completion tasks, which are what copilot does, not just autocompleting code, but taking into account what’s surrounding the code it needs to generate.</p><p>Available in 7B, 13B and 34B sizes, the largest model beats GPT3.5 on HumanEval, which is a metric for coding tasks. (you can <a target="_blank" href="https://labs.perplexity.ai/?utm_content=first_codellama&#38;s=u&#38;utm_source=twitter&#38;utm_campaign=labs">try it here</a>)</p><p>In an interesting move, they also separately release a specific python finetuned versions, for python code specifically.</p><p>Additional incredible thing is, it supports 100K context window of code, which is, a LOT of code. However it’s unlikely to be very useful in open source because of the compute required</p><p>They also give us instruction fine-tuned versions of these models, and recommend using them, since those are finetuned on being helpful to humans rather than just autocomplete code.</p><p>Boasting impressive numbers, this is of course, just the beginning, the open source community of finetuners is salivating! This is what they were waiting for, can they finetune these new models to beat GPT-4? 🤔</p><p>Nous update</p><p>Friends of the Pod LDJ and Teknium1 are releasing the latest 70B model of their Nous Hermes 2 70B model 👏</p><p>* <a target="_blank" href="https://twitter.com/ldjconfirmed/status/1694582252783407239">Nous-Puffin-70B</a></p><p>We’re waiting on metrics but it potentially beats chatGPT on a few tasks! Exciting times!</p><p>Vision & Multi Modality</p><p>IDEFICS - a new 80B model from HuggingFace, was released after a years effort, and is quite quite good. We love vision multimodality here on ThursdAI, we’ve been covering it since we say that GPT-4 demo!</p><p>IDEFICS is a an effort by hugging face to create a foundational model for multimodality, and it is currently the only visual language model of this scale (80 billion parameters) that is available in open-access.</p><p>It’s made by fusing the vision transformer CLIP-VIT-H-14 and LLaMa 1, I bet LLaMa 2 is coming soon as well!</p><p>And the best thing, it’s openly available and you can use it in your code with hugging face transformers library!</p><p>It’s not perfect of course, and can hallucinate quite a bit, but it’s quite remarkable that we get these models weekly now, and this is just the start!</p><p>AI Art & Diffusion</p><p>Stable Diffusion is 1 year old</p><p>Has it been a year? wow, for me, personally, stable diffusion is what started this whole AI fever dream. SD was the first model I actually ran on my own GPU, the first model I learned how to.. run, and use without relying on APIs. It made me way more comfortable with juggling models, learning what weights were, and we’ll here we are :) I now host a podcast and have a newsletter and I’m part of a community of folks who do the same, train models, discuss AI engineer topics and teach others!</p><p>Huge thank you to Emad, Stability AI team, my friends there, and everyone else who worked hard on this.</p><p>Hard to imagine how crazy of a pace we’ve been on since the first SD1.4 release, and how incredibly realistic the images are now compared to what we got then and got excited about!</p><p>🎂</p><p>IdeaoGram joins the AI art race</p><p>IdeoGram - new text to image from ex googlers (<a target="_blank" href="https://twitter.com/ideogram_ai/status/1694461000622219435">announcement</a>) is the new kid on the block, not open source (unless I missed it) it boasts significant text capabilities, and really great quality of imagery. It also has a remix ability, and is availble from the web, unlike… MidJourney!</p><p>Big Co LLMs + API updates</p><p><a target="_blank" href="https://twitter.com/OfficialLoganK/status/1694728881666920777?s=20">Open AI pairs with ScaleAI</a> to let enterprises finetune and run finetuned GPT3.4 models!</p><p>This is an interesting time for OpenAI to dive into fine-tuning, as open source models inch closer and closer to GPT3.5 on several metrics with each week.</p><p>Reminder, if you finetune a GPT3.5 model ,you need to provide your own data to OpenAI but then also you have to pay them for essentially hosting a model just for you, which means it’s not going to be cheap.</p><p>Use as much prompting as humanly possible before you consider doing the above fine-tuning and you may be able to solve your task much better and cheaper.</p><p>Agents</p><p>The most interesting thing to me in the world of agents actually came from an IDE!</p><p>I installed Cursor, the new AI infused VsCode clone, imported my vscode settings, and off we went! It can use your own GPT-4 keys if you don’t want to send them our code or pay, it embeds your whole repo for easy import and code understand and does so much more, like adding a button to every error in console to “debug” and has an “new AI project” feature, which builds you a template just by typing a few words!</p><p>Our friends Alessio and Swyx have interviewed the founder of Cursor on their podcast, a strong recommendation to check that episode out!</p><p>After using Cursor for just a few days, I don’t want to go back to VSCode and even consider … maybe pausing my copilot subscription 🤯</p><p>That’s all for today folks! I wish you all a great week, and we’ll see you in the next ThursdAI 🫡</p><p></p><p>Thank you for reading ThursdAI - Recaps of the most high signal AI weekly spaces. This post is public so feel free to share it with a friend? Let’s get to 1K readers 🔥</p><p></p><p></p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-aug-24-seamless-voice-model</link><guid isPermaLink="false">substack:post:136389837</guid><dc:creator><![CDATA[Alex Volkov, Xenova, Nisten, and yam]]></dc:creator><pubDate>Fri, 25 Aug 2023 02:45:46 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/136389837/1073c4598c6a2da256321eab3db82054.mp3" length="49001636" type="audio/mpeg"/><itunes:author>Alex Volkov, Xenova, Nisten, and yam</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>4083</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/136389837/9514aa6715a7d0d511740b3906539115.jpg"/></item><item><title><![CDATA[🎙️ThursdAI - LLM Finetuning deep dive, current top OSS LLMs (Platypus 70B, OrctyPus 13B) authors & what to look forward to]]></title><description><![CDATA[This is a free preview of a paid episode. To hear more, visit <a href="https://sub.thursdai.news?utm_medium=podcast&#38;utm_campaign=CTA_7">sub.thursdai.news</a><br/><br/><p>Brief outline for your convenience:</p><p>[00:00] Introduction by Alex Volkov[06:00] Discussing the Platypus models and data curation process by Ariel, Cole and Nathaniel[15:00] Merging Platypus with OpenOrca model by Alignment Labs* Combining strengths of Platypus and OpenOrca* Achieving state-of-the-art 13B model[40:00] Mixture of Experts (MOE) models explanation by Prateek and Far El[47:00] Ablation studies on different fine-tuning methods by Teknium</p><p>Full transcript is available for our paid subscribers 👇 Why don’t you become one?</p><p>Here’s a list of folks and models that appear in this episode please follow all of them on X:</p><p>* ThursdAI cohosts - <a target="_blank" href="https://twitter.com/altryne/">Alex Volkov</a>, <a target="_blank" href="https://twitter.com/Yampeleg/">Yam Peleg</a>, <a target="_blank" href="https://twitter.com/nisten">Nisten Tajiraj</a></p><p>* Garage Baind - <a target="_blank" href="https://twitter.com/ArielNLee">Ariel</a>, <a target="_blank" href="https://twitter.com/ColeJHunter">Cole</a> and <a target="_blank" href="https://twitter.com/natanielruizg">Nataniel</a> (<a target="_blank" href="https://platypus-llm.github.io/">platypus-llm.github.io</a>)</p><p>* Alignment Lab - <a target="_blank" href="https://twitter.com/alignment_lab/">Austin</a>, <a target="_blank" href="https://twitter.com/Teknium1/">Teknium</a> (<a target="_blank" href="https://discord.gg/mhFWVbXUDh">Discord server</a>)</p><p>* SkunkWorks OS - <a target="_blank" href="https://twitter.com/far__el">Far El</a>, <a target="_blank" href="https://twitter.com/prateeky2806">Prateek Yadav</a>, <a target="_blank" href="https://twitter.com/propback_">Alpay Ariak</a> (<a target="_blank" href="https://discord.gg/69qT6MP6yr">Discord server</a>)</p><p>* <a target="_blank" href="https://huggingface.co/Open-Orca/OpenOrca-Platypus2-13B">Platypus2-70B-instruct</a></p><p>* <a target="_blank" href="https://huggingface.co/Open-Orca/OpenOrca-Platypus2-13B">Open Orca Platypus 13B</a></p><p>I am recording this on August 18th, which marks the one month birthday of the Lama 2 release from Meta. It was the first commercially licensed large language model of its size and quality, and we want to thank the great folks at MetaAI. Yann LeCun, BigZuck and the whole FAIR team. Thank you guys. It's been an incredible month since it was released.</p><p>We saw a Cambrian explosion of open source communities who make this world better, even since Lama 1. For example, LLaMa.Cpp by Georgi Gerganov is such an incredible example of how open source community comes together and this one guy in the weekend Took the open source weights and made it run on CPUs and much, much faster.</p><p>Mark Zuckerberg even <a target="_blank" href="https://twitter.com/altryne/status/1666949067975561218">talked about this</a>, how amazing the open source community has adopted LLAMA, and that Meta is also now adopting many of those techniques and developments back to run their own models cheaper and faster. And so it's been exactly one month since LLAMA 2 was released.</p><p>And <a target="_blank" href="https://thursdai.substack.com/p/thursdai-aug-3-openai-qwen-7b-beats#details">literally every</a> <a target="_blank" href="https://thursdai.substack.com/p/thursdai-aug-17-ai-vision-platypus#details">ThursdAI</a> <a target="_blank" href="https://thursdai.substack.com/p/thursdai-aug-10-deepfakes-get-real#details">since then</a>, we have covered a new state of the art open source model all based on Lama 2 that topped the open source model charts on Hugging Face.</p><p>Many of these top models were fine tuned by Discord organizations of super smart folks who just like to work together in the open and open source their work.</p><p>Many of whom are great friends of the pod.</p><p>Nous Research, with whom we've had a special episode a <a target="_blank" href="https://thursdai.substack.com/p/thursdai-special-episode-interview#details">couple of weeks back </a>Teknium1 seems to be part of every orgm Alignment Labs and GarageBaind being the last few folks topping the charts.</p><p>I'm very excited not to only bring you an interview with Alignment Labs and GarageBaind, but also to give you a hint of two additional very exciting efforts that are happening in some of these discords.</p><p>I also want to highlight how many of those folks do not have data scientist backgrounds. Some of them do. So we had a few PhDs or PhD studies folks, but some of them studied all this at home with the help of GPT 4. And some of them even connected via ThursdAI community and space, which I'm personally very happy about.</p><p>So this special episode has two parts. The first part we're going to talk with Ariel. Cole and Natniel, currently known as GarageBaind, get it? bAInd, GarageBaind, because they're doing AI in their garage. I love it.</p><p>🔥 Who are now holding the record for the best performing open source model called <a target="_blank" href="https://huggingface.co/garage-bAInd/Platypus2-70B-instruct">Platypus2-70B-Instruct</a>.</p><p>And then, joining them is Austin from Alignment Labs, the authors of OpenOrca, also a top performing model, will talk about how they've merged and joined forces and trained the best performing 13b model called <a target="_blank" href="https://huggingface.co/Open-Orca/OpenOrca-Platypus2-13B">Open Orca Platypus 13B</a> or Orctypus 13B</p><p>This 13b parameters model comes very close to the Base Llama 70b. So, I will say this again, just 1 month after Lama 2 released by the great folks at Meta, we now have a <strong>13 billion</strong> parameters model, which is way smaller and cheaper to run that comes very close to the performance benchmarks of a way bigger, very expensive to train and run 70B model.</p><p>And I find it incredible. And we've only just started, it's been a month. And so the second part you will hear about two additional efforts, one run by Far El, Prateek and Alpay from the SkunksWorks OS Discord, which is an effort to bring everyone an open source mixture of experts model, and you'll hear about what mixture of experts is.</p><p>And another effort run by a friend of the pod Teknium previously a chart topper himself with Nous Hermes models and many others, to figure out which of the fine tuning methods are the most efficient. and fast and cheap to run. You will hear several mentions of LORAs, which stand for Low Rank Adaptation, which are basically methods of keeping the huge weights of LAMA and other models frozen and retrain and fine tune and align some specific parts of it with new data, which is a method we know from Diffusion World.</p><p>And it's now applying to the LLM world and showing great promise in how fast, easy, and cheap it is to fine tune these huge models with significantly less hardware costs and time. Specifically, Nataniel Ruiz, the guy who helped Ariel and Cole to train Platypus, the co-author on DreamBooth, StyleDrop and many other diffusion methods, mentioned that it takes around <strong>five hours</strong> on a <strong>single A100 GPU</strong> to fine tune the 13B parameter model. That, if you can find an A100 GPU, that's around $10.</p><p>That's incredible.</p><p>I hope you enjoy listening and learning from these great folks, and please don’t forget to checkout our website at <a target="_blank" href="https://thursdai.news">thursdai.news</a> for all the links, socials and podcast feeds.</p><p>Brief outline for your convinience:</p><p>[00:00] Introduction by Alex Volkov</p><p>[06:00] Discussing the Platypus models and data curation process by Ariel, Cole and Nathaniel</p><p>[15:00] Merging Platypus with OpenOrca model by Alignment Labs</p><p>* Combining strengths of Platypus and OpenOrca</p><p>* Achieving state-of-the-art 13B model</p><p>[40:00] Mixture of Experts (MOE) models explanation by Prateek and Far El</p><p>[47:00] Ablation studies on different fine-tuning methods by Teknium</p><p>Full transcript is available for our paid subscribers 👇 Why don’t you become one?</p>]]></description><link>https://sub.thursdai.news/p/thursdai-llm-finetuning-deep-dive</link><guid isPermaLink="false">substack:post:136207596</guid><dc:creator><![CDATA[Alex Volkov, Ariel N. Lee, Cole Hunter, Teknium, Autometa, Far El, Prateek Yadav, yam, Nisten, and Nataniel Ruiz]]></dc:creator><pubDate>Sun, 20 Aug 2023 20:05:13 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/136207596/e03e492c5e64277287f3e6918b0fc692.mp3" length="37757911" type="audio/mpeg"/><itunes:author>Alex Volkov, Ariel N. Lee, Cole Hunter, Teknium, Autometa, Far El, Prateek Yadav, yam, Nisten, and Nataniel Ruiz</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>3146</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/136207596/5bfadc66e1a6059012d736e243f1eafb.jpg"/></item><item><title><![CDATA[ThursdAI Aug 17 - AI Vision, Platypus tops the charts, AI Towns, Self Alignment 📰 and a special interview with Platypus authors!]]></title><description><![CDATA[<p>Hey everyone, this is <a target="_blank" href="https://x.com/intent/follow?screen_name=altryne">Alex Volkov</a>, the host of ThursdAI, welcome to yet another recap of yet another incredibly fast past faced week.</p><p>I want to start with a ThursdAI update, we now have a new website http://<a target="_blank" href="http://thursdai.news">thursdai.news</a> and a new dedicated twitter account <a target="_blank" href="https://x.com/intent/follow?screen_name=thursdai_pod">@thursdai_pod</a> as we build up the ThursdAI community and brand a bit more.</p><p>As always, a reminder that ThursdAI is a weekly X space, <a target="_blank" href="https://thursdai.substack.com">newsletter</a> and 2! podcasts, short form (<a target="_blank" href="https://podcasts.apple.com/us/podcast/the-top-ai-news-from-the-past-week-every-thursdai/id1698613329?itsct=podcast_box&#38;itscg=30200&#38;ls=1">Apple</a>, <a target="_blank" href="https://open.spotify.com/show/2J3lqMPD0BUI0bF9KJYKc1?si=b30f7bafc83b4a3f">Spotify</a>) and the unedited long-form spaces recordings (<a target="_blank" href="https://zealous.one/@altryne/thursdAI/rss">RSS</a>, <a target="_blank" href="https://zealous.one/@altryne/thursdAI">Zealous page</a>) for those who’d like the nitty gritty details (and are on a long drive somewhere).</p><p>Open Source LLMs & Finetuning</p><p>Honestly, the speed with which LLaMa 2 finetunes are taking over state of the art performance is staggering. We literally talk about a new model every week that’s topping the <a target="_blank" href="https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard">LLM Benchmark leaderboard</a>, and it hasn’t even been a month since LLaMa 2 release day 🤯 (<a target="_blank" href="https://about.fb.com/news/2023/07/llama-2/">July 18</a> for those who are counting)</p><p>Enter Platypus 70B (<a target="_blank" href="https://twitter.com/natanielruizg/status/1690048207030493189?s=20">🔗</a>)</p><p>Platypus 70B-instruct is currently the highest ranked open source LLM and other Platypus versions</p><p>We’ve had the great pleasure to chat with new friends of the pod <a target="_blank" href="https://twitter.com/ArielNLee">Arielle Lee</a> and <a target="_blank" href="https://twitter.com/ColeJHunter">Cole Hunter</a> (and long time friend of the pod <a target="_blank" href="https://twitter.com/natanielruizg">Nataniel Ruiz</a>, co-author of DreamBooth, and StyleDrop which we’ve covered before) about this incredible effort to finetune LLaMa 2, the open dataset they curated and released as part of this effort and how quick and easy it is possible to train (a smaller 13B) version of Platypus (just 5 hours on a single A100 GPU ~= 6$ on Lambda 🤯)</p><p></p><p>We had a great interview with Garage BAIND the authors of Platypus and we’ll be posting that on a special Sunday episode of ThursdAI so make sure you are subscribed to receive that when it drops.</p><p></p><p>Open Orca + Platypus = OrctyPus 13B? (<a target="_blank" href="https://twitter.com/alignment_lab/status/1691477139001114625">🔗</a>)</p><p>We’ve told you about OpenOrca just last week, from our friends at <a target="_blank" href="https://twitter.com/alignment_lab">@alignment_lab </a>and not only is Platypus is the best performing 70B model, the open source community comes through with an incredible merge and collaborating to bring you the best 13B model, which is a merge between OpenOrca and Platypus.</p><p>This 13B model is now <strong>very close</strong> to the original LLaMa 70B in many of the metrics. LESS THAN A MONTH after the initial open source. It’s quite a remarkable achievement and we salute the whole community for this immense effort 👏 Also, accelerate! 🔥</p><p>Join the skunksworks</p><p>Speaking of fast moving things, In addition to the above interview, we had a great conversation with folks from so called <a target="_blank" href="https://twitter.com/far__el/status/1690043630331867136?s=20">SkunksWorks OS discord</a>, Namely <a target="_blank" href="https://twitter.com/far__el/">Far El</a>, <a target="_blank" href="https://twitter.com/prateeky2806">Prateek Yadav</a>, <a target="_blank" href="https://twitter.com/propback_">Alpay Ariak</a>, <a target="_blank" href="https://twitter.com/Teknium1/">Teknium</a> and Alignment Labs, and our recurring guest hosts <a target="_blank" href="https://twitter.com/Yampeleg/">Yam Peleg</a> and <a target="_blank" href="https://twitter.com/nisten">Nisten</a> covered two very exciting community efforts, all happening within the SkunksWorks Discord.</p><p>First effort is called MoE, Open mixture of experts, which is an Open Source attempt at replicating the Mixture of Experts model, which is widely attributed to why GPT-4 is so much better than GPT-3.</p><p>The second effort is called Ablation studies, which is an effort Teknium is leading to understand once and for all, what is the best, cheapest and most high quality way to finetune open source models, whether it's Qlora or a full finetune or Loras.</p><p>If you're interested in any of these, either by helping directly or provide resources such as GPU compute, please join the SkunksWorks discord. They will show you how to participate, even if you don't have prior finetuning knowledge! And we’ll keep you apprised of the results once they release any updates!</p><p>Big Co LLMs + API updates</p><p>In our Big CO corner, we start with an incredible paper from MetaAi, announcing:</p><p>Self-Alignment w/ Backtranslation method + Humpback LLM - MetaAI</p><p>Summarized briefly (definitely listen to the full episode and <a target="_blank" href="https://twitter.com/Yampeleg/status/1691198756668968960">@yampeleg</a> detailed overview of this method) it’s a way for an LLM to be trained on a unsupervised way of creating high quality datasets, for itself! Using not a lot of initial “seed” data from a high quality dataset. Think of it this way, fine-tuning a model requires a lot of “question → response” data in your dataset, and back-translation proposes “response → question” dataset generation, coming up with novel ways of saying “what would a potential instruction be that would make an LLM generate this result”</p><p>This results in a model that effectively learns to learn better and create it’s own datasets without humans (well at least human labelers) in the loop.</p><p>Here are <a target="_blank" href="https://twitter.com/hrishioa/status/1691059349509267456">some more</a> <a target="_blank" href="https://twitter.com/thursdai_pod/status/1691959792485671300?s=20">reading material</a> on X for reference.</p><p>OpenAI new JS SDK (<a target="_blank" href="https://twitter.com/OfficialLoganK/status/1691875240647758123">X link</a>)</p><p>OpenAI has partnered with StainlessAPI to released a major new version 4 of their TS/JS SDK with the following incredible DX improvements for AI engineers</p><p>* Streaming responses for chat & completions</p><p>* Carefully crafted TypeScript types</p><p>* Support for ESM, Vercel edge functions, Cloudflare workers, & Deno</p><p>* Better file upload API for Whisper, fine-tune files, & DALL·E images</p><p>* Improved error handling through automatic retries & error classes</p><p>* Increased performance via TCP connection reuse</p><p>* Simpler initialization logic</p><p>The most exciting part for me is, this is now <a target="_blank" href="https://twitter.com/rickyrobinett/status/1691942920012873785?s=20">very easy to get started</a> with AI projects and get streaming on the incredible Cloudflare workers platform (Targum is part of the first Cloudflare workers launchpad but is not affiliated, we’re just superfans 🫶)</p><p>Vision & Multi Modality</p><p>There’s been some really cool stuff happening in computer vision and multi-modal AI recently. First up, a new method called 3D Gaussian Splatting that shows an incredibly clear and smooth way to generate 3d scenes from just a few images.</p><p>Compared to neural radiance fields (NeRFs), Gaussian splatting produces much smoother results without the grainy voxel artifacts NeRFs often have. However, it achieves this improved quality without sacrificing the speed and performance of NeRFs. So Gaussian splatting gives a big boost in realism compared to NeRF renderings, while maintaining real-time speeds in cleaning up those “clouds”</p><p>Supervision from Roboflow (and Piotr)</p><p>Btw our own friend of the pod and AI Vision expert <a target="_blank" href="https://linktr.ee/skalskip">@skalskiP</a> (who reviewed Gaussian Splatting for us) is also having a crazy ThursdAI week, with their open source library called <a target="_blank" href="https://github.com/roboflow/supervision">SuperVision</a>, which is a computer vision toolkit, and is trending #2 on <a target="_blank" href="https://github.com/roboflow/supervision">Github</a> 👏</p><p>Apple stepping in their Vision (not the headset) Transformer game</p><p>Apple has <a target="_blank" href="https://twitter.com/anuragranj/status/1691555005000802455?s=20">open sourced</a> ml-fastvit, which is their general purpose Vision Transformers model, which they claim runs at <strong>~1ms on mobile devices</strong>, including code and pre-train weights available on <a target="_blank" href="https://github.com/apple/ml-fastvit">Github</a> 🔥</p><p>This is great to see from Apple ML teams, not only them open sourcing, but also them preparing all of us to the world of spatial computers (Vision Pro coming remember?) and many new Computer Vision heavy apps will be available at those incredible speeds.</p><p>This is also great for on device inference running these models in node / on edge (as Friend of the pod <a target="_blank" href="https://twitter.com/visheratin/status/1691696255334817814">@visheratin</a> demonstrated with WebAI)</p><p>Additional updates included Nvidia releasing a web playground for NeVa, which is their MLLM (Multimodal LLM, get used to seeing this term everywhere) and you can play with that <a target="_blank" href="https://twitter.com/NVIDIAAIDev/status/1691162782090108928">here</a> ), and <a target="_blank" href="https://twitter.com/_akhaliq/status/1691730510408822885">Link-Context learning</a> for MLLMs</p><p>Agents</p><p>OpenAi is also announced that <a target="_blank" href="https://twitter.com/DrJimFan/status/1691915578620129380">Global Illumination joining OpenAI</a>, that team is CEOd by the creator of Instagram stories algorithm and feed contributor and the team is behind a massive open world minecraft clone. Will we see OpenAI release agents into that world? We <a target="_blank" href="https://twitter.com/altryne/status/1672688222898618368">know</a> that they are working on agents</p><p>A16Z - AI Town (<a target="_blank" href="https://twitter.com/stuffyokodraws/status/1691179412069445632">🔗</a>)</p><p>Speaking of agents roaming free and interacting, we covered the open sourcing of SmallVille just last week ↴ and now we see a new open source framework called AI Town of running letting agents roam and interact with each other from Andreessen Horowitz AI division.</p><p>AI Town (<a target="_blank" href="https://github.com/a16z-infra/ai-town">Github</a>) is a web framework, written in TypeScript and is built to run, get customized and run with different LLMs (even Open source ones) in mind and you can see the AI agents running around in a <a target="_blank" href="https://t.co/D0sYJWnQOq">live demo here</a></p><p></p><p>This ThursdAI was so packed with great information, that it’s really worth listening to the whole recording, you can do this on our Zealous page, RSS and on twitter (all those links can always be found on <a target="_blank" href="http://thursdai.news">thursdai.news</a> )</p><p>If you found this valuable, join our community and let your friends know? This is a great way to support us, as well as participate in the discussion on social, tag #thursdAI on anything you feel is worthwhile for us to summarize and</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-aug-17-ai-vision-platypus</link><guid isPermaLink="false">substack:post:136171916</guid><dc:creator><![CDATA[Alex Volkov, Ariel N. Lee, Cole Hunter, Autometa, Far El, and yam]]></dc:creator><pubDate>Thu, 17 Aug 2023 22:47:14 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/136171916/e469ccf1498242a6545857e7f70999c4.mp3" length="12172421" type="audio/mpeg"/><itunes:author>Alex Volkov, Ariel N. Lee, Cole Hunter, Autometa, Far El, and yam</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>1014</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/136171916/83f66be1a549bebbde5b20dc10a1b9c3.jpg"/></item><item><title><![CDATA[ThursdAI Aug 10 - Deepfakes get real, OSS Embeddings heating up, Wizard 70B tops tops the charts and more! ]]></title><description><![CDATA[<p>Hey everyone, welcome to yet another ThursdAI update! As always, I’m your host, Alex Volkov, and every week, ThursdAI is a twitter space that has a panel of experts, guests and AI enthusiasts who join to get up to date with the incredible fast pace of AI updates, learn together and listen to subject matter experts on several of the topics. </p><p>Pssst, this podcast is now available on <a target="_blank" href="https://podcasts.apple.com/us/podcast/the-top-ai-news-from-the-past-week-every-thursdai/id1698613329?itsct=podcast_box&#38;itscg=30200&#38;ls=1">Apple</a>, <a target="_blank" href="https://open.spotify.com/show/2J3lqMPD0BUI0bF9KJYKc1?si=2fcf026867014945">Spotify</a> and everywhere using <a target="_blank" href="https://api.substack.com/feed/podcast/1801228.rss">RSS</a> and a new, long form, raw and uncut, <a target="_blank" href="https://zealous.one/@altryne/thursdAI">full spaces recording</a> podcast is coming soon! </p><p><p>ThursdAI - Is supported by readers, and I promised my wife I’d ask, if you find this valuable, why not upgrade your subscription so I can keep this going? Get better equipment and produce higher quality shows? </p></p><p>I started noticing that our updates spaces are split into several themes, and figured to start separating the updates to these themes as well, do let me know if the comments if you have feedback or preference or specific things to focus on. </p><p>LLMs (Open Source & Proprietary)</p><p>This section will include updates pertaining to Large Language Models, proprietary (GPT4 & Claude) and open source ones, APIs and prompting. </p><p>Claude 1.2 instant in Anthropic API (<a target="_blank" href="https://twitter.com/AnthropicAI/status/1689303697535414272">source</a>)</p><p>Anthropic has released a new version of their Claude Instant, a very very fast model of Claude, with 100K, a very capable model that’s now better at code task, and most of all, very very fast! </p><p>Anthropic is also better at giving access to these models, so if you’ve waited in their waitlist for a while, and still don’t have access, DM me (@altryne) and I’ll try to get you API access as a member of ThursdAI community. </p><p>WizardLM-70B V1.0 tops OSS charts (<a target="_blank" href="https://twitter.com/WizardLM_AI/status/1689270108747976704">source</a>)</p><p>WizardLM 70B from WizardLM is now the top dog in open source AI, featuring the same License as LLaMa and much much better code performance than base LLaMa 2, it’s now the top performing code model that’s also does other LLMy things. </p><p>Per <a target="_blank" href="https://thursdai.substack.com/p/thursdai-special-episode-interview#details">friend of the pod</a>, and Finetuner extraordinaire <a target="_blank" href="https://twitter.com/Teknium1/status/1689278307831881728?s=20">Teknium</a>, this is the best HumanEval (coding benchmark) we’ve seen in a LLaMa based open source model 🔥</p><p>Also from Teknium btw, a recent evaluation of the <a target="_blank" href="https://github.com/QwenLM/Qwen-7B"><strong>Alibaba Qwen 7B</strong></a> model we talked about <a target="_blank" href="https://thursdai.substack.com/p/thursdai-aug-3-openai-qwen-7b-beats?sd=pf">last ThursdAI</a>, by Teknium, actually showed that LLaMa 7B is a bit better, however, Qwen should also be evaluated on tool selection and agent use, and we’re waiting for those metrics to surface and will update! </p><p>Embeddings Embeddings Embeddings</p><p>It seems that in OpenSource embeddings, we’re now getting state of the art open source models (read: require no internet access) every week!</p><p>In just the last <a target="_blank" href="https://twitter.com/osanseviero/status/1689175705060290560?ref_src=twsrc%5Etfw%7Ctwcamp%5Etweetembed%7Ctwterm%5E1689175705060290560%7Ctwgr%5E3ce18616b1da5d3dc0ac1b709fd5d7add2f69c6d%7Ctwcon%5Es1_&#38;ref_url=https%3A%2F%2Fspacesdashboard.com%2Fspace%2F1LyxBqglWdyJN%2Fthursdai-open-source-llms-voice-video-cloning-and-claude-12">few months</a>: - Microsoft open-sourced E5 - Alibaba open-sourced General Text Embeddings - BAAI open-sourced FlagEmbedding - Jina open-sourced Jina Embeddings</p><p>And now, we have a new metric MTEB and a <a target="_blank" href="https://huggingface.co/spaces/mteb/leaderboard"><strong>new leaderboard</strong></a><strong> </strong>from hugging face (who else?) to always know which model is currently leading the pack. With a new winner from this week! BGE (<a target="_blank" href="https://huggingface.co/BAAI/bge-large-en">large</a>, base and <a target="_blank" href="https://huggingface.co/BAAI/bge-small-en">small</a> (just 140MB) ) </p><p>Embedding models are very important for many AI applications, RAG (retrieval augmented generation) products, semantic search and vector DBs, and the faster, smaller and more offline they are, the better the whole field of AI tools we’re going to get, including, much more capable, and offline agents. 🔥 </p><p>Worth noting that text-ada-002, the OpenAI embedding API is now ranked 13 on the above MTEB leaderboard! </p><p>Open Code Interpreter 👏</p><p>While we’re on the agents topic, we had the privilege to chat with a new friend of the pod, <a target="_blank" href="https://twitter.com/Shroominic/status/1679914728368242688?s=20">Shroominic</a> who’s told us about his open source project, called <a target="_blank" href="https://github.com/shroominic/codeinterpreter-api">codeinterpreter-api</a> which is an open source implementation of code interpreter. We had a great conversation about this effort, the community push, the ability of this open version to install new packages, access the web, run offline and have multiple open source LLMs that run it, and we expect to hear more as this project develops! </p><p>If you’re not familiar with OpenAI Code Interpreter, we’ve talked about it at length when it just came out here and it’s probably the best “AI Agent” that many folks have access to right now. </p><p></p><p>Deepfakes are upon us! </p><p>I want to show you this video and you tell me if you saw this not in an AI newsletter, would you have been able to tell it’s AI generated. </p><p>This video was generated automatically, when I applied to the waitlist by HeyGen and then I registered again and tried to get AI <a target="_blank" href="https://twitter.com/joshua_xu_">Joshua</a> to generate an ultra realistic ThursdAI <a target="_blank" href="https://am8evw00qys.typeform.com/to/wauwjUYP">promo vid</a> haha. </p><p>I’ve played with many tools for AI video generation and never saw anything come close to this quality, and can’t wait for this to launch! </p><p>While this is a significant update for many folks in terms of how well deepfakes can look (and it is! Just look at it, reflections, HQ, lip movement is perfect, just incredible) this isn’t the only progress data point in this space. </p><p>Play.ht <a target="_blank" href="https://news.play.ht/post/introducing-playht2-0-the-state-of-the-art-generative-voice-ai-model-for-conversational-speech">announced version 2.0</a> which sounds incredibly natural, increased <a target="_blank" href="https://twitter.com/_akhaliq/status/1689689133961080845">model size 10x</a> and dataset to more than 1 million hours of speech across multiple languages, accents, and speaking styles and emotions and claims to have sub 1s latency and fake your voice with a sample of only… 3 seconds! 🤯</p><p>So have you and your loved ones chosen a code word to authenticate over the phone? Or switched to a verifiable communication style? While those of us with multiple accents don’t yet have to worry, everyone should stop believing any video or voice sample from now on, it’s just inevitable that all of that will be deepfaked and we should start coming up with ways to authenticate content. </p><p><p>If you made it this far, and any of the above was new/important to you, why not support this pod/newsletter/community? If you’d like to sponsor us more directly, please ping me at altryne [at] gmail.com , I’m also open to consulting, and if you’re a great company, Developer Relations positions :) </p></p><p>Finally, we’ve talked for a whopping 2 hours on the spaces, and that whole conversation can be heard on our <a target="_blank" href="https://zealous.one/@altryne/thursdAI">Zealous page</a> which has transcripts, AudioGrams of key moments, and space summarizations! </p><p></p><p>And the Long form space recordings can be added to your podcatcher separately if you’d prefer the “ThursdAI raw feed” by using <a target="_blank" href="https://zealous.one/@altryne/thursdAI/rss">this RSS link</a>, and will come as it’s own podcast very soon! Thanks to our friends <a target="_blank" href="https://zealous.one/?via=alex-volkov">at Zealous</a></p><p></p><p>Thank you, </p><p>Alex Volkov.</p><p>Host <a target="_blank" href="https://open.substack.com/pub/thursdai">ThursdAI - Recaps of the most high signal AI weekly spaces</a>  </p><p>CEO @ Targum.video</p><p>AI Consultant with free slots (<a target="_blank" href="mailto:altryne@gmail.com">Lets Talk</a>) </p><p></p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-aug-10-deepfakes-get-real</link><guid isPermaLink="false">substack:post:135910102</guid><dc:creator><![CDATA[Alex Volkov, Nisten, and Xenova]]></dc:creator><pubDate>Thu, 10 Aug 2023 22:31:42 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/135910102/034c6764ed905beba5e75a312333b855.mp3" length="15064964" type="audio/mpeg"/><itunes:author>Alex Volkov, Nisten, and Xenova</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>942</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/135910102/c2d56c86f6cf7ddc9133976fc73c85d6.jpg"/></item><item><title><![CDATA[ThursdAI Aug 3 - OpenAI, Qwen 7B beats LLaMa, Orca is replicated, and more AI news]]></title><description><![CDATA[<p>Hi, today’s episode is published on a Friday, it’s been a busy week with at least 4 twitter spaces, countless DMs and research! </p><p></p><p>OpenAI announces UX updates</p><p>* Example prompts: No more staring at a blank page! </p><p>* Suggested replies: ChatGPT automatically synthesizes follow up questions. Then you just click a button</p><p>* GPT-4 by default: When starting a new chat as a Plus user, ChatGPT will remember your previously selected model! </p><p>* 4. Uploading multiple files is now supported in the Code Interpreter beta for all Plus users.</p><p>*  5. Stay logged in: You’ll no longer be logged out every 2 weeks and if you do, we have a sweet new welcome page! </p><p>* 6. Keyboard shortcuts: Work faster with shortcuts, Try ⌘ (Ctrl) + / to see the complete list.</p><p><p>ThursdAI - I stay up to date so you don’t have to</p></p><p>Alibaba releases <a target="_blank" href="https://github.com/QwenLM/Qwen-7B">Qwen7b</a></p><p>* <strong>Trained with high-quality pretraining data</strong>.  Qwen-7B pretrained on a self-constructed large-scale high-quality dataset of over 2.2 trillion tokens. The dataset includes plain texts and codes, and it covers a wide range of domains, including general domain data and professional domain data.</p><p>* <strong>Strong performance</strong>. In comparison with the models of the similar model size,  outperforms the competitors on a series of benchmark datasets, which evaluates natural language understanding, mathematics, coding, etc.</p><p>* <strong>Better support of languages</strong>. New tokenizer, based on a large vocabulary of over 150K tokens, is a more efficient one compared with other tokenizers. It is friendly to many languages, and it is helpful for users to further finetune Qwen-7B for the extension of understanding a certain language.</p><p>* <strong>Support of </strong>8K Context<strong> Length</strong>. Both Qwen-7B and Qwen-7B-Chat support the context length of 8K, which allows inputs with long contexts.</p><p>* <strong>Support of Plugins</strong>. Qwen-7B-Chat is trained with plugin-related alignment data, and thus it is capable of using tools, including APIs, models, databases, etc., and it is capable of playing as an agent.</p><p>This is an impressive jump in open source capabilities, less than a month after LLaMa 2 release! </p><p></p><p>GTE-large a new embedding model outperforms OPENAI ada-002</p><p>If you’ve used any “chat with your documents” app or built one, or have used a vector database, chances are, you’ve used openAI ada-002, it’s the most common embedding model (that turns text into embeddings for vector similarity search) </p><p>This model is ousted by an OpenSource (nee. free) one called <a target="_blank" href="https://huggingface.co/thenlper/gte-large">GTE-large</a> with improvements on top of ada across most parameters! </p><p>OpenOrca 2 preview </p><p>Our friends from AlignmentLab including Teknium and LDJ have discussed the release of OpenOrca 2! If you’re interested in the type of finetuning things these guys do, we had a special interview w/ NousResearch on the pod a few weeks ago </p><p>OpenOrca tops the charts for the best performing 13B model 👏</p><p></p><p>Hyper-write releases a personal assistant</p><p>You know how much we love agents in ThursdAI, and we’re waiting for this field to materialize and I personally am waiting for an agent to summarize the whole links and screenshots for this summary, and… we’re not there yet! But we’re coming close, and our friends from HyperWrite have released their <a target="_blank" href="https://app.hyperwriteai.com/personalassistant">browser controlling agent</a> on ThursdAI. Talk about a full day of releases! </p><p>I absolutely love the marketing trick they used where one of the examples of how it works, is “upvote us on producthunt” and it actually did work for me, and found out that I <a target="_blank" href="https://www.producthunt.com/posts/personal-assistant-by-hyperwrite">already upvoted</a></p><p></p><p>Superconductor continues</p><p>I was absolutely worried that I won’t make it to this thursdAI or won’t know what to talk about because, well, I’ve become a sort of host and <a target="_blank" href="https://twitter.com/i/lists/1684446795731206144">information hub</a> and a interviewer of folks about LK-99. Many people around the world seem interested in it’s properties, replication attempts and to understand this new and exciting thing. </p><p>We talked about this briefly, but if interests you (and I think it absolutely should) please listen to the below recording. </p><p><p>ThursdAI - See ya next week, don’t forget to subscribe and if you are already subscribed, and get value, upgrading will help me buy the proper equipment to make this a professional endeavor and pay for the AI tools! 🫡</p></p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-aug-3-openai-qwen-7b-beats</link><guid isPermaLink="false">substack:post:135726693</guid><dc:creator><![CDATA[Alex Volkov, Junaid Dawud, yam, and Roie Schwab Cohen]]></dc:creator><pubDate>Fri, 04 Aug 2023 21:58:30 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/135726693/c437aeb2b349119d6a45d5fcadee8432.mp3" length="18760423" type="audio/mpeg"/><itunes:author>Alex Volkov, Junaid Dawud, yam, and Roie Schwab Cohen</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>1563</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/135726693/8e2d4dfb4f3782b9767f2de6a2eeaab4.jpg"/></item><item><title><![CDATA[🧪 LK99 - The superconductor that can change the world, and the K-drama behind it! ]]></title><description><![CDATA[This is a free preview of a paid episode. To hear more, visit <a href="https://sub.thursdai.news?utm_medium=podcast&#38;utm_campaign=CTA_7">sub.thursdai.news</a><br/><br/><p>First of all, let me address this from the get go, I’m not a material scientist! I am pretty good at finding information in twitter’s incredibly noisy info stream. (hey, this is how I bring you AI updates <a target="_blank" href="https://thursdAI.substack.com">every ThursdAI</a>) </p><p>Since LK-99 is potentially groundbreaking and revolutionary, I’ve compiled a <a target="_blank" href="https://twitter.com/i/lists/1684446795731206144">twitter list</a> of everyone who I found credible, interested and a source of new information, and there’s now over 1.5K followers to this list alone!</p><p>Since this clearly is interesting to a lot of you, I reached out to a <a target="_blank" href="https://twitter.com/altryne/status/1685148163223719936?s=20">few prominent people</a> on this list, and asked them to join a twitter space, to try and stitch together an update on the current state of LK-99, replication attempts, history and lore, as it stands a week after the original papers release. </p><p><p>If you found this interesting, you’re the type of person who wants to stay up to date, feel free to subscribe and keep this Substack alive!</p></p><p>First of all, let’s do some level setting. Superconductors are real, we’ve used them in MRI machines for example, but the currently available superconductors need extremely low temperature and high pressure to well.., and the promise of a room temperature and ambient pressure superconductor is the holy grail of energy use. </p><p>For a breakdown on what superconductors are, and what they can mean for the world, I strongly recommend <a target="_blank" href="https://twitter.com/Andercot/status/1666629851305111554?s=20">this thread</a> from Andrew Cote (published presciently a full two weeks before the LK-99 paper) or watch this incredible breakdown: </p><p>July 22nd, the LK-99 arXiv day! </p><p>On July 22nd, two papers describing “worlds first room temperature superconductor”  were uploaded to arXiv: </p><p><a target="_blank" href="https://arxiv.org/abs/2307.12008">2307.12008</a> - <a target="_blank" href="https://arxiv.org/search/cond-mat?searchtype=author&#38;query=Lee%2C+S">Sukbae Lee</a>, <a target="_blank" href="https://arxiv.org/search/cond-mat?searchtype=author&#38;query=Kim%2C+J">Ji-Hoon Kim</a>, <a target="_blank" href="https://arxiv.org/search/cond-mat?searchtype=author&#38;query=Kwon%2C+Y">Young-Wan Kwon</a> (submitted by Kwon)</p><p>and after 2 hours and 20 minutes another paper was uploaded</p><p><a target="_blank" href="https://arxiv.org/abs/2307.12037">2307.12037</a> - <a target="_blank" href="https://arxiv.org/search/cond-mat?searchtype=author&#38;query=Lee%2C+S">Sukbae Lee</a>, <a target="_blank" href="https://arxiv.org/search/cond-mat?searchtype=author&#38;query=Kim%2C+J">Jihoon Kim</a>, <a target="_blank" href="https://arxiv.org/search/cond-mat?searchtype=author&#38;query=Kim%2C+H">Hyun-Tak Kim</a>, <a target="_blank" href="https://arxiv.org/search/cond-mat?searchtype=author&#38;query=Im%2C+S">Sungyeon Im</a>, <a target="_blank" href="https://arxiv.org/search/cond-mat?searchtype=author&#38;query=An%2C+S">SooMin An</a>, <a target="_blank" href="https://arxiv.org/search/cond-mat?searchtype=author&#38;query=Auh%2C+K+H">Keun Ho Auh</a> (Submitted by Hyuntak Kim)</p><p>You may notice that the first two authors on both papers are <strong>Sukbae Lee</strong> and <strong>Ji-Hoon Kim</strong>, and in fact LK stands for Lee and Kim and 99 in the <strong>LK-99</strong> name stands for the year 1999 they have started research on this.</p><p>You may also notice that YW Kwon who submitted the first paper, is not included on the second one, and in fact, is no longer part of the <a target="_blank" href="https://qcentre.co.kr">Quantum Energy Research Institute</a> (Aka QCentre) where he was a CTO (he’s no longer listed on the site) </p><p>If this shakes out, and SC is replicated, there’s definitely going to be a Netflix series on the events that led to YW Kwon to release the paper, after he was no longer affiliated with QCentre, with limited information so let’s try to connect the dots (a LOT of this connecting happened on the ground by <a target="_blank" href="https://twitter.com/sanxiyn">Seo Sanghyeon</a> and his friends, and translated by me. Their original coverage has a LOT of details and is available in Korean <a target="_blank" href="https://hackmd.io/@sanxiyn/S1hejVXo3">here</a></p><p>Let’s go back to the 90s</p><p>On the LinkedIn page of Ji-Hoon Kim (the page turned blank shortly before me writing this), JH Kim showed that he started working on this back in 1999, and they estimated they have a material that contained “very small amount of superconductivity” and together with  <strong>Sukbae Lee, </strong>in 2018 they have established QCentre to complete the work of their Professor Emeritus of Chemistry at Korea University, the late <strong>Choi Dong-Sik</strong> (1943-2017) who apparently first proposed the LK-99 material (following the 1986 bonanza of the discovery of <a target="_blank" href="https://safeswisscloud.com/en/blog/1986-high-temperature-superconductors/">high temperature superconductors</a> by IBM researchers).</p><p>Fast forward to 2017, a wish expressed in a last will and testament starts everything again </p><p>Professor Choi passed away, and in this will requested follow-up research on ISB theory and LK-99 and Quantum Energy Research Institute is now established by Lee and Kim (LK) and they continue their work on this material. </p><p>In 2018, there’s a potential breakthrough, that could have been an accident that led to the discovery of the process behind LK-99? </p><p>Here’s a snippet of Seo Sanghyeon explaining this:</p><p>Kwon Young-Wan the ex-CTO</p><p>Kwon is a Research Professor at Korea University & KIST, is the third author on the first arXiv paper, and the submitter, was previously the CTO, but at the time of the paper to arXiv he was not affiliated with QCentre for “some months” according to an interview with Lee. </p><p>He <a target="_blank" href="https://arxiv.org/abs/2307.12008">uploads a paper</a>, names only 3 authors (Lee, Kim and Himself) and then surprisingly <a target="_blank" href="https://twitter.com/TeraTom_S/status/1684808086547238912">presents LK-99 research</a> at the <a target="_blank" href="https://www.mml2023.org/">MML2023</a> international conference held in Seoul a few days later, we haven’t yet found a video recording, however a few reports mention him asking for an interpreter, and talking about bringing samples without demonstration and proper equipment.</p><p>Important to note, that </p><p>Enter Hyun-Tak Kim</p><p><strong>H.T Kim</strong> is probably the most cited and well-known professor in academia among the folks involved. See his <a target="_blank" href="https://research.com/u/hyun-tak-kim">google scholar</a> profile, with a D-index of 43 and has 261 publications and 11,263 citations. </p><p>He’s a heavy hitter, and is the submitter and listed as the author of paper number 2 submitted to arXiv, 2 hours and 20 minutes after paper number 1 above. </p><p>In the second paper, he’s listed as the third author (and the submitter to arXiv) and his contribution is acknowledged like so: </p><p>An author, Hyun-Tak Kim (H. T. Kim),’s knowledge on mechanisms of both superconductivity and the metal-insulator (gap-nogap) transition highly contributed to writing the mechanism part. The knowledge was acquired over 20 years by processes of performing national projects including project [Grant 2017-0-00830] funded by Institute for Information and Communications Technology Promotion (IITP) in MSIT of Korea government in ETRI. H. T. Kim left ETRI on Nov. of 2022.</p><p>In the first paper H.T. is not acknowledged, and is only mentioned in in reference no. 52 to his paper from 2021. </p><p>Ok enough about the people Alex! Does the rock levitate? </p><p>In January, QCentre youtube channel uploaded an <a target="_blank" href="https://www.youtube.com/watch?v=EtVjGWpbE7k">unlisted video</a> that showed magnetic properties of LK-99 and another video, with partial levitation is widely shared on <a target="_blank" href="https://twitter.com/kerryhew/status/1684964052882354176?s=20">social media</a>.</p><p>The partial levitation shown is attributed to the <a target="_blank" href="https://en.wikipedia.org/wiki/Meissner_effect">Meissner Effect</a> and is a supposed proof of room temperature super conductivity. However, these two videos are inconclusive and are not enough for us to take QCentre claims at face value. </p><p>The scientific community, having been stung by a recent incident surrounding a supposed room temp superconductor, where the <a target="_blank" href="https://www.nytimes.com/2023/07/26/science/ranga-dias-retraction-physics.html">evidence was apparently falsified</a> (Dais et. al.) are not so easily swayed. </p><p>Adding to that, the mess around the multiple papers, showing different theories, the lack of peer review, or independent replication, the surprised publication, and a rushed follow up publication, all makes people wonder, what is going on here? This doesn’t seem like a fabricated attempt. </p><p>Summary of replication attempts so far (Sun, Jul 20) </p><p>Given the importance of this discovery, and the “relative” triviality of replication, common enough materials, the process is not extremely complex (but kids, do not try this at home) so we can bet that “furnaces in solid-state materials labs around the world have been cooking yesterday and today to try to reproduce” [<a target="_blank" href="https://www.science.org/content/blog-post/breaking-superconductor-news">Science</a> Magazine]</p><p>We have <a target="_blank" href="https://twitter.com/elsa17z/status/1685288876716314624">reports</a> from China that supplies of Led Apatite are <a target="_blank" href="https://twitter.com/elsa17z/status/1685523496212865024">running dry</a> as many are trying to replicate quietly? </p><p>Additional reports from <a target="_blank" href="https://www.facebook.com/AwanaVPS/posts/3590033751277349">India</a> where <a target="_blank" href="https://awanavps.webs.com/?fbclid=IwAR0YEtTCarjUDYJlR6qLwXEvimktM7ZmJY94EvIdZ61GrfcVWwzzFjH5cHE">Dr. VPS. Awana</a>, the Chief scientist at CSIR-NPL and team are trying to replicate, with results expected as early as tomorrow (Monday, Jul 31) and has been emailing with Lee</p><p>In addition to this, we’ve had Andrew McCalip from Varda space who has been <a target="_blank" href="https://twitter.com/andrewmccalip/status/1685742024832815104?s=20">live-tweetin</a>, <a target="_blank" href="https://twitter.com/andrewmccalip/status/1685097443455856640">twitch streamin</a> his “Meissner effect or bust” campaign to reproduce LK-99, while the world watches (Andrew joined the space as well) and provides ideas, materials and an outpour of support for this gung-ho, almost cowboy effort. </p><p>We’ve also had folks from MIT who claimed that professors who want to remain anonymous, and went to MML2023 are also in contact with the team and are trying to test the material.</p><p>Replication failure is … not a failure? </p><p>Discussing the replication attempts with experts on stage, we all concluded that there are likely 2 ways for the world to know wether LK-99 is a superconductor. </p><p>* Replication succeeds and scientists analyze the replicated sample</p><p>* QCentre team provides a sample, and some very smart independent folks put it under a microscope, a magnetism analysis and a bunch of other measurements and confirm that it’s a superconductor at room temperature.</p><p>While we wait for either of those, I encourage you to check out the resources, the space recording, and the <a target="_blank" href="https://twitter.com/i/lists/1684446795731206144">list of folks</a> I’ve collected to stay in the loop! </p><p>Here’s a list of relevant links: </p><p>* <a target="_blank" href="https://doi.org/10.6111/JKCGCT.2023.33.2.061">Paper 1 DOI</a></p><p>* <a target="_blank" href="https://arxiv.org/abs/2307.12008">Paper 2 Arxiv</a></p><p>* <a target="_blank" href="https://arxiv.org/abs/2307.12037">Paper 3 Arxiv</a></p><p>* <a target="_blank" href="https://www.newscientist.com/article/2384782-room-temperature-superconductor-breakthrough-met-with-scepticism/">New Scientist Interview</a></p><p>* <a target="_blank" href="https://biz.chosun.com/science-chosun/science/2023/07/27/CYOH5RGWHVDGDAJSJDW5S4SKXE/">ChosunBiz Interview (Korean)</a></p><p>* <a target="_blank" href="https://www.yna.co.kr/view/AKR20230728146700017">Yonhap Interview (Korean)</a></p><p>* <a target="_blank" href="https://twitter.com/i/lists/1684446795731206144">Twitter List</a></p><p>And the list of folks who participated in the space, give them a follow: </p><p>* <a target="_blank" href="https://twitter.com/altryne"><strong>Alex Volkov (@altryne)</strong></a></p><p>* <a target="_blank" href="https://twitter.com/sanxiyn"><strong>Seo Sanghyeon (@sanxiyn)</strong></a></p><p>* <a target="_blank" href="https://twitter.com/8teAPi"><strong>Ate-a-Pi (@8teAPi)</strong></a></p><p>* <a target="_blank" href="https://twitter.com/andrewmccalip"><strong>Andrew McCalip (@andrewmccalip)</strong></a></p><p>* <a target="_blank" href="https://twitter.com/Andercot"><strong>Andrew Cote (@Andercot)</strong></a></p><p>* <a target="_blank" href="https://twitter.com/radsci"><strong>Ely Rabani (@radsci)</strong></a></p><p>* <a target="_blank" href="https://twitter.com/Robotbeat"><strong>Robotbeat (@Robotbeat)</strong></a></p><p>* <a target="_blank" href="https://twitter.com/marshray"><strong>Marsh Ray (@marshray)</strong></a></p><p>* <a target="_blank" href="https://twitter.com/BenShindel"><strong>Ben (@BenShindel)</strong></a></p><p>* <a target="_blank" href="https://twitter.com/KenCondon1"><strong>Ken Condon (@KenCondon1)</strong></a></p><p>* <a target="_blank" href="https://twitter.com/jesuslares_me"><strong>Jesus (@jesuslares_me)</strong></a></p><p>* <a target="_blank" href="https://twitter.com/DanielleFong"><strong>Danielle Fong (@DanielleFong)</strong></a></p><p></p><p>For your convenience, attached is an AI transcription of the space with speakers and timestamps (may be off by a few minutes) : </p><p>[00:02:40] Alex Volkov (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>): Hello. Hello, everyone. There's a lot of you here, and I wanna welcome a shoot for up on stage while we wait for a few more guests, and then we can get started. Thank you so much for taking the time joining us. as you're as interested as all of us in this very exciting, very confusing, very potentially groundbreaking news. So I wanna introduce 2 folks up on stage 2 folks up on stage already, and bringing up another one just now. And hey, Andrew. Hey.</p><p>[00:03:18] Alex Volkov (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>): Hey, How are you guys?</p><p>[00:03:23] Ben (<a target="_blank" href="https://twitter.com/BenShindel">@BenShindel</a>):</p><p>Doing well. How are you?</p><p>[00:03:27] Alex Volkov (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>): A little bit you know, the palms are a little bit sweaty. This is a insane turnout. Twitter is indeed a public space on because that we have. And, hopefully, spaces or two spaces, whatever they call it now, will hold. And I only invite Sam here to speak as well. Hey, Tobias. How are you?</p><p>[00:03:51] Ate-a-Pi (<a target="_blank" href="https://twitter.com/8teAPi">@8teAPi</a>):</p><p>I'm good. I'm good. So good to good to, you know, hear from you guys in person, Alex. Thanks for putting the space together.</p><p>[00:04:00] Alex Volkov (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>): Thirdly. Andrew, we're gonna introduce Andrew, but many folks who are here already follow you and and follow your work. How how's your evening going, Andrew?</p><p>[00:04:12] Andrew McCalip (<a target="_blank" href="https://twitter.com/andrewmccalip">@andrewmccalip</a>):</p><p>Lee, this has been a wild ride. Thanks for putting all this together. It's gonna be great to get all the information in one place for the first time. This is my first time experiencing the full volume of the Internet, and just been a a lot of fun to see all the positivity around the progress.</p><p>[00:04:29] Alex Volkov (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>): That's great. So I'll do my best that, you know, Mother think this. I will maybe preface this that I am not a scientist. Many of the terms that we'll hear today in the space I've heard for the first time a couple of days ago. What I am is a Twitter for many, many years, and I have collected a a list of folks who I I personally wanted to follow to kinda see the updates as they roll out, and we've seen many, many things roll out very quick. with a lot of confusion and different replication attempts from different places. And I just compiled the list for myself. I started following.</p><p>[00:05:08] Alex Volkov (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>): 8 to buy had incredible incredible content diving into the the timeline. I found I I wanna introduce thank you. Am I saying this right? I think you need to hit the the mute button in a mute. If this is your first time talking on RESTASIS. let me know if you're able to do that. And if not, we'll try to solve this. And out as I was collecting folks, And I I started seeing that Andrew started doing their application attempts and even doing Twitch.</p><p>[00:05:46] Seo Sanghyeon (<a target="_blank" href="https://twitter.com/sanxiyn">@sanxiyn</a>):</p><p>Can you hear me?</p><p>[00:05:47] Alex Volkov (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>): Can you hear me? We can hear you. Hey, Sam Kim. How are you?</p><p>[00:05:57] Seo Sanghyeon (<a target="_blank" href="https://twitter.com/sanxiyn">@sanxiyn</a>):</p><p>It it it's the noon in South Korea, and I'm fine.</p><p>[00:06:01] Alex Volkov (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>): the afternoon. Right?</p><p>[00:06:03] Seo Sanghyeon (<a target="_blank" href="https://twitter.com/sanxiyn">@sanxiyn</a>):</p><p>It's 1. Yes. Yes. It's the 1 PM.</p><p>[00:06:06] Alex Volkov (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>): Awesome. And so I was just doing an introduction maybe as you were telling up, you maybe not heard some of it. However, folks in the audience who followed this kind of thread and how we came to be here I have a a thread that I'll post on top here that has all the folks from the Twitter list that I forgot. And San Kyung and his his team is basically the reason for the space. Me and Nathan kind of found Sunqun. Am I saying Sunqun correctly? Is that is that the right way to say this?</p><p>[00:06:41] Seo Sanghyeon (<a target="_blank" href="https://twitter.com/sanxiyn">@sanxiyn</a>):</p><p>My name is. Your your, yeah, your pronunciation is not actually not.</p><p>[00:06:48] Alex Volkov (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>): Okay. I'll I'll turn my best to put months at the at the right names. And so we both me and 8 to 5, a a 34 in Saint Kyung, who's in Seoul currently, and definitely speaks the language we don't speak, and so there's a lot of insight and translation. And so, yeah, I guess we'll will get started, so feel free to present yourself, and then talk a little bit about your last few days and how you came around getting in this topic. and then how kinda what you found so far.</p><p>[00:07:28] Seo Sanghyeon (<a target="_blank" href="https://twitter.com/sanxiyn">@sanxiyn</a>):</p><p>I I didn't really expect to to speak.</p><p>[00:07:30] Alex Volkov (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>): That's okay. That's okay.</p><p>[00:07:32] Seo Sanghyeon (<a target="_blank" href="https://twitter.com/sanxiyn">@sanxiyn</a>):</p><p>That's put me put me on the spot. Yeah.</p><p>[00:07:34] Alex Volkov (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>): I don't wanna put you on the spot, but give us maybe a brief summary.</p><p>[00:07:44] Ate-a-Pi (<a target="_blank" href="https://twitter.com/8teAPi">@8teAPi</a>):</p><p>Maybe maybe do you do you want me to help Sanyon?</p><p>[00:07:47] Seo Sanghyeon (<a target="_blank" href="https://twitter.com/sanxiyn">@sanxiyn</a>):</p><p>Yes, please. Okay. You you have read my right top, so maybe maybe you can explain what's going on.</p><p>[00:07:57] Ate-a-Pi (<a target="_blank" href="https://twitter.com/8teAPi">@8teAPi</a>):</p><p>Okay. So I'm I'm just gonna I'm just gonna just to preface everything, I I'm writing a work of fiction. So all of you guys are just participating in an experiment. So but I'm trying to keep everything to kinda, like, factual and trying to interpret what what is kind of happening on the ground. Right? Shyam is much more factual, and he he has actually been doing a primary source work. So he's been actually digging up the actual Korean language science papers. He's been sitting down with friends They've kinda, you know, summarized and kind of tried to understand what's going on.</p><p>[00:08:36] Ate-a-Pi (<a target="_blank" href="https://twitter.com/8teAPi">@8teAPi</a>):</p><p>And he's really the one that's, you know, put together this that that the you know, the the the mentor, you know, whose name, I think, in some transliterations comes out to TS's chair, some Donsick He the mentor was basically in superconductors in this idea of this kind of 1 dimensional super and he had this theory.</p><p>[00:09:00] Seo Sanghyeon (<a target="_blank" href="https://twitter.com/sanxiyn">@sanxiyn</a>):</p><p>That so the name is che. che. Oh, sure. Yeah. Yeah. Yeah. He was a a professor in the Korean University's Department of Chemistry.</p><p>[00:09:13] Ate-a-Pi (<a target="_blank" href="https://twitter.com/8teAPi">@8teAPi</a>):</p><p>Yeah. And and so he he had this idea, this theory, and he had graduate students. and one of those graduate students was Lee, and Lee kind of took up the mantle of this this theory. And then they, you know, tied up with who was an experiment list.</p><p>[00:09:37] Ate-a-Pi (<a target="_blank" href="https://twitter.com/8teAPi">@8teAPi</a>):</p><p>And then they kinda discovered this trace this coast of a trace of a material in 1990 And at that point, what happens is having discovered this trace, their path kind of diverge this, and Kim, the experimentalist, goes on to do a masters, not in superconductors. So he does his masters in something else, and then he does the battery materials kind of PhD, and he graduates in 2008.</p><p>[00:10:12] Ate-a-Pi (<a target="_blank" href="https://twitter.com/8teAPi">@8teAPi</a>):</p><p>while Lee continues on the superconductor path, does experimental any when he publishes his PhD. It's both a theory and synthesis of superconductors. And then he graduates, and then he he goes to work as a science adjunct professor, which we which we just found out. Like, a computer science adjunct professor, and he's there for about, you know, 4, 5 5 years. He doesn't publish. And and I'm guessing at this point, he kinda gets, like, you know, cashier out of out of academia completely, and he sets up a consulting firm, basically, Q Center.</p><p>[00:10:50] Ate-a-Pi (<a target="_blank" href="https://twitter.com/8teAPi">@8teAPi</a>):</p><p>And they start taking on consulting work. And and then, again, the timeline is a little bit unclear on whether or not they continue to work on on the on on the product on what they discovered. And what happens then is in 2017, Chey Dongksik passes.</p><p>[00:11:18] Ate-a-Pi (<a target="_blank" href="https://twitter.com/8teAPi">@8teAPi</a>):</p><p>And as he passes, he he gets his former students together, and he asked them to finish off what they started to find this superconducting material that they saw a ghost of a trace of in 1999. And he passes, and they have no money. basically. Song Young has done, again, primary source research, and, you know, the the office space is basically, like, like, a two story building, you know, somewhere in the you know, in in Seoul. It's a very modern kind of office. They don't have much money.</p><p>[00:11:57] Ate-a-Pi (<a target="_blank" href="https://twitter.com/8teAPi">@8teAPi</a>):</p><p>My guess my guess is that they need Kim. because KIM is the experimentalist, and I'm guessing also that none of the theory works at this point. The only thing that they have to go on is that they actually did find something in 1999. And Kim, I'm guessing, is also quite practical because he didn't do he didn't pursue the superconductors for the PhD. Right? Because he's quite practical, he's like, dude, you get me money. I'll join you. You don't have money. I'm not joining you for your wild goose, Jason. Right?</p><p>[00:12:36] Ate-a-Pi (<a target="_blank" href="https://twitter.com/8teAPi">@8teAPi</a>):</p><p>So Lee goes out and he recruits Kwan. And Kwan is kind of like you know, he's he's a US PhD. He has a research university, you know, position. recruit them, and they get funding. And I think I think Sam Young, you were you were saying that Kwon is the one on the, you know, National Science Foundation of Korea's like you know, list, like, grant. Right? I I think that's what you said.</p><p>[00:13:08] Seo Sanghyeon (<a target="_blank" href="https://twitter.com/sanxiyn">@sanxiyn</a>):</p><p>So the paper mentions the public grant from South Korea. called the National Resource Foundation, which is like National Science Foundation in United States. And Korn is listed as a primary invest mitigate our PI, if then.</p><p>[00:13:25] Ate-a-Pi (<a target="_blank" href="https://twitter.com/8teAPi">@8teAPi</a>):</p><p>Right?</p><p>[00:13:26] Alex Volkov (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>): Mhmm.</p><p>[00:13:27] Ate-a-Pi (<a target="_blank" href="https://twitter.com/8teAPi">@8teAPi</a>):</p><p>Yeah. Yeah. That's right. Okay. So he he's the PI. So they recruit him as the PI, and Jade Kim, who is, you know, Lee's partner, basically leaves his very comfortable position as a research director in a hearing aid test.</p><p>[00:13:44] Seo Sanghyeon (<a target="_blank" href="https://twitter.com/sanxiyn">@sanxiyn</a>):</p><p>Yeah.</p><p>[00:13:44] Alex Volkov (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>): Yeah. Yes.</p><p>[00:13:45] Seo Sanghyeon (<a target="_blank" href="https://twitter.com/sanxiyn">@sanxiyn</a>):</p><p>Yes. Yeah. Hearing aid Yeah. I Or the eye test there? Yeah. Yeah. For the ISER tech and in manufacture, the battery is specialized for the hearing aid. code. It is a medical device. They have a different standard from other batteries. And company a small business in South Korea, but seems competitive worldwide.</p><p>[00:14:13] Alex Volkov (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>): So he leaves his let me let me -- Yeah. Go ahead. Just real quick and to give folks a quick summary. The main paper that we saw the explosion from that was published on July 22nd, so a week and and almost a day we're, like, almost 8 days into this. The three people that you you just said, besides the first professor, Choi or chair or Troy and or several places write it separately. So the the three people, SoftBank, Jihoon Kim, which is the LK in LK 99, right, Lee and Kim. And the third person you just mentioned is Young Wan, Kwan. Yes.</p><p>[00:14:52] Alex Volkov (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>): Those are the the 3 authors on the paper that kind of was published on our side out of the blue. 8 days ago. Please continue.</p><p>[00:15:03] Ate-a-Pi (<a target="_blank" href="https://twitter.com/8teAPi">@8teAPi</a>):</p><p>Right. And then so at this at this point, they're in 2017, And, you know, Lee goes out and does the fundraising. He recruits Kwan, who's the research professor, Kwon is basically he's on the paper. He he's he's the principal investigator on the grant, but he's still a professor at university. So he's basically, I'm guessing, like, a day a day in the, you know, in the office at Q Center, very modest place. I think the grand size is pretty small, and they get this ESR machine.</p><p>[00:15:41] Ate-a-Pi (<a target="_blank" href="https://twitter.com/8teAPi">@8teAPi</a>):</p><p>And again, from what I can tell, the ESR machine only came knows how to use it. Because none of the other people are actually synthetic, you know, synthesis people. They're all like theory guys, Kuan is a physicist. And Kim himself, JH Kim himself, he's looking for something which you have to know what you're looking for, right? Because that's what he says in his LinkedIn. He's like, I'm looking for some if you don't know what you're looking for, then forget about it. Right?</p><p>[00:16:19] Ate-a-Pi (<a target="_blank" href="https://twitter.com/8teAPi">@8teAPi</a>):</p><p>But he he knows what he's looking for, and they refine, they refine, and they refine, and he keeps doing experiments. He keeps refining the experiment, and he goes through, like, a 1000 iterations. And somehow, starting in 2018, somehow, By the middle of 2018, they find it. So that that's a surprising thing for me because they've I I I suspect they they've been working on it you know, before or, you know, Jay and Lee had a breakthrough on their theory, so they knew how to narrow the workspace down. But somehow in at the end of the day, Kim is the one grinding.</p><p>[00:16:58] Ate-a-Pi (<a target="_blank" href="https://twitter.com/8teAPi">@8teAPi</a>):</p><p>Through that 1000 experiments, finally, to get, you know, a sample that works.</p><p>[00:17:03] Seo Sanghyeon (<a target="_blank" href="https://twitter.com/sanxiyn">@sanxiyn</a>):</p><p>And then they start by -- No. No.</p><p>[00:17:05] Alex Volkov (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>): No.</p><p>[00:17:05] Ate-a-Pi (<a target="_blank" href="https://twitter.com/8teAPi">@8teAPi</a>):</p><p>No.</p><p>[00:17:05] Alex Volkov (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>): No.</p><p>[00:17:05] Seo Sanghyeon (<a target="_blank" href="https://twitter.com/sanxiyn">@sanxiyn</a>):</p><p>No. No. No. No. No. No? So so besides the two papers, there is a paper published in April returning query. And In their own words, they describe what what prompted their breakthrough in 2018.</p><p>[00:17:27] Seo Sanghyeon (<a target="_blank" href="https://twitter.com/sanxiyn">@sanxiyn</a>):</p><p>and it said that so so they are putting the material in a quartz tube And because they called it to best courts to cancel and Brooke, And the material left after the breaking of the glass was had the property they wanted. So so it was an accidental discovery.</p><p>[00:18:02] Ate-a-Pi (<a target="_blank" href="https://twitter.com/8teAPi">@8teAPi</a>):</p><p>So can can you repeat that? Like, they what what happened? They put it in the quartz tube, and the quartz tube broke accidentally?</p><p>[00:18:10] Seo Sanghyeon (<a target="_blank" href="https://twitter.com/sanxiyn">@sanxiyn</a>):</p><p>Yes.</p><p>[00:18:10] Alex Volkov (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>): Yes. Yes.</p><p>[00:18:11] Seo Sanghyeon (<a target="_blank" href="https://twitter.com/sanxiyn">@sanxiyn</a>):</p><p>I see. And and And that what's the breakthrough in 2018? I see. It's what I'm saying.</p><p>[00:18:19] Alex Volkov (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>): Yeah. I just wanna confirm what I hear. The breaking of the course you led to the incidental discovery. This is this is the the breakthrough as it's written in the first paper in Korea? Yes. Yes. Okay. So I'll just call ASAP, I'll just give it back for some logistics. Folks, if you look up on on top of the space, there's a few tweets we're pinning. And as we go along, we're gonna add some information on top of this. The 3rd the third we pin from dystopian breaker has a link to the original kind of Korean paper. So please go ahead, Datapai.</p><p>[00:18:54] Seo Sanghyeon (<a target="_blank" href="https://twitter.com/sanxiyn">@sanxiyn</a>):</p><p>So so quick -- Okay. point.</p><p>[00:18:56] Alex Volkov (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>): Yeah.</p><p>[00:18:56] Ely Rabani (<a target="_blank" href="https://twitter.com/radsci">@radsci</a>):</p><p>Go ahead. Go ahead. This this could be important because, you know, as as soon as you expose it to the atmosphere, your getting hydration. And hydration, you know, might be harmful, might be helpful. From this, like, little account, it seems like it it it either didn't do anything or was helpful. But, like, no what temperature it was at when it broke, and and things like that could could actually be really pertinent.</p><p>[00:19:30] Ate-a-Pi (<a target="_blank" href="https://twitter.com/8teAPi">@8teAPi</a>):</p><p>Yeah. So, absolutely, like so it's not they he does do the 1000 experiments, but the 1000 experiments, whether that gets him there or not, at one point in the experiment, the quartz tube breaks, that gets them there. They get lucky. Right? So they get they get lucky. And then after that, things proceed pretty quick They isolate they isolate it, and then they they get the crystallization. They start working on the papers. They start on the patents, and they start also trying to figure out the chemical vapor deposition process. They seem to have made some way some headway on the chemical vapor deposition process.</p><p>[00:20:06] Ate-a-Pi (<a target="_blank" href="https://twitter.com/8teAPi">@8teAPi</a>):</p><p>And then, you know, sometime around September 2021, something start happening. Quant takes a position, sabbatical at, I think, Seoul University at that point. I'm not sure whether that means he's putting more time in the office or not. And then that fast forwards to yeah. Go go ahead, Sunggham.</p><p>[00:20:33] Seo Sanghyeon (<a target="_blank" href="https://twitter.com/sanxiyn">@sanxiyn</a>):</p><p>No. No.</p><p>[00:20:33] Alex Volkov (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>): No.</p><p>[00:20:33] Ate-a-Pi (<a target="_blank" href="https://twitter.com/8teAPi">@8teAPi</a>):</p><p>You go ahead. Okay. So that fast forward about March 2023 when basically the international patent has been filed. And Kuan leaves the team at this time. I'm not sure when Kim comes on board. That's not very to me at what point Yum Tuck comes on board.</p><p>[00:20:57] Ate-a-Pi (<a target="_blank" href="https://twitter.com/8teAPi">@8teAPi</a>):</p><p>So I'm guessing it's after the nature, the nature paper gets dinged in 2020, And and and, you know, the the other thing that strikes me also is that every single person on the team is very aware of every single hoax in superconductors to date. Right? They they they all know the space well, They've seen every single hoax before. They know they know what the hoaxes look like. They know what to look for. They know what diamagmatism is. So I I I don't think yeah.</p><p>[00:21:29] Seo Sanghyeon (<a target="_blank" href="https://twitter.com/sanxiyn">@sanxiyn</a>):</p><p>Go ahead. So the date is So the day before the yesterday, Andrew McCully posted on his Twitter the translation of the Korean paper at Doctor Lloyd. Is that correct? And can can you so so how did you translate and can Can you say something about it?</p><p>[00:21:59] Alex Volkov (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>): Andrew, I think he's Frank to you. So I can just ring to you. You posted a translated paper also. Right?</p><p>[00:22:08] Andrew McCalip (<a target="_blank" href="https://twitter.com/andrewmccalip">@andrewmccalip</a>):</p><p>Yes. Now that was just a machine translation from Google. That was just a very cursory translation.</p><p>[00:22:19] Seo Sanghyeon (<a target="_blank" href="https://twitter.com/sanxiyn">@sanxiyn</a>):</p><p>Okay.</p><p>[00:22:19] Ate-a-Pi (<a target="_blank" href="https://twitter.com/8teAPi">@8teAPi</a>):</p><p>So in basically, quantity is team in March, and then you have the kind of papers being released, you know, haphazardly. The next the next point that of them is that they had started releasing the papers app as early, like, late last week.</p><p>[00:22:42] Alex Volkov (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>): And and then and then we have -- And by the way, I think it's it's important to highlight by Kwan, the guy who's no longer affiliated with with QCenter. Like, this this sole endeavor a business venture that's funded for for this for this purpose. Kwan is no longer affiliated with that. We've seen Sankyo posted an interview in Korea from Friday where I think both of the and Kim say that Kwan, the guy who published the first paper, is no longer affiliated.</p><p>[00:23:12] Alex Volkov (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>): there were some speculation as to maybe the limit of three people on the paper is the limit of the Nobel Prize or 2 or 3 authors. I don't have this confirmed, but this is speculation going around. And it's important to note like, both of them say that the paper was not ready when it was released, and it was released by Juan, the guy who left the first paper. 2 hours later, 2 than 20 minutes later, another paper gets released in the in the same archive with, I wouldn't say, 5 authors. not including Kwan. Right?</p><p>[00:23:48] Ate-a-Pi (<a target="_blank" href="https://twitter.com/8teAPi">@8teAPi</a>):</p><p>So Lee -- Yeah. And -- The user the the user name is TumTuk team, the the college professor from, you know, Virginia is the username who who pushes the r archive paper at that Yeah.</p><p>[00:24:04] Seo Sanghyeon (<a target="_blank" href="https://twitter.com/sanxiyn">@sanxiyn</a>):</p><p>Chantakim is a big name with the 18 days of 45, and If you look at the paper, there is an error message in Korean saying that Bloomberg could not be found. It is a neutral error message when you did the some of the typesetting wrong.</p><p>[00:24:27] Seo Sanghyeon (<a target="_blank" href="https://twitter.com/sanxiyn">@sanxiyn</a>):</p><p>And You just don't probably see the room temperature, sugar conductor paper with the error deaths that had to bookmark cannot be found if you are following if you are in not in emergency.</p><p>[00:24:52] Alex Volkov (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>): So so it does feel to us at least from the summary so far that the paper that Quang released has different information than than the second paper, and the second paper feels like it was released in the Harry and included more people that currently work at Q Center, including Hyundai Kim. And Sonja, owner, you this question. You mentioned his h h score or something score. Can can you explain the importance of that score for him talking?</p><p>[00:25:20] Seo Sanghyeon (<a target="_blank" href="https://twitter.com/sanxiyn">@sanxiyn</a>):</p><p>creates someone else to the explanation.</p><p>[00:25:24] Ate-a-Pi (<a target="_blank" href="https://twitter.com/8teAPi">@8teAPi</a>):</p><p>Okay. So so the h score is, you know, because we have a web web savvy audience here. It's kind of like a page rank for, you know, researchers. It shows you how influential how influential the researcher was, and so a higher score means that more people have been citing your paper.</p><p>[00:25:45] Ben (<a target="_blank" href="https://twitter.com/BenShindel">@BenShindel</a>):</p><p>Go ahead, Ben. Yeah. More precisely. So, like, an h index of, say, 40 means you have 40 papers that each have 40 citations or more. That's a little tricky to understand. So, like, if I get another paper that has only 30 citations, it won't affect my h index at all. I have to get a 41st paper that has 41 citations to to to make it rise.</p><p>[00:26:07] Alex Volkov (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>): So I think it's it's safe to say HUNTAKIM, the guy who submitted the second paper, potentially haphazardly. Correct? Like, we're we're we're saying there's 2 hours after the first one. So likely prompted by these events is a well well sighted very well sighted scientist with a very high kind of confidence score. It's not like a random person of the street that decide that there's now a superconductor of room temperature and, you know, verified it.</p><p>[00:26:41] Seo Sanghyeon (<a target="_blank" href="https://twitter.com/sanxiyn">@sanxiyn</a>):</p><p>Okay. Sorry for being side tracked, but I just checked the the motion related to Korean paper or not to talk through it by Andrew. And on the page 5, we clearly said that the quartz tube was destroyed due to internal pressure during rapid cooling of reaction and etcetera. So I think, in fact, nobody really read ready carefully. It is it is just there about the quartz tube once destroyed.</p><p>[00:27:19] Ate-a-Pi (<a target="_blank" href="https://twitter.com/8teAPi">@8teAPi</a>):</p><p>Yeah. So I think I think it's yeah. Definitely, like, probably the the rest of us are are are not very close readers. of of that paper.</p><p>[00:27:29] Seo Sanghyeon (<a target="_blank" href="https://twitter.com/sanxiyn">@sanxiyn</a>):</p><p>So so We can we can continue on after the upload to the archive.</p><p>[00:27:42] Ate-a-Pi (<a target="_blank" href="https://twitter.com/8teAPi">@8teAPi</a>):</p><p>Indeed. So okay. So they they they it goes into our our archive, and then all of the events of the last week happen you know, I don't think any of us expected any of the events to happen. So we've all just been kind of, like, following along and seeing what happens next. I had no idea that there was a metallics conference in South Korea, and I I definitely had, like, no idea that you know, one of the authors would show up there, and it gets posted on Twitter. And so and then and then Seung Young points it out on the FM Korea Football message board.</p><p>[00:28:20] Ate-a-Pi (<a target="_blank" href="https://twitter.com/8teAPi">@8teAPi</a>):</p><p>And so we translate, you know, what the audience reaction was in in in a bad translation to get -- So -- -- whatever message was across.</p><p>[00:28:30] Alex Volkov (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>): -- mind let me interject here because this is around the that I found out about this. Alex, frozen coffee. Alex, I forgot his nickname. We invited him here. He posted a a very long Twitter thread that got the attention of the algorithm and then boosted of this room template ambin pressure, superconductor paper from Korea. I think he only started talking about the first paper, and then after the second paper also came out. And I think at this point, or somewhere around there. Andrew, you found out about this. What what did you first hear about, you know, Twitter drama around LK 90 Right?</p><p>[00:29:08] Alex Volkov (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>): And, Andrew, feel free to at least produce you know, introduce yourself officially and BARDA and how you're interacting with this.</p><p>[00:29:16] Andrew McCalip (<a target="_blank" href="https://twitter.com/andrewmccalip">@andrewmccalip</a>):</p><p>Yeah. So I was just cruising the Internet at night, and this came across. I think my my Twitter feed And so I I'm incredibly curious. This is something that has been a bit of a a hobby for me. And so I was always interested in superconductors, so it it caught my attention. I'm a mechanical engineer. So full disclosure. I am not a subject matter expert. I am simply an aerospace engineer that has a lot of curiosity and some assets at his disposal.</p><p>[00:29:50] Andrew McCalip (<a target="_blank" href="https://twitter.com/andrewmccalip">@andrewmccalip</a>):</p><p>And so reading this paper, it it struck me just the simplicity of of the process. And so I realized that I probably had the ability to replicate with full fidelity, the process that was described in the paper. And so that within about 30 minutes, I I realized I should simply start down this road that Twitter was already picking up at the time.</p><p>[00:30:21] Andrew McCalip (<a target="_blank" href="https://twitter.com/andrewmccalip">@andrewmccalip</a>):</p><p>There's some conversations going back and forth and the it was the classic scenario where on every superconductor discussion, there is the same conversation that happens over and over again. And this synthesis appeared so simple that it seemed that the most expedient thing was to simply test it physically. And so my my work is very receptive of of after hours projects. I'm I'm known as the the guy that has really aggressive hobbies, let's say.</p><p>[00:30:57] Andrew McCalip (<a target="_blank" href="https://twitter.com/andrewmccalip">@andrewmccalip</a>):</p><p>And so I'm always in the back doing something interesting with materials or automation. So within 30 minutes of reading the paper, I had ticked off orders to various chemical suppliers. I've reached out to overseas vendors. to try to procure a couple of the the elements. And so it was just kind of an offhand comment that I made on Twitter and and then the ball really started rolling, and I realized that everyone wanted to see this this made.</p><p>[00:31:32] Andrew McCalip (<a target="_blank" href="https://twitter.com/andrewmccalip">@andrewmccalip</a>):</p><p>And so it was just supposed to be a a a fun little project, but I was really overwhelmed by the the response. Everyone wanted to to see this done. I think there's this incredible curiosity, there's this incredible drive. People wanna see, like, incredible things happen for the the the human race. And so something if this magnitude pops up, everyone's motivated to drop everything and investigate. And I think that's where we're at.</p><p>[00:32:08] Alex Volkov (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>): And I think you met the algorithm at the right place where folks were excited about the future and think this could bring a lot of changes around the future, and you started saying, hey. You know? Here's a here's a direct approach. Let's try to replicate this. And I I wanna just highlight the fact the the materials involved in creating this. And the process, some folks say and please talk about this. Some folks say that has been an attempt at a hoax, it wouldn't be as simple. They wouldn't have released a simple instruction manual kind of quote, unquote simple that many labs around the work they replicate given the materials and and the right equipment. Right?</p><p>[00:32:48] Ely Rabani (<a target="_blank" href="https://twitter.com/radsci">@radsci</a>):</p><p>So -- Yeah.</p><p>[00:32:48] Alex Volkov (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>): So -- -- straightforwardness of this potentially shows some stuff.</p><p>[00:32:51] Ely Rabani (<a target="_blank" href="https://twitter.com/radsci">@radsci</a>):</p><p>So this this is a good time for for a PSA. I mean, I know that that Andrew is well aware of this, and and and many of peep of the people who've been following it. But in case anybody who's listening isn't. The these compounds in vapor form at any rate are are highly talked music, and you you have to know lab safety. If you're gonna start trying to experiment with them, you need things like, a glove box and, you know, all kinds of PPE, a fume hood, everything else. Taking risks with this kind of thing is just really not worth it.</p><p>[00:33:31] Alex Volkov (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>): I I I can't stress that. Absolutely. Don't try this at home.</p><p>[00:33:36] Andrew McCalip (<a target="_blank" href="https://twitter.com/andrewmccalip">@andrewmccalip</a>):</p><p>kids definitely. Yeah. Absolutely. There's a lot of chatter in the beginning in the first couple hours about this can be replicated in a garage And, you know, I thought it was interesting. I thought maybe we've got the opportunity to to do it safely. we've got all the right equipment. We've got, you know, the the 1,000,000 of dollars of equipment that support our spacecraft business. that allow us to do some of these things safely. And so I thought Twitter wants to live vicariously through somebody why not do this?</p><p>[00:34:12] Andrew McCalip (<a target="_blank" href="https://twitter.com/andrewmccalip">@andrewmccalip</a>):</p><p>I ended up being in sort of an interesting middle ground because I'm not in academia. I'm also not trying to commercialize any part of this tech. really just doing it for fun because it's incredibly interesting. So I've got no skin in the game except for making this work in a transparent manner. and then getting the materials into the hands of the experts.</p><p>[00:34:34] Andrew McCalip (<a target="_blank" href="https://twitter.com/andrewmccalip">@andrewmccalip</a>):</p><p>So I thought if we can leverage some of our equipment and some of our, you know, very smart people that we have, to speed this timeline up, I didn't see anybody in the United States being vocal about trying to do replication there are so many stories coming out of other parts of the world that all the labs, there must be thousands of furnaces burning right now trying to replicate this. But I wanted to get material into the hands of some local experts in California.</p><p>[00:35:09] Andrew McCalip (<a target="_blank" href="https://twitter.com/andrewmccalip">@andrewmccalip</a>):</p><p>And so that's really our our goal is, you know, can we can we sort of be the face of of the Internet do this experiment in a safe manner and then help advance the science and be sort of a forcing function to to doing this replication.</p><p>[00:35:27] Alex Volkov (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>): So, Andrew, just before just a a small pause before you continue, I want to ask the other, Andrew, here. The Andrew code, if if you're able to unmute and and and talk us if you're available about the potential reasons why all of Twitter jumped on this. Andrew Kot, you had a thread on room temperature superconductors. About 2 weeks before this, like, almost a permanent is kind of a threat. And could you give us some summary first of all, feel free to introduce yourself, but also some summary of what this means if this replicates, what this means for the world.</p><p>[00:36:07] Alex Volkov (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>): Applications, you know, give us, like, some excitement of what happens if this is an actual ambient pressure in room temperature superconductor? Andrew? Does not look like Andrew is Oh, hey.</p><p>[00:36:33] Andrew Cote (<a target="_blank" href="https://twitter.com/Andercot">@Andercot</a>):</p><p>Sorry. My my audio cut out for a second. I I missed the prompt. Oh, here you are. Let you only -- Sure. Yeah. Thanks. Thanks very much.</p><p>[00:36:44] Alex Volkov (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>): So so folks so so I I explained to folks your thread about MBN, you know, pressure room temperature superconductors that you've offered, what, 2 weeks before the paper came out. And then suddenly, this dropped. And I wanted you to highlight some of the potential applications of superconductors and give us some of the highlights of what happens in this replicating. This is an actual, you know, real thing.</p><p>[00:37:08] Andrew Cote (<a target="_blank" href="https://twitter.com/Andercot">@Andercot</a>):</p><p>Yeah. Sure. So it's kind of a funny thing. Yeah. I put that thread out there 7 weeks before this story broke. You know, just I have worked with this kind of stuff in in a few different areas now, so it's very, you know, superconducting radio frequency cavities are standard technology in accelerator physics to fill these to work in.</p><p>[00:37:31] Andrew Cote (<a target="_blank" href="https://twitter.com/Andercot">@Andercot</a>):</p><p>Like, my first job in physics was actually in a condensed matter lab using a a scanning tunneling microscope to look at, you know, electronic structures of potential high temperature superconductors So this has always been sort of like a holy grail of material science, like sort of a holy grail of applied physics. It's one of these properties it's one of these materials where the bulk properties come from its quantum mechanical behavior. And and, you know, when quantum mechanics and its effects escape the realm of the very tiny, it can really manifest as as magical phenomenon at our scale in the world of the kind of the bulk matter or the big stuff.</p><p>[00:38:10] Andrew Cote (<a target="_blank" href="https://twitter.com/Andercot">@Andercot</a>):</p><p>So, you know, superconductors are used currently today, You know, it's it's they've reached engineering applicability through decades of continuous refinements and improvements. And and some of the biggest things to think about in what lets these things get used in industrial applications is their ability to superconducts at higher and higher temperatures And, also most also importantly, is to operate at higher and higher background magnetic field strengths. And so the way to think about this is that a superconductor, it's allowing current to move through it with zero resistance, but it also perfectly spells magnetic fields.</p><p>[00:38:48] Andrew Cote (<a target="_blank" href="https://twitter.com/Andercot">@Andercot</a>):</p><p>And there's an operating point of these materials where it's basically the current density and the temperature and the magnetic field kind of put the bounds or the performance envelope on the material. So some conductors can carry tons of current, but they can't exist in a very high field. And so, you know, those are hard to make as useful. You can use them for carrying, like, electricity, which is awesome, but often what you really wanna do is generate very strong magnetic fields. So I think maybe the most familiar to the most people here would be, like an MRI machine. Right?</p><p>[00:39:27] Andrew Cote (<a target="_blank" href="https://twitter.com/Andercot">@Andercot</a>):</p><p>Magnetic resonance imaging. So the idea there is you're generating very high strength field, and magnetic fields are measured in Tesla, for example. So just for just for context, you know, 3 Tesla is a is a pretty strong field, and that's what is about the strength using an MRI. So, you know, MRIs use these cryogenically cooled magnets, or or they're not I don't think cryogenically cooled. They're actually often just copper, but they do have cooling. But they generate this high strength field, and then, you know, it kind of sets all these little protons in your body spinning and dancing in a little, you know, kind of radiating energy.</p><p>[00:40:03] Andrew Cote (<a target="_blank" href="https://twitter.com/Andercot">@Andercot</a>):</p><p>And then you have a pickup coil, which is like an antenna, and the antenna is trying to pick up that energy and kinda reconstruct what's going on in your body. And this is how we can get, like, a really high detailed, high fidelity, three-dimensional image of what's going on inside someone without any invasive surgery. So it's, like, you know, MRIs are a real kind of amazing breakthrough in medical imaging. Superconductors if they could work without cryogenics would really simplify and make cheaper and more available, high resolution, high fidelity, three d images of people's bodies.</p><p>[00:40:35] Andrew Cote (<a target="_blank" href="https://twitter.com/Andercot">@Andercot</a>):</p><p>not just for making the magnetic fields, but also for picking up the signal emitted by the protons that get put into motion by the field in the first place. So it's kind of, like, one sort of off the shelf example. I think another one that's kind of under the radar, we don't think about it's not just in carrying electricity without resistance, which is useful for long range, like energy transmission, that kind of stuff. But if you look at the national grid, I mean, only 5, 7 percent of energy total, which is still significant, but it's, you know, single digit percentage ends up, you know, burning as weight You're suddenly muffled.</p><p>[00:41:11] Alex Volkov (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>): I don't think yeah. You're suddenly a voice like your -- Oh, better.</p><p>[00:41:18] Andrew Cote (<a target="_blank" href="https://twitter.com/Andercot">@Andercot</a>):</p><p>Now it's better. Okay. Sorry about that. Yeah. So just gonna say so, you know, National Grid Scale Energy Production. Right? So trans transmitting the energy to its endpoint consumption, there's a bit of waste heat along the way. But what's what's also important to think about is how that energy is produced. It's produced also using high strength magnetic fields. And I was looking into this. There's a a experiment where these guys used sort of more modern high temperature superconducting tape to, you know, retrofit a large DC generator then it had, like, a 36 percent power improvement, right, which is pretty substantial. That's that's a that's a serious win.</p><p>[00:41:58] Andrew Cote (<a target="_blank" href="https://twitter.com/Andercot">@Andercot</a>):</p><p>Yeah. So there's there's, you know, sort of thousands of places this stuff could be used that would really just, like you know, it would either greatly improve the performance efficiency, reduce the cost, increase the accessibility of what we think of as, like, high technology like MRIs or particle accelerators. But it would also just decrease the cost of basic things like electricity generation and distribution And that's just the beginning. Right? So, you know, this kind of stuff there's a really good analogy here actually with the transistor, you know, for for years, scientists, then electrical engineers and physicists, they had this idea of a transistor. Right?</p><p>[00:42:35] Andrew Cote (<a target="_blank" href="https://twitter.com/Andercot">@Andercot</a>):</p><p>If only we could have some kind of simple, reliable, current model supplier. We could design all these wonderful things. We could design all these different kinds of logic functions and so forth. And so there was this search for the transistor people were searching for something that could do that, and they had anticipated all the places it could be used ahead of time. And it wasn't until at Bell labs, you know, a very kind of funny crossover here. One of the guys that's on the patent for the transistor is John Bardine. and John Bardeen's actually the only guy to win 2 Nobel Prizes. 1 was for the transistor. The other was for the theory of superconductivity, right, which is Barting Cooper Schiffer Theory, BCS.</p><p>[00:43:14] Andrew Cote (<a target="_blank" href="https://twitter.com/Andercot">@Andercot</a>):</p><p>So, again, it's one of it's one of those things where, you know, physicists, scientists, engineers kinda thought about this for a long time, realize this be amazing. And there's been this, you know, really complicated random walk through the configuration space of possible materials, right, which is so high dimensional. There's so many things you can construct. So I think it's I'm very optimistic about the field in general. I think one thing to think about with this particular result there's so much artisanal craft and and mastery that goes into producing these materials in a reliable, consistent way You know, science people don't often recognize. It's a lot of art involved too. Right?</p><p>[00:43:52] Andrew Cote (<a target="_blank" href="https://twitter.com/Andercot">@Andercot</a>):</p><p>Like like, things that are reduced to expert practice us and know how. And so I'd I'd just be cautious on, you know, jumping to conclusions either on this particular result, if it's if it's valid right now. But, also, if some labs can't fail to reproduce it, it doesn't actually rule it out entirely. I I think there's scientists that have traveled to Korea to work with the original authors. I look closely at that. You know, I'd also you know, I my internal odds are kind of like a 1 in 6 chance, this pans out, and it and it could be big.</p><p>[00:44:21] Andrew Cote (<a target="_blank" href="https://twitter.com/Andercot">@Andercot</a>):</p><p>But that doesn't mean that it's the end of the search or the end of the race, and I'm and I'm also optimistic that Getting people to understand what the massive long term and large scale social benefits of this kind of discovery could be could help direct a lot more basic science research towards this field. You know, I think we spend a lot of things on, like, how to make smartphone cameras better and not a lot of things on and not as much as we could spend on things like high temperature superconductors. And this is a final example.</p><p>[00:44:48] Andrew Cote (<a target="_blank" href="https://twitter.com/Andercot">@Andercot</a>):</p><p>I mean, so right now, you know, I work as a accelerator engineer, accelerator is a type of magnetic confinement fusion reactor The reason the company I work for can't exist, and and the reason there is this current burn and boom in nuclear fusion, is because we've engineered these high temperature superconductors to work in higher and higher magnetic fields, at at higher and higher temperatures. And and the big economic breakthrough there came when we can have these superconductors that can work at liquid nitrogen temperatures, right, which is 77 kelvin. And it's a lot cheaper to make liquid nitrogen and run that kind of cryogenics than it like liquid helium at, like, 4 Kelvin.</p><p>[00:45:24] Andrew Cote (<a target="_blank" href="https://twitter.com/Andercot">@Andercot</a>):</p><p>So, you know, we're already reaping some of the benefits of this sort of tech stack maturing over time. And I think really just getting started in terms of, like, the hunt for promising materials. I mean, I'm hoping this results in positive publicity and more effort, more energy, put into the field. I think if this doesn't pan out as the thing, you know, don't give up hope. Right? I mean, this is a long term game. Science sees by starts and stops. There's no fundamental physics here that's impossible. Right? There's no physical principle that says this can't work. Right? This isn't like a a momentumless or massless propulsion drive like the EM drive.</p><p>[00:46:04] Andrew Cote (<a target="_blank" href="https://twitter.com/Andercot">@Andercot</a>):</p><p>isn't, like, superluminal neutrinos. Right? Those things kind of break laws of physics. This is very much in the realm of, yeah, physically possible. seems seems very you know, in my mind, seems likely there could be something out there given the complexity of state space of electronic structures and given how you know, how large that space of exploration can be. And, yeah, so I think I'm just kind of you know, this is a great time to be interested in material science to appreciate basic science research and educating ourselves on on how good the future can be. You know, I think there's a lot of narratives right now in society and cultural in general. that kinda say, like, you know, you know, we we can't solve our way out of our biggest problems today. Right?</p><p>[00:46:43] Andrew Cote (<a target="_blank" href="https://twitter.com/Andercot">@Andercot</a>):</p><p>And and I'm very much on the other side of that debate. I think we can. I think it's through efforts like this. I think it's through people like Andrew at Varda that are willing to do stuff in their backyard or their garage or their fact or their their work workplace on their extra time. You know? I mean, this is the kind of this is the the let's build mentality. Right? And so I think we can build our way out of the world's greatest problems, and I its fundamental scientific advances like this discovery could be that that kind of paved the way out of there too. So, yeah, overall, very optimistic.</p><p>[00:47:11] Andrew McCalip (<a target="_blank" href="https://twitter.com/andrewmccalip">@andrewmccalip</a>):</p><p>Andrew? That that's incredibly well said. That is an incredibly well balanced viewpoint. So how would you advise people to absorb the the next week of the new cycle? I mean, we're very much on a you know, we're we're back dead. We're back type of hype cycle. So how do you advise people to think about the results that they're seeing knowing that this is a a very difficult thing to replicate when it just because it a negative result is shown in a lab that doesn't mean it's not physically possible.</p><p>[00:47:49] Andrew McCalip (<a target="_blank" href="https://twitter.com/andrewmccalip">@andrewmccalip</a>):</p><p>It's very difficult to prove the negative here. So tell us how we should absorb the new cycle coming up in the next few days.</p><p>[00:47:59] Ate-a-Pi (<a target="_blank" href="https://twitter.com/8teAPi">@8teAPi</a>):</p><p>So I I I I I I might I might say something about that. I think I think this is basically tacit knowledge transfer, and you Kim Kim seems to have been this kind of, like, artisanal, like, you know, experiment list. So you need people to actually sit there in the lab with this guy, and he needs to demonstrate to them. And they need to pick up and and there might be things that he does, which he didn't write down. That that's the like, my my take on it given that He is the experiment list. He's the synthesis on on the team.</p><p>[00:48:38] Ate-a-Pi (<a target="_blank" href="https://twitter.com/8teAPi">@8teAPi</a>):</p><p>Given that the team seems to have been only, like, 5 or 6 people, is that this guy is the maybe the only person in the world as of, like, you know, 18 months ago. I'm guessing that, you know, he managed to transfer some of that to the JungTux team. So I'm guessing that at at least one more one more team on on earth has this now. And I'm guessing that this knowledge transfer is now happening to a couple more people. So so you need to see this progress maybe 2 or 3 cycles for, like, a bunch of other people to have learned the skill, and then that's when that's when things get interesting.</p><p>[00:49:14] Seo Sanghyeon (<a target="_blank" href="https://twitter.com/sanxiyn">@sanxiyn</a>):</p><p>I mean, you don't really need to replicate to to verify this. There, the the team can just the team has the working samples. they can adjust the samples to the laps around the world.</p><p>Hey, the rest of the episode is for paid subscribers to thursdai. </p><p><p>I encourage you to subscribe or upgrade your subscription to access it, there’s almost 2 more hours of in depth conversation, stitching of facts, experts on material science, physics, electrical engineering and MIT folks chiming in. It’s really a great space, around 25K folks have listened to it on twitter so far. </p></p>]]></description><link>https://sub.thursdai.news/p/lk99-the-superconductor-that-can</link><guid isPermaLink="false">substack:post:135556493</guid><dc:creator><![CDATA[Alex Volkov, Seo Sanghyeon, Prakash, Andrew McCalip, Ben, Andrew Cote, and Danielle Fong]]></dc:creator><pubDate>Sun, 30 Jul 2023 22:09:53 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/135556493/9b68ddb9bd1c30360f1da44c7b2f742c.mp3" length="36181668" type="audio/mpeg"/><itunes:author>Alex Volkov, Seo Sanghyeon, Prakash, Andrew McCalip, Ben, Andrew Cote, and Danielle Fong</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>3015</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/135556493/2d78008f5fad86b5d65007224c428ce7.jpg"/></item><item><title><![CDATA[🎙️ThursdAI - Jul 27: SDXL1.0, Superconductors? StackOverflowAI and Frontier Model Forum]]></title><description><![CDATA[<p>⏰ Breaking news, ThursdAI is now on <a target="_blank" href="https://podcasts.apple.com/us/podcast/the-top-ai-news-from-the-past-week-every-thursdai/id1698613329?itsct=podcast_box&#38;itscg=30200&#38;ls=1">Apple Podcasts</a> and in <a target="_blank" href="https://api.substack.com/feed/podcast/1801228.rss">this RSS</a> ! So use your favorite pod-catcher to subscribe or his this button right here: </p><p>Our friends at <a target="_blank" href="https://zealous.one/?via=alex-volkov">Zealous</a> have provided an incredible platform for us to generate these awesome video podcasts from audio or from twitter spaces so if you prefer a more visual format, our <a target="_blank" href="https://zealous.one/?via=alex-volkov">deep thanks to them</a>! </p><p>P.S - You can find the <a target="_blank" href="https://zealous.one/@altryne/thursdAI/convos/cnv_2TAPOlQdP7ZGDxNmfpNNVrRBun6?tab=Transcript">full 2 hour space</a> with speakers on our <a target="_blank" href="https://zealous.one/@altryne/thursdAI/convos/cnv_2TAPOlQdP7ZGDxNmfpNNVrRBun6?tab=Transcript">Zealous page</a> and <a target="_blank" href="https://twitter.com/altryne/status/1684594417582346240?s=20">on Twitter</a></p><p>Here’s a summary of the main things that happened in AI since <a target="_blank" href="https://thursdai.substack.com/p/thursdai-july-20-llama-2-vision-and#details">last ThursdAI</a>: </p><p>🧑‍🎨 Stability.ai releases <a target="_blank" href="https://twitter.com/GozukaraFurkan/status/1684301022146035712?ref_src=twsrc%5Etfw%7Ctwcamp%5Etweetembed%7Ctwterm%5E1684301022146035712%7Ctwgr%5E445e2350ea9f09419211e1256198202ac2f589ef%7Ctwcon%5Es1_c10&#38;ref_url=https%3A%2F%2Fspacesdashboard.com%2Fspace%2F1YqKDogaNVzxV%2Fthursdai-llama-2-fine-tunes-sdxl-10-agent-updates-more">SDXL1.0</a></p><p>* Generates 1024px x 1024x stunning images</p><p>* High high photorealism</p><p>* Supports hands and text</p><p>* Different (simpler?) prompting required</p><p>* Fine-tunes very well! </p><p>* Supports LORAs, ControlNet in-painting and outcropping and the whole ecosystem built around SD</p><p>* Refiner is a separate piece that adds high quality detail</p><p>* Available on <a target="_blank" href="https://dreamstudio.ai/">Dreamstudio</a>, <a target="_blank" href="https://github.com/Stability-AI/generative-models">Github</a>, <a target="_blank" href="https://clipdrop.co/stable-diffusion">ClipDrop</a> and <a target="_blank" href="https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0">HuggingFace</a></p><p>* Also, is available with incredible ComfyUI and can be used in a <a target="_blank" href="https://twitter.com/camenduru/status/1678649469481385985">free Colab</a>!</p><p>Image Credit goes to <a target="_blank" href="https://twitter.com/thibaudz/status/1675936194092232706">Thibaud</a></p><p></p><p>Superconductors on Hugging Face? What? </p><p>Honestly, this has nothing immediate to do with AI updates, but, if it pans out, it’s so revolutionary that it will affect AI also!</p><p>Here’s what we know about LK-99 so far: </p><p>* 2 papers released on arXiv (and <a target="_blank" href="https://huggingface.co/papers/2307.12008">hugging face</a> haha) in the span of several hours</p><p>* <a target="_blank" href="https://arxiv.org/abs/2307.12008">First</a> AND <a target="_blank" href="https://arxiv.org/abs/2307.12037">second</a> paper both claim extraordinary claims of solving ambient superconductivity</p><p>* Ambient pressure and room temp superconductive material called <a target="_blank" href="https://en.wikipedia.org/wiki/LK-99"><strong>LK-99</strong></a><strong> </strong></p><p>* Straightforward process with a clear replication manual and fairly common materials</p><p>* Papers lack rigor, potentially due to rushing out or due to fighting for credit for nobel prize </p><p>* The science is potentially sound, and is being “<a target="_blank" href="https://www.science.org/content/blog-post/breaking-superconductor-news">baked and reproduced in multiple labs</a>” per science mag.</p><p>Potential effects of room temperature superconductivity on AI: </p><p>While many places (All?) can benefit from the incredible applications of superconductors (think 1000x batteries) the field of AI will benefit as well if the result above replicates.</p><p>* Production of GPU and CPU is power-constrained and could benefit</p><p>* GPU/CPUs themselves are power-constrained while running inference</p><p>* GPT-4 is great but consumes more power (training and inference) than previous models making it hard to scale</p><p>* Local inference is also power-restricted, so running local models (and local walking robots) could explode with superconductivity </p><p>* Quantum computing is going to have a field day if this is true</p><p>* So will fusion reactors (which need superconductors to keep the plasma in place) </p><p>As we wait for labs to reproduce, I created a <a target="_blank" href="https://twitter.com/i/lists/1684446795731206144">twitter list of folks</a> who are following closely, feel free to follow along! </p><p>AI agents protocol, discussion and state of for July 2023</p><p>* Participated in an e2b space with tons of AI builders (Full space and recap coming soon!) </p><p>* Many touted AI agents as a category and discussed their own frameworks</p><p>* Folks came up and talked about their needs from the <a target="_blank" href="https://github.com/e2b-dev/agent-protocol">agent protocol</a> proposed by e2b</p><p>* Agents need to be able to communicate with other agents/sub agents</p><p>* Tasks payloads and artifacts and task completion can be async (think receiving a response email from a colleague) </p><p>* The ability to debug (with timetravel) and trace and reproduce an agent run</p><p>* Deployment, running and execution environment issues</p><p>* Reliability of task finish reporting, and evaluation is hard</p><p>Frontier model forum</p><p>* <a target="_blank" href="https://openai.com/blog/frontier-model-forum?utm_source=bensbites&#38;utm_medium=referral&#38;utm_campaign=frontier-model-forum-google-microsoft-open-ai-anthropic">OpenAI</a>, Anthropic, Google, and <a target="_blank" href="https://blogs.microsoft.com/on-the-issues/2023/07/26/anthropic-google-microsoft-openai-launch-frontier-model-forum/?utm_source=bensbites&#38;utm_medium=referral&#38;utm_campaign=frontier-model-forum-google-microsoft-open-ai-anthropic">Microsoft</a> are forming the Frontier Model Forum to promote safe and responsible frontier AI.</p><p>* The Forum will advance AI safety research, identify best practices, share knowledge on risks, and support using AI for challenges like climate change.</p><p>* Membership is open to organizations developing frontier models that demonstrate safety commitment.</p><p>* The Forum will focus on best practices, AI safety research, and information sharing between companies and governments.</p><p>* Some have expressed concern that this could enable regulatory capture by the “Big LLM” shops that can use the lobbying power to stop innovation. </p><p>StackOverflow AI - “The reports of my death have been greatly exaggerated” </p><p>Stack overflow has been in <a target="_blank" href="https://twitter.com/nixcraft/status/1683754010443157505">the news lately</a>, when a graphic of it’s decline in traffic has become viral. </p><p>They have <a target="_blank" href="https://twitter.com/StackOverflow/status/1684310579735928836">publicly disputed that information</a> claiming they have moved to a different measuring and didn’t update the webpage, but then also… announced <a target="_blank" href="https://stackoverflow.blog/2023/07/27/announcing-overflowai/">Overflow AI</a>!</p><p>* AI search and aggregation of answers + ability to follow up in natural language</p><p>* Helps drafting questions</p><p>* AI answers with a summary, and citations with the ability to “extend” and adjust for your coding level</p><p>* VSCode integration! </p><p>* Focusing on “validated and trusted” content</p><p>* Not only for SO code, stack overflow for teams will also embed other sources (like your company confluence) and will give you attributed answers and tagging abilities on external content</p><p>This has been an insane week in terms of news (👽 anyone?) and superconductors and AI releases! As always, I’m grateful for your attention! Forward this newsletter to 1 friend as a favor to me if you learned something new? Or alternatively, retweet us on twitter for bigger reach! </p><p>Thank you! See you next ThursdAI (and on Sunday when I release the State Of Agents recap 😅 ) </p><p><p>ThursdAI - Get in on this, and share w/ 1 friend 🫡</p></p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-jul-27-sdxl10-superconductors</link><guid isPermaLink="false">substack:post:135496262</guid><dc:creator><![CDATA[Alex Volkov, Nisten, Junaid Dawud, and yam]]></dc:creator><pubDate>Thu, 27 Jul 2023 23:05:55 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/135496262/6922ade0f5f3932f9d59a3fa1707fa6c.mp3" length="18149503" type="audio/mpeg"/><itunes:author>Alex Volkov, Nisten, Junaid Dawud, and yam</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>1134</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/135496262/a999ef6d73a4255fd247461c6420a87d.jpg"/></item><item><title><![CDATA[ThursdAI - Special Episode, interview with Nous Research and Enrico Shippole, fine-tuning LLaMa 2, extending it's context and more]]></title><description><![CDATA[<p>Hey there, welcome to this special edition of ThursdAI. This  episode is featuring an interview with Nous Research, a group of folks who fine-tune open source large language models to make them better. If you are interested to hear how finetuning an open source model works, dataset preparation, context scaling and more, tune in! </p><p>You will hear from <a target="_blank" href="https://twitter.com/karan4d">Karan</a>, <a target="_blank" href="https://twitter.com/Teknium1">Teknium</a>, <a target="_blank" href="https://twitter.com/Dogesator">LBJ</a> from <a target="_blank" href="https://twitter.com/NousResearch">Nous Research</a> and <a target="_blank" href="https://twitter.com/EnricoShippole">Enrico</a> who worked along side them. </p><p>To clarify, Enrico is going in depth into the method called Rope Scaling, which is a clever hack, that extends the context length of LLaMa models significantly and his project LLongMa which is an extended version of LLaMa with 8000 token context window. </p><p>The first voice you will hear is Alex Volkov the host of ThursdAI who doesn’t usually have a lisp, but for some reason, during the recording, twitter spaces decided to mute all the S sounds. </p><p>Links and acknowledgments: </p><p>* Nous Research - <a target="_blank" href="https://nousresearch.com/">https://nousresearch.com/</a> (<a target="_blank" href="https://twitter.com/nousresearch">@nousresearch</a>)</p><p>* <a target="_blank" href="https://huggingface.co/NousResearch/Redmond-Puffin-13B">Redmond Puffin 13b</a> - First LLaMa Finetune</p><p>* <a target="_blank" href="https://huggingface.co/conceptofmind/LLongMA-2-7b">LLongMa</a> - LLaMa finetune with 8K context (by Encrico, <a target="_blank" href="https://twitter.com/theemozilla/status/1676597615750701057">emozilla</a> and <a target="_blank" href="https://twitter.com/kaiokendev1">KaioKenDev</a>)</p><p>* <a target="_blank" href="https://huggingface.co/NousResearch/Nous-Hermes-Llama2-13b-GPTQ">Nous-Hermes-Llama2-13b-GPTQ</a> - Hermes Finetune was released after the recording 🎊</p><p><p>Psst, if you like this, why don’t you subscribe? Or if you are subscribed, consider a paid subscription to support #ThursdAI</p></p><p>Show transcription with timestamps: </p><p>Alex Volkov - targum.video (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>)<strong>[00:00:55]</strong> Yeah. That's awesome. So I guess with this, maybe, Karan, if you if you are able to, can you you talk about Nous research and how kind of how it started and what the what are you guys doing, and then we'll dive into the kind of, you know, Hermes and and Puffin and the methods and and all of it.</p><p>karan (<a target="_blank" href="https://twitter.com/karan4d">@karan4d</a>)<strong>[00:01:16]</strong> Absolutely. Nous research. I mean, I I myself and many other of us are just, like, enthusiasts that we're fine tuning models like, you know, GPTJ or GPT 2. And, you know, we all are on Twitter. We're all on Discord, and kind of just found each other and had this same mentality of we wanna we wanna make these models. We wanna kinda take the power back from people like OpenAI and anthropic. We want stuff to be able to run easy for everyone. And a lot of like minds started to show up.</p><p>karan (<a target="_blank" href="https://twitter.com/karan4d">@karan4d</a>)<strong>[00:01:50]</strong> I think that Technium's addition initially to Nous research, Jim, kinda showing up. And himself, I and human working on compiling the Hermes dataset was really what came to attract people when Hermes came out. I think we just have a really strong and robust, like, data curation thesis in terms of that. And I think that have just some of the most talented people who have come to join us and just volunteer and work with us on stuff. And I absolutely must say, I can see in the in the listeners is our compute provider, Redmond AI.</p><p>karan (<a target="_blank" href="https://twitter.com/karan4d">@karan4d</a>)<strong>[00:02:30]</strong> And, you know, none of this none of these models would be possible without Redmond's generous sponsorship for us to be able to deliver these things lightning fast, you know, without making us through a bunch of hoops just a a total total pleasure to work with. So I would I have to shell and say, you know, I highly recommend everyone check out Redmond as because they really make our project possible.</p><p>Alex Volkov - targum.video (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>)<strong>[00:02:52]</strong> Absolutely. So shout out to Redmond AI and folks give them a follow. They're the the only square avatar in the audience. Go take them out. And, Karan, thanks for that. I wanna just do a mic check for teknium. Teknium. Can you speak now? Can you? Can I hear you?</p><p>Teknium (e/λ) (<a target="_blank" href="https://twitter.com/Teknium1">@Teknium1</a>)<strong>[00:03:08]</strong> Yeah. My phone died right when you were introducing me earlier.</p><p>Alex Volkov - targum.video (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>)<strong>[00:03:10]</strong> Yep. What's up, Eric? -- sometimes on Twitter basis. Welcome, Technium. So briefly, going back to question. I don't know if you heard it. What besides the commercial and kind of the the contact window, what kind of caught your eye in the llama, at least the base until you guys started, or have you also, like, the other guys not had a second to play with the base model and dove into fine tuning directly?</p><p>Teknium (e/λ) (<a target="_blank" href="https://twitter.com/Teknium1">@Teknium1</a>)<strong>[00:03:35]</strong> Yeah. The only thing that really caught my eye was the chat model and how horribly RLHF it was.</p><p>Alex Volkov - targum.video (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>)<strong>[00:03:41]</strong> Yeah. I've seen some conversations about and kind of the point of Ira, RLHF as well. And okay. So so now that we've introduced Neus research, sorry, I wanna talk to you guys about what you guys are cooking. Right? The we've seen, the the Hermes model before this was, like, loved it as one of the, you know, the best fine tunes that I've seen at least and the the the most performing ones. Could you guys talk about the process to get to the Hermes model, the previous one? and then give us things about what coming soon?</p><p>karan (<a target="_blank" href="https://twitter.com/karan4d">@karan4d</a>)<strong>[00:04:16]</strong> Teknium, you got this one. man.</p><p>Teknium (e/λ) (<a target="_blank" href="https://twitter.com/Teknium1">@Teknium1</a>)<strong>[00:04:22]</strong> Yeah. It was basically I saw Alpaca, and I wanted to make it like, remake it with GPT 4, and then from there and just pretty much exclusively included anything that was GPT 4 only, and that was the beginning of the thesis for that. Going forward, though, We still have a lot of low quality data, I think, in Hermes data set that can be cleaned out, and then there's a lot of new data sets that have come out that I wanna start merging into there. also wanna move to something like chat ML or even Vikura format so that we can do some multi turn stuff. It's not very great, long chat.</p><p>Alex Volkov - targum.video (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>)<strong>[00:05:03]</strong> Yeah.</p><p>karan (<a target="_blank" href="https://twitter.com/karan4d">@karan4d</a>)<strong>[00:05:03]</strong> Within within within the Hermes dataset, you know, a lot of it is public available stuff that's particularly GPT 4. Of course, Technium's massive GP teacher dataset. We also have a bunch of GPT 4 data we had generate that we didn't release necessarily just yet, as well as an instruction set that's particularly focused on tasks like Python, transformers, linguistics, very small dataset of that. That's inside Hermes that, you know, we don't really talk about much, but figure that we'll put some exposure to right now on the spaces. And yeah.</p><p>Alex Volkov - targum.video (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>)<strong>[00:05:42]</strong> That's awesome. And so the previous Hermes was released on top of LAMA 1, and for many folks, know, obviously, they couldn't use this for different commercial points. And now that this model relates, what the models that you guys release, are you thinking about the license of them? And could you talk about, like, the availability of folks using them in commercial standing now that, you know, the the back of it is commercially available.</p><p>LDJ (<a target="_blank" href="https://twitter.com/Dogesator">@Dogesator</a>)<strong>[00:06:07]</strong> Mhmm. I think we have puffin licensed us MIT I'll have to doublecheck on our own own model. I think that's right, Curran, right, or Tech?</p><p>karan (<a target="_blank" href="https://twitter.com/karan4d">@karan4d</a>)<strong>[00:06:18]</strong> Yeah. I think so either that or Apache 2 point Like, if if the base model is commercially usable, you know, the stuff we put out is you're good to go. It's -- Yeah.</p><p>LDJ (<a target="_blank" href="https://twitter.com/Dogesator">@Dogesator</a>)<strong>[00:06:29]</strong> So And, like, in our announcements, I put in kind of, you know, one of the main things. It's it's commercially available. the first Nous as far as I think yeah. I'm pretty sure it's the first commercially available Nous model that's released, and a big differential data from Hermes is the fact that, like tech was saying, Hermes is pretty much all single turn data. And it's surprisingly can do pretty decent at multiturn conversations when you actually use it. But then puffin is almost kind of, like, a 180 where it's a vast majority really on context multi turn data.</p><p>LDJ (<a target="_blank" href="https://twitter.com/Dogesator">@Dogesator</a>)<strong>[00:07:09]</strong> And oh, I think can you guys hear me so? I can hear. Okay. It's just something's up with that. Okay. Yeah. So puffin is a vast majority, multi turn data, GPT 4 specifically, and a lot of it is actually real human conversations with GPT for that go on for some of them 4k 6 k context, like, even all the way up to the max 8 k context length of GPT 4. And then we took those few thousand conversations of real humans interacting with GPT 4. And now after that, I'm not sure if you've A lot of people probably heard of Camel AI.</p><p>LDJ (<a target="_blank" href="https://twitter.com/Dogesator">@Dogesator</a>)<strong>[00:07:46]</strong> So they have the physics, biology, chemistry, and mathematics data set. And then within those, there's a bunch of subtopics that you can carry it through. And I just pretty much spent a good few days curating just handpicking the right subtopics, like differential geometry, logic problems, optimization problems, a bunch of different GPT, for examples, and responses from those different subtopics. And then I specifically added those in certain ways to the puffin dataset.</p><p>Alex Volkov - targum.video (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>)<strong>[00:08:17]</strong> Awesome. So just just looking for the audience maybe. The puffin model that I think the official name is the red redmon puffin 7B or, sorry, 13B. Yes. This is this is the model that you guys fine tuned, and one of the first is maybe not the first fine tune of llama v Two. that's now publicly available, like you said, maybe with MIT license on Huggingspace, and I think you even added the GGML quantized version. Correct? Mhmm. So and so folks can can go and download that and and already start playing with this. And so first of all, thank you for contributing to the open source. That's great to see. And the speed with which you guys are fine tuned on this is also great to see.</p><p>Alex Volkov - targum.video (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>)<strong>[00:08:55]</strong> And maybe now that we've introduced this, maybe this is like repeating a bit. So could you speak about the the difference so the difference is the in the data set, in the task that you fine tune? Like, what is the actual difference between the Hermes or the Hermes that's coming out and the Puffin model? What would people use them for differently? Is that like that? That's a question.</p><p>Teknium (e/λ) (<a target="_blank" href="https://twitter.com/Teknium1">@Teknium1</a>)<strong>[00:09:21]</strong> The profit model definitely be better at multi turn stuff. That's for sure. Yeah.</p><p>nisten (<a target="_blank" href="https://twitter.com/nisten">@nisten</a>)<strong>[00:09:28]</strong> So if you want to do anything like OpenAI I'll I'll paste the link above the GGML version of it because I I really I'm I'm gonna test it thoroughly, but I I really think because you guys have use GPT 4, high quality, multi turn conversations, then this can have actual, like, practical use for whoever else was to use it either as, like, something that tells you about the documentation on the site or walks a user through. In other words, this should be better than Hermes then in for, like, customer service stuff, which is just one example.</p><p>nisten (<a target="_blank" href="https://twitter.com/nisten">@nisten</a>)<strong>[00:10:08]</strong> Anyway, yeah, I'm gonna try. I'll I'll paste the the link above.</p><p>karan (<a target="_blank" href="https://twitter.com/karan4d">@karan4d</a>)<strong>[00:10:14]</strong> It's it's likely better for production use alongside, like, stuff that you have with, like, a retrieval pipeline, like, with lang chain, etcetera. Like, I I would believe that without to get it, you know, or just talking, of course. But, you know, there is even though, you know, with this Lima tech unique of of small examples where we can get, like, a a really good really good model that does really well.</p><p>karan (<a target="_blank" href="https://twitter.com/karan4d">@karan4d</a>)<strong>[00:10:41]</strong> The thing about Hermes dataset and just its size and the various types of data and topics that are in there, I think you get a totally different like, role play or storytelling experience or completion experience with Hermes. Personally, I feel that way.</p><p>Alex Volkov - targum.video (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>)<strong>[00:11:01]</strong> Awesome.</p><p>Teknium (e/λ) (<a target="_blank" href="https://twitter.com/Teknium1">@Teknium1</a>)<strong>[00:11:01]</strong> So and that. Another thing about Puffin Dataset is that it does go up to, like, 8K and Enrico here. has been doing a ton of work on extending Llama's context.</p><p>Alex Volkov - targum.video (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>)<strong>[00:11:13]</strong> Right. So I wanna I wanna give an introduction then introduce Enrique and and talk about this real quick. Right? LAMA version 1 was released with, again, 2,000 tokens in the contact window. And then many folks, including KaioKendev, and Emozhila. Right? And and some other folks, I think, were involved in bringing some of the quote on quote tricks about what eventually ended up being named rope, scaling, if I'm if I'm not mistaken. And we follow this, and we've talked about the previous news ThursdAI, I. And Llama V2 was released with 4000 tokens in the context window.</p><p>Alex Volkov - targum.video (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>)<strong>[00:11:52]</strong> And, you know, we're now still used to kind of Claude and the 16k GPT 3 that four didn't seem like a lot. And then many folks were wondering, and, meanwhile, Enrico was working, whether or not the rope scaling method would apply to the next plumber and look like it did. And so I wanna introduce Enrico uh Enrico Shippole. I hope on staying this right. Welcome to the state. Hopefully, you can unmute and and this place works with you. And The second finetune that I saw rest of the was also back with Nous, the Nouse research, and this was the extended version, what's called Longma.</p><p>Alex Volkov - targum.video (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>)<strong>[00:12:28]</strong> So Enrique will go out of the stage and feel free to introduce yourself, your affiliation with news and LlongMa with with the context window.</p><p>Enrico Shippole (<a target="_blank" href="https://twitter.com/EnricoShippole">@EnricoShippole</a>)<strong>[00:12:38]</strong> Hello. So I'm actually a independent researcher. I'm sponsored by Stability AI, Eleuther AI, and a few other different organizations, including NewsNow. Awesome. I work with different people like Tanishq from Medark, Aaron Komatsusaki, who also is from a Luther and Duck AI. John Ney from Nomosai. So I I have a I have a lot of affiliation with a bunch of different organizations. including together. We're starting a project right now with them.</p><p>Alex Volkov - targum.video (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>)<strong>[00:13:13]</strong> That's that's so great to hear, and so welcome to Thursday. Welcome to this day. And can you talk to us a little bit about kind of the ROPE scaling method and and how how were you able to, like, find them like this quickly and how the results looked so far? I wasn't able to run this myself. But hopefully, yeah, talk to us about</p><p>Enrico Shippole (<a target="_blank" href="https://twitter.com/EnricoShippole">@EnricoShippole</a>)<strong>[00:13:34]</strong> Okay. So initially, The the thing is I actually was hoping that both Emozilla, Bowen, and KaioKenDev would have been able to make it because It was kinda like a equal parts effort on, like, all fronts from each of us. Initially, I had trained some pathways models at 8,000 context length about 4 months ago based on the exposition paper, which did rotary embedding scaling initially. They were one of the first people did it. They based their methodology off of ofer presses alibi.</p><p>Enrico Shippole (<a target="_blank" href="https://twitter.com/EnricoShippole">@EnricoShippole</a>)<strong>[00:14:11]</strong> I would imagine that most people are pretty familiar with Ofir Press in this work on the alibi positional bias that's been used in a wide range of models now. So Emozilla and I came into contact based off of the work that he had seen me doing with the Palm models scaling those to 8000 context length pretraining, not fine tuning. So what we had initially done is basically take a section of c 4 in different data sets that had examples that were all over 8000 context length that pretrained on them packed together.</p><p>Enrico Shippole (<a target="_blank" href="https://twitter.com/EnricoShippole">@EnricoShippole</a>)<strong>[00:14:50]</strong> with a beginning of string and end of string token to help with, like, the attention masking portion of that. After he had seen that, Emozilla actually became into contact with kaikode dev I believe Kaiokendev is how you pronounce it. Kaiokendev had also been following Ofir Press's research. He had started working on his own version of scaling the rotary embeddings, I believe based off of both alibi and exposition.</p><p>Enrico Shippole (<a target="_blank" href="https://twitter.com/EnricoShippole">@EnricoShippole</a>)<strong>[00:15:22]</strong> And what he found is that by scaling the max position all embeddings and the rotary embedding from something like 2048, which you would initially train with. He scaled it up to 8000 or 8192. And he found that by applying, like, in interpolation to the encoding by scaling basically like the the positional index in the rotary embedding, that you were able to essentially turn down the frequency window and rope by like a factor of 0.25.</p><p>Enrico Shippole (<a target="_blank" href="https://twitter.com/EnricoShippole">@EnricoShippole</a>)<strong>[00:16:01]</strong> The scaling depends on the length that you're trying extrapolate to and the initial context length that the model was trained with. So if you were training with LAMA 2, which had an context window of 4096, and you wanted to do the linear interpolation positional scaling to something like 8192. then you would use a scaling factor of 0.5. If you were trying to do it from 2048, which is the original LAMA was trained with, and you wanted to scale it to 8192, then you would use a scaling factor of 0 point 25.</p><p>Enrico Shippole (<a target="_blank" href="https://twitter.com/EnricoShippole">@EnricoShippole</a>)<strong>[00:16:39]</strong> So basically, after we had done all of this, Meta had released a paper around the same time that Kaiokendev had released his blog. They both found very similar finding. They had shown in the meta paper that you only had to fine tune for 1000 steps with the linear positional interpolation scaling to be able to get the benefit of doing a full pretrain at a context window of 8192.</p><p>Enrico Shippole (<a target="_blank" href="https://twitter.com/EnricoShippole">@EnricoShippole</a>)<strong>[00:17:13]</strong> So this is actually like a a big step because it shows that you no longer need to pre train right off the bat at a longer context length. Then you're able to do the fine tuning on essentially a a lower resource like, computational budget and still be able to get the, like, greater results of the longer context window. I know a lot of the major AI companies had been doing just for my work in in personal research with many of them had been doing staged scaling of the context window during training.</p><p>Enrico Shippole (<a target="_blank" href="https://twitter.com/EnricoShippole">@EnricoShippole</a>)<strong>[00:17:46]</strong> So they would pre train basically, when pre training, they would separate the initial examples from a dataset into multiple stages.</p><p>Enrico Shippole (<a target="_blank" href="https://twitter.com/EnricoShippole">@EnricoShippole</a>)<strong>[00:17:54]</strong> So anything that is under the window of 2048, you'd separate from the initial dataset then you take things between 2048 4096, then 4096, and 8192, and you would basically chunk the data sets into those different parts you'd first initially train on the 2048 chunk of the data, then you would train on the data between 2048 and 4096, and then you would do the same thing from 4096 to 8192, or if you want to scale that to 16k or 32k context length. But what we have shown now with both the meta paper and this thing, you don't even need to go through that extensive pretraining and staged process, you can just go from a context length of 2048 to 8192.</p><p>Enrico Shippole (<a target="_blank" href="https://twitter.com/EnricoShippole">@EnricoShippole</a>)<strong>[00:18:47]</strong> scale the rotary embeddings by whatever type of factor that you want to use. So like I was saying, if you're going from 2048 to 8192, you'd be using a scaling factor of 0.25. It only needs 2 lines of code to be able to do that. In the LLongMa post, I had provided an example of scaling the rotary embeddings. The the code was written by Emozilla or Jeff.</p><p>Enrico Shippole (<a target="_blank" href="https://twitter.com/EnricoShippole">@EnricoShippole</a>)<strong>[00:19:15]</strong> We also came into contact with after all these experiments we then came into contact with Bowen, who had worked a lot about the dynamic NTK scaling with Emozilla, and he had also done NTK by parts which we're we're currently training a lot of models on. So we have the Longma 1 models trained on the open llama series, like the suite of those models that use the linear interpolation scaling.</p><p>Enrico Shippole (<a target="_blank" href="https://twitter.com/EnricoShippole">@EnricoShippole</a>)<strong>[00:19:45]</strong> We now have the llama 2 models or the longma 2 suite, which is what we're calling it, again, trained on the linear interpolation scaling And then we have another suite of models coming out very soon that uses the the NDK by parts dynamic scaling. That was really specialized by Bowen, so I do not wanna speak on his behalf. It'd it'd probably be good to get him to talk about it in another one of these.</p><p>Alex Volkov - targum.video (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>)<strong>[00:20:14]</strong> Absolutely. So let's get in touch after this and and and and set it up. So Thank you for the a very in-depth kind of explanation because we did cover the the the kind of the RoPE killing and how Kaioken in the image boards are ready to wherever he started this in his blog post, and then how it's gonna rotate it. So it's great to to actually hear from the folks who are doing this. I just for the audience, I've attached Enrico's tweet about LLongMA 2, which is now currently trained at AK contact length.</p><p>Alex Volkov - targum.video (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>)<strong>[00:20:47]</strong> And and Rico, you told us that we may see even double from the So could you think about the next the next version?</p><p>Enrico Shippole (<a target="_blank" href="https://twitter.com/EnricoShippole">@EnricoShippole</a>)<strong>[00:20:56]</strong> Okay. So the the initial training process of doing this up to a context, like length of 8192, can be due with be done, basically, with deep speed, 02. and activation checkpointing. And you're able to fit the model on a A100 80 gigabyte node. Now, we are working on the process of scaling it both to 16 k and 32 k. This requires a different methodology during training, you either need to use deep speed 0.3 or fully sharded data parallelism.</p><p>Enrico Shippole (<a target="_blank" href="https://twitter.com/EnricoShippole">@EnricoShippole</a>)<strong>[00:21:35]</strong> Both of those are are very similar for people who aren't aware. Basically, you're just sharding the optimizer states. The model states across, like, different nodes. You can also use things like tensor parallelism to help with the scaling as well. And then we're going to be basically just adjusting the scaling factor again, collecting a large we've already collected large quantity of data at 16k context length, and we're going to be doing the fine tuning to 16k and be releasing those models Soon, all of this computing is sponsored by stability AI.</p><p>Enrico Shippole (<a target="_blank" href="https://twitter.com/EnricoShippole">@EnricoShippole</a>)<strong>[00:22:12]</strong> They've been very generous what helping with a lot of the independent research.</p><p>Alex Volkov - targum.video (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>)<strong>[00:22:17]</strong> That so I wanna shout out Stability AI for not only given, you know, the world's stability diffusion, also participating in this kind of next wave of AI. Many folks kinda coined the stability AI moment when released the the stable diffusion of the. I wanna say 1.4 back then almost a year ago now, and many folks are saying the about the Llama 2 release now this commercially open source, and and folks can start, like, doing things for you know, for profit companies can join So we definitely wanna shout out stability for for the effort here. And, Enrico, thank you. And, folks, please follow Enrico, and and we'll stay tuned.</p><p>Alex Volkov - targum.video (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>)<strong>[00:22:56]</strong> I wanna ask Karan and and Teknium, and other folks from Nous the efforts that that Enrico was talking about. the longer context windows. How would they kinda interplay with the stuff that you're working on with Hermes with with Pufin? Are are kind of the efforts interchangeable? We're gonna see building a top of each other?</p><p>karan (<a target="_blank" href="https://twitter.com/karan4d">@karan4d</a>)<strong>[00:23:16]</strong> So I I think LDJ can definitely speak to this, but I'd like to happily say that once we did Longbow 1 on the 1st Llama generation of models, we already had puffin 2k, 4k, and 8 for that -- Yeah. -- already prepared and ready. So as the LLongMa models for 13B are released, we will also be doing equivalent, puff in fine tunes, and Potentially Hermes fine tunes. We can talk a little bit more about the future of Hermes at a a little bit later, though.</p><p>LDJ (<a target="_blank" href="https://twitter.com/Dogesator">@Dogesator</a>)<strong>[00:23:51]</strong> Yeah. I mean, I was pretty much going to say the same thing, but kind of elaborate on that about how before when LLongMa V1 and everything. And during the development of LLongMa, there was actually, like you know, of course, me Enrico who are usually just called concepts of mind and and and Emozilla. Like, we've all kinda, like, been butting shoulders a lot together and just kinda working closely, you know, in the same Discord and whatnot. And it's like, hey. Like, you know, working on this, like, experimental LLongMa with thing. Like, hey. You wanna try, like, fine tuning, and then the plan just kind of ended up being like, okay. Just gonna have this Puffin thing.</p><p>LDJ (<a target="_blank" href="https://twitter.com/Dogesator">@Dogesator</a>)<strong>[00:24:31]</strong> that Puffin dataset is already containing a ton of high context conversational data. from GPT 4 and, like, human high quality data. So it's like it's like the perfect fit to have something that's high context capable will be fine tuned on that. And then LLaMa 2 came out, and it's like, oh, Yeah. Let's let's get this out ASAP, and then we'll figure out what we're gonna do later.</p><p>Alex Volkov - targum.video (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>)<strong>[00:24:58]</strong> Yeah. Great. And it's just great to see, you know, how many opportunities is like this where with open source can the stuff that we're able to now run and gonna iterate on are building on top of each other. They're just incredible. and this is maybe a watershed moment. And I I wanna thank all of you for being here. I wanna kind of let the other folks who usually hear on Thursday, I need to ask you a question or 2 for Nous visitors. Yam and Nisten, if you if you have a question for news or for Enrico, go ahead. I I will stay young.</p><p>Alex Volkov - targum.video (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>)<strong>[00:25:29]</strong> I know you if you have to ask the super deep technical stuff, and the audience will, like it will fly over their I I won't be using the DM with LBJ and and Rico. But yeah. Of course, the stuff that we haven't covered and interesting tough news. Feel free as it pertains to LAMA 2 is gonna be very interesting, I think, for everyone.</p><p>nisten (<a target="_blank" href="https://twitter.com/nisten">@nisten</a>)<strong>[00:25:47]</strong> Just to quickly clarify, you guys fine tuned the plain model. Right? Not the chat 1.</p><p>Teknium (e/λ) (<a target="_blank" href="https://twitter.com/Teknium1">@Teknium1</a>)<strong>[00:25:55]</strong> Yep. Okay. Yep. The base model. We wouldn't fine that model. The chat 1 at all.</p><p>Alex Volkov - targum.video (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>)<strong>[00:26:00]</strong> Actually, to -- Yeah. The -- -- to maybe continue this stratigram for interrupting. Just one sec. To continue this question, the there are models they were released by Meta, and you have to, like, register and get the email and everything. And then they put some stuff on Hugging Face. And then the the those models were delineated with, like, dash HF. Have you guys use the HuggingFace or the Meta 1, and do you guys know the difference? I felt somebody that, like, maybe doesn't work as well and to inform her Yeah.</p><p>Teknium (e/λ) (<a target="_blank" href="https://twitter.com/Teknium1">@Teknium1</a>)<strong>[00:26:30]</strong> The one on Hugging phase is an FP 16 and the original Llama 2 models in bf16, but we tested the difference between the two models at Carper, and there's such a negligible difference in their quality that it's irrelevant, but we trained on the Hug and Face f P Sixteen ones, but in the f Sixteen ask them.</p><p>Alex Volkov - targum.video (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>)<strong>[00:26:52]</strong> Sorry. Yeah. Goran, for interrupting. Go ahead.</p><p>karan (<a target="_blank" href="https://twitter.com/karan4d">@karan4d</a>)<strong>[00:26:56]</strong> No. All good.</p><p>Alex Volkov - targum.video (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>)<strong>[00:26:58]</strong> I I totally forgot what -- That's not it. interrupted today. Yes, Randall. Okay. Nispen, if you have a question for Kiran to follow-up with feel free, and And if not, then, Yum, if you have anything that you wanna ask the the fine folks from Nous, feel feel free as well.</p><p>Yam Peleg (<a target="_blank" href="https://twitter.com/Yampeleg">@Yampeleg</a>)<strong>[00:27:17]</strong> Yeah. Sure. First, thank you for what you're doing, guys. You're really making a difference for anyone. There aren't many demos online, so anyone that didn't try Hermes, I highly encourage you to try. I don't know why there aren't them. Okay. I know why there aren't demos that cost money, but just try it. Okay? And now I got a question because from my experience, if you train on the open datasets of Hermes, you get a significantly less quality of a model. No. Now I'm fine I'm fine if you don't release datasets. Don't don't get me wrong.</p><p>Yam Peleg (<a target="_blank" href="https://twitter.com/Yampeleg">@Yampeleg</a>)<strong>[00:27:54]</strong> Just I wanted to ask, is there anything else besides the data that is different? What what tips can you give for, I don't know, someone else that want to train high quality model besides having high quality data.</p><p>Teknium (e/λ) (<a target="_blank" href="https://twitter.com/Teknium1">@Teknium1</a>)<strong>[00:28:08]</strong> Everyone understands this. Yeah. The hyperparameters can make key difference. LBJ knows very well because we had to do a ton of different tests. We don't have our freight owners for puffin model. But I'm not sure if those are on the model card for Hermes. If they're not, I can put them And Karen your card can probably talk about the Nous datasets that weren't made public.</p><p>karan (<a target="_blank" href="https://twitter.com/karan4d">@karan4d</a>)<strong>[00:28:38]</strong> Yeah. We've got, like, maybe around, like, 50 k items of data, like, versus, like, total 300 k instructions there that are not released. And to be frank with you about 45 k of them is just more GPT 4, like, alpaca style instructions. The 5000 or so, the, like, 4500 them compose this dataset we have we've been working on that, you know, at this point, I'm pretty comfortable talking about a we call it the p dactyl dataset.</p><p>karan (<a target="_blank" href="https://twitter.com/karan4d">@karan4d</a>)<strong>[00:29:14]</strong> I won't speak on everything that's in it, but, essentially, And I don't know if this is the thing that made the big difference, but it's, like, the the one place where I guess you deviate from just using the open datasets more GPT 4 instructions, but it's got some transformers instructions, some linguistics instructions, some calculus 1, instructions, etcetera. It seems to be pretty good.</p><p>Teknium (e/λ) (<a target="_blank" href="https://twitter.com/Teknium1">@Teknium1</a>)<strong>[00:29:41]</strong> Also, Yam, do you have links or anything to the models that tried it with just the makeup of the datasets that we're public from Hermes because I haven't actually seen that before.</p><p>Yam Peleg (<a target="_blank" href="https://twitter.com/Yampeleg">@Yampeleg</a>)<strong>[00:29:57]</strong> And again, can you repeat that?</p><p>Teknium (e/λ) (<a target="_blank" href="https://twitter.com/Teknium1">@Teknium1</a>)<strong>[00:29:58]</strong> didn't hear. Do you have any links to the models that trained with just the open datasets from Hermes that you could share with me later?</p><p>Yam Peleg (<a target="_blank" href="https://twitter.com/Yampeleg">@Yampeleg</a>)<strong>[00:30:06]</strong> No. No. It's just it's just from my experiments -- Oh, okay. -- on training. Pretty much following the same idea of let's take only GPT 4 from all the open datasets, and the the model that you get is is different. for sure. And and it might be that hyperparameters, you know.</p><p>Teknium (e/λ) (<a target="_blank" href="https://twitter.com/Teknium1">@Teknium1</a>)<strong>[00:30:25]</strong> Another thing that we did too is pretty extensive, like, cleaning. We did do deduplication. We removed things like a URL. Like, any response that had a URL in it, we removed in case it was gonna like, hallucinated URLs. Instead of, like, maybe 8 different filtering processes too that might have made our data quality higher.</p><p>LDJ (<a target="_blank" href="https://twitter.com/Dogesator">@Dogesator</a>)<strong>[00:30:48]</strong> So as an AI language model?</p><p>nisten (<a target="_blank" href="https://twitter.com/nisten">@nisten</a>)<strong>[00:30:51]</strong> For anybody -- What do you say? -- for anybody in the audience that hyperparameter meters are are just like the settings in the oven. So it it looks here, like, the ingredients were all okay, but yam mess something up, and before selling as a token -- Yeah. -- came out half baked at the model.</p><p>LDJ (<a target="_blank" href="https://twitter.com/Dogesator">@Dogesator</a>)<strong>[00:31:08]</strong> So we're gonna have to check that out.</p><p>LDJ (<a target="_blank" href="https://twitter.com/Dogesator">@Dogesator</a>)<strong>[00:31:10]</strong> I'm a big proponent personally of hyperparameter optimization being underrated right now, like, in -- Yeah. -- the current space. And that's something I've kind of focused on a lot specifically for things like puffin and just trying to help others around and use some stuff like trying to optimize they're doing, and even just something like like what you just said about the settings for the oven, I mean, double the amount of time you're putting something in the oven, and it's not gonna come out twice as good. It's not even gonna come out 10% as good. It's gonna come worse. You know?</p><p>LDJ (<a target="_blank" href="https://twitter.com/Dogesator">@Dogesator</a>)<strong>[00:31:45]</strong> And although it depends, like, what is your baseline for how how much time you're putting it in the oven and all these different variables that kind of are dependent on each other and affect each other. So it's definitely something you kind of have to build an intuition about to some degree. And then the other end is really I feel like there has to be more investment and more time and energy invested into actual tools that make hyperparameter optimization easier for people that are doing these things.</p><p>Yam Peleg (<a target="_blank" href="https://twitter.com/Yampeleg">@Yampeleg</a>)<strong>[00:32:13]</strong> Yeah. Yeah. And the thing is that the models are are really big, so it's really expensive to run them. So you have you have a trade off of how many how much computer you're investing in searching hyperparameters rather than actually using it for training. But but I completely agree So one one last question, actually, too.</p><p>Teknium (e/λ) (<a target="_blank" href="https://twitter.com/Teknium1">@Teknium1</a>)<strong>[00:32:33]</strong> Actually, one thing before we go on. Something great about the puffin dataset is that it's just like, 3000 or so examples, I believe. And so it makes tuning a lot less expensive because you can finish the whole training in just a couple of hours. So, like, with Hermes, if we wanted to try full ablations and dozens of them, it would take weeks weeks to do.</p><p>LDJ (<a target="_blank" href="https://twitter.com/Dogesator">@Dogesator</a>)<strong>[00:32:55]</strong> Yeah. Yeah. Well, to be fair, it's not like it only takes a couple hours on one GPU. We use a a 100 80 gigabytes. So Yeah. Yeah.</p><p>Teknium (e/λ) (<a target="_blank" href="https://twitter.com/Teknium1">@Teknium1</a>)<strong>[00:33:04]</strong> Courtesy of Redman.</p><p>Alex Volkov - targum.video (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>)<strong>[00:33:05]</strong> Thank you, Redman.</p><p>Enrico Shippole (<a target="_blank" href="https://twitter.com/EnricoShippole">@EnricoShippole</a>)<strong>[00:33:08]</strong> Mhmm. I should also probably clarify that when doing the context length, extrapolation, We're doing it on 1,000,000,000 tokens and 64, 80 gigabyte a 100.</p><p>Yam Peleg (<a target="_blank" href="https://twitter.com/Yampeleg">@Yampeleg</a>)<strong>[00:33:20]</strong> OOf Mhmm.</p><p>Alex Volkov - targum.video (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>)<strong>[00:33:23]</strong> Yeah. Yam is getting over excited. Alright, folks. I wanna -- Yeah. Yeah. -- maybe maybe ask her on this one less and we'll move on to the the the regular ThursdI update camera cadence. But I will say that, like, folks from Nous research and and Rick and and some other here. Thank you so much for coming up and giving us kind of the insights into how this actually happens. Lama2 just released, you know, a few days ago, and you guys are already pumping out, like, open source fine tuned models. And it's great to see. And just so you know, there's always a stage for you here to come in and and announce things.</p><p>Alex Volkov - targum.video (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>)<strong>[00:33:53]</strong> And If you do wanna announce, like, a release or something, maybe just, you know, right now, Karan and and Teknium and some folks, I would love to hear like, when the next Hermes is coming?</p><p>karan (<a target="_blank" href="https://twitter.com/karan4d">@karan4d</a>)<strong>[00:34:06]</strong> Before we say that, I just would like to clarify something about Hermes. So we have the original Hermes dataset on LAMA 2 as something that we will release, but also a sequel to the Hermes dataset, Hermes 2. There will be a distinction between these 2, and you'll see you'll see the the the prior come out first and the latter come out after. But as for release, etcetera, I will absolutely let Technium take the stage with those final words.</p><p>Teknium (e/λ) (<a target="_blank" href="https://twitter.com/Teknium1">@Teknium1</a>)<strong>[00:34:36]</strong> So the training is nearly done. At least it was about 2.8 epochs out of 3 a few hours ago. So it might be done already. Before I release it though, unlike puffin, I didn't we wanted it puffing out, like, same day that llama 2 came out, so we didn't run any benchmarks. And we had to put all the compute we had on Hermes immediately after we were done with that. So we don't have any compute to do any benchmarks or puffing until Hermes is done.</p><p>Teknium (e/λ) (<a target="_blank" href="https://twitter.com/Teknium1">@Teknium1</a>)<strong>[00:35:06]</strong> But before I release Hermes, I do wanna do, like, a full range of benchmarks and stuff like that to make sure everything's good and have a pretty detailed model card, but that should probably only take the rest of tonight at the most. So probably tomorrow morning would be when Hermes comes out.</p><p>Alex Volkov - targum.video (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>)<strong>[00:35:22]</strong> That's some folks. And you you heard it here first and definitely follow Teknium, Karan, Enrico, LDJ, and the rest of, like, Nous Research folks, and stay tuned. Enrico, go ahead.</p><p>Enrico Shippole (<a target="_blank" href="https://twitter.com/EnricoShippole">@EnricoShippole</a>)<strong>[00:35:34]</strong> Yes. I just wanted to to piggyback off of Teknium comment a little bit. So we did do pretty sense of the valuation of the Lauma 2 AK models. We had run different things on perplexity using Gov Report in a couple different other data sets to make sure that the length extrapolation in the context was working properly. We did passkey retrieval. We also did a lot of extensive human evaluation, which took a little bit. I had wanted to get the LAMA 2 AK models out yesterday, but we decided to push it back one day.</p><p>Enrico Shippole (<a target="_blank" href="https://twitter.com/EnricoShippole">@EnricoShippole</a>)<strong>[00:36:08]</strong> So and what we were doing is we were feeding in research papers and seeing if it could pull out even, like, relevant pieces of information from the context length. And so far, it has been quite successful. So we're we're still running more evals, but the ones so far have shown that there's been, like, no performance degradation, no matter what context length that you're basically using with these extended models.</p><p>Alex Volkov - targum.video (<a target="_blank" href="https://twitter.com/altryne">@altryne</a>)<strong>[00:36:32]</strong> That sounds great. and now that this this, you know, LLongMa lies out and the next versions are gonna come out as well. I'm sure that some other folks who also contribute to this research and tell you, like, from their own experiences and vibe. So, yeah, I wanna thank folks. Again, this has been very illuminating, and very glad to have you. And, obviously, the stage is yours whenever you want to come here, and we appreciate you. And you guys are welcome to stay tuned and kinda chime in to the rest of the updates. And with that, I think, for folks in the audience, we're moving to the next thing.</p><p></p><p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-special-episode-interview</link><guid isPermaLink="false">substack:post:135347804</guid><dc:creator><![CDATA[Alex Volkov, Enrico Shippole, desiderata, and Teknium]]></dc:creator><pubDate>Sun, 23 Jul 2023 17:02:00 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/135347804/201841ccbeeb56ab3e4bd08582fa14ee.mp3" length="35577668" type="audio/mpeg"/><itunes:author>Alex Volkov, Enrico Shippole, desiderata, and Teknium</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>2224</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/135347804/6fede48663dc083c06c863d9621f6250.jpg"/></item><item><title><![CDATA[ThursdAI July 20 - LLaMa 2, Vision and multimodality for all, and is GPT-4 getting dumber?]]></title><description><![CDATA[<p><p>ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></p><p>If you’d like to hear the whole 2 hour conversation, <a target="_blank" href="https://twitter.com/altryne/status/1682132940564881408?s=20">here’s the link to twitter spaces</a> we had. And if you’d like to add us to your favorite podcatcher - here’s the <a target="_blank" href="https://api.substack.com/feed/podcast/1801228.rss">RSS link</a> while we’re pending approval from Apple/Spotify</p><p>Happy LLaMa day! Meta open sourced LLaMa v2 with a fully commercial license. </p><p>LLaMa 1 was considered the best open source LLM, this one can be used for commercial purposes, unless you have more than 700MM monthly active users (no 🦙 for you Google!)</p><p>Meta has released the code and weights, and this time around, also a fine-tuned chat version of LLaMa v2 to all, and has put them on HuggingFace. </p><p>There are already (3 days later) at least 2 models that have fine-tuned LLaMa2 that we know of: </p><p>* <a target="_blank" href="https://huggingface.co/NousResearch">@nousresearch</a> have released <a target="_blank" href="https://colab.research.google.com/drive/1A0L8vXljU0my6TokMT5-VSg7f6lYWgvj?usp=sharing">Redmond Puffin 13B </a></p><p>* <a target="_blank" href="https://twitter.com/EnricoShippole">@EnricoShippole</a>  with collaboration with Nous have released <a target="_blank" href="https://twitter.com/EnricoShippole/status/1682054848584228866">LLongMa</a>, which extends the context window for LLaMa to 8K (and is training a 16K context window LLaMa) </p><p>* I also invited and had the privilege to interview the folks from @nousresearch group (<a target="_blank" href="https://twitter.com/karan4d">@karan4d</a>, <a target="_blank" href="https://twitter.com/teknium1">@teknium1</a> <a target="_blank" href="https://twitter.com/Dogesator">@Dogesator</a> ) and <a target="_blank" href="https://twitter.com/EnricoShippole">@EnricoShippole</a> which will be published as a separate episode.</p><p>Many places already let you play with LLaMa2 for free:  </p><p>* https://www.llama2.ai/</p><p>* <a target="_blank" href="https://huggingface.co/chat/">HuggingFace chat</a></p><p>* <a target="_blank" href="https://llama.perplexity.ai">Perplexity LLaMa chat</a></p><p>* nat.dev, replicate and a bunch more! </p><p>The one caveat, the new LLaMa is not that great with code (like at all!) but expect this to change soon!</p><p>We all just went multi-modal! Bing just got eyes!</p><p>I’ve been waiting for this moment, and it’s finally here. We all, have access to the best vision + text model, the GPT-4 vision model, via bing! (and also bard, but… we’ll talk about it) </p><p>Bing chat (which runs GPT-4) has now released an option to upload (or take) a picture, and add a text prompt, and the model that responds understands both! It’s not OCR, it’s an actual vision + text model, and the <a target="_blank" href="https://twitter.com/altryne/status/1680325172761616384">results are very impressive</a>! </p><p>I’ve personally took a snap of a food-truck side, and asked Bing to tell me what they offer, it found the name of the truck, searched it online, found the menu and printed out the menu options for me! </p><p>Google’s Bard also introduced their google lens integration, and many folks tried uploading a screenshot and asking it for code in react to create that UI, and well… it wasn’t amazing. I believe it’s due to the fact that Bard is using google lens API and was not trained in a multi-modal way like GPT-4 has. </p><p>One caveat is, the same as text models, Bing can and will hallucinate stuff that isn’t in the picture, so YMMV but take this into account. It seems that at the beginning of an image description it will be very precise but then as the description keeps going, the LLM part kicks in and starts hallucinating. </p><p></p><p>Is GPT-4 getting dumber and lazier? </p><p>Researches from Standford and Berkley (and Matei Zaharia, the CTO of Databricks) have tried to evaluate the vibes and complaints that many folks have been sharing, wether GPT-4 and 3 updates from June, had degraded capabilities and performance. </p><p>Here’s the link to <a target="_blank" href="https://arxiv.org/pdf/2307.09009.pdf">that paper</a> and <a target="_blank" href="https://twitter.com/matei_zaharia/status/1681467961905926144?s=46&#38;t=YWI8HBextIRGxS6sxpjfDg">twitter thread </a> from Matei. </p><p>They have evaluated the 0301 and the 0613 versions of both GPT-3.5 and GPT-4 and have concluded that at some tasks, there’s a degraded performance in the newer models! Some reported drops as high as 90% → 2.5% 😮</p><p>But is there truth to this? Well apparently, some of the methodologies in that paper lacked rigor and the fine folks at <a target="_blank" href="https://open.substack.com/pub/aisnakeoil">AI Snake Oil</a>  ( <a target="_blank" href="https://substack.com/profile/891603-sayash-kapoor">Sayash Kapoor</a> and Arvind) have done a great deep dive into that paper and found very interesting things!</p><p>They smartly separate between capabilities degradation and behavior degradation, and note that on the 2 tasks (Math, Coding) that the researches noted a capability degradation, their methodology was flawed, and there isn’t in fact any capability degradation, rather, a behavior change and a failure to take into account a few examples. </p><p>The most frustrating for me was the code evaluation, the researchers scored both the previous model and the new June updated models on “code execution” with the same prompt, however, the new models defaulted to wrap the returned code with ``` which is markdown code snippets. This could have been easily fixed with some prompting, however, the researchers scored the task based on, wether or not the code snippet they get is “instantly executable”, which it obviously isn’t with the ``` in there. </p><p>So, they haven’t actually seen and evaluated the code itself, just wether or not it runs! </p><p>I really appreciate the <a target="_blank" href="https://open.substack.com/pub/aisnakeoil">AI Snake Oil</a> deep dive on this, and recommend you all read it for yourself and make your own opinion and don’t give into the hype and scare mongering and twitter thinkfluencer takes. </p><p></p><p>News from OpenAI - Custom Instructions + Longer deprecation cycles</p><p>In response to the developers (and the above paper), OpenAi announced an update to the deprecation schedule of the 0301 models (the one without functions) and they will keep that model alive for a full year now! </p><p>Additionally, OpenAI has released “<a target="_blank" href="https://openai.com/blog/custom-instructions-for-chatgpt">Custom Instructions for ChatGPT</a>” which allows a chatGPT user to store custom instructions, information and custom prompt that will be saved on OpenAI server side, and will append to every new session of yours with chatGPT. </p><p>Think, personal details, preferred coding style (you love ruby and not python) and other incredible things you can achieve without copy-pasting this to every new session! </p><p>Don’t forget to enable this feature (unless you’re in the UK or EU where this isn’t available) </p><p></p><p>Thanks for tuning in, wether you’re a newsletter subscriber, twitter space participant, or just someone who stumbled onto this post, if you find this interesting, subscribe and tell your friends!</p><p>“We stay up to date so you don’t have to” is the #ThursdAI motto! 🫡</p><p>In other news this week: </p><p>LangChain has gotten some flack but they are looking ahead and releasing <a target="_blank" href="https://docs.smith.langchain.com/">LangSmith</a>, an observability framework for your agents, that does NOT required using LangChain! </p><p>It looks super cool, and is very useful to track multiple prompts and tokens across agent runs! And the results are <a target="_blank" href="https://smith.langchain.com/public/92835ead-e1da-45b2-b99d-4ff1a0f6c301/r">share-able</a> so you can take a look at great runs and share yours with friends! </p><p>Don’t forget to share this with your friends and come back next week 🫡</p><p>— Alex Volkov</p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-july-20-llama-2-vision-and</link><guid isPermaLink="false">substack:post:135311231</guid><dc:creator><![CDATA[Alex Volkov, Nisten, yam, and Roie Schwab Cohen]]></dc:creator><pubDate>Fri, 21 Jul 2023 00:27:41 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/135311231/ed835f40ba5f7abdf68c89c34c0db51d.mp3" length="14076491" type="audio/mpeg"/><itunes:author>Alex Volkov, Nisten, yam, and Roie Schwab Cohen</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>880</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/135311231/16ff551793020ce90cc07c6976339a42.jpg"/></item><item><title><![CDATA[ThursdAI July 13 - Show recap + Notes]]></title><description><![CDATA[<p>Welcome Friends, to the first episode of ThursdAI recap. </p><p><p>If you can’t come to the spaces, subscribing is the next best thing. Distilled, most important updates, every week, including testimony and tips and tricks from a panel of experts. Join our community 👇</p></p><p>Every week since the day GPT-4 released, we’ve been meeting in twitter spaces to talk about AI developments, and it slowly by surely created a community that’s thirsty to learn, connect and discuss information. </p><p>Getting overwhelmed with daily newsletters about tools, folks wanted someone else to do the legwork, prioritize and condense the most important information about what is shaping the future of AI, today! </p><p>Hosted by AI consultant Alex Volkov (available for hire), CEO of Targum.video, this information-packed edition covered groundbreaking new releases like GPT 4.5, Claude 2, and Stable Diffusion 1.0. We learned how Code Interpreter is pushing boundaries in computer vision, creative writing, and software development. Expert guests dove into the implications of Elon Musk's new XAI startup, the debate around Twitter's data, and pioneering techniques in prompt engineering. If you want to stay on top of the innovations shaping our AI-powered tomorrow, join Alex and the ThursdAI community. </p><p>Since the audio was recorded from a twitter space, it has quite a lot of overlaps, I think it’s due to the export, so sometimes it sounds like folks talk on top of each other, most of all me (Alex) this was not the case, will have to figure out a fix. </p><p>Topics we covered in July 13, ThursdAI </p><p><strong>GPT 4.5/Code Interpreter:</strong></p><p>00:02:37 - 05:55 - General availability of Chad GPT with code interpreter announced. 8k context window, faster than GPT-4.</p><p>05:56 - 08:36 - Code interpreter use cases, uploading files, executing code, skills and techniques.</p><p>08:36 - 10:11 - Uploading large files, executing code, downloading files.</p><p><strong>Claude V2:</strong></p><p>20:11 - 21:25 - Anthropic releases Claude V2, considered #2 after OpenAI.</p><p>21:25 - 23:31 - Claude V2 UI allows uploading files, refreshed UI.</p><p>23:31 - 24:30 - Claude V2 product experience beats GPT-3.5.</p><p>24:31 - 27:25 - Claude V2 fine-tuned on code, 100k context window, trained on longer outputs.</p><p>27:26 - 30:16 - Claude V2 good at comparing essays, creative writing.</p><p>30:17 - 32:57 - Claude V2 allows multiple file uploads to context window.</p><p>32:57 - 39:10 - Claude V2 better at languages than GPT-4.</p><p>39:10 - 40:30 - Claude V2 allows multiple file uploads to context window.</p><p><strong>X.AI:</strong></p><p>46:22 - 49:29 - Elon Musk announces <a target="_blank" href="http://x.ai">X.AI</a> to compete with OpenAI. Has access to Twitter data.</p><p>49:30 - 51:26 - Discussion on whether Twitter data is useful for training.</p><p>51:27 - 52:45 - Twitter data can be transformed into other forms.</p><p>52:45 - 58:32 - Twitter spaces could provide useful training data.</p><p>58:33 - 59:26 - Speculation on whether XAI will open source their models.</p><p>59:26 - 61:54 - Twitter data has some advantages over other social media data.</p><p><strong>Stable Diffusion:</strong></p><p>89:41 - 91:17 - Stable Diffusion releases SDXL 1.0 in discord,  plans to open source it.</p><p>91:17 - 92:08 - Stable Diffusion releases <a target="_blank" href="https://clipdrop.co/stable-doodle">Stable Doodle</a>.</p><p>GPT Prompt Engineering:</p><p>61:54 - 64:18 - Intro to Other Side AI and prompt engineering.</p><p>64:18 - 71:50 - GPT Prompt Engineer project explained.</p><p>71:50 - 72:54 - GPT Prompt Engineer results, potential to improve prompts.</p><p>72:54 - 73:41 - Prompts may work better on same model they were generated for.</p><p>73:41 - 77:07 - GPT Prompt Engineer is open source, looking for contributions.</p><p></p><p>Related tweets shared: </p><p><a target="_blank" href="https://twitter.com/altryne/status/1677951313156636672">https://twitter.com/altryne/status/1677951313156636672</a></p><p><a target="_blank" href="https://twitter.com/altryne/status/1677951330462371840">https://twitter.com/altryne/status/1677951330462371840</a></p><p><a target="_blank" href="https://twitter.com/sdand/status/1678476411416498178?ref_src=twsrc%5Etfw%7Ctwcamp%5Etweetembed%7Ctwterm%5E1678476411416498178%7Ctwgr%5Ef0cc918ac68a52d99d66e5f7dc2dbad4fe1851bf%7Ctwcon%5Es1_c10&#38;ref_url=https%3A%2F%2Fspacesdashboard.com%2Fspace%2F1YpKkggrRgPKj%2Fthursdai-space-code-interpreter-claude-v2-xai-sdxl-more">@Surya - Running GPT2 inside code interpreter</a> </p><p>tomviner - <a target="_blank" href="https://twitter.com/tomviner/status/1678559722931122178?ref_src=twsrc%5Etfw%7Ctwcamp%5Etweetembed%7Ctwterm%5E1678559722931122178%7Ctwgr%5Ef0cc918ac68a52d99d66e5f7dc2dbad4fe1851bf%7Ctwcon%5Es1_c10&#38;ref_url=https%3A%2F%2Fspacesdashboard.com%2Fspace%2F1YpKkggrRgPKj%2Fthursdai-space-code-interpreter-claude-v2-xai-sdxl-more">scraped all the internal knowledge about the env</a></p><p><a target="_blank" href="https://twitter.com/Peter_0_0_g/status/1678907200511565827?ref_src=twsrc%5Etfw%7Ctwcamp%5Etweetembed%7Ctwterm%5E1678907200511565827%7Ctwgr%5Ef0cc918ac68a52d99d66e5f7dc2dbad4fe1851bf%7Ctwcon%5Es1_c10&#38;ref_url=https%3A%2F%2Fspacesdashboard.com%2Fspace%2F1YpKkggrRgPKj%2Fthursdai-space-code-interpreter-claude-v2-xai-sdxl-more">Peter got all pypi packages and their description</a></p><p><a target="_blank" href="https://substack.com/profile/89230629-swyx">swyx</a> <a target="_blank" href="https://twitter.com/swyx/status/1678944036135260160?ref_src=twsrc%5Etfw%7Ctwcamp%5Etweetembed%7Ctwterm%5E1678944036135260160%7Ctwgr%5Ef0cc918ac68a52d99d66e5f7dc2dbad4fe1851bf%7Ctwcon%5Es1_c10&#38;ref_url=https%3A%2F%2Fspacesdashboard.com%2Fspace%2F1YpKkggrRgPKj%2Fthursdai-space-code-interpreter-claude-v2-xai-sdxl-more">added Claude to to smol menubar</a> (which we also discussed)</p><p>SkalskiP <a target="_blank" href="https://github.com/SkalskiP/awesome-chatgpt-code-interpreter-experiments">awesome code interpreter experiments repo</a></p><p></p><p>See the rest of the tweets shared and listen to the original space here:</p><p>https://spacesdashboard.com/space/1YpKkggrRgPKj/thursdai-space-code-interpreter-claude-v2-xai-sdxl-more</p><p></p><p><strong>Full Transcript:</strong> </p><p>00:02 	(Speaker A)	You. First of all, welcome to Thursday. We stay up to date so you    </p><p>don't have to. There's a panel of experts on top here that discuss   </p><p>everything.                                                          </p><p>00:11 	(Speaker A)	If we've tried something, we'll talk about this. If we haven't, and  </p><p>somebody in the audience tried that specific new AI stuff, feel free </p><p>to raise your hand, give us your comment. This is not the space for  </p><p>long debates.                                                        </p><p>00:25 	(Speaker A)	We actually had a great place for that yesterday. NISten and Roy from</p><p>Pine, some other folks, we'll probably do a different one. This      </p><p>should be information dense for folks and this will be recorded and  </p><p>likely we posted at some point.                                      </p><p>00:38 	(Speaker A)	So no debate, just let's drop an opinion and discuss the new stuff   </p><p>and kind of continue. And the goal is to stay up to date so you don't</p><p>have to in the audience. And I think with that, I will say hi to Alan</p><p>Janae and we will get started.                                       </p><p>00:58 	(Speaker B)	Hi everyone, I'm NISten Tahira. I worked on, well, released one of   </p><p>the first Docker chat bots on the market for Dr. Gupta and scaled it,</p><p>and now we're working on getting the therapist bought out once. We   </p><p>can also pass more testing and get Voice to work at a profitable     </p><p>manner because we don't really have VC. So at the scale of few       </p><p>hundred thousand users, the API bills matter quite a bit.            </p><p>01:31 	(Speaker B)	So, yeah, these spaces have been pretty helpful because I have some  </p><p>trouble with running a Voice transformer, trying to run it on the    </p><p>browser on web GPU, and then the person that wrote Transformers JS   </p><p>comes in here and just says, oh yeah, that back end is messed up.    </p><p>Just try blas and synth and stuff. So these have been very           </p><p>interesting and technical spaces.                                    </p><p>01:54 	(Speaker A)	Yeah, we need to get Zenova in here. Zenova is the guy who NISten was</p><p>referring to. Al Janae, do you want to give a few words of intro and </p><p>say hi and then we'll start? Just briefly, please, because I think we</p><p>need to get going.                                                   </p><p>02:09 	(Speaker C)	Sure. Hi, I'm Janae.                                                 </p><p>02:11 	(Speaker D)	I'm the resident noob, I started messing around with AI at the       </p><p>beginning of.                                                        </p><p>02:16 	(Speaker E)	The year, and I also host the.                                       </p><p>02:18 	(Speaker D)	Denver AI Tinkerers coming up next week.                             </p><p>02:20 	(Speaker A)	And if you're in Colorado area, greater Denver, please join us. It's </p><p>going to be a blast.                                                 </p><p>02:27 	(Speaker F)	Hi, I'm Al Chang. I'm kind of an old school technologist. Just       </p><p>getting started with the AI again and just here to help.             </p><p>02:36 	(Speaker A)	Yeah. All right, folks, so I think we've had a whole space on this.  </p><p>Simon Wilson and me and many, many other folks chimed in. The second </p><p>this was released.                                                   </p><p>02:50 	(Speaker A)	Was that six? Was that Sunday? It's hard to keep track of actual     </p><p>days. Saturday, Saturday, last week, exactly during those spaces, by </p><p>the way, as we were talking, Chad GPT, Logan and everybody else from </p><p>OpenAI announced general availability of Chad GPT with code          </p><p>interpreter. So GPT four with code interpreter.                      </p><p>03:12 	(Speaker A)	And I think we just heard from Matt that even some folks who got     </p><p>access to the slept on it a little bit because it's maybe potentially</p><p>because of its very horrible name that's really hard to type         </p><p>interpreter and get lost in the R's. But it's an extremely powerful  </p><p>new superpower that we've got. And we've had the whole space talking </p><p>about use cases that people already had.                             </p><p>03:37 	(Speaker A)	It was like three days into it and since then I bet that many more   </p><p>people tried it. I think Swyx 20,000 listens to that space, plus the </p><p>pod. At least people definitely want to hear new use cases, right?   </p><p>03:53 	(Speaker G)	Yeah, not much else to add about it. I think it's the feature for    </p><p>Switch.                                                              </p><p>03:59 	(Speaker A)	Posted a whole deep dive essay and coined it GPT 4.5 between us      </p><p>friends. And one of the interesting things about it is that we think </p><p>at least that's where we are currently after playing around with     </p><p>this, is that it's a fine tuned model. So they kept training this on </p><p>actually running code and executing code.                            </p><p>04:21 	(Speaker A)	That's what we believe. We don't know, nobody confirmed this and then</p><p>that it's fine tuned from an earlier checkpoint of GBT Four. And so  </p><p>we actually had some folks on spaces talking about that it's less    </p><p>restricted and better like previous times.                           </p><p>04:36 	(Speaker A)	So it's an interest, I think NISten right. We have some folks who    </p><p>tell us they're using code interpreter without the code part. They   </p><p>just stopped the GPT Four just because it's that model.              </p><p>04:48 	(Speaker A)	And I think also they took down the 25 messages per hour restriction </p><p>on code interpreter. I've had like four hour sessions and it stopped </p><p>like I didn't saw complaints.                                        </p><p>05:03 	(Speaker G)	So it's just better.                                                 </p><p>05:06 	(Speaker A)	It's also fast. I think it's fast because not many people maybe use  </p><p>this by default and this could be the reason for the speed, but it's </p><p>definitely faster for sure. I think also context window, was it Yam? </p><p>Somebody summarized the context window and they told us the context  </p><p>window for code interpreter is eight k versus the regular GPD for    </p><p>actually that could be also a kick.                                  </p><p>05:29 	(Speaker G)	You mean Yam copied and pasted.                                      </p><p>05:34 	(Speaker A)	I would encourage you and Yam need to kiss in the cup because Yama is</p><p>doing a lot of legwork to take down the stuff that he posted and Yama</p><p>is working on that and it's very visible and you guys need to do     </p><p>there you go, yam, you need to clear the air. However, Pharrell and  </p><p>Gabriel bring you up as well. And we're going to keep talking about  </p><p>code interpreter because that's what we're here to do. NISten and a  </p><p>few other folks and we started cooking with code interpreter.        </p><p>05:59 	(Speaker A)	And by cooking I mean we started stretching the complete boundaries  </p><p>of what's possible there. And I think Simon Willison kick started    </p><p>this with the latent space Pod. So for folks who are not following   </p><p>latent space pod, feel free to follow SWIX, his main account, not    </p><p>this hidden one.                                                     </p><p>05:59 	(Speaker A)	And SWIX reposted the spaces we had simon Wilson was able to run node</p><p>JS and Dino within code interpreter, even though OpenAg didn't allow </p><p>for that by uploading like a binary and asking code interpreter to   </p><p>generate. Simon then promptly said they fine tuned the model away    </p><p>from that and we found ways anyway to ask it to do some stuff. I have</p><p>a thread on how I was able to run a vector DB chroma inside code     </p><p>interpreter.                                                         </p><p>06:10 	(Speaker A)	I ran whisper CPP. We saw some folks running GPT-2 inside code       </p><p>interpreter, right? So imagine an Ll GPD Four running another and    </p><p>talking to it. It's like a little brother inside.                    </p><p>06:10 	(Speaker A)	I personally love that inception. I don't know if the person who ran </p><p>GPD Two is in the audience as Dan I think was the nickname NISten. I </p><p>don't know.                                                          </p><p>07:22 	(Speaker A)	Surya.                                                               </p><p>07:23 	(Speaker B)	Surya. He also wrote the search to PDF plugin for GP Four plugins and</p><p>he wrote that in like two days and it's more used than any other     </p><p>enterprise thing, which is pretty hilarious.                         </p><p>07:36 	(Speaker A)	We need to get surya.                                                </p><p>07:38 	(Speaker B)	Yeah, he just did that as I'm just going to do a search plugins for  </p><p>PDF and it's like the most used.                                     </p><p>07:45 	(Speaker A)	So dope pretty amazing. Again, in that space we've talked about      </p><p>having like a living manual, so to speak, for code interpreter use   </p><p>cases because it's coding. So it covers pretty much everything that  </p><p>we can think of as coders, maybe just in Python, maybe restricted to </p><p>an environment. And I've been trying to do that with the code        </p><p>interpreter can hashtag and I encourage all of you, let me pin this  </p><p>to the top of the space, to the jumbotron if you have an interesting </p><p>code interpreter thing and I'll bring up Skalsky P to the stage as   </p><p>well.                                                                </p><p>08:03 	(Speaker A)	And Lantos, so many good friends. If you have a very interesting code</p><p>interpreter technique or skill or new thing that people can do       </p><p>without coding skills, please tag with this hashtag so folks can find</p><p>this. Otherwise I will cover the main three things the code          </p><p>interpreter gave us besides the new model.                           </p><p>08:42 	(Speaker A)	One of them is uploading files. And since we've talked, we've noticed</p><p>that you can upload up to 250 megabyte files and those can be zips of</p><p>other files. So we've uploaded like full models weights.             </p><p>08:55 	(Speaker A)	We've uploaded bin files. It's incredible that you can now drag and  </p><p>drop whole directory and have JPT just know about this and read about</p><p>this. We've uploaded weights in embeddings.                          </p><p>09:08 	(Speaker A)	You can then obviously execute code in a secure environment, which is</p><p>again incredible, and you can download files, you can ask it to      </p><p>actually generate a download for you, which is also super, super     </p><p>cool. Maybe one last thing I'll say before I'll give it to the       </p><p>audience for a few more cool use cases. And folks in the stage,      </p><p>please feel free to raise your hand.                                 </p><p>09:21 	(Speaker A)	I'll get to you in the order that you raise your hand if you have a  </p><p>use case. Some folks built like a built in memory built in brain     </p><p>within code interpreter just to save to a file. That's what I try to </p><p>do with my vector DB and then they download that memory at the end of</p><p>every session and then upload this to the next one and have some like</p><p>a prompt that reminds the jgpd like to start from that point.        </p><p>09:50 	(Speaker A)	So in addition to the context window, they're also having a separate </p><p>offloaded file persisted memory. So code interpreter incredible.     </p><p>Again.                                                               </p><p>10:00 	(Speaker A)	Potentially GPT 4.5. And if you haven't played with this, feel free  </p><p>to if you don't know what to play with, follow the code interpreter  </p><p>can hashtag and let's get to Skowski.                                </p><p>10:11 	(Speaker A)	 What's up, man?                                                     </p><p>10:14 	(Speaker H)	Hi, hello. Do you hear me?                                           </p><p>10:15 	(Speaker A)	Yeah, we can hear you fine.                                          </p><p>10:19 	(Speaker H)	Yeah, I've been playing a lot with code interpreter over the past    </p><p>five days, mostly with computer vision use cases because that's what </p><p>I do. I haven't introduced myself. I'm pretty much doing computer    </p><p>vision full time for the past five years and was focusing on like    </p><p>when I saw that you can input image and video, that was immediately  </p><p>what I was thinking, we need to make it to computer vision. So I went</p><p>through some low effort tasks.                                       </p><p>10:46 	(Speaker H)	So I managed to run old school computer vision algorithms, face      </p><p>detection, tracking of objects, stuff like that. But I also managed  </p><p>to exploit it a little bit. So you can add yolo object detection     </p><p>models to the list of models that were run in code interpreter.      </p><p>11:15 	(Speaker H)	There are some problems with memory management, so I'm not yet fully </p><p>happy with the result. But yeah, I managed to run it on images and on</p><p>videos and the things that are super cool and are kind of like       </p><p>underrated right now, false positive. So when the model detects      </p><p>something that shouldn't be detected, you can really use text to ask </p><p>code interpreter to filter out false detections.                     </p><p>11:48 	(Speaker H)	You can just give it your feeling like why that stuff is happening or</p><p>when or where. And it's very good at cleaning the detections, which  </p><p>was kind of like mind blowing for me. And one thing that I noticed   </p><p>that it sucks at is I managed to create an application that counts   </p><p>objects moving on the video when they cross the line.                </p><p>11:55 	(Speaker H)	And I didn't use any off the shelf libraries, I just had detector and</p><p>say, okay, now draw a line and count objects when they cross the     </p><p>line. It's terrible at that, writing math logic to figure out that   </p><p>something crossed something, we had like ten prompts or twelve       </p><p>prompts exchange and I basically bailed out on that, forget it. So   </p><p>there are some things that blow my mind, but there are something that</p><p>probably not.                                                        </p><p>12:49 	(Speaker A)	So folks, feel free to follow Skowski. And also I just pin to the top</p><p>of the Tweet his brand new awesome code interpreter use cases, git   </p><p>repo, and there's a list, there's a bunch of use cases there. This   </p><p>could also serve as a de facto manual. So feel free to go there at   </p><p>PRS and follow that for updates.                                     </p><p>12:52 	(Speaker A)	And I want to get to Lentos because he seems to be unmuting. What's  </p><p>up, Lentos?                                                          </p><p>13:12 	(Speaker H)	I was just going to say I can't follow him because he's blocked me.  </p><p>13:15 	(Speaker C)	Sad face.                                                            </p><p>13:16 	(Speaker H)	Oh, no, I noticed that, but I'm not sure why. I will undo that.      </p><p>13:20 	(Speaker A)	All right, I'm the peacemaker in the status. Please kiss and make up.</p><p>You two as well. Everybody should get along.                         </p><p>13:26 	(Speaker A)	Yay. I want to get to some other folks who came up on stage recently.</p><p>And Gabriel, welcome to talk about code interpreter and your use     </p><p>cases.                                                               </p><p>13:35 	(Speaker A)	Jeanette, if you play with this, I would like to hear two more       </p><p>opinions before we move on to the next incredible thing. Yeah. Oh,   </p><p>you guys are talking about let's get together and then June sorry, I </p><p>should have been explicit about the order.                           </p><p>13:54 	(Speaker E)	No worries. So I just posted a comment on this space about the       </p><p>message cap on a conversation. So even though in the UI, it still    </p><p>says 25 messages per 3 hours, if you look at the network request, you</p><p>can see that. And I posted this, it's actually 100 messages per 3    </p><p>hours now.                                                           </p><p>14:12 	(Speaker E)	And I don't know if they're scaling that up and down as demand       </p><p>increases and decreases, or they're just trying to trick people into </p><p>conserving their messages, but it's definitely been on 100 for a     </p><p>little while now. Can you confirm same thing you can see in the      </p><p>network?                                                             </p><p>14:32 	(Speaker A)	Can you confirm the same for the regular mode, or do you think the   </p><p>regular mode is still restricted? Well.                              </p><p>14:41 	(Speaker E)	Based on just the fact that there's only one message cap, they don't </p><p>have message cap per model. So I think it's just consistent across   </p><p>all the GP four models. And that's also my experience in the last    </p><p>it's been a little while now. It's probably at least a couple of     </p><p>weeks that it's been higher.                                         </p><p>14:51 	(Speaker E)	And same thing we discussed, I think, on Saturday about the context  </p><p>window. And you can also see it in the API that the context window is</p><p>eight K for plugins and code interpreter, and it's 4K for the base   </p><p>GPT four model.                                                      </p><p>15:16 	(Speaker A)	That's awesome. Like suicide. Better in every single way.            </p><p>15:22 	(Speaker D)	Yeah.                                                                </p><p>15:23 	(Speaker A)	Awesome. Thanks.                                                     </p><p>15:24 	(Speaker E)	Yeah. In terms of use cases I can share, I've been digging around a  </p><p>lot in the code interpreter, and I was really trying to hone in on   </p><p>why are the packages that are installed there, the Python packages in</p><p>the environment? Why are they there? Some of them seem really random,</p><p>and some of them make a lot of sense. And they released it, saying   </p><p>it's for, basically data analysis. And a lot of them make sense for  </p><p>that, but some of them are just really wild, like the ML packages.   </p><p>15:54 	(Speaker A)	And the Gabriel folks in the audience. If you look up at the jumbo   </p><p>tone where we pin Tweets two Tweets before there's a Tweet by Peter  </p><p>Zero Zero G, who actually printed all the packages and asked GPT Four</p><p>to kind of summarize what they do. So if you have no idea about the  </p><p>potential capabilities of what it can do, feel free to pin that tweet</p><p>for yourself. And then it has a bunch of descriptions of what's      </p><p>possible.                                                            </p><p>16:11 	(Speaker A)	So go ahead. Gabriel. Yeah, cool.                                    </p><p>16:28 	(Speaker E)	Yeah, I've done the same kind of thing with just a short yeah, I got </p><p>it to do a four word description for each one. So if you're looking  </p><p>for a really short description of each package, I'll post that tweet.</p><p>And if you're looking for a long one, I think Peters is great. And   </p><p>what you can see there is that there are packages for web            </p><p>development, right? There's Fast API, there's Flask, there's a bunch </p><p>of other packages for Web development.                               </p><p>16:40 	(Speaker E)	And besides the fact that there's no network access, which obviously </p><p>other people using it might be turning it on, but it was just        </p><p>interesting to me. My perspective is that OpenAI has been using this </p><p>internally throughout all their teams for development and testing it </p><p>internally, but probably also using it pretty consistently. They     </p><p>probably have access to the Internet.                                </p><p>17:14 	(Speaker A)	Yeah, I'm sure they have access to.                                  </p><p>17:15 	(Speaker E)	The Internet and they can install new packages. But I think they also</p><p>have the ability, instead of uploading files and downloading files,  </p><p>they have the ability to just mount persist memory, I don't think, to</p><p>persist. I think they just mount their local working directory on    </p><p>their computer right wherever they're working. So they have their    </p><p>active directory where they have their project, and they just mount  </p><p>that and give the code interpreter access to the whole directory with</p><p>their whole repo of their project.                                   </p><p>17:48 	(Speaker C)	Yeah.                                                                </p><p>17:48 	(Speaker E)	And then Chat Gvt is just writing code to the working directory and  </p><p>reading from there and it can explore their whole project. We can do </p><p>that now by uploading, you can zip your whole project and upload the </p><p>whole thing zipped and have it unzipped. And then it can kind of     </p><p>explore your whole project. But then once it makes some changes, you </p><p>want to commit them, you have to ask it to zip the whole thing back, </p><p>download it and upload it.                                           </p><p>17:48 	(Speaker E)	And then I think what they're able to do is more of like a kind of   </p><p>peer programming thing where the developer makes some changes and    </p><p>then Chat GPT makes some changes and they're kind of working         </p><p>together. This is taking it one step further. I don't know if they   </p><p>have this or not, but it would be super.                             </p><p>18:29 	(Speaker A)	Cool in the realm of updates unless there is no speculation. But I   </p><p>would love to explore this more with you in the next stage because   </p><p>this applies to open source and how people already saw somebody tag  </p><p>us after the last space and said, hey, I'll build this open source. I</p><p>would love to pin this to the top of the space. However, I want to   </p><p>move on to new space and then move on to other updates.              </p><p>18:51 	(Speaker A)	Sorry to interrupt, but thanks. I think that the collaborative,      </p><p>persistent code superpower that probably maybe at some point will    </p><p>come to us as well. Plus the internet access is like another ten x I </p><p>want to get to Skowskin and lent us and I think we'll move on to     </p><p>Claude.                                                              </p><p>19:08 	(Speaker A)	Thanks Gabriel.                                                      </p><p>19:11 	(Speaker H)	Yeah, I have a question. I'm not really sure guys, if you notice that</p><p>I was obviously experimenting with PyTorch because I needed it for   </p><p>computer vision. I noticed that the PyTorch version that is installed</p><p>in the environment actually pre compiled to work with CUDA. So it's a</p><p>GPU version of PyTorch.                                              </p><p>19:31 	(Speaker H)	Even though that in the environment you don't have access to GPU, you</p><p>only have CPU. So I'm curious guys, what you think about that. Why is</p><p>that? Any ideas?                                                     </p><p>19:42 	(Speaker A)	Ideas that just come from what Gabriel just said? Likely we're       </p><p>getting the same Kubernetes container. However, the open AI folks    </p><p>have like unlimited stuff. They probably also have CUDA that would   </p><p>make sense right there is probably connected to a GPU as well, but   </p><p>that's just an idea. Lantos, I want to get to you and then we'll move</p><p>on to Claude.                                                        </p><p>20:02 	(Speaker A)	Folks and folks in the audience, feel free to hit the little right   </p><p>button on the bottom left looks like a little message and leave      </p><p>comments through commenting as well. Moving on to Claude V Two. Folks</p><p>in the audience and folks on stage, feel free to hit up the emojis   </p><p>plus one.                                                            </p><p>20:19 	(Speaker A)	Minus one if you have tried Claude V two if you like it and you      </p><p>haven't liked it. I'm going to cover this anyway because I think     </p><p>somebody called me, I think Roy from Python called me a Cloud V Two  </p><p>fanboy yesterday and I first got offended and I told him that I'm    </p><p>just a fanboy for 24 hours. Before that I was a code interpreter     </p><p>fanboy and then I figured with myself whether or not I am a fanboy of</p><p>Claude V Two.                                                        </p><p>20:43 	(Speaker A)	And yeah, I am and Sweet told me to relax and in fact I invited him  </p><p>here to be the red blanket on the other side of the list. Anthropic  </p><p>the company that we can definitely consider number two after opener. </p><p>I think that's fair in terms of quality.                             </p><p>21:02 	(Speaker A)	Have long released Claude version and they made some ways when they  </p><p>released Claude AKS clong with 100K complex window, they have        </p><p>released Cloud V Two and let me paste some Claude sorry, pin some    </p><p>Claude thingies in the jumbotron, sorry. However, Cloud V Two        </p><p>released with multiple stuff and I want to focus on two stuff and I  </p><p>think we'll cover the UI first and then we're going to talk about the</p><p>model itself, UI wise and product wise. My hot take and I'll pin this</p><p>to the top.                                                          </p><p>21:38 	(Speaker A)	Unfortunately not debate this, but I love you, all of you. Is that as</p><p>products, Cloud V Two right now beats JPD as a product. My mom can go</p><p>into two websites and she'll prefer one versus the other one.        </p><p>21:51 	(Speaker A)	Or my friends that don't know Xai as plugged in as we are, theirs is </p><p>free. And I think Cloud V Two beats GPD 3.5, which is also free, and </p><p>100K context window with the model being traded, 200 unleashes, a    </p><p>bunch of use cases that were not possible before.                    </p><p>22:12 	(Speaker A)	It just frees you up. If you heard Skowski just say the limitations  </p><p>of code interpreter. A bunch of these limitations stem from the eight</p><p>K context window.                                                    </p><p>22:13 	(Speaker A)	If you print a bunch within the code that you're doing, code         </p><p>interpreter sometimes forgets what you guys talked about 20 minutes  </p><p>ago. And the 100K context window also means a long, long conversation</p><p>history with the model. And I think it's really great.               </p><p>22:37 	(Speaker A)	Not to mention that you can drag and drop full books in there. Those </p><p>books need to be in like one or two files and they still don't accept</p><p>zip files. And I'm planning to release an extension soon that does   </p><p>this for us and unifies and single files.                            </p><p>22:51 	(Speaker A)	So hopefully by next week we'll have some updates. However, once you </p><p>upload that much or you can upload like a transcript or a podcast,   </p><p>you can do a bunch of stuff because Cloud V Two is also better       </p><p>trained on code and we saw a significant jump in wait, I'm switching </p><p>to the model, so let me get back to the UI. The UI allows you to     </p><p>upload files.                                                        </p><p>23:09 	(Speaker A)	The UI has a command k interface, which I personally love. I hit     </p><p>Command K in every website and see if they support it. You can just  </p><p>start a new chat real quick.                                         </p><p>23:21 	(Speaker A)	It doesn't have Share, but it's definitely refreshed and free UI.    </p><p>It's called Cloud AI and that's the URL, and if you haven't tried it,</p><p>definitely try it. Comments about just the product side and the UI   </p><p>side before we move to the model? Anybody play with this? Anybody    </p><p>like it? Anybody loves the upload files feature? I would love to hear</p><p>hands and comments.                                                  </p><p>23:42 	(Speaker A)	Go ahead, Matt.                                                      </p><p>23:44 	(Speaker D)	A bit of a weird thing, but what I've noticed is it's actually quite </p><p>frustrating if you want to paste text in it actually, if it's over a </p><p>certain length, will paste in as a file. Little small thing.         </p><p>Hopefully they'll change it, but it is really annoying because then  </p><p>you can't edit it. Chat GP does do that much better, but I generally </p><p>agree with you that overall the product experience on Claude is.     </p><p>24:03 	(Speaker A)	Significantly the new one. The fresh coat of paint they released for </p><p>us. I will say that Cloud so far was kind of a hidden gem, that only </p><p>folks who got access to the API actually got access to their UI, and </p><p>that UI was very restricted and folks who have access to Cloud API   </p><p>know what I'm talking about. I think that UI is still around.        </p><p>24:22 	(Speaker A)	It still shows your history. It's like very restrictive. It's not as </p><p>cool as this it's not as leak as this.                               </p><p>24:27 	(Speaker A)	So we like cloud AI, definitely a plus. Check it out. Now, let's talk</p><p>about the model behind this UI, because that model also changed and  </p><p>several incredible things that changed with it.                      </p><p>24:38 	(Speaker A)	First of all, they released a new model, same price as the previous  </p><p>one. We love to see this. Please everybody, including opinion,       </p><p>continue giving the same price and cheaper and cheaper down the line.</p><p>24:41 	(Speaker A)	We love to see this. Second of all, they claim it's been fine tuned  </p><p>on several things. One of them is code.                              </p><p>24:54 	(Speaker A)	And we actually saw a bump in the evaluation called Human Eval, which</p><p>is a set of questions that OpenAI released and I think the bump was  </p><p>from like 55% to 78%, which I think beats 3.5 and is not there       </p><p>compared to GPT four. Correct?                                       </p><p>25:14 	(Speaker C)	Yeah, and four and four on past first on the first, not on GPT four  </p><p>that is allowed to refine and fix it there, but on the first trial.  </p><p>Yeah, by a little bit.                                               </p><p>25:33 	(Speaker A)	So, news to me and thank you for joining in the past numbers is how  </p><p>many times it's able to reflect upon the sensors and improve them.   </p><p>25:43 	(Speaker C)	The past time is kind of what I meant by reflection is even stronger </p><p>GPT four. If GPT four sees the exception, it can come up with a      </p><p>solution. So this is not in the Human Eval test, but if you use GPT  </p><p>four this way, you get to 90 something percent, which is which I     </p><p>think it's more realistic if you think about it. No programmer writes</p><p>the whole code in a one go.                                          </p><p>26:10 	(Speaker C)	You write it intuitively, six bugs and so on. And also in code       </p><p>interpreter, you see it. But it is remarkable to see state.          </p><p>26:19 	(Speaker A)	Of the art on first and it's significantly better in code. And I     </p><p>suggest folks who previously tried quad and haven't impressed to try </p><p>as well. An additional crazy thing that they've trained on is 100K   </p><p>contacts window and they've actually trained, they claim on 200K     </p><p>contact window, so twice as much as the previous round. And we follow</p><p>this one guy of your press, the guy behind Self Ask with Search and  </p><p>the guy behind Alibi, the ability to extend complex windows.         </p><p>26:55 	(Speaker A)	He just defended his PhD and he talked about complex windows and he  </p><p>was impressed with the way they presented and the way they showed    </p><p>their loss curve. And so this could be we saw the paper maybe this   </p><p>week the folks saw the paper where the window dips in the middle.    </p><p>There's like less attention in the middle of the beginning at the    </p><p>end.                                                                 </p><p>27:03 	(Speaker A)	And it looks like that's not the case for Claude as well. So I       </p><p>suggest you try the huge context window and al you have your raised  </p><p>hand and then we'll talk about some other model changes.             </p><p>27:26 	(Speaker F)	Yeah, I would talk a little bit about I used Claude about a month and</p><p>a half ago to win Best Solo Hacker at the Craft Ventures hackathon   </p><p>david Sachs won. Yeah, it had like 200 entries, but it's             </p><p>exceptionally good at creative writing and also like comparing and   </p><p>contrasting. I don't think people have really taken advantage of what</p><p>the context window is capable of doing. It's more than just loading  </p><p>single files in.                                                     </p><p>27:53 	(Speaker F)	So what I did for the project was I loaded these large legislative   </p><p>bills, these like 50 page unreadable bills, and you turned them into </p><p>relatable narratives. So one of the things that Claude can do is you </p><p>can adopt a persona. So a lot of times with summaries, summaries just</p><p>compress the text that you see, but you can tell it to say, write    </p><p>1000 words from a social conservative point of view, or a bus        </p><p>driver's point of view, or a social liberal point of view.           </p><p>28:21 	(Speaker F)	And what that does is it takes all of its knowledge about the outside</p><p>world and gives you not a summary, but it gives you essentially an   </p><p>essay about the practical effects of something like a bill. I've     </p><p>actually been working with the idea of reading a book and having it  </p><p>tell you what I would have learned from this, because that's actually</p><p>probably what you're more interested in. What it can do in terms of  </p><p>comparing and contrasting large essays is exceptional.               </p><p>28:51 	(Speaker F)	So you could have it say, write 2000 words from a social conservative</p><p>point of view, 2000 words from a social liberal point of view, and   </p><p>then have it contrast the essays, which is something that would be   </p><p>very difficult for a human to do. So you get to give it multiple     </p><p>files and have it just give you a more balanced approach so you get  </p><p>rid of some of the bias that comes in.                               </p><p>29:18 	(Speaker A)	My dream, go to my dream project that I never get to is to create    </p><p>this for Twitter as like a Chrome extension that I can select a bunch</p><p>of tweets and then say, remove the bias from this and just give me   </p><p>the debiased version of all of this. Yeah, completely. Like the cross</p><p>reference ability of Cloud between because of this context window is </p><p>incredible for many, many use cases.                                 </p><p>29:41 	(Speaker F)	Yeah, I would say that as far it's not as good as GPT Four for       </p><p>certain things. But that context window is fantastic. And I would say</p><p>a lot of people that are using embeddings and retrieval, you can     </p><p>actually just put the whole thing in the context window and ask      </p><p>questions to that and then you have a baseline to compare your       </p><p>results from it. Most people, if they're chatting to a website or    </p><p>something like that, you actually can just put the whole thing in    </p><p>there as opposed to trying to chunk it up and do questions and you'll</p><p>see that your results are much better that way.                      </p><p>29:51 	(Speaker F)	And for most people, that would be good enough.                      </p><p>30:17 	(Speaker A)	So additional thing that the additional thing that Cloud was trained </p><p>on, they've talked about the output tokens, just the number. Of      </p><p>output tokens of how much cloud is able to generate. And they've said</p><p>that previous models, I don't know if the same about GPT, I haven't  </p><p>seen numbers on GPT Four, but they've said that previous Claude      </p><p>models were focused on shorter outputs just as they were trained. And</p><p>this latest model was trained to output up to 4000 tokens in output. </p><p>30:47 	(Speaker A)	This is added to the fact that they also fine tuned it and trained to</p><p>output JSON files, complete JSON files as responses, which we as     </p><p>engineers, we waited for this and Open Xai gave us functions via kind</p><p>of here you go, there's the function interface. And we love the      </p><p>function interface. The function interface kind of locks us down to  </p><p>the OpenAI ecosystem.                                                </p><p>31:04 	(Speaker A)	And it's great to see another model that's like very close to state  </p><p>of the art in human evil that also is now fine tuned to respond in   </p><p>full intact JSONs. And those JSONs can be 4000 tokens at length. Any </p><p>thoughts on these?                                                   </p><p>31:28 	(Speaker F)	Yeah, I can confirm on it being able to write large amounts of       </p><p>output. I mean, I was having it write like 2000, 3000 word like sort </p><p>of essays and outputs and it was fine with that.                     </p><p>31:40 	(Speaker A)	Yes. And I think it's I'm going to.                                  </p><p>31:45 	(Speaker B)	Stick with GPT Four myself. But this might be pretty useful for just </p><p>dumping in an entire code base, given the 100k context window and    </p><p>then getting some reviews and stuff, and then maybe moving some of   </p><p>the stuff.                                                           </p><p>32:02 	(Speaker A)	Once I stop posting status and build that chrome extension that you  </p><p>upload the zip and it flatlines it to one file and then upload it,   </p><p>then we'd be able to do, like, a proper comparison, because code     </p><p>interpreter can take zip files and then extract them. Oh, one        </p><p>difference that I want to for folks in the audience, GPD Four with   </p><p>code interpreter allows you to upload zip files, et cetera. We talked</p><p>about this. It does not load them into context window, right? So     </p><p>there's like eight k context window.                                 </p><p>32:30 	(Speaker A)	The files that you upload are not automatically in the context       </p><p>window. The model doesn't it has to write Python code that actually  </p><p>prints the files. And it usually does like the first few lines, hint,</p><p>hint.                                                                </p><p>32:30 	(Speaker A)	The folks in the audience who get my drift. But it doesn't usually   </p><p>read all the unless you specifically ask it to and Claude does. So   </p><p>everything you upload to, Claude goes directly to the immediate      </p><p>working memory of the complex window.                                </p><p>32:38 	(Speaker A)	And that's a major difference to watch out for and also take care of.</p><p>Go ahead.                                                            </p><p>33:00 	(Speaker C)	I would like to ask everyone before I say my opinion, what do you    </p><p>think about it in comparison to GPT Four about the performance? What </p><p>do you think?                                                        </p><p>33:10 	(Speaker A)	I would like comments from folks who actually use both and did the   </p><p>comparison. And before I get to folks, please raise your hand to     </p><p>answer. I want to call out SWIX's small menu bar which allows you to </p><p>actually Swyx. Can you give us like a brief two minutes on the menu  </p><p>bar thing?                                                           </p><p>33:28 	(Speaker G)	Yeah, well, you don't have to choose. Just run it all the time on    </p><p>every single chat. So it's a little electron app that runs in the    </p><p>menu bar. And I've been maintaining it and I just added Cloud Two    </p><p>this week.                                                           </p><p>33:42 	(Speaker G)	Cloud Two is not super stable yet. Sometimes it will fail to submit  </p><p>the button. So you just have to retry manually to submit the button. </p><p>33:50 	(Speaker G)	But yeah, it's a great way to a B test models, but then also just    </p><p>amplify every question with between four to five different chat      </p><p>models with the answers. So I've been trying it. It's up to you if   </p><p>you want.                                                            </p><p>34:07 	(Speaker A)	To.                                                                  </p><p>34:10 	(Speaker C)	Find it.                                                             </p><p>34:14 	(Speaker A)	With the announcements, if you can. Yeah, awesome. Yeah, just        </p><p>basically and maybe for instance, you don't have to stop using, you  </p><p>don't have to choose. So I think the last thing that we need to      </p><p>acknowledge it's, Claude, is the multilinguality.                    </p><p>34:28 	(Speaker A)	So they actually focused on showing us how much better, like, the new</p><p>ones from previous ones, and they posted blue scores, Bleu scores,   </p><p>clock Two is significantly better at languages than the previous     </p><p>versions. I think, to answer your question, I think it's close to GPD</p><p>Four, if not better at some things. Hebrew goes fluently, and usually</p><p>Hebrew is not that great.                                            </p><p>34:57 	(Speaker A)	Russian and Ukrainian that I use also go fluently. And that part is  </p><p>really good with a lot of context because you sometimes need to do a </p><p>lot of translation, or at least I need to do a lot of translation.   </p><p>35:11 	(Speaker C)	Yeah, multilinguality works great. I was surprised. Absolutely. What </p><p>I think if you just compare the two on the same prompt, the same     </p><p>question, I have a feeling that GPT Four is slightly better, but I   </p><p>just don't have an example to tell you.                              </p><p>35:31 	(Speaker C)	 Okay, here I don't know, it's a strange situation, but I really     </p><p>wanted to ask you, like, what did you try and work better here and   </p><p>there?                                                               </p><p>35:38 	(Speaker A)	So here's my use case that GPT Four currently cannot do. Yesterday,  </p><p>Lex Friedman interviewed Israel's Prime Minister Benjamin Netanyahu  </p><p>in one of the weirdest turns of history this podcast was, and given  </p><p>that I know kind of who Benjamin Netanyahu is from, before I decided </p><p>to not listen to this, I decided to use the tools that we have at our</p><p>disposal. So I ran this through Whisper with Diarization. So I have, </p><p>like, a very nice transcript of who's talking.                       </p><p>36:10 	(Speaker A)	When I took that, I just dumped this as a text file. And I agree with</p><p>Matt, it's a little bit annoying that Claude turns whatever you paste</p><p>into like, a little text file uploads. That because you can't edit   </p><p>it.                                                                  </p><p>36:21 	(Speaker A)	However, I uploaded that transcript directly to Cloud, and then I    </p><p>asked it to do sentiment analysis, entity extraction, and sentiment  </p><p>analysis and entity extraction. Something that if I'd asked GPT code </p><p>interpreter, it would probably write some Python code to do this, and</p><p>Quad just kind of did it. And I haven't seen GPT Four being able to  </p><p>do this for bigger files.                                            </p><p>36:38 	(Speaker A)	And once I could just let me just this point. I continued by saying, </p><p>hey, because of the new coding abilities of Quad, I asked it like,   </p><p>hey, print me a Python file that dumps whatever table of topics he   </p><p>mentioned and sentiment, negative, positive, dump it into a word     </p><p>cloud. That's something the code interpreters can actually do and    </p><p>show you.                                                            </p><p>37:03 	(Speaker A)	But I asked it from Quad because previously Claude was s**t at coding</p><p>and it gave me Python files that ran from the first time. I didn't   </p><p>have to change anything, there was no bugs. And then showed me a word</p><p>cloud of everything that was mentioned by BB in that podcast and it  </p><p>all took like maybe seven minutes.                                   </p><p>37:11 	(Speaker A)	And I don't know if for bigger complex windows, GPT Four can         </p><p>currently do this. Go ahead, Al.                                     </p><p>37:28 	(Speaker F)	Yeah, I've actually been putting a lot of transcripts for podcasts in</p><p>there and you can actually have the because it seems so much about   </p><p>the speakers and it knows about the speakers, you can actually have  </p><p>them continue a discussion about things that they didn't actually    </p><p>discuss. Yeah, so it's like you can have it say, okay, well, what are</p><p>some topics they disagreed on and then some things that they didn't  </p><p>cover? Tangentially, you can just have it give you another two       </p><p>minutes of interview and it does a pretty reasonable job, especially </p><p>with public figures that it actually has a lot of their background   </p><p>on. So it's pretty interesting.                                      </p><p>38:01 	(Speaker A)	And not to mention free, ngbt Four needs a $20 a month payment and   </p><p>quality is free.                                                     </p><p>38:08 	(Speaker F)	That's a good point, too. For those of you that have eval keys,      </p><p>you'll notice that they're actually not charging you for them, so you</p><p>can actually go on as long as you want. The limitation is that you   </p><p>can only do one request per organization. So if it's just a single   </p><p>person, they only charge you basically when you start deploying for  </p><p>commercial purposes.                                                 </p><p>38:21 	(Speaker F)	So that's something that people may not have realized.               </p><p>38:32 	(Speaker A)	So I think we've covered everything right, trained on 200K context,  </p><p>which they can enable tomorrow for us, and we'll get like two X. It's</p><p>going to be insane. There is some stuff that they have in Cloud in a </p><p>tropic called Constitution AI, so they have a mix of Rlhf access and </p><p>Constitution AI. So they're working on their model to actually be    </p><p>more helpful, but also more safe and less jail breakable.            </p><p>38:57 	(Speaker A)	They talked at length about this. We talked about human evil better  </p><p>and same price and free playground. I think we've covered most of it.</p><p>39:03 	(Speaker A)	So anything else about Quad that we haven't covered, feel free to    </p><p>raise your hand and tell us, and if not, I think we can move on. What</p><p>do you guys think?                                                   </p><p>39:17 	(Speaker G)	I'll mention briefly, did you talk about the multiple file uploads?  </p><p>39:21 	(Speaker A)	No, go ahead.                                                        </p><p>39:24 	(Speaker G)	So I think it's just an interesting way difference between co        </p><p>interpreter and Claude code interpreter. You can only upload one     </p><p>file, right? But it can be a zip file with multiple files in Zion. So</p><p>it's de facto multiple files, but then you can only run code on that.</p><p>Whereas what Cloud here is doing is something slightly different,    </p><p>which is to me is interesting, which is you can upload multiple      </p><p>files, it just reads the file straight into the context and it's     </p><p>using that 100K context to synthesize answers.                       </p><p>39:24 	(Speaker G)	So you can do, for example, PDF A and PDF B and give me a comparison </p><p>between the two of them or synthesize knowledge across them. And I   </p><p>think that is something that code interpreter cannot do because code </p><p>interpreter will only run code across files. So I think that's       </p><p>noteworthy.                                                          </p><p>40:15 	(Speaker G)	It's called genuinely coming up with one new thing that is not       </p><p>copying chat GBT and good for them.                                  </p><p>40:23 	(Speaker A)	Yeah. And unfortunately no zip allowed. But we're going to fix this  </p><p>with an extension and hopefully talk about this next week. I want to </p><p>say hi to Weather Report.                                            </p><p>40:33 	(Speaker A)	Feel free to chime in. Sorry you raised your hand open to come up    </p><p>before. So if you have a comment about code interpreter, we've moved </p><p>past it, but if you have a comment about Claude, feel free to tell us</p><p>what's up with the report.                                           </p><p>40:46 	(Speaker A)	Actually, I had only one thing about code interpreter that in the    </p><p>previous space I talked about that there was a hypothesis I had about</p><p>code interpreter, which.                                             </p><p>40:56 	(Speaker B)	Is to use it as a huddle because it's recorded.                      </p><p>40:59 	(Speaker A)	We'll move on and let's talk about code interpreter next time. I     </p><p>think that some folks are saying that their audio is glitching and so</p><p>they're not able to and I want to see if I think Joseph has comment  </p><p>about code interpreter. Joseph Polak. We'll give him a second to log </p><p>in and then I think we'll move on to other updates because we have   </p><p>many other things to talk about.                                     </p><p>41:29 	(Speaker A)	What's up, Joseph? Welcome to stage.                                 </p><p>41:31 	(Speaker G)	Hi there, folks.                                                     </p><p>41:33 	(Speaker A)	Thanks for taking my question. I didn't even know all about that code</p><p>interpreter stuff with the file.                                     </p><p>41:40 	(Speaker G)	So I'm really happy to have heard it. About Cloud, though.           </p><p>41:46 	(Speaker A)	For Cloud. Well, I'm still on waitlist. First of all, it's free now. </p><p>You can access it right now.                                         </p><p>41:53 	(Speaker A)	Cloud AI. There's no waitlist anymore unless you live in the States  </p><p>and you'll have to get a VPN. Okay, I'll definitely check that out.  </p><p>42:03 	(Speaker A)	My question was about using Cloud and actually code interpreter      </p><p>through API. Do you think that's ever going to exist or if it's      </p><p>coming so clogged API? But I think that's waitlisted. I have talked  </p><p>with Claude folks and they said the waitlist is now going faster.    </p><p>42:24 	(Speaker A)	So they are ready to get more people in. I think because of the new  </p><p>safety updates, they're less afraid. So definitely apply for the     </p><p>waitlist on quads account.                                           </p><p>42:35 	(Speaker A)	Code interpreter is not available via API, and we've seen some folks </p><p>who hack it together with like, I think a browser plugin that proxy  </p><p>something. Sweets I don't know if you remember the unofficial quote  </p><p>unquote code interpreter API and it's how to access this, but it's   </p><p>not available in the official OpenAI APIs as of yet. We haven't seen </p><p>them.                                                                </p><p>42:56 	(Speaker G)	No. For the record, there's no unofficial code interpreter API.      </p><p>There's the browser side thing that we are trying to but nobody's    </p><p>made any.                                                            </p><p>43:07 	(Speaker D)	Adapter for it yet.                                                  </p><p>43:08 	(Speaker G)	I think you can, if you want, using puppeteer.                       </p><p>43:12 	(Speaker A)	I would not recommend definitely, if anything, there was some folks  </p><p>that tagged us and I need to go and find this that they're working on</p><p>like an open source version of code interpreter that uses laws and   </p><p>stuff. And that one this will likely be the way forward. If you do   </p><p>want something programmatic that has code interpret capabilities, go </p><p>ahead. NISten.                                                       </p><p>43:35 	(Speaker B)	There's also Chatbot UI on GitHub. So yeah, for the other people that</p><p>are hacking something together, I'll wait until there is something   </p><p>public before, because then.                                         </p><p>43:45 	(Speaker D)	We don't know everything.                                            </p><p>43:47 	(Speaker G)	Open source is going to be worse. Because you are missing the model. </p><p>43:51 	(Speaker A)	Yeah, because we think that it's fine tuned on actually knowing how  </p><p>to run code. Right. That's kind of the highlight that we get with    </p><p>from the less space. We think it's smarter because of that.          </p><p>44:01 	(Speaker A)	And one of the main things again, sorry, going back to code number   </p><p>just real quick, it is able to then fix itself and ask itself, oh,   </p><p>oops, I made a mistake. Let me try again. Matt, I saw you unmute     </p><p>yourself.                                                            </p><p>44:13 	(Speaker A)	Feel free to go ahead.                                               </p><p>44:16 	(Speaker D)	Well, yeah, just a quick thing. So from what I know, openi will be   </p><p>offering fine tuning relatively soon. So at that point, you          </p><p>theoretically could go and fine tune your own code interpreter like  </p><p>Model, even if they don't offer it, which is going to you.           </p><p>44:31 	(Speaker A)	Can also theoretically not that we would recommend, but theoretically</p><p>right now you could start distilling some stuff from code interpreter</p><p>by asking it questions. Generate code and store it to a file. Ask it </p><p>to download and then quote, unquote, generate the data set. But not  </p><p>that you should, but you can theoretically as well, so that when it's</p><p>time to fine tune, you have some data set.                           </p><p>44:52 	(Speaker D)	Yeah, theoretically. I don't know if a shared GBT currently supports </p><p>those types of conversations, but if it does, I'm sure that's going  </p><p>to happen really soon.                                               </p><p>45:00 	(Speaker G)	I don't think it's maintained because chat GPT itself well, I want to</p><p>speak for share GBT. I know, Steven, but I can help you move the     </p><p>conversation back to cloud.                                          </p><p>45:11 	(Speaker A)	Yes, please. Let's move back to cloud. Thank you.                    </p><p>45:14 	(Speaker G)	So just between the how many people are listening to this chat       </p><p>anyway? I think it's like 60 people. Email support@anthropic.com for </p><p>the Cloud API.                                                       </p><p>45:26 	(Speaker A)	Yes, email them, state your use case and they'll likely get you in   </p><p>and you can use SWIX's menu bar to actually kind of run them in      </p><p>parallel with the megaprom feature. Megapron super prompt, what is it</p><p>called? I think SWIX dropped. There is like one prompt that you type </p><p>and then it all goes to both to all the models. I want to recognize  </p><p>some folks in the audience.                                          </p><p>45:50 	(Speaker A)	Hey, feel free to regime if you.                                     </p><p>45:52 	(Speaker D)	Want to come up.                                                     </p><p>45:52 	(Speaker A)	Obviously, I saw some other Euro I saw in the audience. Max AI.      </p><p>Welcome, Dexter. There's a bunch of folks who are usually here and   </p><p>it's great to see, and I think we're moving on to a very spicy one.  </p><p>46:06 	(Speaker A)	What do you guys think about Xai? So I'm pasting the summary of the  </p><p>people. Elon Musk and a bunch of other folks have announced X. AI    </p><p>they're essentially answer to OpenAI.                                </p><p>46:22 	(Speaker A)	We've all seen Elon kind of talk about safety and talk about helping </p><p>open Xai and then could not be open since then. He talked about truth</p><p>GPT at some point. And finally they announced Xai as we were talking.</p><p>46:37 	(Speaker A)	By the way, I have an application from Xai which they're going to    </p><p>have spaces tomorrow to go deep into deeper into Xai. But so far     </p><p>there's not a lot of detail. There are some details about the folks  </p><p>who work there.                                                      </p><p>46:50 	(Speaker A)	So they have folks who wrote the Adam Optimizer. There are other     </p><p>folks thoughts about Xai before we get to hear what they do.         </p><p>Obviously, there's no product yet.                                   </p><p>46:59 	(Speaker A)	I don't think they've started training. The one thing that I will say</p><p>is that they will have premium access to Twitter, obviously, because </p><p>Twitter is now rebranded.com Xai. After closing down the APIs and    </p><p>closing down the scraping for Twitter, xai will now have a data set  </p><p>that's insane to train on Twitter.                                   </p><p>47:21 	(Speaker A)	And we wish them, quote, unquote, good luck. I would love to hear    </p><p>from folks on stage. What do you think about the announcement, the   </p><p>direction, the people? And we're going to wait for tomorrow to       </p><p>actually hear them talk.                                             </p><p>47:24 	(Speaker A)	I know. NISten, you have some ideas if you want to share to get      </p><p>started.                                                             </p><p>47:40 	(Speaker B)	Well, this is more of an old lady babushko opinion that's just       </p><p>talking about stuff. I found it interesting that they went from, what</p><p>was it? Base GPT through street taking on GPT four and this entire   </p><p>competition to doing something more noble like dedicating it to be   </p><p>better at math and discovering new things in physics. So the way I   </p><p>see that, that's pretty noble. But at the same time, I feel like     </p><p>that's a result of having problems hiring in order to be competitive </p><p>with the other ones.                                                 </p><p>48:26 	(Speaker B)	So, yeah, this will be interesting. But the way I see the whole set  </p><p>up right now is, as the kids say, it's pretty mid, in my opinion.    </p><p>48:39 	(Speaker A)	As the kids you don't use with that. I will say that we will see     </p><p>tomorrow from their space. They're probably going to use Elon's Cloud</p><p>to maybe try to hire and it's probably harder now to hire because    </p><p>everybody knows how quick they're getting fired and how much. It's   </p><p>not like super fun to work for X, but we're in for a nice ride       </p><p>because they do have access to the cross pollination from Tesla as   </p><p>well, right? So if they have big questions, tesla does have a few    </p><p>good folks still, even after Andre Capati left, and so they'd be able</p><p>to ask them for assistance.                                          </p><p>49:20 	(Speaker A)	There's obviously the whole Dodgy thing in play, which we can I don't</p><p>think we have time to talk about Dodgy, and it's not new, but there  </p><p>could be something there. Gabriel, you wanted to come up? Maybe you  </p><p>have. Yeah, go ahead.                                                </p><p>49:33 	(Speaker A)	Gabriel.                                                             </p><p>49:34 	(Speaker E)	Yeah, I was just going to say about Xai, I mean, you mentioned       </p><p>Twitter's data, and I'd be interested in hearing other people on the </p><p>stage opinion on this because recently there's been a lot of work    </p><p>done on quality of data over quantity of data. And of course, Elon   </p><p>also has a ton of GPUs. Reportedly, he's bought tens of thousands of </p><p>GPUs. So that's definitely important in building these big models.   </p><p>49:58 	(Speaker E)	But I'd be interested in hearing from people on the stage if they    </p><p>think Twitter's data and the kind of data that Twitter has is        </p><p>actually going to be really powerful for training good models.       </p><p>50:11 	(Speaker A)	Anybody wants to take this?                                          </p><p>50:13 	(Speaker F)	Yeah, I'll take a little of it. One of the things that Twitter has   </p><p>that other people don't is that people are actually debating issues. </p><p>So I think that's one of the reasons why he's really focused on the  </p><p>idea of Twitter being a source of truth and being sort of            </p><p>unrestricted so that you're not just following like, one thread, you </p><p>watch the narratives being debated and he has access to all that.    </p><p>50:35 	(Speaker A)	Data and community notes. And it's really hard to scrape. Like, I    </p><p>don't think it's API ball at all. It's not super simple to scrape at </p><p>all.                                                                 </p><p>50:42 	(Speaker A)	I want to get yum before I think Matt wanted to unmute and go and    </p><p>then yum. If Matt, you still want to chime in and then yum.          </p><p>50:53 	(Speaker D)	Yeah, I mean, nothing too much to add here. I think the community    </p><p>notes are very interesting as a way to sort of like, reduce          </p><p>hallucinations. I think one of the things that they're going to want </p><p>to do heavily is invest in sort of filtering that data set because   </p><p>there's a lot of great stuff on Twitter. There's a lot of crap on    </p><p>Twitter.                                                             </p><p>51:07 	(Speaker A)	A lot of yeah.                                                       </p><p>51:09 	(Speaker D)	And the more of that that seeps in, the worse the model is going to  </p><p>perform. Obviously, scale is important, but data quality is          </p><p>incredibly, incredibly important and the scale kind of doesn't negate</p><p>bad data quality. So I think if they do one thing right, it's going  </p><p>to have to be getting the sort of filtering of the data set down. But</p><p>they do have a ton of incredibly high quality data.                  </p><p>51:27 	(Speaker A)	Yes, I think Yam was next and then we have a few folks wanted to come</p><p>in. I think Pharrell wanted to come up. So yam. And then pharrell.   </p><p>51:34 	(Speaker A)	And then Gabriel.                                                    </p><p>51:37 	(Speaker C)	I just want to say, of course, if you just take Twitter data and     </p><p>start training your model, you can expect it to be average Twitter,  </p><p>which is not what you want. What you can do, which is a gold mine, is</p><p>to transform this data or just rephrase it as other forms. And this  </p><p>just makes the data a gold mine because Twitter does have very high  </p><p>quality content here and there. Absolutely.                          </p><p>52:05 	(Speaker C)	If you can, and transform it and rephrase it to a different form if  </p><p>you want an example. So the paper textbooks are all you need.        </p><p>Basically, they just take data and make it into a tutorial, make it  </p><p>into a textbook, like perfect, clean and everything.                 </p><p>52:22 	(Speaker C)	It is very easy to do, and you don't need a powerful model to do     </p><p>that. You don't need chachi PT. You can use it to do it with a small </p><p>model.                                                               </p><p>52:30 	(Speaker C)	I'm currently doing off the record, I'm currently doing it myself in </p><p>a large model I'm training. It doesn't it doesn't matter matter      </p><p>anyway. It's a gold mine.                                            </p><p>52:43 	(Speaker C)	What I'm saying, it's a gold mine.                                   </p><p>52:45 	(Speaker D)	About Twitter.                                                       </p><p>52:46 	(Speaker A)	An additional thing before I get to Farrell and then gabriel         </p><p>additional thing. NISten I talked about yesterday at length in our   </p><p>late night line cook space. That's not going to be scheduled. If you </p><p>guys are on, feel free to join that one.                             </p><p>53:00 	(Speaker A)	Twitter Spaces is also a gold mine. Transcribing Twitter spaces and  </p><p>seeing all the reaction emojis that they have in real time. Like the </p><p>space that Elon ran with RFK Jr. For example, if you know in the     </p><p>audience who are actual people instead of bots, and you're able to   </p><p>get like emoji reactions in real time, that's a definite, definite,  </p><p>very high signal kind of training set that they have and almost      </p><p>nobody else has.                                                     </p><p>53:25 	(Speaker A)	And through how to get Pharrell, you are next, I think. And then     </p><p>gabriel yeah, I wonder what.                                         </p><p>53:30 	(Speaker D)	The relation is and how useful the Twitter data will be for their    </p><p>goal of building a sort of math reasoning machine. Right. Also, do we</p><p>know if they're open source, as in truly open source or not?         </p><p>53:49 	(Speaker A)	No, we don't know yet. Hopefully tomorrow we'll be able to answer    </p><p>questions. However, we've seen Elon take Twitter's algorithm to open </p><p>source, and now he's like, boasting this comparatively competitive   </p><p>advantage versus something like Threads. He's saying, like, hey, open</p><p>source.                                                              </p><p>54:07 	(Speaker A)	If you go to Threads, you're under the Zucks influence algorithm. So </p><p>there is definitely an attempt to open source from their side, but we</p><p>don't know anything about that beyond that. Gabriel.                 </p><p>54:17 	(Speaker A)	And then Johnny.                                                     </p><p>54:20 	(Speaker C)	Yeah.                                                                </p><p>54:22 	(Speaker E)	First of all, I think it's funny that Elon's s**t posting is         </p><p>polluting his data set. I would say that.                            </p><p>54:34 	(Speaker A)	By the way, if there's anybody with the option to detect S**t        </p><p>posting, it's them, right? They're going to be able to build a model.</p><p>Understand, this is s**t post. This is like somebody who made an     </p><p>effort to give us clean information. But sorry, go ahead.            </p><p>54:49 	(Speaker E)	Yeah, that's exactly my point that I was going to make, that Elon was</p><p>on this crusade before he bought Twitter. And this is kind of why he </p><p>got forced into buying Twitter, because he was going after the bots  </p><p>and he made a big deal about the bots. And I think they spent a lot  </p><p>of resources on figuring out what's good content and what's bought   </p><p>content. And another thing is that we each are kind of experiencing a</p><p>different Twitter, right? Because we're within whether it's an ML    </p><p>Twitter or Israel based Twitter, and there's many different          </p><p>communities and their Twitter is very good at segmenting those       </p><p>communities and figuring out which content belongs to what community.</p><p>54:55 	(Speaker E)	And they'll have the ability, I think, to segment this data and train</p><p>many different models that are good at different things because      </p><p>they're in a literature community or in an ML community or MMA       </p><p>community or whatever.                                               </p><p>55:37 	(Speaker A)	I actually saw a map of like 5 million, 7 million tweets all embedded</p><p>in Nomic Xai Atlas. I don't know if you guys follow Nomic, they just </p><p>recently announced like a 17 million round A, by the way. So kudos to</p><p>Nomic good friends. Andre, the GPT for all team, and they have like  </p><p>an embedded map before the API was shut down that they were able to  </p><p>siphon, et cetera.                                                   </p><p>56:00 	(Speaker A)	And Gabriel, what you're saying is actually visible in the embedding </p><p>map. You can actually see those tweets and then different areas of   </p><p>the political Twitter. There was a journalist Twitter until all of   </p><p>the journalists started living there's like a bunch of different     </p><p>pockets of Twitter that we don't get exposed to, not to mention the  </p><p>different languages.                                                 </p><p>56:20 	(Speaker A)	There's a whole Japanese Twitter that's like insane. And people go   </p><p>super, super hard. And translating is easy.                          </p><p>56:26 	(Speaker A)	We talked about Cloud being able to translate. So they have a bunch  </p><p>of very interesting data. And I think Zuck is also going after that  </p><p>data with Threads.                                                   </p><p>56:31 	(Speaker A)	And I think this is the reason why we'll see Threads getting         </p><p>continued work and we'll see a lot of investment from their side. But</p><p>to compare to Threads, and we talked about this yesterday, is that   </p><p>Twitter has back history and a lot of historical data that they can  </p><p>train others. Threads is fairly new as well.                         </p><p>56:54 	(Speaker A)	So definitely a bunch of interesting data sets. Johnny and then      </p><p>Lentil. Hey.                                                         </p><p>57:00 	(Speaker H)	So one I think about when I think about the data from Twitter that is</p><p>potentially lacking and some of the other data sets is colloquial    </p><p>language. Because what Twitter has that Facebook doesn't have and a  </p><p>lot of other things don't have, especially from what you're talking  </p><p>about, like historic, is the way that people actually interact with  </p><p>each other. You know what I mean?                                    </p><p>57:26 	(Speaker A)	Not only that, how it evolved as well, right throughout exactly.     </p><p>57:35 	(Speaker H)	To be honest, I think the data sets from earlier is probably better  </p><p>and stronger because it's just gotten out of hand. But I agree with  </p><p>what I'm not sure it was Yam or who said the filtering because all   </p><p>right, this is black box, it's not open source. Elon has not been shy</p><p>about his kind of response to what he perceives as wokism and all of </p><p>that stuff. I'll be super curious.                                   </p><p>57:36 	(Speaker H)	I mean, there's a big team on this, but I will be super curious to   </p><p>see what that bears out in the actual model. Because, God, there's   </p><p>equal parts or more parts disinformation on Twitter than there is    </p><p>information. So if we're talking about source of truth, that rings   </p><p>some alarm bells for me, for me personally.                          </p><p>58:21 	(Speaker H)	So those are just my thoughts.                                       </p><p>58:29 	(Speaker A)	Yeah. Thanks, johnny Lentil. Go ahead. And then Gabriel.             </p><p>58:33 	(Speaker A)	Let's finish on the Gabriel and then we'll move on to the next topic.</p><p>58:36 	(Speaker H)	Cool.                                                                </p><p>58:37 	(Speaker A)	Yes.                                                                 </p><p>58:37 	(Speaker H)	So I think it's going to be hugely bullish for this data. And from   </p><p>the perspective of relating idea space and people and the relations  </p><p>between those, I think that's probably going to be more of a goat    </p><p>information than conversation because you can build so much from     </p><p>that. Like dating this is just one like a dating thing. Or finding   </p><p>people, finding brain power compute, that's going to be huge.        </p><p>58:40 	(Speaker H)	And to touch on the open sourceness of the data, I think not open    </p><p>sourcing it at some point is going to be hugely politically bad for  </p><p>Elon to do.                                                          </p><p>59:23 	(Speaker A)	That'S.                                                              </p><p>59:23 	(Speaker H)	My thoughts on that.                                                 </p><p>59:24 	(Speaker A)	Awesome. Thanks, Lance. Gabriel, let's end up and then, Matt, we're  </p><p>going to talk about some interesting stuff.                          </p><p>59:31 	(Speaker E)	Yeah, just on the kind of data. I think for those of us who ran,     </p><p>like, the early versions of Llama before they got fine tuned in all  </p><p>kinds of ways, and you run it, and especially the smaller models, you</p><p>put in a prompt and it spits out some generic Facebook type of       </p><p>content. It sounds like a Facebook post of like a 15 year old or     </p><p>something like that. That shows what you get when you use all this   </p><p>kind of unfiltered data.                                             </p><p>59:59 	(Speaker E)	But I think the interesting thing is that Llama was then fine tuned  </p><p>in many different ways and some really powerful models are built on  </p><p>top of it. So I think in some sense, almost any data is valuable in  </p><p>the sort of pretraining stages and maybe you need really high quality</p><p>for the fine tuning, but I think that big volume might be really     </p><p>useful, maybe not the most economical.                               </p><p>60:21 	(Speaker A)	So I want to wrap up things why they potentially have like a leg up  </p><p>versus not a leg up. We definitely know that Twitter was used to     </p><p>train other models that we currently use. We know this for a fact.   </p><p>This was the reason why Elon and Sam Hoffman, who used to be friends,</p><p>are no longer friends, sheet posting about them.                     </p><p>60:40 	(Speaker A)	And the current models we use. Do use this data set, but it's old for</p><p>them. It's no longer like recent and relevant.                       </p><p>60:40 	(Speaker A)	And we know for a fact that Twitter is significantly biased and      </p><p>probably the best place in the world for uncovering news as they     </p><p>happen before the bias sets in, before the narrative sets in, before </p><p>folks know how to before folks get their marching orders from MSNBC, </p><p>from the Other Side, how to think about things when not. The Twitter </p><p>is really good at talking about issues as they arise, the second they</p><p>arise. And I think that on its own is going to teach the models a    </p><p>very great deal.                                                     </p><p>61:16 	(Speaker A)	Naval Ravican, if you guys follow Namal, he always said Twitter makes</p><p>him a better writer. So we definitely know also that tweet in short  </p><p>form condense information better. And if their model trains on that, </p><p>obviously taking all the precautions we talked about before, bots and</p><p>s**t, posting, et cetera, if they're able to actually get this into  </p><p>the model, likely their model will be more up to date and more fine  </p><p>tuned like reaction.                                                 </p><p>61:20 	(Speaker A)	So with that, I want to close. We'll see about Xai. It's definitely  </p><p>exciting, right? We're potentially getting another big one,          </p><p>potentially open source one.                                         </p><p>61:20 	(Speaker A)	So we'll see. I'm going to wrap up this update and I think the next  </p><p>one I want to move on. Matt, let me know if you're still around if   </p><p>you want to cover.                                                   </p><p>61:20 	(Speaker A)	So we have Matt, who introduced himself in the beginning. So I'll let</p><p>you do this quickly again because maybe and then we're going to talk </p><p>about the stuff that GitHub Stars is rising on, which I think is     </p><p>super cool. And I invite you to give us a little bit of an interview </p><p>about this.                                                          </p><p>62:16 	(Speaker A)	Go ahead, Matt.                                                      </p><p>62:17 	(Speaker D)	Yeah, sure. So I'll try to summarize it a bit better than the last   </p><p>time. A lot of practice, but very long story short, co founder, CEO  </p><p>of Other Side AI, creator of Hyperwrite, and a number of other       </p><p>things. Basically, we've been around for a number of years now.      </p><p>62:30 	(Speaker D)	We're one of the first companies in the space working with LLMs. The </p><p>goal always has been to build a personal assistant that scales to    </p><p>everybody, just like a real human personal assistant, but at scale,  </p><p>way cheaper, digital. The tech wasn't there at the beginning. So we  </p><p>built other products to sort of learn and gather resources, whether  </p><p>that's users, revenue, bunch of other things that we can do.         </p><p>62:50 	(Speaker D)	What we do today. Today we are actually building that personal       </p><p>assistant. So an AI that can operate a computer, any software to do  </p><p>what a human can do on pretty much anything.                         </p><p>62:53 	(Speaker D)	So it'll help you with your tasks. It's very simple. Today it's a    </p><p>Chrome extension that lets you sort of like control Chrome just by   </p><p>sort of talking to it.                                               </p><p>62:53 	(Speaker D)	So you could say, go order me a pizza, or go send this person an     </p><p>email or go filter my email, or anything else it works okay today.   </p><p>The idea is that over time, it's going to get a lot better, a lot    </p><p>cheaper, a lot faster, to the point where six months from now, a year</p><p>from now, it might actually be as good as, if not better than a human</p><p>on many tasks. But that being said, while I work on this, I also like</p><p>to learn about getting the most out of these technologies because    </p><p>they're so fast moving and you really have to stay on top of it to be</p><p>effective, or you.                                                   </p><p>63:34 	(Speaker A)	Can every week and then stay up to date with us together. But yeah,  </p><p>go ahead.                                                            </p><p>63:40 	(Speaker D)	Exactly. I mean, a lot of what I do to learn really, is just build   </p><p>things that I find interesting, and I find that often, even if I'm   </p><p>not expecting it, a lot of those learnings do translate to stuff     </p><p>we're doing at other sides. So this sort of just came out of that.   </p><p>Happy to sort of dive into the project, or if you want to sort.      </p><p>63:56 	(Speaker A)	Of stop me and let's pause here for a second and I'll just tell folks</p><p>that I pinned Matt's Tweet from a couple of days ago with the        </p><p>introduction. Since then you got a few thousand stars, I think, on   </p><p>GitHub, and we're going to talk about the GPT Prompt Engineer project</p><p>and the different reasons why Matt and folks kind of written this and</p><p>what it's here to serve. So maybe give us an introduction to the GPD </p><p>Prompt Engineer and what kind of made you come up with this and how  </p><p>it works. Yeah, go deep, man.                                        </p><p>64:29 	(Speaker A)	Sure. Yeah.                                                          </p><p>64:30 	(Speaker D)	So forget about rambling in advance. Essentially, I find prompt      </p><p>engineering so fun. I've been doing it pretty much every day for     </p><p>everything, honestly, to the point of excess, from what I would do   </p><p>for work to having it decide what I'm making for dinner for years    </p><p>now. And as I've gone through this process, sort of like learning how</p><p>to use these models, it's become very clear that especially as these </p><p>models evolve, there's no best practice for anything.                </p><p>64:54 	(Speaker D)	Prompts change ways to prompt change. Something that works for one   </p><p>task might not work for a very similar task. And the only way sort of</p><p>get out of that is to sort of get an intuition of the model and try a</p><p>lot of things, but that doesn't always work perfectly.               </p><p>65:01 	(Speaker D)	And also you don't really know kind of what works and what doesn't.  </p><p>Even when you're trying things right, you have to do it sort of like </p><p>in a very scientific way, but there's no real right answer to        </p><p>anything. It's kind of like alchemy.                                 </p><p>65:18 	(Speaker D)	So starting to think I think this was right. When GPD Four came out, </p><p>I was using GPD Four pretty often to just ideate prompts. I would    </p><p>say, here's what I'm trying to do.                                   </p><p>65:20 	(Speaker D)	I would say, write a prompt me, and I would use the ideas from that  </p><p>to help me improve my own prompts and that actually got a lot of     </p><p>interest. We ended up building a sort of thing similar to that into  </p><p>the hyperwrite platform. At the time it was really cool, but really  </p><p>wasn't something that would replace what I do every day, which is    </p><p>really hardcore prompting.                                           </p><p>65:43 	(Speaker D)	Eventually I was just sort of thinking about it, and I think this was</p><p>on the 4 July, I was just sitting there kind of thinking, what if we </p><p>tried it? And I started thinking about how could you design a system </p><p>that actually comes up with good prompts? Not just a prompt that does</p><p>the job, but something that's actually optimal, because as humans,   </p><p>right, we can only try so many things at once. But the magic of these</p><p>LLMs is they're creative and they think faster than we do. In the    </p><p>time that I could write half a prompt, LLMs could write 5100.        </p><p>65:48 	(Speaker D)	And what if you could leverage that? Because even if the average     </p><p>prompt isn't very good, you're going to luck into one or two that    </p><p>happen to be exceptional for your task. So I started by doing it     </p><p>actually with a classifier. I only released this notebook yesterday  </p><p>just because it's like a step on the road.                           </p><p>65:48 	(Speaker D)	And what we ended up using it for was actually something at other    </p><p>side where we needed to build a classifier for something with        </p><p>personal assistant. And I just wasn't getting good performance out of</p><p>the prompts that I was writing. So I said f**k it, what if we have   </p><p>the AI try to do this? And I built this so that essentially I        </p><p>describe the task, I give it some test cases, so I'll give it some   </p><p>true false test cases.                                               </p><p>66:11 	(Speaker D)	Because the classifier was classifying things as true or false. It   </p><p>was like classified the statement as true or false. And it was like  </p><p>New York is in America, it would be true.                            </p><p>66:54 	(Speaker D)	If it was new York is in Paris it would be false. And I basically    </p><p>created like ten or 20 of these test cases. I described the task and </p><p>I had GPT generate something like, I think 20 or so prompts.         </p><p>66:57 	(Speaker D)	And surprisingly, the quality of them just at first glance was pretty</p><p>good, right? It was kind of shocking considering I spent so much time</p><p>trying to do this manually. Then what I did was I just basically had </p><p>each of these prompts test against each of these test cases. And I   </p><p>plotted sort of the success of each and turns out some of them       </p><p>actually outperformed what I did.                                    </p><p>66:57 	(Speaker D)	I was kind of shocked, right? Like you wouldn't expect that,         </p><p>especially doing this for years.                                     </p><p>67:30 	(Speaker A)	Just to recap real quick on this, the GPT four, I assume that's what </p><p>you're using generated prompts actually performed better than Match  </p><p>rumors. Prompts and Matchroomr is the founder of a prompt company    </p><p>with a lot of prompt use cases for a long time, from GPT-3 to four,  </p><p>et cetera. And some of the ones that it came up with performed better</p><p>than yours.                                                          </p><p>67:52 	(Speaker D)	Yeah, it was kind of scary. Some of them performed way worse. But the</p><p>idea is that you're going to sort of luck into something that is     </p><p>better. Maybe two out of 20 will be better, but they're great.       </p><p>68:02 	(Speaker D)	So I was sort of just so fascinated by this, I was like, how do you  </p><p>take this further? Because classification is one thing, but real     </p><p>prompts where you're actually having it generate text, those are     </p><p>harder. How do you judge that? You could use GPD four to judge them, </p><p>right? If you have two prompts and you say each of them generate me  </p><p>something and they give you your responses and you want to know which</p><p>is better, you can ask GPD four. And so I figured we could apply     </p><p>that.                                                                </p><p>68:29 	(Speaker D)	Turns out there's some issues with that and there are some papers    </p><p>written about this where essentially it'll be sort of like more      </p><p>favoring the one that's on the bottom. So just do it twice, flip the </p><p>order and see if one wins. And I took that approach and I sort of    </p><p>combined it with sort of like an ELO style tournament where          </p><p>essentially you have each of them go head to head, like one on one,  </p><p>and each of them gets their ELO score either bumped up or down based </p><p>on whether they win, lose or draw.                                   </p><p>68:53 	(Speaker A)	Can you give two sentences on ELO scores as a concept? Yeah.         </p><p>68:57 	(Speaker D)	I'm actually not super familiar with it. Funny enough, I had GPC     </p><p>write the code for that part, but basically think of it like a       </p><p>ranking system in a video game. Yeah, think of it like a ranking     </p><p>system in chess or a video game where you have two people competing  </p><p>and the one that wins gets their score increased by x. The one that  </p><p>loses gets their score decreased by x.                               </p><p>69:18 	(Speaker D)	And it also sort of like weighted based on the previous scores. So if</p><p>somebody that has a high score beats somebody with a very low score, </p><p>their score won't increase that much because they're very likely     </p><p>going to win. So it's sort of just like a weighting system to help   </p><p>figure out what's the best so instead of just sort of getting a clear</p><p>cut, yes, this is right, or no, this isn't what you can do with      </p><p>classifiers, because there is a right and a wrong ground truth       </p><p>answer.                                                              </p><p>69:39 	(Speaker D)	I just had each prompt sort of generate for a test case and the sort </p><p>of opposite prompt the competition prompt would generate for that    </p><p>test case. So I was a little bit complex and they would have the     </p><p>model judge which one was better. And it's expensive, right? It might</p><p>cost like $20 in GPT calls to get to an answer, but turns out at the </p><p>end, the prompts again were just kind of blowing me away.            </p><p>70:04 	(Speaker D)	Awesome creativity in them. Like the words it used, the trigger      </p><p>words, it didn't do what I would do. And in a really good way.       </p><p>70:10 	(Speaker D)	And it also opened up my eyes to sort of like new ways of prompting  </p><p>that I never would have thought of and just sort of like aren't      </p><p>standard. And that's kind of the magic of all this. I think that this</p><p>sort of abstracts away the sort of atomic level of prompts, right?   </p><p>You talk about prompts as sort of a prompt in and of itself and then </p><p>a system built around the prompts with many prompts kind of working  </p><p>together.                                                            </p><p>70:31 	(Speaker D)	This makes it so that you don't have to guess about, do I have the   </p><p>best prompts for this single atomic part of our system? Where the    </p><p>magic really comes in then, is how do you string these amazing       </p><p>individually crafted by AI prompts together to make something that   </p><p>actually works really well.                                          </p><p>70:46 	(Speaker A)	And how you robustly build the evaluation system, right? Because the </p><p>classifier is a simple example of evaluating, because maybe you know </p><p>this, et cetera, but how do you actually scale up the evaluation     </p><p>system such that this could potentially run in loops and then        </p><p>generate the best of the best prompts for a task?                    </p><p>71:03 	(Speaker D)	Exactly.                                                             </p><p>71:03 	(Speaker A)	That's also like a very interesting piece. How do you think about    </p><p>evaluation going forward?                                            </p><p>71:08 	(Speaker D)	Yeah, so I think it's sort of like that, where you could have this   </p><p>thing run in the loop three times and take the three winners and then</p><p>have GPT read those winners right, and be like, here are prompts that</p><p>worked really, really well. Here are the test cases where they       </p><p>failed. Now I want you to write new prompts that take what's good    </p><p>about these but also mitigate the failure cases and generate a whole </p><p>new set of prompts. Sort of like evolution really doesn't just have  </p><p>to stop at one point in time after the first run.                    </p><p>71:37 	(Speaker D)	It's like, let's learn from what these amazing ones still did wrong  </p><p>and continue to make this better and better and better. Obviously,   </p><p>this relies on a relatively large test set. I'm also experimenting   </p><p>with ways where you can have the test set autogenerate, but that's a </p><p>little bit finicky.                                                  </p><p>71:50 	(Speaker D)	But I do think that sort of like evolution of this could lead to some</p><p>really exceptional prompts. But what I found was even on the first   </p><p>run I was seeing it outperform myself. For example, there was a      </p><p>classifier we were using GPT four with logic bias to do because it   </p><p>was such a hard challenge and we were getting some like 90% accuracy.</p><p>71:50 	(Speaker D)	I had it do these prompts with GPT four, but then I had it run them  </p><p>using GPT 3.5 and it got 96%.                                        </p><p>72:19 	(Speaker A)	We've talked about this pattern before where you can outsource kind  </p><p>of the hard work to GPD four, but then once you get really good at   </p><p>prompting, GPD 3.5 is actually very decent in many things and it's   </p><p>way faster, cheaper, and has a 16K context now that you can use. And </p><p>so we've seen this pattern with many folks that if you don't need the</p><p>full power of the GPT four, human evil for coding, et cetera. You can</p><p>go far into GPT 3.                                                   </p><p>5 and get very far along, especially as you're getting better        </p><p>prompts. And now, Matt, you have like a recursive crafter helper guy </p><p>that's here. And my next question for you is, have you used anything </p><p>else? So you mentioned GPD 3.                                        </p><p>5 where you run the prompts. Have you tried them on different models,</p><p>like Cloud maybe, or the open source llama ones?                     </p><p>73:07 	(Speaker D)	I actually haven't just because I wanted to see if this worked. It   </p><p>was sort of just an interesting thing for me and my time is really   </p><p>focused on other side and personal assistant, but it wouldn't be hard</p><p>to get Claude in. I suspect Claude prompts would perform better on   </p><p>Claude. Open ad prompts would perform better on Open xai just because</p><p>the models give the prompt them very differently.                    </p><p>73:18 	(Speaker D)	Claude is sort of like a more emotional thinker. Open xai is more of </p><p>like a logical thinker. It's a very sort of simple, not perfect      </p><p>analogy, but I suspect you'd want to sort of like stick within the.  </p><p>73:36 	(Speaker A)	Ecosystems, maybe, not to mention inflections pie, which is like a   </p><p>whole different beast.                                               </p><p>73:41 	(Speaker D)	Yeah, that's an interesting one.                                     </p><p>73:44 	(Speaker A)	We discussed by a couple of times and I've seen some reactions, but I</p><p>don't think maybe at the end of this, if we have time, matt, one     </p><p>question I will have for you on this and I think we'll move on. Is   </p><p>that where folks can find more work of this? Is it open source? What </p><p>are you looking for contributions? If you are. And yeah, just give us</p><p>a wrap up of this project.                                           </p><p>74:07 	(Speaker D)	Yeah, so you can find it on GitHub. It's called GPT prompt engineer  </p><p>Currently there are two notebooks. It's all done in Jupiter notebook </p><p>format, so it's pretty easy to edit. One is for the classification   </p><p>system, the other is for the generation system.                      </p><p>74:20 	(Speaker D)	We're honestly sort of like at a point where it works well, so it's  </p><p>like, what do you build around it? One thing that's missing is the   </p><p>classification version only supports true and false labels, but it's </p><p>not hard to use TikTok into or TikTok and whatever it is to allow it </p><p>to support arbitrary labels like happy, sad, angry, whatever. That's </p><p>probably like a 20 minutes ad that if somebody goes in and does that </p><p>opens up a whole new set of use cases. The evolution idea that I     </p><p>mentioned before, right? Taking the best prompts and then saying,    </p><p>here's where it went wrong on these test cases, and then throwing it </p><p>back to GPT and having it generate more and rerunning it, that's     </p><p>interesting.                                                         </p><p>74:45 	(Speaker D)	The ability to use Claude would be awesome if anybody wants to add   </p><p>that. I could even see it evaluating each prompt on each model,      </p><p>right? Because right now we only generate with GPD four. We only     </p><p>evaluate with GPT 3.                                                 </p><p>75:19 	(Speaker D)	5. But imagine if you generate with GPD four half of them, you       </p><p>generate half of them with Claude and then you evaluate each prompt  </p><p>on GPT four, GPT 3.5 and Claude.                                     </p><p>75:27 	(Speaker D)	And you can see sort of the latency success rates for each along with</p><p>scores. I think all that would be super interesting. Also sort of    </p><p>like just open to ideas.                                             </p><p>75:40 	(Speaker D)	I'm not really sort of supporting this at all. So if anybody wants to</p><p>kind of take it and run with it, I am all for that. Also sort of just</p><p>like a shameless plug right now or thing that we're looking for just </p><p>because I have an audience here.                                     </p><p>We are at other side in hyperwrite, really looking for somebody to   </p><p>help on back end hopefully with a security set of expertise. And then</p><p>also if anybody is experienced in training machine learning models, I</p><p>would love some help there because we're doing a lot of LLM training.</p><p>75:55 	(Speaker A)	So just quick thing and also to add that now with the Prompt Engineer</p><p>that's automated, the results of this would likely generate like a   </p><p>great data set that you can add and continue fine tuning, especially </p><p>as GPT four fine tuning is coming soon. So Matt, definitely store    </p><p>everything you generate with the yellow score and everything and from</p><p>a GPT prompt engineer that runs and doesn't know about the rest run, </p><p>maybe there's going to be a path forward to actually fine tuning a   </p><p>prompting model, which could be exactly. Well, yeah, exactly.        </p><p>76:28 	(Speaker D)	Imagine taking a prompt and taking one that has a slightly higher    </p><p>score and fine tuning a model to take the initial prompt and then    </p><p>sort of output the one that has a higher score and you can do that   </p><p>evolutionarily continue to get better prompts in theory.             </p><p>76:40 	(Speaker A)	Awesome. So folks, if you want to work in a cool place, I can write, </p><p>hit met up and also check out GPD Prompt Engineer on GitHub. Thanks  </p><p>for coming. Feel free to stay and kind of continue commenting and    </p><p>talking with us as we go through a bunch of other updates that we    </p><p>have.                                                                </p><p>76:57 	(Speaker A)	Just a quick check with NISten who promised me to follow Twitter and </p><p>see if anything new comes up. Breaking news as we talk. I haven't    </p><p>seen anything besides the space of Xai.                              </p><p>77:04 	(Speaker A)	I will ask people's attention to the last pin tweet from Dr. Jim Fan </p><p>that talks about the context length dip. Matt, you also touched on   </p><p>this context length dip. It's basically a paper, I think.            </p><p>77:22 	(Speaker A)	Stanford I'm not sure that figured out. That even longer. Context    </p><p>windows, they have a dip in the middle, which means that at the      </p><p>beginning of the prompt at the end of the prompt, the model has more </p><p>attention to what you actually asked it to or the details that you   </p><p>provide in the middle there's like a dip.                            </p><p>77:39 	(Speaker A)	And this was also released this week. However, the one thing I said  </p><p>previously I will repeat here claude and some folks who know about   </p><p>contact windows way more than me. They say the Claude is actually    </p><p>really good at this without the dip.                                 </p><p>77:54 	(Speaker D)	Yeah, I feel like that's saying. It's an interesting paper. I feel   </p><p>like it's sort of saying like, hey, if you train on marketing copy,  </p><p>then it's going to be worse at coding, obviously. Right.             </p><p>78:03 	(Speaker D)	We do a lot of long context stuff at other side. That's actually what</p><p>I'm focused on right now, training really long context massive       </p><p>models. And if you train it on data where there's context in the     </p><p>middle that matters, it is going to be good at that.                 </p><p>78:16 	(Speaker A)	Interesting. So what you're saying, I think I've seen this kind of   </p><p>opinion before as well. It's just the outcome of the data that was   </p><p>fed in and for blog posts and other places, people want to hook your </p><p>attention in the beginning and then kind of finish strong. Basically </p><p>you're saying that this is potentially an outcome of that and not    </p><p>necessarily the tech behind it.                                      </p><p>78:38 	(Speaker D)	Yeah, I believe so. I mean, who knows, maybe wrong, but from my      </p><p>experience, right, why I was given that analogy before is like if you</p><p>train it up to do one thing and then you're asking it to do another, </p><p>it's not going to do that other thing as well. And I'm guessing the  </p><p>data set that they sort of did this evaluation on was something that </p><p>didn't have a ton of information at all. Part of the reason that so  </p><p>few of the language model companies have super long context length   </p><p>models and why it was such a big deal that Anthropic did is because a</p><p>lot of the challenge in training them isn't actually in training     </p><p>them, it's in the data.                                              </p><p>79:08 	(Speaker D)	Obviously, inference becomes a challenge. It's the cost and the      </p><p>overhead there. But the data to sort of do this is really sparse.    </p><p>79:10 	(Speaker D)	It's not very available. Right. So that's I think part of it right   </p><p>there's not just like a sort of standard data set that has super long</p><p>context link, that has information in the middle.                    </p><p>79:25 	(Speaker D)	We do actually we've been building one another side and that's sort  </p><p>of given me some of the ideas that I'm sort of spouting here. But my </p><p>guess is that Anthropic part of the reason theirs works is because   </p><p>they focused on the data. The data is really important.              </p><p>79:38 	(Speaker A)	Right.                                                               </p><p>79:39 	(Speaker D)	I will say model, it's just fine tuning.                             </p><p>79:41 	(Speaker A)	Yeah. I will say when I got access to Clouds Window, I did like a    </p><p>bunch of tests with my Twitter data. I just pasted like a bunch of   </p><p>JSON with Twitter numbers, twitter IDs numbers. And the smaller      </p><p>model, the not 100K, gave me back results that actually didn't invent</p><p>those numbers.                                                       </p><p>79:57 	(Speaker A)	The 100K model lost in the middle and started inventing those        </p><p>numbers. I literally saw this difference between the longer complex  </p><p>one and the previous one and I thought it's because of like it loses </p><p>some complex in the middle. And I need to retry this on the new ones </p><p>because the new ones, they claim this doesn't happen with that.      </p><p>80:01 	(Speaker A)	I want to go to Al and yeah, one of you I think raise your hand first</p><p>to talk about the context length dip and that paper if you have read </p><p>this, if you have thoughts and if you have noticed this as well.     </p><p>80:29 	(Speaker F)	I just had a quick question for Matt about the differences that he   </p><p>found in prompting between say, Claude and GPT Four. I noticed like, </p><p>the prompts aren't really reusable and maybe you could speak to that </p><p>in the general case.                                                 </p><p>80:42 	(Speaker A)	Yeah, let's end with maybe this question and move on to other updates</p><p>as we have. Go ahead, Matt.                                          </p><p>80:48 	(Speaker D)	Yeah, sure. So it's like talking to two people with two different    </p><p>personalities, right? They're both people, but they respond          </p><p>differently to different ways. You're sort of prompting them, if you </p><p>will. Claude is sort of like more emotional, I guess, where open xai </p><p>is sort of more logical.                                             </p><p>81:03 	(Speaker D)	And it's hard to sort of pin that down to any one thing, and it's    </p><p>hard to give you sort of like techniques based on that because,      </p><p>again, every use case is very different, but it's very clearly it's a</p><p>prompt them differently. I think also talking about the idea of fine </p><p>tuning a prompting model will be very interesting is fine tuning a   </p><p>model that takes an Open Xai prompt and converts it to the idealized </p><p>version of a Claude prompt and vice versa. I mean, I think that could</p><p>be very powerful because there are ways to sort of intuit your way   </p><p>there.                                                               </p><p>81:29 	(Speaker D)	It's just hard to sort of distill into a set of rules. One thing I   </p><p>found actually quite interestingly with Quad two is that it is       </p><p>insanely resistant to sort of like jailbreak attacks. So I was able  </p><p>to get it to do it.                                                  </p><p>81:44 	(Speaker D)	Turns out the stupidest method worked. It was sort of like modifying </p><p>that dan prop that's been going around like reddit but the more      </p><p>nuanced sort of like complex methods that typically work with OpenAI </p><p>they didn't. So I think the model is just qualitatively different.   </p><p>81:56 	(Speaker D)	I think it's going to take some time to fully explore it and         </p><p>understand why and how still super early days.                       </p><p>82:06 	(Speaker A)	I love the fact that all of us are getting an intuition about        </p><p>different models and how to approach them right. And that's like     </p><p>Sweet was here before. This is like a specialization of what I think </p><p>he talked about as an AI engineer. We're getting to start to         </p><p>understand the differences between those to the little fine little   </p><p>things that you can say.                                             </p><p>82:11 	(Speaker A)	And I think it will be very interesting if you have a model that's   </p><p>trained to actually convert them or translate them between the models</p><p>to work the same. I have an idea where not to get locked into the GPD</p><p>Four ecosystem with the functions. I have an idea of wrapping the GPD</p><p>Four API package with something.                                     </p><p>82:47 	(Speaker A)	They will actually kind of print the functions into the context      </p><p>because cloud now has a huge context window. And then try to see     </p><p>whether or not cloud is able to kind of without additional tech,     </p><p>without additional changes to the API to replicate the outputs of how</p><p>a GPT with functions would do. And that's going to be an idea I'll be</p><p>testing, hopefully, and talk about next week.                        </p><p>83:08 	(Speaker A)	Thanks, Matt.                                                        </p><p>83:10 	(Speaker C)	Today, there has been a thing today, maybe yesterday, but anyway,    </p><p>today there have been a model that generates prompts. By the way, by </p><p>giving the data, you generate the prompt. I've written about it today</p><p>on Twitter. It is so powerful, it is such a cool method that you can </p><p>take whatever you have, like, I don't know, scientific papers and    </p><p>generate instructions for them.                                      </p><p>83:32 	(Speaker C)	Now you can fine tune a model that generate scientific papers. You   </p><p>got jokes. Now you can train a model that become funny.              </p><p>83:35 	(Speaker C)	You can generate the instruction, convert whatever you want into     </p><p>instructions. Amazing it is today. One more thing about the deep in  </p><p>the middle thing.                                                    </p><p>83:51 	(Speaker C)	I don't know why it happens. I have no idea how Open Xai trained     </p><p>their models. But I think if you think about it, many missions, many </p><p>instructions, paragraph, and before the paragraph, you tell the      </p><p>model, please summarize the following, or on the contrary, like a    </p><p>paragraph and at the end, what was that? Something.                  </p><p>84:10 	(Speaker C)	So it makes a lot of sense that a model pays a lot of attention to   </p><p>the beginning at the end, because of this. And on the same note, it's</p><p>very easy to fix. So I wouldn't just point fingers.                  </p><p>84:21 	(Speaker C)	It's good that they pointed it, but I think it's like, I don't know, </p><p>a couple of minutes of training, open AI, like, fine tune for a      </p><p>minute and fix it.                                                   </p><p>84:28 	(Speaker A)	I just want to ask yum, yum. The the pin that I just tweet sorry, the</p><p>Tweet that I just pinned on top, this was the one that you talked    </p><p>about, the instructions generation and the problem generation.       </p><p>84:38 	(Speaker C)	Yeah.                                                                </p><p>84:39 	(Speaker A)	Awesome. So folks, definitely feel free to check this out. I haven't </p><p>seen this. You want to give a couple more words about that one.      </p><p>84:44 	(Speaker A)	It looks like you wrote, like, a very deep dive. What's the model    </p><p>like eleven B, three B?                                              </p><p>84:54 	(Speaker C)	Sure. Two models put into the models, whatever you want. Okay, let's </p><p>go back. You got a data set of something, emails from your company,  </p><p>for example, and you want a model that will help you write emails.   </p><p>85:01 	(Speaker C)	Okay, you can start thinking about how to train this model, or you   </p><p>can use this and now generate a text that basically says, help me    </p><p>write the following email to this following person of something      </p><p>something and the actual email. And all of a sudden, you have a model</p><p>that is extremely you have a data set to train a model or to fuselage</p><p>or whatever that is extremely tuned to this. So I think it's a very  </p><p>cool technique.                                                      </p><p>85:40 	(Speaker C)	It's very powerful, has a lot of potential. And the trick, in simple </p><p>words, is training the model. What not to say? That's the missing    </p><p>piece here, that they added the trick.                               </p><p>85:51 	(Speaker C)	They took instructions and outputs that do not fit just a different  </p><p>random output from the data and train with a different laws. That the</p><p>model should not say this because this input does not with that      </p><p>instruction, does not result in this output. That's it.              </p><p>86:11 	(Speaker C)	That's the trick. And it works perfectly and really cool.            </p><p>86:17 	(Speaker A)	Awesome. I have some folks who want to come up and ask questions. I  </p><p>think we're almost there in terms of the updates. I will just briefly</p><p>run to some updates.                                                 </p><p>86:18 	(Speaker A)	I don't even have time to go and look for the threads, but if you're </p><p>not following Rama CPP, follow gerga is one of the groups that we    </p><p>have in the States. I think he single handedly is in charge of so    </p><p>many folks trying to get a MacBook, because it's incredible how much </p><p>performance they've been able to squeeze out of Llama. And it's      </p><p>comparatives.                                                        </p><p>86:49 	(Speaker A)	And many people just, like, quantize their models, basically make    </p><p>them smaller to run on this GGML platform that they have. The recent </p><p>news that I have from over there, there's like two pieces of news.   </p><p>Last week, for those of us who were here last week, we talked about  </p><p>CFG.                                                                 </p><p>86:58 	(Speaker A)	I forgot something. I forgot the guidance scale. And we talked about </p><p>the CFG parameter moving from diffusion models that we know.         </p><p>87:17 	(Speaker A)	Like, in stable diffusion, you can define how close to your prompt   </p><p>should the model generate the image. Somebody decided, I think, an   </p><p>illusion reaction. Somebody said, hey, can we have this control of   </p><p>CFG to our LLM generation? CFG is a classifying guidance scale,      </p><p>something like that.                                                 </p><p>87:37 	(Speaker A)	And they did it. The Chad GGR added this to Llama CPP. And so now you</p><p>can actually kind of pass a CFG control and fine tune.               </p><p>87:48 	(Speaker A)	It's almost like a running fine tune to an extent. You can test the  </p><p>model to be closer, farther away from the problem that you have.     </p><p>Contrasting this with the stuff that we have on a GPD, four API,     </p><p>which is temperature.                                                </p><p>88:01 	(Speaker A)	And I think, Matt, you mentioned something to logic bias, logged     </p><p>bias, something like that, right? Where you can ask it not to say    </p><p>certain things. So contrasting CFG, it's like a different beast that </p><p>we now have a different control. And so GGML just merged into their  </p><p>platform.                                                            </p><p>88:18 	(Speaker A)	Definitely worth checking out. And the second thing is, I need to    </p><p>find the Tweet. Yesterday, Georgia was like, oh, yeah, by the way,   </p><p>here's the 48% inference speed improval that somebody just merged in.</p><p>88:30 	(Speaker A)	Have you guys play and try this. For the 33 billion parameter model  </p><p>of Llama, somebody just merged in a 50% increase on inference speed  </p><p>just on the way. And I find this incredible because Gmail already    </p><p>runs many stuff on Raspberry Pi or whatever, iPhones, and now        </p><p>somebody's like, oh, yeah, here's a 50% increase in infinite speed.  </p><p>88:41 	(Speaker A)	And then I think Nissan was here before he was talking about GGML    </p><p>runs on the iPhone, because iPhones, even from three years ago, have </p><p>the same neuron chip that like the latest Max or some such, and that </p><p>this performance boost on GGML also applies to iPhones as well. So,  </p><p>incredible stuff. And as we hear every week, we keep seeing leaps,   </p><p>incredible leaps in speed and performance.                           </p><p>89:15 	(Speaker A)	Definitely worth checking out GGML and the five folks that work on   </p><p>those stuff. GML comments, folks who use Llama, CCP, feel free to hop</p><p>up and raise your hand and give us more updates from that length. I  </p><p>denied it.                                                           </p><p>89:28 	(Speaker A)	You are gay at the spaces, but sometimes as a guest as well. Other   </p><p>than that, I think we'll move on to some more updates and then we    </p><p>just have questions. No? Cool.                                       </p><p>89:41 	(Speaker A)	So the next update that I have is from the diffusion side that we    </p><p>sometimes cover. We don't cover it often, but we do cover it from    </p><p>sometimes time to time. So two things from stability stable          </p><p>diffusion.                                                           </p><p>89:46 	(Speaker A)	We talked about Sdxl, the new Excel model that can generate 1024     </p><p>images. We've talked about last week about the 0.9 weights dropping. </p><p>90:01 	(Speaker A)	Sdxl 1.0 is now available in the Stable Diffusion discord. If you've </p><p>played with Me Journey before and you looked at Stable Diffusion,    </p><p>it's like, it's not that great.                                      </p><p>90:05 	(Speaker A)	Stable diffusion sdxl one is really impressive. And besides being    </p><p>really impressive, they plan to release this open source. So we're   </p><p>going to see a bunch of folks fine tune loras and specific versions  </p><p>of the specific things.                                              </p><p>90:16 	(Speaker A)	And I think it's like, incredible. If you want to play with those    </p><p>models and you haven't yet, go to Stable Diffusion discord and hit up</p><p>that bot and then Netflix let us know how incredibly different that  </p><p>is. And we're waiting for the wait for the Sdxl 1.                   </p><p>90:47 	(Speaker A)	0 to drop. And I will mention this every day until the year mark.    </p><p>It's been less than a year since table Diffusion.                    </p><p>90:57 	(Speaker A)	It's been less than a year. I remember I think it was August 22 when </p><p>they actually dropped the full open source model. Less than a year.  </p><p>91:12 	(Speaker A)	And we've seen just such incredible progress. So, like Matt said     </p><p>before, it's really hard to keep up, but it's also really hard to    </p><p>internalize how far, just how far we're coming with those incredible </p><p>leaps and changes every week. And again, to just plug in this        </p><p>Thursday I space.                                                    </p><p>91:21 	(Speaker A)	This is why we're here. Every thursdai talking about everything and  </p><p>everything that's changed and updated. And the other thing that I    </p><p>want to I see art in the audience with apart.                        </p><p>91:28 	(Speaker A)	If you play the list, the Excel, feel free to raise your hand to come</p><p>up. The other thing that they released, I don't know if you guys     </p><p>familiar with Clip Drop. So Stable Diffusion bought Clip Drop as a   </p><p>company and started implementing that interface compared to their    </p><p>Dream Studio interface.                                              </p><p>91:49 	(Speaker A)	So ClipDrop is like a way simpler interface day to day release,      </p><p>something called Stable Doodle. Stable Doodle is I don't know if     </p><p>folks in the audience remember this. Meme how to draw an owl.        </p><p>91:51 	(Speaker A)	Step one, draw a circle. Step two, draw some eyes. And step three is </p><p>like, draw the rest of the f*****g owl.                              </p><p>92:06 	(Speaker A)	And then you have, like, a beautiful owl painting at the end of this.</p><p>This is now the go to test on how the Doodle models work. And I      </p><p>pinned my attempt at this, but definitely check out ClipDrop Doodle  </p><p>thing.                                                               </p><p>It's really fun to play with. So those are, like, the updates from   </p><p>the diffusion world.                                                 </p><p>92:10 	(Speaker D)	Hey, real quick. I was just looking at the repository for Comfy UI,  </p><p>and then I saw that I don't know how to say his name. Scousekip is in</p><p>here. So I just wanted to come on and say, like, hey, this is        </p><p>incredible.                                                          </p><p>92:24 	(Speaker D)	This is what we've been talking about for months now, right? This    </p><p>node based character codex, if you will, of like there's just        </p><p>infinite possibilities. I just want to listen, but thanks.           </p><p>92:35 	(Speaker A)	For bringing me up.                                                  </p><p>92:36 	(Speaker D)	This is really cool, man. I was just thanks for bringing up Comfy UI.</p><p>92:42 	(Speaker A)	I feel guilt at not being up to date on every single possible thing. </p><p>I know it's impossible. I really try, and Comfy I has been on my list</p><p>to try, but then Quad was released and Code Interpreter was released.</p><p>Comfy I seems like the thing we want, man.                           </p><p>92:42 	(Speaker A)	I think stabilization when they tried to bring up Dream Studio, they </p><p>talked about, like, a node based thing where you can pipe models to  </p><p>other models, you can find filters, et cetera. Comfy UI for folks who</p><p>have tested it out, it looks like that's it. And I definitely want to</p><p>agree with Art.                                                      </p><p>93:16 	(Speaker A)	It's something to watch out and maybe try because automatic one on   </p><p>one, even though it's, like, super advanced and has been there for a </p><p>beginning since Stable Diffusion, it's just like a s**t show of a UX.</p><p>Just like horrible, horrible. I'm sorry, guys.                       </p><p>93:30 	(Speaker A)	I've built a web UI before automatic. It's really hard to get Gradio </p><p>to play as much as you want. It's really hard to maintain a good UX  </p><p>product with many, many people contributing, with many, many things  </p><p>are changing under your feet.                                        </p><p>93:45 	(Speaker A)	So it's really not their fault, but it's a s**t show to get started  </p><p>with. And Comfy UI seems like a fresh, clean start. So definitely if </p><p>you're playing with this, test this out and let us know.             </p><p>93:55 	(Speaker A)	Max, you have your hand raised and you play with the Excel. Give us  </p><p>some of your thoughts.                                               </p><p>94:01 	(Speaker I)	Yeah, I have played through the website in a studio, so I'm lately   </p><p>working with a company that make toys for kids. They want to start   </p><p>incorporating AI. And one of my concerns we're working with them is  </p><p>like, okay, we want to generate images for kids. Something that is   </p><p>going to probably freak them out is two things that diffusion models </p><p>have been lacking.                                                   </p><p>94:27 	(Speaker I)	One is the ability of painting things like complicated shapes or     </p><p>intricate shapes like hands. SD. Excel is not better at it.          </p><p>94:40 	(Speaker I)	Another one is this concept of what is named like concept bleeding,  </p><p>which is this diffusion model tends to mix objects that are similar  </p><p>in shape or form is not good at it, neither. Now, I was reading the  </p><p>paper from Stability or the report. They claim they are outperforming</p><p>Mid Journey in five of seven categories now, mid Journey 5.          </p><p>1, right?                                                            </p><p>95:12 	(Speaker A)	Just to make sure. Mid Journey since then released the new version   </p><p>also because we're in same pace, but yeah, they've compared to Mid   </p><p>Journey 5.1. Yeah.                                                   </p><p>95:20 	(Speaker I)	Well, now this is a report internal released by Stability. It's a    </p><p>paper, it might have some credibility, I don't know. I like the      </p><p>results. It's very close to me journey, but I think there is still   </p><p>one or two steps behind, in my opinion.                              </p><p>95:36 	(Speaker I)	What is different is what you have mentioned, Alex. Once they release</p><p>the weight and we can see Lotus about this, I'm expecting to see the </p><p>results that we can get because probably that is what is going to    </p><p>position this model like a step above Mid Journey, but not yet. This </p><p>is my opinion.                                                       </p><p>95:58 	(Speaker A)	Yeah, definitely. And thanks for that. And I love folks coming up and</p><p>sharing their opinion about these things. I will say on the top.     </p><p>96:05 	(Speaker A)	Thanks Mike. Or I guess I know you're a new name, but I'm not sure if</p><p>I can if I should.                                                   </p><p>96:10 	(Speaker I)	Yeah, totally, totally have it, in my view. I'm Juan Spanish, living </p><p>in Mexico and I like these things.                                   </p><p>96:17 	(Speaker A)	We appreciate you coming up here on the topic of UIs that we've      </p><p>mentioned with somebody or somebody folks released Pinocchio. They   </p><p>call this the AI browser. And I want to highlight this because I want</p><p>to give you practical tips. Janae, I think, is coming in with some   </p><p>breaking news.                                                       </p><p>96:28 	(Speaker A)	I don't know if Janae wants to come up or can, but if you can, feel  </p><p>free to come up and tell us there's some news from Bard. Until we    </p><p>talk about Bard, the topic of UIs for those things, and you guys know</p><p>we're mostly focused on the LLM side and the Engineer side. Less than</p><p>there's a fusion, but we sometimes have love for both the above tool </p><p>that you can download and not deal with the terminal, not deal with  </p><p>the bunch of stuff, unifies all of them.                             </p><p>97:08 	(Speaker A)	It's really nice. Check out the Nokio AI browser. I think it's open  </p><p>source.                                                              </p><p>97:12 	(Speaker A)	You download this once, it's cross platform, Mac, PC, et cetera, and </p><p>then you're able to download Llama CPP, and then you're able to also </p><p>download table diffusion. And then fairly quickly, without knowing   </p><p>how to code, without going through the terminal, without installing  </p><p>packages, folks here know that installing the packages is like a     </p><p>whole pain we all share and we all hate without doing all of that.   </p><p>That's the promise that they have, you are able to pipe Llama outputs</p><p>into stable diffusion.                                               </p><p>97:38 	(Speaker A)	So Yam previously mentioned kind of the model that can do, and Yam   </p><p>and Method are talking about a method of generating prompts for LLMs,</p><p>but also we know that there's models prompts to actually generate    </p><p>prompts for diffusions and they're trained on different and fine     </p><p>tuned on different ways to generate diffusion prompts. Right, and    </p><p>this Pinocchio browser is actually allowing you to run like an and   </p><p>then pipe the output into stabilization model and then see the output</p><p>of that. I think it's incredible that this exists and is             </p><p>downloadable.                                                        </p><p>98:07 	(Speaker A)	I haven't tried this yet. If you in the audience or somebody on stage</p><p>have tried Pinocchio, please raise your hand. I want to bring you up </p><p>and talk about Pinocchio and your experience with this.              </p><p>98:19 	(Speaker A)	And if we haven't, I want to bring this to our attention so that next</p><p>week we're able to talk about this. This is added to my list of      </p><p>things I like. The Comfy UI that I haven't tried it yet.             </p><p>98:29 	(Speaker A)	Anybody use pinocchio yet? No? Cool. I wanted to get Cocktail Peanut.</p><p>The guy who wrote Cocktail Peanut.                                   </p><p>98:36 	(Speaker A)	If you're in the audience, feel free to raise your hand. I don't     </p><p>think you are, but feel free to follow the thread. He goes fairly    </p><p>deep.                                                                </p><p>98:44 	(Speaker A)	And feel free to use and try Pinocchio by next week and then come up </p><p>next week and talk about the differences between this and running    </p><p>automatic one one. All right, folks, thanks everyone for coming to   </p><p>another Thursday. I space.                                           </p><p>98:58 	(Speaker A)	Hope this has been helpful for a bunch of you. We tried a few new    </p><p>things here. We tried to give updates, but also deep dive into a     </p><p>conversation with Matt and looks from the reactions here that maybe  </p><p>this is worth putting down on paper and sending out an email for     </p><p>those of you who want to maybe sign up for this and not don't have   </p><p>the time to listen to two hour spaces, so I'll definitely try at     </p><p>least to do that.                                                    </p><p>99:19 	(Speaker A)	I want to thank a few folks on stage that have joined consistently   </p><p>and providing a lot of signal yum follow Yam. He has great insights  </p><p>into models and training and different things al in the audience.    </p><p>Thanks always for coming up.                                         </p><p>99:33 	(Speaker A)	Junaid is running the Denver meetup, and if you're in the Denver     </p><p>area, feel free to join us next week. Thanks for coming. Haven't seen</p><p>you in a while, buddy.                                               </p><p>99:45 	(Speaker A)	Juan sorry. Yeah, I think Juan great. Maxi and Lentos has recently   </p><p>been joining us.                                                     </p><p>99:51 	(Speaker A)	It's been great. We have some more folks in the Evans who are        </p><p>regulars, and we invite you to also be regulars and come up and talk </p><p>about Thursday. I will say this one thing, tag me in anything that's </p><p>new.                                                                 </p><p>100:01	(Speaker A)	I would love that. And help promote the message for other folks. If  </p><p>you did like the space, this also really helps for more folks to get </p><p>to the bottom of this for those folks.                               </p><p>100:01	(Speaker A)	I didn't get to their questions. I apologize. I'm trying to keep this</p><p>as a balance of a high signal thing versus letting everybody         </p><p>questions as well.                                                   </p><p>100:22	(Speaker A)	Last thing I'll say is about myself, a little bit consultant. I stay </p><p>up to date so you don't have to. That's my tagline.                  </p><p>100:29	(Speaker A)	If you're in the company and needs consultancy for somebody who's up </p><p>to date on everything, I try to be that guy. Feel free to tap me in  </p><p>the DMs. And, yeah, thursdai folks, keep tagging us everything that's</p><p>new. We're going to try to cover next week with that.                </p><p>100:34	(Speaker A)	I thank all of you. Thanks for coming. Thanks for giving us two and a</p><p>half hours of your attention.                                        </p><p>100:34	(Speaker A)	I really appreciate it. Attention is sparse and very important, and I</p><p>really thank everybody who gave us, like, two and a half hours. Thank</p><p>you, folks.                                                          </p><p>101:00	(Speaker A)	Hey, Alex, we really appreciate you.                                 </p><p>101:04	(Speaker B)	Thanks, Alex.                                                        </p><p>101:05	(Speaker H)	Thanks for doing a good space and keeping us on track, actually.     </p><p>101:09	(Speaker A)	Yeah, thank you.                                                     </p><p>101:10	(Speaker D)	Yeah, alex definitely want to kind of.                               </p><p>101:13	(Speaker A)	Give our thanks to you as well.                                      </p><p>101:15	(Speaker E)	For curating an awesome space.                                       </p><p>101:17	(Speaker D)	I think I'm definitely not the only one that gets a lot of good      </p><p>signal out of this. And I know a lot of hard work goes into keeping  </p><p>yourself up to.                                                      </p><p>101:27	(Speaker A)	Date so that you can share it.                                       </p><p>101:28	(Speaker E)	With all of us.                                                      </p><p>101:29	(Speaker D)	So just on my own behalf, thank you. And I'm sure that is echoed by. </p><p>101:34	(Speaker E)	A lot of people on stage and in the audience.                        </p><p>101:36	(Speaker A)	Humble man thank you. I appreciate you. Thank you, folks. Have a nice</p><p>Thursday and bye next week.                                          </p> <br/><br/>This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://sub.thursdai.news/subscribe?utm_medium=podcast&#38;utm_campaign=CTA_2">sub.thursdai.news/subscribe</a>]]></description><link>https://sub.thursdai.news/p/thursdai-july-12-show-recap-notes</link><guid isPermaLink="false">substack:post:134776168</guid><dc:creator><![CDATA[Alex Volkov, swyx (Shawn), and Junaid Dawud]]></dc:creator><pubDate>Fri, 14 Jul 2023 02:33:41 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/134776168/a707863241124ad5b30b3e67a7224375.mp3" length="97972283" type="audio/mpeg"/><itunes:author>Alex Volkov, swyx (Shawn), and Junaid Dawud</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>6123</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/1801228/post/134776168/bf135b37b39d59a20fb5e8c965960b3c.jpg"/></item></channel></rss>