<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd"><channel><title><![CDATA[TheAIRuntime.com]]></title><description><![CDATA[Free Field Guide to Production AI - https://substack.com/home/post/p-200392331
Production AI engineering, learned from those who ship <br/><br/><a href="https://theairuntime.com?utm_medium=podcast">theairuntime.com</a>]]></description><link>https://theairuntime.com/podcast</link><generator>Substack</generator><lastBuildDate>Sun, 07 Jun 2026 20:06:53 GMT</lastBuildDate><atom:link href="https://api.substack.com/feed/podcast/8325250.rss" rel="self" type="application/rss+xml"/><author><![CDATA[TheAIRuntime.com]]></author><copyright><![CDATA[Kranthi Manchikanti]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[theairuntime@substack.com]]></webMaster><itunes:new-feed-url>https://api.substack.com/feed/podcast/8325250.rss</itunes:new-feed-url><itunes:author>TheAIRuntime.com</itunes:author><itunes:subtitle>Free Field Guide to Production AI - https://substack.com/home/post/p-200392331
Production AI engineering, learned from those who ship</itunes:subtitle><itunes:type>episodic</itunes:type><itunes:owner><itunes:name>TheAIRuntime.com</itunes:name><itunes:email>theairuntime@substack.com</itunes:email></itunes:owner><itunes:explicit>No</itunes:explicit><itunes:category text="Technology"/><itunes:category text="Business"/><itunes:image href="https://substackcdn.com/feed/podcast/8325250/baae721b68fd55f77e936573095d07dd.jpg"/><item><title><![CDATA[Two Ways to Shrink an AI Model. Only One Keeps the Output.]]></title><description><![CDATA[<p><p><em>If your inference bill is climbing or you are running out of GPU memory, you have two ways to make a model smaller. Quantization cuts the most bytes but changes the model’s outputs, which is a problem for anything regulated or already validated. Lossless compression cuts about 30% of the bytes by re-packing the wasted space in BF16 weights, and the outputs come back bit-for-bit identical. The </em><a target="_blank" href="https://arxiv.org/abs/2504.11651"><em>DFloat11 research</em></a><em> confirms the 30% with zero accuracy change, and </em><a target="_blank" href="https://arxiv.org/html/2411.05239v2"><em>ZipNN</em></a><em> reports similar. The 30% is a fixed ceiling, not a knob, so treat it as a free one-time discount for BF16 workloads that are memory-bound and cannot tolerate changed output. ISIRO Runtime is one commercial product built on this technique, with vendor-reported numbers worth testing rather than trusting. Before you quantize anything, run a bit-exact diff on a compiled model and measure whether your decode path is actually memory-bound.</em></p></p> <br/><br/>This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit <a href="https://theairuntime.com?utm_medium=podcast&#38;utm_campaign=CTA_1">theairuntime.com</a>]]></description><link>https://theairuntime.com/p/two-ways-to-shrink-an-ai-model-only-49b</link><guid isPermaLink="false">substack:post:200517884</guid><dc:creator><![CDATA[The AI Runtime]]></dc:creator><pubDate>Sun, 07 Jun 2026 11:48:00 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/200517884/387b0d42bde6adfffe9691cd4a7d5ef8.mp3" length="17865394" type="audio/mpeg"/><itunes:author>The AI Runtime</itunes:author><itunes:explicit>No</itunes:explicit><itunes:duration>1117</itunes:duration><itunes:image href="https://substackcdn.com/feed/podcast/8325250/post/200517884/880035721da36c6080d1f00d1c5137d8.jpg"/></item></channel></rss>