家庭 AI Lab 信源动态一页纸

本地 runtime 的主风险变成接口兼容性

本地 runtime 的主风险变成接口兼容性。 [MED] 本周不是单点模型突破，而是 Ollama、llama.cpp、MLX、LM Studio 与 Unsloth 同时触及 structured output、reasoning、工具接口、GGUF/量化和硬件热约束；家庭 AI Lab 的主风险从“模型跑不起来”转成“升级后接口语义悄悄变了”。

LOCAL

llama.cpp 把 CLI/服务端继续推向可路由、可接工具的本地 runtime。

[HIGH] b9927 将 CLI 改为 HTTP-based implementation，支持 remote server、router mode 与 model aliases；b9957 又改进 server tools、移除 apply_diff 并新增 tools_io/tools_io_basic 抽象。

对本地服务的含义：把 llama.cpp 当作 agent fallback 或多模型路由节点时，必须把 tool schema、编辑行为和路由别名放进回归集，不能只测首 token 与 tokens/s。URLs: https://github.com/ggml-org/llama.cpp/releases/tag/b9927 https://github.com/ggml-org/llama.cpp/releases/tag/b9957

AGENT

Ollama 0.31.2 修的是“能否可靠接入 agent”的边缘条件。

[HIGH] 该稳定版为 compute capability 6.x 的旧 NVIDIA GPU 启用 Flash Attention，让 iGPU 可在内存可容纳时 offload vision model，并修复 thinking 关闭时的 structured output、GGUF model creation 与非 UTF-8 路径加载；同时更新 MLX 与 llama.cpp engines。

对本地服务的含义：Ollama 是本周最适合做小范围灰度的统一入口，但 structured output、视觉模型内存边界和旧卡路径应先过真实任务。URL: https://github.com/ollama/ollama/releases/tag/v0.31.2

SIGNAL

MLX 0.32.0 的重点是量化/跨后端正确性，不是可直接外推的速度承诺。

[HIGH] release 包含 FP quantized qmm_naive K-tail dispatch 修复、GGUF input validation 在 release build 保持启用、JACCL barrier/all-reduce race 修复、Q4_1 GGUF loading 修复，以及 Metal/ CUDA 的多个 kernel 正确性和编译缓存改动。

对本地服务的含义：Apple Silicon worker 升级应以“量化模型能否稳定加载、长上下文是否正确、分布式路径是否复现”为准，而不是只看短提示测速。URL: https://github.com/ml-explore/mlx/releases/tag/v0.32.0

DESKTOP

LM Studio 0.4.19 将 Engine Protocol 默认打开，并修 reasoning 内容串流语义。

[HIGH] 官方 changelog 显示该版本在 lms chat 增加 /reasoning，把 LM Studio Engine Protocol 设为稳定默认开启，并修复 /v1/responses 或聊天 UI 中 reasoning 被当作普通内容 replay 的问题。

对本地服务的含义：桌面端不再只是 GUI；若它给本地 agent 或家庭成员提供 API，应把 reasoning channel 与普通 response 的兼容性作为 API contract 测试。URL: https://lmstudio.ai/changelog/lmstudio-v0.4.19

DESKTOP

Unsloth 把训练后的本地分发连接到 NVFP4、GGUF、MLX 与 tool calling，但仍是 beta。

[HIGH] v0.1.481-beta 宣布 Studio 可在训练后导出 NVFP4、FP8、imatrix GGUF，提供 llama-swap API 能力，并补 MLX/safetensors 的 tool calling 与 healing；release 同时称 MoE training 可快 3-5 倍。

对本地服务的含义：这是“训练到本地 serving”的候选链路，不是默认生产路径；所有导出物必须先过同一套 GGUF/MLX/tool-call eval。URL: https://github.com/unslothai/unsloth/releases/tag/v0.1.481-beta

SIGNAL

FlashAttention 4 beta 同时补 AMD RDNA 与 NVIDIA SM120 路径，硬件分叉仍在加深。

[HIGH] fa4-v4.0.0.beta21 启用 AMD ROCm RDNA backward 并采用 CK unified workspace，修复 NVIDIA SM120 编译期参数处理，新增 SM120 Pack-GQA 和 SplitKV fallback。

对本地服务的含义：GPU worker 不应把“支持 Flash Attention”当作一个二元开关；需按 GPU 架构单独记录 prefill、decode、长上下文和 fallback。URL: https://github.com/Dao-AILab/flash-attention/releases/tag/fa4-v4.0.0.beta21

SIGNAL

硬件服务的信号从“买什么”转到“持续热稳定”。

[HIGH] ServeTheHome 的 Patrick Kennedy 本周发布 ASUS AI server thermal lab tour，文章明确聚焦服务器性能与耐久性的严格热测试。

对本地服务的含义：家庭机架/工作站的验收不应只有 benchmark 峰值；长 prefill、并发解码、环境温度与风扇曲线需要进入服务 SLO。URL: https://www.servethehome.com/asus-thermal-lab-tour-2026-testing-ai-servers/

证伪条件

- [MED] 如果三入口在相同模型与请求下出现 reasoning 被混入普通文本、JSON schema 不一致、tool call 参数丢失或编辑工具行为不同，本周“接口合同优先于模型升级”的判断继续成立，但任何全量升级应立即停止。URLs: https://github.com/ggml-org/llama.cpp/releases/tag/b9957 https://github.com/ollama/ollama/releases/tag/v0.31.2 https://lmstudio.ai/changelog/lmstudio-v0.4.19 - [MED] 如果 MLX 0.32.0 在 Q4_1、当前主力 GGUF 与长上下文回归中没有改善稳定性，或出现输出差异/内存回退，则 Mac worker 继续 pin 现网版本。URL: https://github.com/ml-explore/mlx/releases/tag/v0.32.0 - [MED] 如果 Unsloth beta 导出的 NVFP4/FP8/GGUF 在本地 engine 无法稳定加载、tool-call 失败率高于稳定产物，或质量评测下降超过预设阈值，则该链路只保留为实验，不进入模型 registry 默认分支。URL: https://github.com/unslothai/unsloth/releases/tag/v0.1.481-beta - [LOW] 如果长压 30 分钟后温度、clock、tokens/s 与错误率均没有显著漂移，热稳定验收可以降级为月度，而不是每次 runtime 升级的阻断条件。URL: https://www.servethehome.com/asus-thermal-lab-tour-2026-testing-ai-servers/

本地 runtime 的主风险变成接口兼容性

本周判断

公开来源摘录

ggml-org / GitHub Releases / 2026-07-09

ggml-org / GitHub Releases / 2026-07-11

Ollama / GitHub Releases / 2026-07-07

Ollama / GitHub Releases / 2026-07-07

MLX Team / GitHub Releases / 2026-07-08

MLX Team / GitHub Releases / 2026-07-08

LM Studio / Changelog / 2026-07-07

LM Studio / Changelog / 2026-07-07

Unsloth / GitHub Releases / 2026-07-07

Unsloth / GitHub Releases / 2026-07-07

Dao-AILab / GitHub Releases / 2026-07-08

Patrick Kennedy / ServeTheHome / 2026-07-11

信源信号

llama.cpp 把 CLI/服务端继续推向可路由、可接工具的本地 runtime。

Ollama 0.31.2 修的是“能否可靠接入 agent”的边缘条件。

MLX 0.32.0 的重点是量化/跨后端正确性，不是可直接外推的速度承诺。

LM Studio 0.4.19 将 Engine Protocol 默认打开，并修 reasoning 内容串流语义。

Unsloth 把训练后的本地分发连接到 NVFP4、GGUF、MLX 与 tool calling，但仍是 beta。

FlashAttention 4 beta 同时补 AMD RDNA 与 NVIDIA SM120 路径，硬件分叉仍在加深。

硬件服务的信号从“买什么”转到“持续热稳定”。

下周优先看

证伪条件

主要来源