Daily Briefing

May 12, 2026
2026-05-11
19 articles

The Qwen 3.6 35B A3B hype is real!!!

My personal test for small local LLM intelligence is to check whether a model has any ability to understand the code that I write for my own academic research.

  • My research is on some pretty niche topics and I doubt that anything like it is substantively present in the training sets for LLMs.
  • A few months ago, small local models' ability to understand my code was nominal at best with Devstral Small 2 being the top performer .
  • However, several small open weight models now have methods of accommodating fairly long contexts (gated delta net, hybrid Mamba2, sliding window attention) which makes them extremely smarter .
  • I can now feed a model an entire academic paper along with accompanying code and ask it to use the paper to work out what the code is doing.
  • I just spent a couple days experimenting with: Qwen 3.6 35B A3B Qwen 3.6 27B Gemma 4 26B A4B Nemotron 3 Nano All of them were able to comprehend my code significantly better than what any small local model could do a few months ago.
Notable Quotes & Details

Software developer, AI engineer

ExLlamaV3 Major Updates!

Turboderp has a been on an absolute tear recently, in the endless battle to cram new llamas into smaller, faster boxes.

  • We started off last month with the release of gemma 4 support , and continued with improved caching efficiency .
  • DFlash support came 2 weeks ago with these impressive results: Category Baseline N-gram/suffix DFlash Agentic, code 55.98 t/s 89.58 t/s (1.60x) 140.61 t/s (2.51x) Agentic, curl 54.03 t/s 74.62 t/s (1.38x) 125.94 t/s (2.33x) Coding 59.21 t/s 75.34 t/s (1.27x) 177.67 t/s (3.00x) Creative 59.10 t/s 67.26 t/s (1.13x) 89.19 t/s (1.50x) Creative (reasoning) 59.03 t/s 64.25 t/s (1.09x) 93.54 t/s (1.58x) Translation 58.11 t/s 55.39 t/s (0.95x) 75.73 t/s (1.30x) Translation (reasoning) 58.08 t/s 80.21 t/s (1.38x) 119.43 t/s (2.06x) More model optimization last week, with these improvements: Model 3090¹ 4090¹ 5090¹ 6000 Pro¹ 5090² 6000 Pro² Qwen3.5-35B-A3B 4.00bpw 5.3% 5.8% 8.6% 10.3% 21.0% 23.5% Qwen3.5-27B 4.00bpw 0.0% 1.9% 8.1% 11.7% 13.1% 15.0% Trinity-Nano 4.15bpw 29.5% 48.6% 52.3% 52.9% 70.5% 72.4% Gemma4-26B-A4B 4.10bpw 3.1% 2.9% 7.8% 9.6% 16.4% 19.2% Gemma4-31B 4.00bpw 4.0% 4.9% 10.0% 8.0% 16.0% 12.0% DFlash model quantization and more bugfixes + efficiency in the last 2 days, and more work on the dev branch already!
  • Come say hi at the exllama discord .
  • submitted by /u/Unstable_Llama [link] [comments]
Notable Quotes & Details
  • 60x
  • 51x
  • 38x
  • 33x
  • 27x
  • 00x
  • 13x
  • 50x
  • 09x
  • 58x
  • 95x
  • 30x
  • 06x
  • 5-35
  • 5.3%
  • .3
  • 5.8%
  • .8
  • 8.6%
  • .6
  • 10.3%
  • 21.0%
  • .0
  • 23.5%
  • .5
  • 5-27
  • 0.0%
  • 1.9%
  • .9
  • 8.1%
  • .1
  • 11.7%
  • .7
  • 13.1%
  • 15.0%
  • 29.5%
  • 48.6%
  • 52.3%
  • 52.9%
  • 70.5%
  • 72.4%
  • .4
  • 4-26
  • 3.1%
  • 2.9%
  • 7.8%
  • 9.6%
  • 16.4%
  • 19.2%
  • .2
  • 4-31
  • 4.0%
  • 4.9%
  • 10.0%
  • 8.0%
  • 16.0%
  • 12.0%

Software developer, AI engineer

PSA: Watch out for extra spaces in chat-template-kwargs when using Qwen3.6 with llama-server

Hey folks, just a heads-up for anyone running Qwen3.6 through llama-server .

  • I ran into an issue where the preserve_thinking parameter wasn't working as expected, even though I had it explicitly enabled in my models.ini config.
  • After some digging, I found that extra spaces in the JSON string are breaking the parser for this specific parameter in my build.
  • ❌ Does NOT work: chat-template-kwargs = { "preserve_thinking": true } ✅ Works: chat-template-kwargs = {"preserve_thinking": true} How to test it: The easiest way to verify if it's working is to send this prompt: think of a number from 1 to 100, don't tell me what it is, I'm going to guess it Then check the reasoning/thinking output to verify that the "hidden" number stays consistent across your guesses.
  • If it changes, your template kwargs are likely being parsed incorrectly.
  • My env: llama-server v9102 (7d442abf5) | RTX 4090 Might be a minor parsing quirk in how llama-server handles JSON in the ini file, but it's definitely worth checking.
Notable Quotes & Details

AI researcher, academic

Any news (or hope) of Qwen-3.6 14B and 9B distills for local coding ?

As the title suggests.

  • I'm already testing (with some success, and few challenges) usage of Qwen-3.5 9B with a new work laptop that I've received with RTX 1000 6GB VRAM (I know it seems like a joke in today's time and age).
  • I am using it with `pi` as the terminal coding harness.
  • The issue I am facing with Qwen-3.5 9B is that I've encountered some (relatively infrequent) issues around: How it handles directories / folders - more than once, strangely I got a deeply nested folder structure for final code/test artefacts Recognized test run to be failure, while it was actually a success Same prompts when used with gemini-2.5-flash and gemini-2.5-flash-lite don't see such issues, indicating the possibility that the issue is not with `pi`.
  • I've read some reports of `pi` sometimes struggling with Qwen-3.5 tool-calling, and that is apparently fixed in Qwen-3.6.
  • Thus wondering if anyone heard or Qwen-3.6-27B dense model distillations with 9B, 14B might also be released, enabling using in smaller GPUs.
Notable Quotes & Details
  • 6-27

Software developer, AI engineer

Markdown browser for LLMs

An introduction to a tool called TextWeb that helps AI agents make inferences about web pages by rendering them as Markdown.

  • Developed TextWeb, a Markdown web renderer for AI agents.
  • Render web pages with markdown instead of expensive screenshots, allowing LLM to infer natively.
  • Supports JavaScript execution and interactive element annotations, and provides CLI and MCP servers.
  • Allows LLM to navigate web pages, scroll, enter text, click buttons, and more.
  • llama.cpp Linked to web UI and based on chrisrobison/textweb's text grid renderer.
Notable Quotes & Details

AI developer, LLM researcher

Running my agents in a VPS

How to run an AI coding agent asynchronously and independently on a VPS and share our experience on its setup.

  • Experiment with independent task completion by running coding agents such as Claude Code, Codex, and Overlay in fully asynchronous mode on a VPS.
  • Use the `--dangerously-skip-permissions` flag for full autonomy of the agent.
  • Run the agent in an environment completely isolated from personal computers.
  • Utilizing Hetzner VPS, user settings for each agent and collaboration through Git workflow.
  • Install required dependencies like git, tmux, curl.
Notable Quotes & Details
  • Hetzner

AI agent developer, system administrator, DevOps engineer

Notes: Content incomplete

I jailbroke my old Kindle to install KOReader - but there's a better way to extend its life

Jailbreaking and installing KOReader are suggested as ways to extend the life of old Kindle devices, and alternatives that can be used despite Amazon's end of support.

  • Jailbreaking your Kindle is a way to modify your device by bypassing Amazon's software restrictions.
  • Older Kindle models will no longer receive technical support from Amazon on May 20.
  • Kindle and Fire tablets released before 2013 no longer receive software support.
  • You can extend the usability of your older Kindle with alternative software, such as KOReader.
  • ZDNet's recommendations are based on extensive testing, research and comparison shopping.
Notable Quotes & Details
  • 2026-05-20
  • 2013

Kindle users, e-reader users, technology enthusiasts

Notes: Content incomplete

I tested whether Gemini, ChatGPT, and Claude can analyze videos - this one wins

Results of comparative tests on the ability of major AI models such as Gemini, ChatGPT, and Claude to analyze and understand video content.

  • Gemini can watch YouTube, MP4, and MOV files and has excellent video analysis capabilities.
  • Claude cannot process video directly yet.
  • ChatGPT needs help from Codex for more in-depth video work.
  • AI models understand text and images well, but their understanding of video is different.
  • Video tests used YouTube videos, DJI Neo 2 drone motion tests, and original MOV files before uploading to YouTube.
  • A comparative test of ChatGPT Plus and Gemini Pro is also mentioned.
Notable Quotes & Details
  • $20-per-month ChatGPT Plus plan
  • $20-per-month Gemini Pro

AI researcher, general user, technology journalist

Notes: Content incomplete

Samsung Galaxy Z Flip 7 vs. Motorola Razr Ultra: I've used both, and this phone is my pick

I compare and analyze the Samsung Galaxy Z Flip 7 and the Motorola Razr Ultra smartphone and offer my recommendations.

  • ZDNET's recommendations are based on extensive testing, research, comparison shopping and consumer review data.
  • Motorola has unveiled the 2026 Razr series, and the Razr Ultra is the new flagship model.
  • The article recommends that the $450 Samsung phone is better than competing models from Google and One Plus.
  • ZDNET's editorial content is not influenced by advertisers and aims to provide readers with accurate information.
Notable Quotes & Details
  • $450 Samsung phone

General consumers, those planning to purchase smartphones

I stopped using a smart plug with these 5 common household devices - here's why

Here are five common appliances to avoid when using smart plugs and why.

  • Smart plugs are useful for some appliances, such as lamps, chargers, and fans, but they are not suitable for all devices.
  • Smart plugs should be avoided on devices that generate heat, have compressors, or exceed 1,500 watts.
  • Smart plugs support a maximum power consumption of 15A and should not be used in larger appliances that exceed this.
  • Improper use may lead to circuit breakage, plug damage, and fire hazard.
  • If the smart plug emits a burnt smell or shows signs of deformation, you should stop using it.
Notable Quotes & Details
  • 15A
  • 1,500W

General consumers, smart home users

Why Mastering EVM Is Essential for Next-Generation Wireless Systems

Covers the importance and understanding of Error Vector Magnitude (EVM), a key metric for measuring modulation accuracy in next-generation wireless systems.

  • EVM is a key metric for quantifying modulation accuracy in Wi-Fi, LTE, and 5G NR systems.
  • EVM is calculated as the distance between the ideal constellation point and the measured point, expressed in percentage and decibel format.
  • Higher modulation orders increase throughput, but require higher accuracy when transmitting and receiving signals.
  • Causes of EVM degradation can be divided into four main categories: amplitude effects, phase effects, I/Q imperfections, and configuration issues.
  • By visually inspecting the constellation diagram, you can diagnose root causes of EVM degradation such as phase noise, amplifier compression, and noise.
Notable Quotes & Details

Wireless communications engineer, researcher, student

Article: Local-First AI Inference: A Cloud Architecture Pattern for Cost-Effective Document Processing

We introduce a local-first AI inference architecture pattern for cost-effective document processing in cloud AI systems.

  • Local-first AI inference pattern reduces Azure OpenAI costs by 75% by processing 70-80% of documents locally without API calls.
  • This pattern reduces cloud AI calls through confidence-based routing.
  • A composite scoring function that includes spatial, anchor, format, and context criteria outperforms single-criteria approaches.
  • Model upgrades should be evaluated against task-specific validation sets rather than vendor benchmarks.
  • The three-stage architecture (local deterministic, cloud AI, human review) limits the error rate and compensates for the shortcomings of cloud-only or local-only approaches.
Notable Quotes & Details
  • 70~80%
  • 75%
  • 55%
  • GPT-5+
  • GPT-4.1
  • 89%
  • 98%
  • 400-file validation set

AI/ML engineer, cloud architect, document processing system developer

Netflix Introduces ‘Model Lifecycle Graph’ to Scale Enterprise Machine Learning

Netflix introduced a graph-based architecture called 'Model Lifecycle Graph' that maps relationships between datasets, models, and functions to manage enterprise-scale machine learning systems.

  • Netflix presented a 'model life cycle graph' to solve the problem of increased management complexity due to expansion of the ML system.
  • This graph-based system considers ML assets and their relationships as core infrastructure, modeling dependencies between datasets, features, models, evaluations, and workflows.
  • This allows you to trace the lineage of your ML assets, determine the impact of changes on downstream systems, find reusable ML assets, and review how your models are constructed.
  • Netflix engineers explain that the graph structure is suitable for ML system modeling because ML assets do not exist independently but are connected to each other.
Notable Quotes & Details

Machine learning engineer, data scientist, ML platform developer

Your Purple Team Isn't Purple — It's Just Red and Blue in the Same Room

In cyber security, attackers' time is getting faster and defense teams' responses are getting slower, and it points out the problem that the theoretically effective 'purple team' operation is not carried out properly in reality.

  • The average time from the disclosure of a CVE to actual exploitation has drastically decreased from 56 days in 2024 to 10 hours as of 2026.
  • The defense team's response time has also become faster, but it cannot keep up with the attacker's speed.
  • 'Purple Team' is a concept that strengthens security posture through cooperation between the Red Team (attack simulation) and Blue Team (defense), but in reality, it is not operated effectively due to lack of communication, long meetings, and document work.
  • Human bottlenecks (unread messages, manual tasks, waiting for approval, etc.) are the main cause of slow response times for defense teams.
Notable Quotes & Details
  • Average CVE-exploit time in 2024: 56 days
  • Average CVE-exploit time by 2025: 23 days
  • Average CVE-exploit time in 2026: 10 hours (based on 3,532 CVE-exploit pairs)

Cybersecurity expert, CISO, security team manager

Misos records ‘16 hours’ in METR autonomy evaluation… “Breakthrough of measurement limits”

According to an evaluation by METR, a non-profit AI research institute, the early version of 'Claude Misos Preview' performed a task that would take about 16 hours for a human expert with a 50% success rate, showing that AI's autonomy is developing rapidly enough to break the measurement limit.

  • METR evaluates AI models based on ‘task-completion time horizon,’ a new indicator that measures AI’s autonomous performance ability.
  • 'Claude Opus Preview' performed a task that would take more than 16 hours for a human to complete with a 50% success rate, surpassing 'Claude Opus 4.6' (11 hours and 59 minutes) from last February.
  • METR found that it is difficult to reliably measure work hours of more than 16 hours with current evaluation tool sets, suggesting that the autonomous performance capabilities of modern AI models are approaching or exceeding the upper limits of existing evaluation systems.
  • Since the introduction of the AI ​​agent concept, the task performance time of AI models has been doubling every three months, and improvements in reliability, error recovery ability, and tool utilization are considered key factors.
  • METR warns that within five years, AI has the potential to automate many of the software tasks that would otherwise take a month for humans, but that increased AI autonomy could lead to increased risks in case of malfunction or malicious use.
Notable Quotes & Details
  • Claude Mysos Preview: Approximately 16 hours of work for a human expert (50% success rate)
  • Claude Opus 4.6: Job completion time 11 hours 59 minutes
  • Of the 228 tasks prepared by METR, only 5 took more than 16 hours for a human to complete.
  • Time available for AI models to perform tasks after 2024: Doubling every 3 months

AI researchers, AI developers, policy makers, technology trend analysts

"I will accept you reliably"...'Chat GPT's speaking habits that irritate Chinese people'

When chatting in Chinese, ChatGPT uses overly emotional and unnatural repetitive expressions such as "I will accept you reliably", causing fatigue to Chinese users. This is analyzed as a 'mode collapse' phenomenon caused by AI's failure to acquire cultural nuances and a Western-centric data learning structure.

  • Chat GPT repeatedly uses excessively emotional and unnatural expressions such as “我会稳稳地接住你” (“I will accept you reliably”) in Chinese conversations.
  • Another repeated expression, “砍一刀” (please lower the price), appears to have been learned from advertising slogans on Chinese e-commerce platforms.
  • This phenomenon is explained as 'mode collapse', in which large-scale language models (LLMs) learn specific expressions or writing styles excessively and repetitively, and failure to acquire cultural nuances and a Western-centric data learning structure are cited as the causes.
  • “接住你,” which is a direct translation of the English expression “I’ve got you,” is an expression used in Chinese psychological counseling culture and is used indiscriminately in general conversation, causing awkwardness.
  • Recently, similar expressions have appeared in other AI models such as Claude and DeepSeek, which are analyzed to be due to the use of similar datasets or knowledge distillation between models.
Notable Quotes & Details
  • Wired reported on the 7th (local time)

AI researchers, language model developers, general users, cultural critics

The reason why ‘Hermes Agent’ beat Open Claw and ranked first globally

‘Hermes Agent’, developed by Noose Research, ranked first on the OpenRouter global daily app and agent rankings on the 10th (local time), beating the existing powerhouse ‘OpenClaw’.

  • Currently, Hermes Agent processes approximately 224 billion tokens per day, surpassing OpenClaw at 186 billion tokens, and has become the most actively used open source AI agent.
  • This change in rankings means more than just traffic competition.
  • After OpenClaw founder Peter Steinberger joined OpenAI in February of this year, OpenClaw was converted into an independent open source foundation system.
  • OpenAI is participating as a sponsor, but the market is evaluating that the open source agent ecosystem has entered a new phase.
  • The competition between the two projects is not just a comparison of features, but also leads to differences in AI agent design philosophy.
Notable Quotes & Details
  • 2026-25253
  • 2026-7113

Software developer, AI engineer

Open AI exposes the risk of ‘inference manipulation’ during GPT-5 training… “AI may deceive humans”

It was later confirmed that OpenAI unintentionally used the chain of thought (CoT), the model's thinking process, as an evaluation standard while training some GPT-5 series models using reinforcement learning (RL).

  • The importance of this was emphasized in that AI can design the reasoning process to match the human reward system.
  • OpenAI announced on the 7th (local time) that while inspecting the newly introduced automatic detection system, it discovered that CoT evaluation was accidentally included in the learning process of some public models.
  • CoT refers to the reasoning process that AI develops internally to solve problems.
  • OpenAI has emphasized that monitoring this inference process is very important to detect model malfunctions, risky behavior, or alignment problems.
  • However, at the same time, concerns have been raised that if CoT itself is used as an evaluation target of the RL reward system, the model may learn ‘show-based reasoning’ that is different from actual thoughts.
Notable Quotes & Details
  • 0.6%
  • .6
  • 3.8%
  • .8

Software developer, AI engineer

AI has become a ‘double-edged sword’... Palantir criticizes AI as a ‘slob’

Palantir, which is considered a representative AI beneficiary, is faced with the ‘paradox of AI.’

  • Then, the company's management went on guard, criticizing the AI ​​for being sloppy.
  • According to the Wall Street Journal (WSJ), Palantir executives emphasized during an earnings conference call on the 4th (local time) that the models of major AI research institutes are "slop" that is too crude and unreliable to be integrated into large enterprise systems.
  • This word meaning trash was repeated a whopping 17 times.
  • CEO Alex Karp said that companies considering adopting AI "should look at all kinds of crappy AI companies," but "most of them end up coming back to Palantir."
  • Palantir sells software to centralize, manage and analyze large volumes of data to help government agencies and private companies make decisions such as supply chain planning or choosing where to drop weapons.
Notable Quotes & Details
  • 1600%
  • 20%
  • 137%
  • 45%

Business leaders, investors, and AI industry insiders

Jooojub
System S/W engineer
Explore Tags
Series
    Recent Post
    © 2026. jooojub. All right reserved.