nvtop's comments

nvtop · 2025-09-17T10:02:02 1758103322

I tried to love mc, but its ergonomics felt slightly off. Maybe it's just hard to rewire my Norton / Volkov Commander / FAR Manager muscle memory, I don't know

I ended up being on a Linux fork of Far Manager, which works beautifully: https://github.com/elfmz/far2l

rob74 · 2025-09-17T10:10:14 1758103814

While I use command line tools on Linux daily, using a console-based tool that pretends to be a GUI tool is a bridge too far, so I prefer GUI-based dual pane file managers like Double Commander or Krusader.

Of course mc and far can be used over an SSH connection, so they have their advantages too...

nvtop · 2025-07-24T20:41:19 1753389679

March 2024

nvtop · 2025-07-07T13:51:45 1751896305

This video has a live coding part which implements a masked diffusion generation process: https://www.youtube.com/watch?v=oot4O9wMohw

nvtop · 2025-05-22T07:38:32 1747899512

Despite the name, diffusion LMs have little to do with image diffusion and are much closer to BERT and old good masked language modeling. Recall how BERT is trained:

1. Take a full sentence ("the cat sat on the mat") 2. Replace 15% of tokens with a [MASK] token ("the cat [MASK] on [MASK] mat") 3. Make the Transformer predict tokens at masked positions. It does it in parallel, via a single inference step.

Now, diffusion LMs take this idea further. BERT can recover 15% of masked tokens ("noise"), but why stop here. Let's train a model to recover texts with 30%, 50%, 90%, 100% of masked tokens.

Once you've trained that, in order to generate something from scratch, you start by feeding the model all [MASK]s. It will generate you mostly gibberish, but you can take some tokens (let's say, 10%) at random positions and assume that these tokens are generated ("final"). Next, you run another iteration of inference, this time input having 90% of masks and 10% of "final" tokens. Again, you mark 10% of new tokens as final. Continue, and in 10 steps you'll have generated a whole sequence. This is a core idea behind diffusion language models.

Of course, there are some optimizations in the real world. If you need to generate a really long text (over 200 tokens), you'd better split it in chunks and fully generate the first chunk in parallel before moving to the next one. This semi-autoregressive generation is what Block Diffusion does.

You can be smart about how exactly you pick tokens you consider generated and what % exactly. At earlier stages, when it's mostly noise, you can take more, and on final stages you can do more iterations and take fewer tokens.

All in all, diffusion LMs are still iterative, but the number of steps is much lower than in autoregressive models. A nice thing is that you can choose how many steps are you going to make, trading quality for speed.

In the extreme, you can even generate just one leftmost masked token with a diffusion LM, effectively turning it into a traditional causal language model.

yahoozoo · 2025-05-22T10:36:27 1747910187

Great explanation. I think I have seen where text diffusion models can “edit” as it’s running inference. Or in other words, a “final” token isn’t necessarily “final” and could change but at some later iteration the model decides it truly is. How does that work?

nvtop · 2025-05-22T11:36:27 1747913787

Correct, diffusion LMs can edit their intermediate predictions, so "final" tokens aren't necessarily final. This is an exciting property because it allows models to correct errors in what's generated so far -- something that GPT-like models can't.

This editing is based on the Transformer's encoder property to predict token probabilities for __every__ token in a sequence, not just for [MASK]s. So when you input a sentence of three tokens `[MASK] cat barks`, Transformer will generate a probability distribution over the vocabulary for each of the three tokens, for free.

Now you can come up with many ways of how to decide whether you want to edit token or keep it as is. In the simplest case, take a new token if its probability higher than the original by some margin. In our example, say model returns the probability of the token "cat" on the second position as p_2("cat") = 0.3, while p_2("dog") = 0.6. We may want to replace "cat" with dog, and use it in the subsequent iterations.

Actual heuristics are slightly more complicated, but the base idea is this.

P.S. In order to teach LM not to just copy input unmasked tokens but to try to find a better replacement, your training objective should include replacing some % of input tokens with some other random token. Now you have part of the input masked, and part of the input corrupted, so the model can't blindly assume that all input tokens are here to stay.

paulsmith · 2025-05-22T15:34:11 1747928051

> say model returns the probability of the token "cat" on the second position as p_2("cat") = 0.3, while p_2("dog") = 0.6. We may want to replace "cat" with dog, and use it in the subsequent iterations.

Might one tradeoff of speed/quality be a tree "search" for better outcomes by branching on logit choices? If a diffusion model is so much faster overall than AR, then I might not mind that I hunt or backtrack for the best probabilities overall.

skydhash · 2025-05-22T12:28:35 1747916915

But what about the dependency graph between symbols in the program. Because all those symbols have high constraints around them which is the program design.

The issue comes in image diffusion as well. When you ask it for a portrait and some details are wrong. That’s because the face has constraints (which you learn about as an artist). Patterns and probability won’t help you.

angusturner · 2025-05-22T14:01:38 1747922498

You assume that for small steps (I.e taking some noisy code and slightly denoising) you can make an independence assumption. (All tokens conditionally independent, given the current state).

Once you chain many steps you get a very flexible distribution that can model all the interdependencies.

A stats person could probably provide more nuance, although two interesting connection I’ve seen: There is some sense in which diffusion generalises autoregression, because you don’t have to pick an ordering when you factor the dependency graph.

(Or put otherwise, for some definitions of diffusion you can show autoregression to be a special case).

skydhash · 2025-05-22T16:46:05 1747932365

There’s a reason we have formal verification as the highest guarantee for software. To ensure that we have a complete assurance of what the program can and can not do, the semantic of each of its components needs to be known. Recursively.

A minor change in one token can change the meaning of the whole software. Programming is just trying to enforce semantics on instructions (how well is that done is software engineering’s realm)

An algorithm like merge sort is just semantic constraints. Which is why most books go with their own notations as code does not really matter.

At most, LLMs and diffusion can be regarded as fancy searches. But, what you actually want is semantics and that’s why you can design lots of stuff on paper. But we do it with the code editor because feedbacks are nice and libraries’ documentations (if they exist) lie about their semantics. And we read code because there’s nothing more complete about semantics than that.

oliwary · 2025-05-22T12:34:53 1747917293

Fascinating, and great explanation.

What about insert and delete operations however? Isn't there a risk of there being too few tokens to properly finish the code in-between the "final" tokens?

Workaccount2 · 2025-05-22T15:15:39 1747926939

Can you have a hybrid model that can do autoregression and diffusion? It doesn't seem like there is something that would fundamentally prevent this. A model with diffusion CoT for rapid "thought" generation, and then autoregression for the answer on the output.

nvtop · 2025-05-22T18:19:19 1747937959

You can absolutely do it, and I think it's a nice idea to try.

shawntan · 2025-05-22T17:08:06 1747933686

I'm curious how the speed is achieved is this is the technique used. Generally I expected this "masked language model" technique to be far slower since the full vocab projection needs to be computed every iteration.

I always thought the eventual technique would be some form of diffusion in continuous space, then decoding into the discrete tokens.

Also I'm guessing this is a "best guess" of how Gemini Diffusion is done?

victorbjorklund · 2025-05-22T09:28:19 1747906099

Thanks. Best explanation of text diffusion.

moralestapia · 2025-05-22T09:39:24 1747906764

Whoa man, thanks.

This is a great explanation.

ctxc · 2025-05-22T12:36:47 1747917407

Thank you for the explanation!

nvtop · on Feb 7, 2025

I'm also very skeptical of the significance of this "aha moment". Even if they didn't include chain-of-thoughts to the base model's training data (unlikely), there are still plenty of it on the modern Internet. OpenAI released 800k of reasoning steps which are publicly available, github repositories, examples in CoT papers... It's definitely not a novel concept for a model, that it somehow discovered by its own.

nvtop · on May 20, 2024

The whole point of NPU-enabled devices is to run models locally, so they your data never leaves your device. This is a huge privacy win.

jazzyjackson · on May 20, 2024

They're trying to have it both ways and it's not clear to me as a consumer what is local and what is cloud. (As a developer, I can tell they're doing a few things locally like OCR and webcam background blur on the NPU, but they are not running ChatGPT on an a laptop anytime soon)

jsheard · on May 20, 2024

Although the line can get fuzzy when they want to ship a feature that's too big to run locally. Android has run into that, some of the AI features run locally, some of them run on Googles servers, and some of them might run locally or on Googles servers depending on which device you happen to have.

papichulo2023 · on May 21, 2024

The whole point is making the consumers pay the cost of running LLMs (both in hardware and power), not your privacy, they will still get your data to train better models.

wmf · on May 20, 2024

The whole point of enshittification is that companies don't need your data but they take it anyway.

nvtop · on Feb 5, 2024

I use tmux when SSH'ing to remote boxes, but when working locally I find native terminal panes and tabs to be a better experience. Does tmux provide anything extra to what wezterm/kitty/iterm2 do?

i-use-nixos-btw · on Feb 5, 2024

- a tmux session persists on the remote machine, whereas with direct SSH a disconnect loses what you were doing

- a tmux session can be used by multiple people

- with tmux a single command “tmux attach -t someusefulname” restores the layout and all of the commands used, saving a bunch of time and opportunity for error

- tmux has an API, so you can spawn a fully loaded session from scratch with code, rather than by manually doing things or by having a pre-made session (this is soooo underrated, especially if you make it configurable)

Honestly there comes a point where you just redesign the software and have it run in a more automated fashion anyway, but for the odd job that you have to run from time to time it’s very handy to have tmux as a persistent, shareable, configurable scratch space.

thworp · on Feb 5, 2024

You can use the tmux API or a plugin like tmuxp[1] to load a pre-configured session (including running arbitrary shell commands and setting layout).

Using this you could automatically spawn your ssh connections as nested tmux sessions. Obviously this brings some complications, namely that you need to set different prefix keys (or double press prefix) and any non-prefix hotkeys always get sent to your local session. Personally I just configure remote tmux to have only prefixed keybinds and never make any complicated layouts on remote sessions so that having to press <inner-prefix> + <key> doesn't get annoying.

[1]: https://github.com/tmux-python/tmuxp

0x008 · on Feb 7, 2024

You can use the same hotkeys to switcch between windows and panes that you use on the ssh box :=)

eigenvalue · on Feb 5, 2024

That’s true, but I do much of my work on remote boxes so it’s a less relevant comparison for me.

nvtop · on Dec 23, 2023

A lot. POS taggers used to be linear classifiers + features. In 2018 they switched to BERT and similar encoder-only models. In 2023, POS tagging is largely irrelevant, because it was used as a part of a larger pipeline, but now you can have everything end-to-end with better accuracy by fine-tuning a sufficienly large pretrained model (LLM or encoder-decoder like T5)