> My assumption based on the latency claims in paper.
The latency claims are based on the merged version, where the modifications are merged into the model weights. Hence there is no latency cost, since the final model has the same shape as the original.
> having that W0 forward process once with n distinct BAx paths (for distinct fine tunings!) would address that, no?
The tl;dr is that that works, but is more expensive. Not ridiculously more expensive, but certainly more expensive that processing a few additional tokens with prefix/prompt tuning.
The latency claims are based on the merged version, where the modifications are merged into the model weights. Hence there is no latency cost, since the final model has the same shape as the original.
> having that W0 forward process once with n distinct BAx paths (for distinct fine tunings!) would address that, no?
The tl;dr is that that works, but is more expensive. Not ridiculously more expensive, but certainly more expensive that processing a few additional tokens with prefix/prompt tuning.