That's great news, but one would think that since they're behind Stable Diffusio...

GaggiX · on April 19, 2023

>try 10 trillion or 100 trillion tokens

Computation is not free and data is not infinite.

youssefabdelm · on April 19, 2023

You could've said the same to OpenAI when they were scaling GPT from 1 billion to 175 billion parameters. We're all grateful they didn't follow that line of thought.

But Stability does have access to a pretty big cluster, so it's not paying cloud compute (I assume), so cost will be less, and data of course is not infinite...never stated that.

But considering 3.7 million videos are uploaded to youtube everyday, 2 million scientific articles published every year, yada yada...that argument falls apart.

At the very least implement spiral development... 1 trillion... 3 trillion... (oh it seems to be getting WAY better! There seems to be a STEP CHANGE!)... 5 trillion... (holy shit this really works, lets keep going)

dragonwriter · on April 19, 2023

The training corpus is the problem. An extra trillion tokens is (ballpark) an extra million KJV bibles worth of text formatted for ingestion. And you probably picked all of the low hanging fruit in terms of quality prior vetting and being in a standard format for ingestion in your first trillion tokens of training data.

taneq · on April 19, 2023

There’s a difference between telling someone they’re wasting their time with their current project, and asking them why they didn’t spend 6x - 60x as much budget on an already expensive project.

youssefabdelm · on April 20, 2023

They're loaded, and we know scaling works, they'd massively benefit... both in marketing and profit.

Although it is open source to be fair.

dragonwriter · on April 19, 2023

> Like... try 10 trillion or 100 trillion tokens (although that may be absurd, I never did the calculation)

But where’s the corpus supposed ro come from?

Taek · on April 20, 2023

Nobody knows where to find 10 trillion tokens of good data. Publicly available / data without a license seems to cap at around 1.5 trillion tokens total. The internet isn't as big as you thought! (Or at least, all the good stuff is behind a walled garden, which I think we did know)