Similar architectures have been available for a plenty of time! 256 bits at once with multiple execution units is a lot of compute power and has been the standard for a decade. Let alone SSE.
SSE and AVX instructions are optimised primarily for 3D graphics, such as multiplying 4 floating point numbers with a 4x4 matrix. There are a handful of additional instructions optimised for doing things to pixels... and that's about it.
AVX-512 is designed to work more like what a GPU does internally, and provides a much richer set of instructions. It enables fine-grained masking and shuffles, without which many simple types of code are either impossible to compile, or much more complex... and slower. This is why auto-vectorisation with SSE an AVX are only enabled for some simple loops, and provide marginal benefits outside of those scenarios.