The past decade has seen the rise of GPGPUs. We're leveraging the tremendous computational power of graphics cards to accelerate computationally intensive applications such as machine learning, video compression and sorting. Unfortunately, GPGPU is somewhat slow to gain adoption. There are multiple issues involved, such as the need for special GPU/numerical programming languages, complex drivers, vendor-specific differences, and the overhead of having to shuffle data in and out of a separate memory hierarchy.
I was recently reading a blog post claiming that matrix multiplication (GEMM) is the most expensive operation in deep learning, taking up to 95% of the execution time. This got me thinking that maybe GPGPUs are simply not ideal for most applications. Maybe future CPUs should begin to include numerical coprocessors. Sure, we already have SIMD, but the way it's implemented on x86 CPUs is awkward and relatively inefficient, forcing you to deal with multithreading, prefetching, and SIMD registers of small fixed sizes. Every few years, Intel adds support for new instructions with new SIMD register sizes, rendering your code outdated (yuck). To do SIMD well, you basically write or generate code specialized for a specific CPU model, and even then, it's just not that fast.
I believe that the Cray 1, which came out in 1975, had the right idea. You write a small loop (kernel) and let the CPU handle the memory traffic and looping. What I'm thinking of is essentially a CPU core which is optimized to run parallel-for instructions on a reduced numerical instruction set. Imagine having a specialized CPU core that shares the same memory space as other CPU cores and can handle prefetching, parallelization and low-level optimizations according to its capabilities, without you needing to change your code. Imagine not needing a driver or threads to make use of this core. Imagine how fast matrix multiplication or cross product could be if you had native hardware support for it.