Maybe we ought to have Numerical Coprocessors?

May 25, 2015

The past decade has seen the rise of GPGPUs. We’re leveraging the tremendous computational power of graphics cards to accelerate computationally intensive applications such as machine learning, video compression and sorting. Unfortunately, GPGPU is somewhat slow to gain adoption. There are multiple issues involved, such as the need for special GPU/numerical programming languages, complex drivers, vendor-specific differences, and the overhead of having to shuffle data in and out of a separate memory hierarchy.

I was recently reading a blog post claiming that matrix multiplication (GEMM) is the most expensive operation in deep learning, taking up to 95% of the execution time. This got me thinking that maybe GPGPUs are simply not ideal for most applications. Maybe future CPUs should begin to include numerical coprocessors. Sure, we already have SIMD, but the way it’s implemented on x86 CPUs is awkward and relatively inefficient, forcing you to deal with multithreading, prefetching, and SIMD registers of small fixed sizes. Every few years, Intel adds support for new instructions with new SIMD register sizes, rendering your code outdated (yuck). To do SIMD well, you basically write or generate code specialized for a specific CPU model, and even then, it’s just not that fast.

I believe that the Cray 1, which came out in 1975, had the right idea. You write a small loop (kernel) and let the CPU handle the memory traffic and looping. What I’m thinking of is essentially a CPU core which is optimized to run parallel-for instructions on a reduced numerical instruction set. Imagine having a specialized CPU core that shares the same memory space as other CPU cores and can handle prefetching, parallelization and low-level optimizations according to its capabilities, without you needing to change your code. Imagine not needing a driver or threads to make use of this core. Imagine how fast matrix multiplication or cross product could be if you had native hardware support for it.

From → Assembly, Compilers, Microprocessors

10 Comments

Zeev Tarantov permalink

OpenCL?

Reply
- Maxime Chevalier-Boisvert permalink
  
  OpenCL is higher level than what I have in mind, which is an instruction set extension designed for parallel calculations, with some hardware-independent features (the hardware adapts the specifics of the computation based on its capabilities).
  
  Reply
  - Zeev Tarantov permalink
    
    You’d have to implement a useful algorithm using this proposed instruction set and describe two different implementations of the instructions set for people to understand you, IMO.
    
    Reply
Rob Fowler permalink

Why not use a GPU? Common, relatively easy to use libraries already exist. (https://developer.nvidia.com/cuBLAS) . Not just the GEMM, which is accessible as an API in the library but the whole BLAS. Even further, they also added support for BLAS across multiple GPUs over a year ago.
Somewhat like ye’ old 80287, if you need the numerical power you can just pop to Frys and pick up a card to off-load it.

Reply
- Maxime Chevalier-Boisvert permalink
  
  The overhead of CPUGPU communication is quite high. If you wanted to do operations on matrices of moderate size, it would likely not be worthwhile to pay the cost of shuffling the data over and launching these operations on GPUs. However, SSE’s SIMD still feels underpowered. I think there could be a middle ground between those two options, and a better integration of numerical computations and general-purpose computations.
  
  Reply
  - A. Miloradovsky permalink
    
    Regarding overhead reduction.
    It is called Heterogeneous System Architecture (HSA), a computing architecture created to address these concerns, efficiency of CPU and GPU integration.
    And there are processors using this scheme (CPU and GPU are packed into single IC, share all the resources, including memory access, etc.).
    They are called Accelerated Processing Units (APUs) by AMD and exist for several years.
    Look at AMD’s site for developers, there are lots of information on this topic.
    Another implementation of BLAS for GPU (or APU) is clMath (open-source).
    
    Reply
    - Maxime Chevalier-Boisvert permalink
      
      APUs are going in the direction of what I was proposing, but it’s still an awkward integration of what used to be a graphics card. You still need to go through the OS (system calls) to access this. You need drivers and libraries. That makes it inefficient. There is overhead that could be shaved off. I still believe that there must be a better way to integrate things.
A. Miloradovsky permalink

In CPUs there is multiply-add instruction(s) to accelerate computation of linear combinations (linear algebra): matrix multiplication, dot product etc. (see FMA instruction set).
GPUs are great piece of hardware, exactly for “numerical” computations, especially those with GCN architecture, the problem is in their programming, not in the hardware itself.
The main problem with tools like OpenCL/OpenGL SL is that they have “two-level” or “just-in-time” translation: the library is just an interface to another translator, of programs that will be compiled only after the main program starts…
But the problem is not with JIT architecture itself but with lack of a way to change the “shader language” to whatever I want.
There are attempts to address this problem by defining an API to describe the kernels/shaders in terms of the language itself (not embedded strings), Mantle for example, but it is for Windows only, there is also Vulkan, but it is in veery early stage of it’s development.
Anyway I hope in not distant future Vulkan will allow to create GPU backend for any programming language quite easily: right now however you’d have to dig through all the hardware documentation for the specific GPU to do that.
(and this is what I’m considering to do, if only I’ll have time for that…)
Finally, about vendor specificity, there is AFAIK only one modern GPU vendor which offers all this documentation…
Model specificity is another problem, it may/has to be solved just like for CPUs.
Shortly: we ought to have not numerical coprocessors (yet another hardware) but rather a better API (than OpenCL) to program GPUs, which is free standard and supported by Mesa et al.

Reply
- A. Miloradovsky permalink
  
  s/great piece/great pieces/
  
  P.S. Look for Vulkan Overview (PDF by Khronos Group) and SPIR-V by the way: I guess, this is what you had in mind.
  
  Reply
S Kumar permalink

Efficient dataflow is probably what you are looking for. Look at the TRIPS project. For a short period of time it was more efficient than Von-Neumann style processors for DGEMM. However, it is encumbered by the fact that it tries to run general purpose code.

A recent research project from U. Wisc to be presented at ISCA 2015 will show how to integrate a dataflow subtrate on a conventional CPU (http://pages.cs.wisc.edu/~vinay/pubs/isca-hybrid-arch.pdf).

Reply