
The compute unified device architecture (cuda) enables nvidia graphics processing units (gpus) to be used for massively parallel general purpose computation.
the nvblas library is a gpu-accelerated library that implements blas (basic linear algebra subprograms). it can accelerate most blas level-3 routines by dynamically routing blas calls to one or more nvidia gpus present in the system, when the characteristics of the call make it to speedup on a gpu.