Improve GPU implementation

The performance of the CUDA implementation was already reasonable, there are several more optimisations that could be made. For example, in the sass::cuda::scatter() kernel, there are opportunities to exploit block-level memory to reduce global memory accesses. The performance of the reduction kernels should likewise further be investigated. Speciﬁcally, the sass::cuda::store_atfinal kernel currently does not use any form of partial reduction and hence calls atomicAdd relatively often, which could lead to memory contention issues.