A simple and easy-to-use library for GPU computing in Fortran, providing transparent access to GPU acceleration through a clean Fortran interface.
Documentation
Online API Documentation - Complete API reference generated with Doxygen
Overview
Simple GPU is a library designed to simplify GPU computing in Fortran applications. It provides:
- Dual implementation: CPU-only version using standard BLAS, and GPU-accelerated version using NVIDIA cuBLAS or AMD hipBLAS
- Transparent interface: Same Fortran API for both CPU and GPU versions
- Memory management: Easy GPU memory allocation and data transfer
- BLAS operations: Common BLAS operations (GEMM, GEMV, DOT, GEAM) for both single and double precision
- Stream support: Asynchronous operations through CUDA or HIP streams
Features
Memory Management
gpu_allocate: Allocate memory on GPU (or CPU for CPU version)
gpu_deallocate: Free allocated memory
gpu_upload: Transfer data from CPU to GPU
gpu_download: Transfer data from GPU to CPU
gpu_copy: Copy data between GPU memory regions
Device Management
gpu_ndevices: Query number of available GPU devices
gpu_set_device: Select active GPU device
gpu_get_memory: Query GPU memory status
BLAS Operations
All BLAS operations have variants that accept 64-bit integers for dimensions. These variants have a _64 suffix (e.g., gpu_ddot_64, gpu_dgemm_64).
Level 1: Vector operations
gpu_sdot, gpu_ddot: Dot product (single/double precision)
Level 2: Matrix-vector operations
gpu_sgemv, gpu_dgemv: Matrix-vector multiplication
Level 3: Matrix-matrix operations
gpu_sgemm, gpu_dgemm: Matrix-matrix multiplication
gpu_sgeam, gpu_dgeam: Matrix addition/transposition
Advanced Features
- Streams: Create and manage CUDA streams for asynchronous execution
- BLAS handles: Manage cuBLAS library handles
- Stream synchronization: Control execution flow
Installation
Prerequisites
For CPU version (required):
- C compiler (gcc, clang, etc.)
- Fortran compiler (gfortran, ifort, etc.)
- BLAS library (OpenBLAS, Intel MKL, or reference BLAS)
- Autotools (autoconf, automake, libtool)
For NVIDIA GPU version (optional):
- NVIDIA CUDA Toolkit (with nvcc compiler)
- NVIDIA cuBLAS library
- CUDA-capable GPU
For AMD GPU version (optional):
- AMD ROCm platform
- hipBLAS library
- ROCm-capable GPU
Building from Source
- Generate configure script (if building from git):
Configure the build:
The configure script will automatically detect if CUDA is available and enable the NVIDIA GPU library if possible.
Configuration options:
--disable-nvidia: Disable NVIDIA GPU library even if CUDA is available
--with-cuda=DIR: Specify CUDA installation directory (default: /usr/local/cuda)
--disable-amd: Disable AMD GPU library even if ROCm is available
--with-rocm=DIR: Specify ROCm installation directory (default: auto-detect)
--with-blas=LIB: Specify BLAS library to use
Example configurations:
# CPU version only
./configure --disable-nvidia --disable-amd
# Specify CUDA location
./configure --with-cuda=/opt/cuda
# Specify ROCm location
./configure --with-rocm=/opt/rocm
# Use specific BLAS library
./configure --with-blas="-lmkl_rt"
- Build the libraries:
- Run tests (optional):
- Install:
Library Versions
Simple GPU provides three shared libraries:
- libsimple_gpu-cpu.so: CPU-only version
- Uses standard BLAS library
- No GPU required
- Useful for development and testing on systems without GPUs
- libsimple_gpu-nvidia.so: NVIDIA GPU version (if CUDA is available)
- Uses NVIDIA cuBLAS library
- Requires CUDA-capable GPU
- Provides GPU acceleration for supported operations
- libsimple_gpu-amd.so: AMD GPU version (if ROCm is available)
- Uses AMD hipBLAS library
- Requires ROCm-capable GPU
- Provides GPU acceleration for supported operations
All libraries provide the same Fortran interface, allowing seamless switching between CPU and GPU implementations.
Usage
Including the Library in Your Code
Important: To use the Simple GPU library in your Fortran project, you must create a file with the .F90 extension (uppercase) that includes the library header using the C preprocessor:
#include <simple_gpu.f90>
Why Use <tt>.F90</tt> Instead of <tt>.f90</tt>?
The .F90 extension (uppercase) is critical because:
- Preprocessor Support: Fortran compilers only run the C preprocessor on files with uppercase extensions (
.F90, .F, .FOR). The lowercase .f90 extension bypasses the preprocessor, so the #include directive won't work.
- Cross-Compiler Compatibility: The
#include <simple_gpu.F90> directive allows the C preprocessor to find the header file in a default location (via CPATH environment variable). This means:
- When you update the library, no changes are needed in your code
- The library can be compiled once with one compiler (e.g., gcc/gfortran) and used with any other Fortran compiler (ifort, nvfortran, etc.)
- No Fortran
.mod files are distributed, which are notoriously incompatible between different compilers
- Simplified Distribution: By avoiding compiler-specific
.mod files, simple_gpu maintains maximum portability across different Fortran compiler ecosystems.
Example of a minimal program:
#include <simple_gpu.F90>
program my_program
implicit none
end program my_program
Simple GPU - Fortran GPU Computing Library.
Data Types
The library provides multidimensional array types for both single and double precision:
gpu_double1: 1-dimensional array of double precision values
gpu_double2: 2-dimensional array of double precision values
gpu_double3: 3-dimensional array of double precision values
gpu_double4, gpu_double5, gpu_double6: 4, 5, and 6-dimensional arrays
Similarly for single precision:
gpu_real1 through gpu_real6
Each type contains:
c: C pointer to GPU memory
f: Fortran pointer for accessing data (e.g., f(:) for 1D, f(:,:) for 2D)
The gpu_allocate function is overloaded and automatically accepts the appropriate number of dimensions:
90
call gpu_allocate(b, l, m, n) ! 3d array: b is gpu_double3 or gpu_real3
void gpu_allocate(void **ptr, const int64_t n)
Basic Example with 1D Arrays
90
#include <simple_gpu.F90>
program example
implicit none
type(gpu_blas) :: handle
type(gpu_double1) :: x, y
double precision, allocatable :: x_h(:), y_h(:)
double precision :: result
integer :: n, i
n = 1000
allocate(x_h(n), y_h(n))
do i = 1, n
x_h(i) = dble(i)
y_h(i) = dble(i) * 2.0d0
end do
call gpu_ddot(handle, n, x%f(1), 1, y%f(1), 1, result)
print *, 'Dot product result:', result
deallocate(x_h, y_h)
end program example
Allocate GPU/CPU memory for arrays.
Upload data from host (CPU) to device (GPU)
subroutine gpu_blas_destroy(handle)
Destroy a BLAS handle.
subroutine gpu_ddot(handle, n, dx, incx, dy, incy, res)
Double precision dot product (32-bit dimensions)
subroutine gpu_blas_create(handle)
Create a BLAS handle.
Example with 2D Arrays
90
#include <simple_gpu.F90>
program example_2d
implicit none
type(gpu_blas) :: handle
type(gpu_double2) :: a, b, c
double precision, allocatable :: a_h(:,:), b_h(:,:), c_h(:,:)
double precision :: alpha, beta
integer :: m, n, i, j
m = 100
n = 200
allocate(a_h(m,n), b_h(m,n), c_h(m,n))
do j = 1, n
do i = 1, m
a_h(i,j) = dble(i + j)
b_h(i,j) = dble(i * j)
end do
end do
alpha = 1.5d0
beta = 0.5d0
alpha, a%f(1,1), m, beta, b%f(1,1), m, &
c%f(1,1), m)
print *, 'Result at (1,1):', c_h(1,1)
deallocate(a_h, b_h, c_h)
end program example_2d
Download data from device (GPU) to host (CPU)
subroutine gpu_dgeam(handle, transa, transb, m, n, alpha, a, lda, beta, b, ldb, c, ldc)
Important Note about BLAS Function Arguments:
When calling BLAS functions, always use the address of the first element of the array (e.g., xf(1) for 1D arrays or af(1,1) for 2D arrays), otherwise you may encounter type errors:
90
call gpu_ddot(handle, n, x%f(1), 1, y%f(1), 1, result)
call gpu_ddot(handle, n, x, 1, y, 1, result)
void gpu_ddot(const void *handle, const int64_t n, const double *x, const int64_t incx, const double *y, const int64_t incy, double *result)
Technical Note: The library wrappers use Fortran's c_loc() intrinsic to obtain the memory address of the array element. By passing xf(1), you're providing the first element as a scalar with the target attribute, which c_loc() then converts to the appropriate C pointer for the underlying BLAS routines.
Using Different Library Versions
To use a specific library version, link your application against the desired library:
# CPU version
gfortran -o myapp myapp.f90 -lsimple_gpu-cpu
# GPU version (NVIDIA)
gfortran -o myapp myapp.f90 -lsimple_gpu-nvidia
# GPU version (AMD)
gfortran -o myapp myapp.f90 -lsimple_gpu-amd
Testing
The library includes comprehensive unit tests that compare CPU and GPU implementations to ensure correctness.
Run tests with:
For verbose test output:
API Reference
For detailed API documentation, see the comments in:
Performance Notes
- For small problem sizes, CPU version may be faster due to GPU overhead
- GPU version shows significant speedup for larger matrices/vectors
- Use streams for asynchronous operations to overlap computation and data transfer
- Keep data on GPU between operations to minimize transfer overhead
Troubleshooting
CUDA not detected during configuration
If CUDA is installed but not detected:
./configure --with-cuda=/path/to/cuda
BLAS library not found
Specify BLAS library explicitly:
./configure --with-blas="-lopenblas"
Runtime errors with GPU version
- Ensure NVIDIA drivers are properly installed (for NVIDIA GPUs)
- Ensure ROCm drivers are properly installed (for AMD GPUs)
- Check that CUDA libraries are in your library path:
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
- Check that ROCm libraries are in your library path:
export LD_LIBRARY_PATH=/opt/rocm/lib:$LD_LIBRARY_PATH
- Verify GPU is accessible with
nvidia-smi (NVIDIA) or rocm-smi (AMD)
Contributing
Contributions are welcome! Please ensure:
- Code follows existing style and conventions
- All tests pass before submitting
- New features include appropriate tests
Adding BLAS Functions
The library currently implements a core set of BLAS operations (DOT, GEMV, GEMM, GEAM). If you need additional BLAS functions (such as AXPY, SCAL, TRSM, SYRK, etc.), contributions are highly encouraged! Adding new BLAS routines follows the established pattern in the codebase and helps make the library more complete and useful for the community.
Building Documentation
The project uses Doxygen to generate API documentation from source code comments.
Prerequisites
- Doxygen (version 1.9 or later)
- Graphviz (for generating diagrams)
Generate Documentation Locally
# Install dependencies (Ubuntu/Debian)
sudo apt-get install doxygen graphviz
# Generate documentation
doxygen Doxyfile
# View documentation
# Open docs/html/index.html in your web browser
The documentation is automatically built and published to GitHub Pages when changes are pushed to the main branch.
License
See [LICENSE](LICENSE) file for details.