Skip to content

pplmx/nova

CUDA Parallel Algorithms Library

Generated by CI Coverage Status

A production-ready CUDA parallel algorithms library with a five-layer architecture, supporting education, extensibility, and production use cases.

Architecture

Five-Layer Design

┌─────────────────────────────────────────────────────────────┐
│  Layer 3: High-Level API (STL-style)                        │
│  cuda::reduce(), cuda::sort()                              │
└─────────────────────────────────────────────────────────────┘
                              ▲
┌─────────────────────────────────────────────────────────────┐
│  Layer 2: Algorithm Wrappers                                │
│  cuda::algo::reduce_sum(), memory management               │
└─────────────────────────────────────────────────────────────┘
                              ▲
┌─────────────────────────────────────────────────────────────┐
│  Layer 1: Device Kernels                                   │
│  Pure __global__ kernels, no memory allocation             │
└─────────────────────────────────────────────────────────────┘
                              ▲
┌─────────────────────────────────────────────────────────────┐
│  Layer 0: Memory Foundation                                │
│  Buffer<T>, unique_ptr<T>, MemoryPool, Allocator concepts   │
└─────────────────────────────────────────────────────────────┘

Directory Structure

include/cuda/
├── memory/               # Layer 0: Memory Foundation
│   ├── buffer.h         # cuda::memory::Buffer<T>
│   ├── unique_ptr.h     # cuda::memory::unique_ptr<T>
│   ├── memory_pool.h    # MemoryPool for allocation
│   └── allocator.h      # Allocator concepts
├── device/              # Layer 1: Device Kernels
│   ├── reduce_kernels.h
│   ├── scan_kernels.h
│   └── device_utils.h   # CUDA_CHECK, warp_reduce
├── algo/                 # Layer 2: Algorithm Wrappers
│   ├── reduce.h
│   ├── scan.h
│   └── sort.h
└── api/                  # Layer 3: High-Level API
    ├── device_vector.h   # STL-style device container
    ├── stream.h          # Stream and Event wrappers
    └── config.h          # Algorithm configuration objects

include/
├── image/               # Image processing
│   ├── types.h
│   ├── brightness.h
│   ├── gaussian_blur.h
│   ├── sobel_edge.h
│   └── morphology.h
├── parallel/            # Parallel primitives
│   ├── scan.h
│   ├── sort.h
│   └── histogram.h
├── matrix/              # Matrix operations
│   ├── add.h
│   ├── mult.h
│   └── ops.h
└── convolution/         # Convolution
    └── conv2d.h

src/
├── memory/               # Layer 0 implementations
├── cuda/
│   ├── device/           # Layer 1 implementations
│   └── algo/             # Layer 2 implementations
├── image/
├── parallel/
├── matrix/
└── convolution/

Layer Responsibilities

Layer Namespace Purpose Dependencies
Layer 0 cuda::memory Memory allocation, RAII, pooling CUDA runtime
Layer 1 cuda::device Pure device kernels Layer 0
Layer 2 cuda::algo Memory management, algorithms Layers 0, 1
Layer 3 cuda::api STL-style containers Layers 0, 1, 2

Quick Start

Build

git clone https://github.com/pplmx/nova.git
cd nova
make build

Run Demo

make run

Run Tests

make test          # Run all tests
make test-unit     # Run algorithm tests

Usage Examples

Layer 0: Memory Foundation

#include "cuda/memory/buffer.h"

// RAII memory management
cuda::memory::Buffer<int> buf(1024);
buf.copy_from(host_data.data(), 1024);

// Memory pool for efficiency
cuda::memory::MemoryPool pool({.block_size = 1 << 20});
auto buf2 = pool.allocate(1024);

Layer 2: Algorithm API

#include "cuda/algo/reduce.h"

// Use layered API
int sum = cuda::algo::reduce_sum(d_input, N);
int max = cuda::algo::reduce_max(d_input, N);

Layer 3: High-Level API

#include "cuda/api/device_vector.h"
#include "cuda/api/stream.h"
#include "cuda/api/config.h"

// DeviceVector - STL-style container
cuda::api::DeviceVector<int> d_vec(N);
d_vec.copy_from(input);
int sum = cuda::algo::reduce_sum(d_vec.data(), d_vec.size());

// Stream - RAII async operations
cuda::api::Stream stream;
stream.synchronize();

// Config - algorithm configuration
auto config = cuda::api::ReduceConfig::optimized_config();

Modules

Module Files Description
cuda::memory Buffer, unique_ptr, MemoryPool, Allocator Memory management
cuda::device device_utils, reduce_kernels Pure CUDA kernels
cuda::algo reduce wrappers, device_buffer Algorithm orchestration
cuda::api DeviceVector, Stream, Event, Config High-level API
image types, brightness, gaussian_blur, sobel, morphology Image processing
parallel scan, sort, histogram Parallel primitives
matrix add, mult, ops Matrix operations
convolution conv2d 2D convolution

Testing

81 tests across 13 test suites, all passing:

Test Suite Tests
ReduceTest 11
ScanTest 10
SortTest 7
OddEvenSortTest 3
MatrixMultTest 7
MatrixOpsTest 16
ImageBufferTest 5
GaussianBlurTest 7
SobelTest 7
BrightnessTest 10
TestPatternsTest 14

Development

Makefile Targets

Target Description
make build Configure and build project
make run Run benchmark demo
make test Run all tests (81 tests)
make clean Clean build artifacts

Requirements

  • CUDA Toolkit 12+
  • CMake 3.25+
  • C++20 compatible compiler
  • CUDA-capable GPU

License

Licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

See CONTRIBUTING.md.

About

A CUDA parallel algorithms library

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages