Skip to content
Snippets Groups Projects
  • Tobias Burnus's avatar
    25072a47
    OpenMP: Call cuMemcpy2D/cuMemcpy3D for nvptx for omp_target_memcpy_rect · 25072a47
    Tobias Burnus authored
    When copying a 2D or 3D rectangular memmory block, the performance is
    better when using CUDA's cuMemcpy2D/cuMemcpy3D instead of copying the
    data one by one. That's what this commit does.
    
    Additionally, it permits device-to-device copies, if neccessary using a
    temporary variable on the host.
    
    include/ChangeLog:
    
    	* cuda/cuda.h (CUlimit): Add CUDA_ERROR_NOT_INITIALIZED,
    	CUDA_ERROR_DEINITIALIZED, CUDA_ERROR_INVALID_HANDLE.
    	(CUarray, CUmemorytype, CUDA_MEMCPY2D, CUDA_MEMCPY3D,
    	CUDA_MEMCPY3D_PEER): New typdefs.
    	(cuMemcpy2D, cuMemcpy2DAsync, cuMemcpy2DUnaligned,
    	cuMemcpy3D, cuMemcpy3DAsync, cuMemcpy3DPeer,
    	cuMemcpy3DPeerAsync): New prototypes.
    
    libgomp/ChangeLog:
    
    	* libgomp-plugin.h (GOMP_OFFLOAD_memcpy2d,
    	GOMP_OFFLOAD_memcpy3d): New prototypes.
    	* libgomp.h (struct gomp_device_descr): Add memcpy2d_func
    	and memcpy3d_func.
    	* libgomp.texi (nvtpx): Document when cuMemcpy2D/cuMemcpy3D is used.
    	* oacc-host.c (memcpy2d_func, .memcpy3d_func): Init with NULL.
    	* plugin/cuda-lib.def (cuMemcpy2D, cuMemcpy2DUnaligned,
    	cuMemcpy3D): Invoke via CUDA_ONE_CALL.
    	* plugin/plugin-nvptx.c (GOMP_OFFLOAD_memcpy2d,
    	GOMP_OFFLOAD_memcpy3d): New.
    	* target.c (omp_target_memcpy_rect_worker):
    	(omp_target_memcpy_rect_check, omp_target_memcpy_rect_copy):
    	Permit all device-to-device copyies; invoke new plugins for
    	2D and 3D copying when available.
    	(gomp_load_plugin_for_device): DLSYM the new plugin functions.
    	* testsuite/libgomp.c/target-12.c: Fix dimension bug.
    	* testsuite/libgomp.fortran/target-12.f90: Likewise.
    	* testsuite/libgomp.fortran/target-memcpy-rect-1.f90: New test.
    25072a47
    History
    OpenMP: Call cuMemcpy2D/cuMemcpy3D for nvptx for omp_target_memcpy_rect
    Tobias Burnus authored
    When copying a 2D or 3D rectangular memmory block, the performance is
    better when using CUDA's cuMemcpy2D/cuMemcpy3D instead of copying the
    data one by one. That's what this commit does.
    
    Additionally, it permits device-to-device copies, if neccessary using a
    temporary variable on the host.
    
    include/ChangeLog:
    
    	* cuda/cuda.h (CUlimit): Add CUDA_ERROR_NOT_INITIALIZED,
    	CUDA_ERROR_DEINITIALIZED, CUDA_ERROR_INVALID_HANDLE.
    	(CUarray, CUmemorytype, CUDA_MEMCPY2D, CUDA_MEMCPY3D,
    	CUDA_MEMCPY3D_PEER): New typdefs.
    	(cuMemcpy2D, cuMemcpy2DAsync, cuMemcpy2DUnaligned,
    	cuMemcpy3D, cuMemcpy3DAsync, cuMemcpy3DPeer,
    	cuMemcpy3DPeerAsync): New prototypes.
    
    libgomp/ChangeLog:
    
    	* libgomp-plugin.h (GOMP_OFFLOAD_memcpy2d,
    	GOMP_OFFLOAD_memcpy3d): New prototypes.
    	* libgomp.h (struct gomp_device_descr): Add memcpy2d_func
    	and memcpy3d_func.
    	* libgomp.texi (nvtpx): Document when cuMemcpy2D/cuMemcpy3D is used.
    	* oacc-host.c (memcpy2d_func, .memcpy3d_func): Init with NULL.
    	* plugin/cuda-lib.def (cuMemcpy2D, cuMemcpy2DUnaligned,
    	cuMemcpy3D): Invoke via CUDA_ONE_CALL.
    	* plugin/plugin-nvptx.c (GOMP_OFFLOAD_memcpy2d,
    	GOMP_OFFLOAD_memcpy3d): New.
    	* target.c (omp_target_memcpy_rect_worker):
    	(omp_target_memcpy_rect_check, omp_target_memcpy_rect_copy):
    	Permit all device-to-device copyies; invoke new plugins for
    	2D and 3D copying when available.
    	(gomp_load_plugin_for_device): DLSYM the new plugin functions.
    	* testsuite/libgomp.c/target-12.c: Fix dimension bug.
    	* testsuite/libgomp.fortran/target-12.f90: Likewise.
    	* testsuite/libgomp.fortran/target-memcpy-rect-1.f90: New test.