![]() So in general I think your conclusions are incorrect, and you probably need to do more debugging, rather than arriving at the conclusions about shared memory. (So, it cannot be occurring on a read from global memory to populate shared memory, as your conjecture suggests.) The reported failure has to do with out-of-bounds indexing into global memory, on a write to global memory, not shared memory.However we can draw two conclusions from this: = Invalid _global_ write of size 8 So if you attempt to allocate more than 48KB there, you should get an error if you are checking for it. Is the amount per threadblock, and that amount is limited to 48KB (which is 49152, not 48000). The amount of shared memory you are allocating dynamically here: kernel_function>(N. This error makes no sense to me since the cumulated amount of shared memory should only affect the way the grid is scheduled among the streaming multiprocessors (and moreover the GPU device has 15 SM at disposal). If I have understood correctly how CUDA works, the cumulated amount of shared memory that the grid uses is not limited to 48kB, and this is only the limit of shared memory that can be used by a single thread block. This limit is the same regardless of the grid and the block size. If(idx 1500, hence when the overall amount of shared memory exceeds 48kB ( 1500 * 4 * sizeof(double) = 1500 * 32 = 48000). Unsigned int idx = threadIdx.x + blockDim.x * blockIdx.x This occurs when, in the kernel, the global memory is being copied in the shared memory: _global_ void kernel_function(const size_t N, double *pN. = Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaLaunch.Īs far as I understand, this is due to the attempt of writing out of the bounds of the allocated shared memory. ) įor small N, this works fine and the kernel is executed without error.īut if a exceed N = 1500, the kernel launch fails (with the following messages appearing multiple times): = Invalid _global_ write of size 8 Then the kernel is launched using these 3 variables. The dimension of the grid and the shared memory are computed depending on N and the block size: const size_t BLOCK_SIZE = 512 Ĭonst size_t GRID_SIZE = (N % BLOCK_SIZE) ? (int) N/BLOCK_SIZE : (int) N/BLOCK_SIZE +1 Ĭonst size_t SHARED_MEM_SIZE = BLOCK_SIZE * 4 * sizeof(double) ![]() Since the number N of tasks is not known at compile time, the shared memory is dynamically allocated. And a single task is executed by a thread.įor the sake of rapidity, I have been using the shared memory for these variables (given that registers are also being used by threads). I am working on a N-body problem requiring a large amount of shared memory.īasically, there are N independent tasks, each one using 4 doubles variables, i.e.
0 Comments
Leave a Reply. |