Найти простаивающий графический процессор на компьютере с несколькими графическими процессорами

Question

Найти простаивающий графический процессор на компьютере с несколькими графическими процессорами

У меня есть следующий код, работающий под Cuda (Windows 10, VS 2015).

//Code for running on one or 2 gpu's
const unsigned __int64 MemOutputSize = (1i64 << 25)*64; //2GB
int deviceCount;
cudaGetDeviceCount(&deviceCount);
unsigned long long* dBuffer[2];
for (int dev = 0; dev < deviceCount; dev++) {
cudaMalloc(&dBuffer[dev], MemOutputSize);
cudaMemset(dBuffer[dev], 0, MemOutputSize);
}
for (int i=0; i < (1024*1024*1024); i++) {
int dev = i % deviceCount;
cudaSetDevice(dev);
runKernel<<<NUM_BLOCK, NUM_THREADS>>>(i, dBuffer[dev]);
}
//Copy data from GPU buffers to main mem
//Merge buffers into one.

Неважно, какой графический процессор выполняет какую часть кода.

Обычно я запускаю код на самой быстрой видеокарте. Но там являются два GPU, так что я могу запустить код на обоих, удвоив мою скорость.
Однако когда я запускаю код, самый медленный GPU работает на 100%, а самый быстрый работает на 67%, что означает, что я получаю ускорение только 2 x 67% = 133%. Я хочу, чтобы оба GPU были все время ближе к 100% использованию.

Я получаю эти проценты от Диспетчер задач (выберите вкладку Performance, compute_0). И нет, FireFox или другие графические процессоры, использующие процы, не работают.

Есть ли способ выбрать в данный момент неиспользуемый графический процессор в цикле for?
Это позволило бы мне выбирать быстрый графический процессор в 60% случаев и медленный в 40% случаев, увеличивая скорость до 158%.

Что касается тега cuda, меня интересуют только карты NVidia (на самом деле, двойной GTX 980).

Полный (рабочий) код здесь:

// System includes
#include <stdio.h>
#include <stdint.h>
#include <assert.h>
#include <conio.h>
#include <chrono>
#include <ctime>
#include <iostream>
#include <fstream>
#include <windows.h>// CUDA runtime
#include <cuda_runtime.h>

// helper functions & utilities to work with CUDA
#include <helper_functions.h>
#include <helper_cuda.h>#define NUM_BLOCKS    2048  //2^11
#define NUM_THREADS   128   //2^7   49-11-7 = 31

const unsigned __int64 MemOutputSize = (1i64 << 25) * 64;  // (__int64)(2 * 1024 * 1024 * 1024);
//const unsigned __int64 MemOutputSize = 0_40000000; //(1 * 1024 * 1024 * 1024);/************************************************************************************************************/
/*************************             Build the lookup table           *************************************/
/*************************                                              *************************************/
/************************************************************************************************************/

__global__ static void SevenToFive(const unsigned __int64 input, void* doutput) {

const unsigned int NRegs = 16;

unsigned __int64 Y[NRegs];
// A cell looks like this:
//   BCD   123
//   AxA   405
//   BCD   678
// we're using half-adder logic to store the 1, 2 and 4's count in 3 bitplanes.
const unsigned __int64 MaskR2 =  0x0303030303030303; //Keep the rightmost 2 rows.
const unsigned __int64 MaskR1 =  0x0101010101010101; //Keep the rightmost row.
const unsigned __int64 MaskL6 = (0xFFFFFFFFFFFFFFFF & (~MaskR2));
const unsigned __int64 MaskL7 = (0xFFFFFFFFFFFFFFFF & (~MaskR1));
//const unsigned __int64 AllOn =   0xFFFFFFFFFFFFFFFF;
const unsigned __int64 Mask5x5 = 0x007C7C7C7C7C0000;  //Both masks use the Q layout, because the input is transformed
const unsigned __int64 Mask3x3 = 0x0000383838000000;  //to Q in the Y[0] register.
//const unsigned __int64 Mask6x4 = 0x00007E7E7E7E0000;  //Both masks use the Q layout, because the input is transformed
//const unsigned __int64 Mask4x2 = 0x0000003C3C000000;  //to Q in the Y[0] register.

Y[14] = threadIdx.x;//*/   Y[14] = 127;  /*debug*/  7 bits
Y[13] = blockIdx.x;//*/  Y[13] = 2047; /*debug*/ 11 bits

//__int64 input2 = -1;
//Put 7x7 input into Y15.
//The top line (least significant) and left most line will be empty (lsb of every line).
//when doing non-overlapping or's always use xor to make any errors stand out.
//3+7+7+7+7 = 31 bits
Y[15] = (input & 7) ^ (((input >> 3) & 127) << 4) ^ (((input >> 10) & 127) << 12) ^ (((input >> 17) & 127) << 20) ^ (((input >> 24) & 127) << 28); //^ (((input >> 31) & 7) << 36);
//Y[15] = (input2 & 7) ^ (((input2 >> 3) & 127) << 4) ^ (((input2 >> 10) & 127) << 12) ^ (((input2 >> 17) & 127) << 20) ^ (((input2 >> 24) & 127) << 28); //^ (((input2 >> 31) & 7) << 36);
Y[15] = Y[15] << (20);
//Y[15] = 0;
//31+7 = 38 bits
Y[15] = Y[15] ^ Y[14]; //threadIdx.x;
//38 + 7 + 4 = 49 bits. This makes a total of 2^49
Y[15] = Y[15] ^ ((Y[13] & 127) << 8) ^ (((Y[13] >> 7) & 15) << 16); //blockIdx.x//Y[15] = 0x070702000 >> 1; //Test with a glider traveling south
//Y[15] = 0x01c0000; //Test with a blinker
//Y[15] = 0x00c0800; //Test with a preblock

//Y[15] = AllOn;//*input;
Y[1] = 0;
Y[4] = 0;
Y[3] = 0;
///*debug*/Y[15] = 0x7F007F087F007F;//#######;0000000,#######,000#000,#######,0000000,####### : 4 lines with a little ward in the middle.
//Y[04] = (Y[15] << 7) & 0x8080808080808080;
//Y[15] = (Y[15] >> 1) & 0x7F7F7F7F7F7F7F7F;
//Y[3] = (Y[3] >> (64 - 16));           //vpsrldq xmm3,xmm3,16-4             //keep the bottom 2 rows of NW & shift them to the top
Y[6] = (Y[1] >> (64 - 8));            //vpsrldq xmm6, xmm1, 16 - 2                //N5           keep the bottom 1 rows of N & shift them to the top.
Y[1] = (Y[1] >> (64 - 16));           //vpsrldq xmm1,xmm1,16-4              //N3           keep the bottom 2 rows of N & shift them to the top.
Y[2] = ((Y[4] >> 6) & MaskR2);        //vpsrlw xmm2,xmm4,14              //W6           keep the 2 rightmost columns of W
//Y[3] = ((Y[3] >> 6) & MaskR2);        //vpsrlw xmm3,xmm3,14              //NW1          keep the 2 rightmost columns of NW
Y[5] = (Y[15] << 16);                 //vpslldq xmm5,xmm15,4              //main3        remove the bottom 2 rows from main
Y[7] = (Y[15] << 8);                  //vpslldq xmm7,xmm15,2              //main5        remove the bottom 1 row from main
/*D3*/Y[14] = (Y[1] ^ Y[5]);              //vpxor xmm14, xmm1, xmm5              //***** ymm14 3 - D    2 rows N +14 rows main
/*A5*/Y[13] = (Y[7] ^ Y[6]);              //vpxor xmm13, xmm7, xmm6        //***** ymm13 5 - A'   1 row N  +15 rows main
//We are now done with N, ymm1 and ymm6
Y[1] = ((Y[2] >> 1) & MaskR1);        //vpsrlw xmm1,xmm2,1                                         //W7           remove an extra column from W
Y[7] = ((Y[15] << 1) & MaskL7);       //                                      //main7        Shift main right
Y[8] = ((Y[13] << 1) & MaskL7);       //                                      //main0+N0     Shift main+N1 right
Y[9] = ((Y[14] << 1) & MaskL7);       //                                      //main2+N2     Shift mainn+N2 right
/*C7*/Y[12] = (Y[7] ^ Y[1]);              //                                         //***** ymm12 7 - C Main7+W7
Y[7] = ((Y[7] << 1) & MaskL7);        //                                      //main6       Shift main right
/*B6*/Y[11] = (Y[7] ^ Y[2]);              //                                         //***** ymm11 6 - B' Main6+W6
Y[10] = (Y[11] << 8);                 //                                        //main4+W4    Shift Main6W6 down
Y[7] = (Y[3] >> 8);                   //                                        //NW4         Shift NW1 up (only one row)
Y[6] = ((Y[6] << 2) & MaskL6);        //                                      //N4          Shift N3 right
Y[10] = (Y[10] ^ Y[7]);               //                                       //main4+W4+NW4
/*A4*/Y[10] = (Y[10] ^ Y[6]);             //                                         //***** ymm10 4 - A
Y[1] = (Y[1] << 8);                   //                                        //W0          Shift W7 down 1 row
Y[7] = ((Y[7] >> 1) & MaskR1);        //                                      //NW0         Shift NW4 left (keep only 1 pixel)
Y[0] = (Y[8] ^ Y[1]);                 //                                       //main0+N0+W0
/*X0*/Y[0] = (Y[0] ^ Y[7]);               //                                        //***** ymm0 0 - x
Y[1] = (Y[2] << 16);                                                          //W1          Shift W down 2 rows
Y[8] = ((Y[9] << 1) & MaskL6);                                              //main1+N1    Shift Main2N2 right 1 column
Y[8] = (Y[8] ^ Y[1]);                                                        //main1+N1+W1 Combine with W
/*B18*/Y[8] = (Y[8] ^ Y[3]);                                                        //**** ymm8 1 - B  Combine with the original NW
Y[7] = ((Y[1] >> 1) & MaskR1);                                              //W2          Shift W1 left 1 column
Y[5] = ((Y[3] >> 1) & MaskR1);                                              //NW2         Shift the original NW left 1 column
Y[1] = (Y[7] ^ Y[5]);                                                        //W2+NW2      combine w2 & NW2
/*C2*/Y[9] = (Y[1] ^ Y[9]);                                                        //**** ymm9 2 - C' main2+N2+W2+NW2
//Count the 1's & 2's
Y[1] = (Y[12] ^ Y[9]);       //1's count of c
Y[2] = (Y[12] & Y[9]);       //2's count of c
Y[3] = (Y[10] ^ Y[13]);       //1's count of a
Y[4] = (Y[10] & Y[13]);       //2's count of a
Y[5] = (Y[8] ^ Y[11]);       //1's count of b
Y[6] = (Y[8] & Y[11]);       //2's count of b
Y[7] = (Y[14] ^ Y[15]);     //1's count of d
Y[8] = (Y[14] & Y[15]);     //2's count of d
//Add the 1's together
Y[10] = (Y[1] & Y[3]);       //2's count of CA
Y[1] = (Y[1] ^ Y[3]);       //combined ones of CA
Y[12] = (Y[5] & Y[7]);       //2's count of BD
Y[5] = (Y[5] ^ Y[7]);        //combined ones of BD
Y[14] = (Y[1] & Y[5]);       //2's count of CABD
Y[1] = (Y[1] ^ Y[5]);        //final count of the 1's
//now we need to add all the 2's together.
Y[3] = (Y[2] & Y[4]);        //4's count of ca
Y[2] = (Y[2] ^ Y[4]);        //2's count of ca
Y[5] = (Y[6] & Y[8]);        //4's count of bd
Y[6] = (Y[6] ^ Y[8]);        //2's count of bd
Y[7] = (Y[10] & Y[12]);      //4's count of CABD
Y[8] = (Y[10] ^ Y[12]);      //2's count of CABD
Y[9] = (Y[2] & Y[6]);        //4's count of cabd
Y[4] = (Y[2] ^ Y[6]);        //2's count of cabd
Y[11] = (Y[8] & Y[14]);      //4's count of CABD+abcd
Y[12] = (Y[8] ^ Y[14]);      //2's count of CABD+abcd
//add all 4's
Y[15] = (Y[3] | Y[5]);        //Saturated add of the 4's
Y[13] = (Y[7] | Y[9]);
Y[14] = (Y[11] | Y[15]);
//add the 2's
Y[2] = (Y[12] ^ Y[4]);
//final add
Y[4] = (Y[14] | Y[13]);
//now we have all the counts.
Y[14] = (Y[0] & Y[2]);       //All 2's stay the same
Y[3] = (Y[2] & Y[1]);       //Y[3] hold's the 3 neighbors; i.e. the new births
Y[14] = (Y[14] | Y[3]);       //The same + births = new pattern
Y[15] = (Y[14] & (~Y[4]));   //but subtract the 4+ neighbors
//Now extract the 5x5 resulting block as well as the 3x3 input block
Y[6] = (Y[15] & Mask5x5);  //get the output 5x5
Y[4] = (Y[0] & Mask3x3);   //and the input 3x3
//Translate the 5x5 block into a linear number.
//Mask5x5 = 0x003E3E3E3E3E0000;  //Both masks use the Q layout, because the input is transformed
//Mask3x3 = 0x00001C1C1C000000;  //to Q in the Y[0] register.
//      ----1------            ------2------                 -------- 3------                          ----- 4------                      ------ 6-----
Y[5] = ((Y[6] & 0x7C0000) >> (10 + 8)) | ((Y[6] & 0x7C000000) >> (10 + 16 - 5)) | ((Y[6] & 0x7C00000000) >> (10 + 24 - 10)) | ((Y[6] & 0x7C0000000000) >> (10 + 32 - 15)) | ((Y[6] & 0x7C000000000000) >> (10 + 40 - 20));
Y[3] =                                   ((Y[4] & 0x38000000) >> (11 + 16))     | ((Y[4] & 0x3800000000) >> (11 + 24 - 3))  | ((Y[4] & 0x380000000000) >> (11 + 32 - 6));
//                        Mask6x4 = 0x00007E7E7E7E0000;  //Both masks use the Q layout, because the input is transformed
//                        Mask4x2 = 0x0000003C3C000000;  //to Q in the Y[0] register.
//Y[5] = ((Y[6] & 0x7E0000) >> (9 + 8)) | ((Y[6] & 0x7E000000) >> (9 + 16 - 6)) | ((Y[6] & 0x7E00000000) >> (9 + 24 - 12)) | ((Y[6] & 0x7E0000000000) >> (9 + 32 - 18));
//Y[3] = ((Y[4] & 0x3C000000) >> (18 + 8)) | ((Y[4] & 0x3C00000000) >> (18 + 16 - 4));
//Y[15] is the output block where the data must be stored.
//Every block = 256 / 8 = 64 bytes =
Y[5] = Y[5] * 64;
//Y[3] is the inner 2x8 block, holding 8 bits, a number from 0 - 255. The upper 3 bits denote the dword to store the data in. The lower 5 bits are a shift mask denoting which bit to flip.
//
Y[4] = (1i64 << (Y[3] & 31i64)); //or mask.
Y[3] = (Y[3] >> 5) * 4; //dword offset  //8*4 = 32 bits
Y[5] = Y[5] | Y[3];
//** remove this line!
//Y[5] = Y[5] & ((MemOutputSize)-1); //mask at 512MB, because we don't have more memory.
//** remove the above line !
Y[5] = Y[5] + (unsigned __int64)doutput;
/*debug*/atomicOr((unsigned int *)Y[5], (unsigned int)Y[4]);
}
void printDevProp(cudaDeviceProp devProp)
{
printf("%s\n", devProp.name);
printf("Major revision number:         %d\n", devProp.major);
printf("Minor revision number:         %d\n", devProp.minor);
printf("Total global memory:           %zu", devProp.totalGlobalMem);
printf(" bytes\n");
printf("Number of multiprocessors:     %d\n", devProp.multiProcessorCount);
printf("Total amount of shared memory per block: %zu\n", devProp.sharedMemPerBlock);
printf("Total registers per block:     %d\n", devProp.regsPerBlock);
printf("Warp size:                     %d\n", devProp.warpSize);
printf("Maximum memory pitch:          %zu\n", devProp.memPitch);
printf("Total amount of constant memory:         %zu\n", devProp.totalConstMem);
return;
}

unsigned long long getTotalSystemMemory()
{
MEMORYSTATUSEX status;
status.dwLength = sizeof(status);
GlobalMemoryStatusEx(&status);
return status.ullTotalPhys;
}#define filename "lookuptable5to3_doublecheckA.bin"
// Start the main CUDA Sample here
int main(int argc, char **argv)
{
printf("CUDA Lookup table 5x5->3x3 dual GPU version\n");
int deviceCount;
cudaGetDeviceCount(&deviceCount);
printf("Device count is %i, ", deviceCount);
printf("Available RAM = %lliGiB\n", (getTotalSystemMemory() >> 30i64));
if (deviceCount > 2) { deviceCount = 2; }
// This will pick the best possible CUDA capable device
int dev1 = findCudaDevice(argc, (const char **)argv);
cudaDeviceProp dp;
checkCudaErrors(cudaGetDeviceProperties(&dp, dev1));
printDevProp(dp);

//float *dinput = NULL;
unsigned long long int* dGPUoutput[2];
for (int dev = 0; dev < deviceCount; dev++) {
dGPUoutput[dev] = NULL;
}
//clock_t *dtimer = NULL;

//clock_t timer[NUM_BLOCKS * 2];
//float input[NUM_THREADS * 2];

//for (int i = 0; i < NUM_THREADS * 2; i++)
//{
//    input[i] = (float)i;
//}

//unsigned __int64 a = 0xFFFFFFFFFFFFFFFF;
//unsigned __int64 b;

//SevenToFive(&a, &b);

//checkCudaErrors(cudaMalloc((void **)&dinput, sizeof(float) * NUM_THREADS * 2));//checkCudaErrors(cudaMemcpy(dinput, input, sizeof(float) * NUM_THREADS * 2, cudaMemcpyHostToDevice));

//timedReduction<<<NUM_BLOCKS, NUM_THREADS, sizeof(float) * 2 *NUM_THREADS>>>(dinput, doutput, dtimer);
char* dest[2];
char* outputdest;
outputdest = (char*)malloc(sizeof(char) * MemOutputSize);
if (outputdest == NULL) { printf("Out of memory"); getch(); exit(EXIT_FAILURE); }
//test write
//printf("test write of data\n");
//std::ofstream outputFile;
//outputFile.open(filename, std::ofstream::out | std::ofstream::trunc | std::ofstream::binary);
///*debug*/outputFile.write(dest, MemOutputSize);
//outputFile.close();
printf("Start computing\n");
//DebugSevenToFive((unsigned long long int)-1, NULL); //just a place to check if needed.
//getch();
for (int dev = 0; dev < deviceCount; dev++) {
cudaSetDevice(dev);
/*debug*/checkCudaErrors(cudaMalloc((void **)&dGPUoutput[dev], sizeof(char) * MemOutputSize));
/*debug*/checkCudaErrors(cudaMemset(dGPUoutput[dev], 0, sizeof(char) * MemOutputSize));
dest[dev] = (char*)malloc(sizeof(char) * MemOutputSize);
if (dest[dev] == NULL) { printf("Out of memory"); getch(); exit(EXIT_FAILURE); }
}
auto t_start = std::chrono::high_resolution_clock::now();
//getch();
//we need to repeat this 2^31 times.  31=5+6+20
try {
for (int k = 0; k < 64; k++) {  //2^6
printf("%i of 64 ", k + 1);
for (int j = 0; j < 32; j++) {  //2^5
printf(".");
for (int q = 0; q < 1024; q++) {
printf("%4i\b\b\b\b", 1023 - q);
for (int i = 0; i < (/*1024 **/ 1024); i++) {  //2^20
//__global__ static void SevenToFiveCount(const unsigned __int64 input, void* doutput)
///*debug*/SevenToFiveCount << <NUM_BLOCKS, NUM_THREADS >> > (i + (q * 1024) + (j * 1024 * 1024) + (k * 1024 * 1024 * 32), doutput);  //256K, need to run this code 2GB times.
int dev = i % deviceCount;
cudaSetDevice(dev);

SevenToFive<<<NUM_BLOCKS, NUM_THREADS>>> (i + (q * 1024) + (j * 1024 * 1024) + (k * 1024 * 1024 * 32), dGPUoutput[dev]);  //256K, need to run this code 2GB times.
}
}
}
printf("-");
for (int dev = 0; dev < deviceCount; dev++) {
cudaSetDevice(dev);
checkCudaErrors(cudaDeviceSynchronize());
/*debug*/checkCudaErrors(cudaMemcpy(dest[dev], dGPUoutput[dev], sizeof(char) * MemOutputSize, cudaMemcpyDeviceToHost));
}
for (__int64 i = 0; i < MemOutputSize * sizeof(char); i++) {
outputdest[i] = dest[0][i];
for (int dev = 1; dev < deviceCount; dev++) {
outputdest[i] |= dest[dev][i];
}
}
std::ofstream outputFile;
outputFile.open(filename, std::ofstream::out | std::ofstream::trunc | std::ofstream::binary);
outputFile.write(outputdest, MemOutputSize);
outputFile.close();
printf("W  \n");
}
cudaDeviceSynchronize();
printf("\nDone computing\n");auto t_end = std::chrono::high_resolution_clock::now();
std::cout << "Millisecs used " << std::chrono::duration<double, std::milli>(t_end - t_start).count() << "ms\n";

int check = 0;
for (int i = 0; i < (1024 * 1024 * 1024 / 4); i++) {
check |= outputdest[i];
}
if (check == 0) { printf("Error: nothing happened"); getch(); for (;;) {} }
else { printf("all ok"); getch(); }

//checkCudaErrors(cudaFree(dinput));
for (int dev = 0; dev < deviceCount; dev++) {
cudaFree(dGPUoutput[dev]);
}

free(outputdest);
outputdest = NULL;
for (int dev = 0; dev < deviceCount; dev++) {
free(dest[dev]);
dest[dev] = NULL;
}getch();
return EXIT_SUCCESS;
}

catch (const std::exception& e) //catch all exceptions
{
printf("Oops, an error happened. Here are the details:\n");
std::cout << e.what() << std::endl;
printf("\nPress any key\n");
getch();
}
return EXIT_FAILURE;
}

-1

c++cuda multi-gpu

Решение

Другие решения

Других решений пока нет …

Источник

Accepted Answer

(это не вписывается в комментарий, извините)

Подготовьте решатель, укажите в качестве переменных мгновенную или среднюю производительность всех графических процессоров за последние несколько секунд. Минимизируйте общее время на N ядер или максимизируйте 1 ядро / среднее значение времени.

Если ядра идентичны, должно быть только 1 минимальное значение, например, 90 ядер для gtx 1080 и 10 ядер для gtx1050 в секунду. Если есть много разных ядер, то они могут стать другими переменными для решателя.

Решатель включает в себя математику, которую вы можете сделать, возможно. Я добавлю только неразрешенную версию:

дать им свои собственные очереди ядер
пусть они получают ядра по одному (заканчивает, получает новое)
проверьте размер этих очередей.

Пустая очередь = 1,0 производительность.
Полная очередь = 0.0 производительность

продолжить оптимизацию, «производя» ядро до 1 очереди за раз (может быть, фиксированное время или быстро, пока одна из очередей не будет заполнена), с вероятностью, равной производительности, вероятно, нормализованной версией, которая составляет 0,1 0,1, равную 0,5 0,5. Установите максимальный размер каждой очереди от 50 до 100 для простой точности выполнения.

В CUDA вы можете назначить несколько «потоков», чтобы вы могли сделать это с несколькими очередями на один GPU, асинхронно друг к другу, чтобы еще больше увеличить использование GPU. Важно проверить, чтобы значения производительности очередей для каждого графического процессора соответствовали его реальной производительности.

Поскольку вы запускаете миллионы ядер, у вас могут быть очереди с тысячами ядер для лучшего измерения производительности. Всего 5 максимальных размеров очереди могут быть не очень хорошими, чтобы разделить несколько ядер.

Может быть, вы можете сделать максимальный размер очереди, адаптируясь к производительности подключенного графического процессора. Начиная всего с 5 максимальных размеров, увеличиваясь до тысяч на самых быстрых графических процессорах, так что производительность можно измерить как 0,5444533 вместо 0,5.

Когда последнее ядро ставится в очередь, все оставшееся количество элементов в очередях должно быть равно или близко к их собственным характеристикам графического процессора и должно завершиться в одно и то же время.

(Непроверенные)

-1