DEV Community

Viveknanda
Viveknanda

Posted on

OpenCL performance on RTX4090 by Clpeak

Platform: NVIDIA CUDA
  Device: NVIDIA GeForce RTX 4090
    Driver version  : 550.127.05 (Linux x64)
    Compute units   : 128
    Clock frequency : 2520 MHz

    Global memory bandwidth (GBPS)
      float   : 873.20
      float2  : 901.24
      float4  : 917.89
      float8  : 928.70
      float16 : 938.94

    Single-precision compute (GFLOPS)
      float   : 84761.26
      float2  : 80760.14
      float4  : 80512.55
      float8  : 79900.18
      float16 : 79513.42

    No half precision support! Skipped

    Double-precision compute (GFLOPS)
      double   : 1398.84
      double2  : 1397.85
      double4  : 1394.48
      double8  : 1387.83
      double16 : 1374.64

    Integer compute (GIOPS)
      int   : 44124.49
      int2  : 44080.14
      int4  : 43970.14
      int8  : 44089.10
      int16 : 44104.19

    Integer compute Fast 24bit (GIOPS)
      int   : 44067.89
      int2  : 44081.56
      int4  : 44038.71
      int8  : 43851.83
      int16 : 43369.82

    Integer char (8bit) compute (GIOPS)
      char   : 38655.31
      char2  : 38334.73
      char4  : 37103.88
      char8  : 30839.88
      char16 : 28388.27

    Integer short (16bit) compute (GIOPS)
      short   : 36869.31
      short2  : 35287.81
      short4  : 36894.71
      short8  : 32896.40
      short16 : 28145.07

    Transfer bandwidth (GBPS)
      enqueueWriteBuffer              : 10.68
      enqueueReadBuffer               : 15.51
      enqueueWriteBuffer non-blocking : 10.08
      enqueueReadBuffer non-blocking  : 13.46
      enqueueMapBuffer(for read)      : 19.79
        memcpy from mapped ptr        : 11.54
      enqueueUnmap(after write)       : 25.13
        memcpy to mapped ptr          : 11.41

    Kernel launch latency : 4.06 us
Enter fullscreen mode Exit fullscreen mode

Top comments (0)