System Timeline
GPU Compute Utilization
87%
Unified Memory Usage
94.2 / 128 GB
Memory Bandwidth (LPDDR5x)
218 GB/s
Inference Throughput
308 tok/s
GPU Processes
5 active| Process | GPU % | Memory | Tok/s |
|---|---|---|---|
|
sglang.server
PID 4821 · INTELLECT-3-MoE
|
72% | 52.1 GB | 308 |
|
python: train.py
PID 5102 · llama-finetune run-4
|
12% | 38.4 GB | — |
|
cudf.pandas
PID 5340 · metric query
|
3% | 2.8 GB | — |
|
jupyter-lab
PID 3201 · idle
|
0% | 0.4 GB | — |
|
DGX Dashboard
PID 1820 · system
|
0% | 0.1 GB | — |
CUDA Kernel Timeline
sglang · INTELLECT-3SM Compute
mha_fwd
gemm
silu
mha_fwd
gemm
rmsnorm
mha_fwd
gemm
Tensor Core
fp4_mma
fp4_mma
fp4_mma
fp4_mma
fp4_mma
fp4_mma
Memory
D2D
KV$
D2D
KV$
D2D
KV$
D2D
KV$
Copy Engine
H2D
D2H
Decode
sample
sample
sample
sample
sample
Compute
Attention
MemCopy
NCCL
Idle