90% Tensor Core Efficiency on a H100 Cluster.

Authors: Ziyan Ishani, Veer Guda, Arav Shah

Designations: Executive Director of Research, Director of Projects, Director of Marketing

Date: March-April 2026

Collaboration with: Voltage Park

Combining Student Research and Industry Infrastructure

Over a period of six weeks, the AI @ Georgia Tech research team conducted a comprehensive benchmark of Voltage Park’s on-demand GPU infrastructure. This project was an incredible opportunity for our research division, letting us combine student-led research with real-world, enterprise goals. We tested a bare-metal cluster consisting of four nodes, each of which was equipped with 8x NVIDIA H100 80GB HMB3 SXM5 GPUs. The nodes were heavily networked, connected via 3.2 Tbps InfiniBand with 400 Gbps per port. This gave us the perfect sandbox to push performance boundaries and gain valuable benchmarking experience.

The Need for an Independent Baseline

Most of the time, on-demand cloud platforms lock their firmware stacks to make sure all users get a uniform experience. Because of this, cloud environments often prioritize deployment consistency and stability over raw performance tuning. We had a simple goal: establish a clean baseline and see if a stock, unconfigured deployment could still hit top-tier efficiency metrics without any custom BIOS tuning or firmware modifications.

To properly test the system, we ran reproducible, industry-standard tests. Our testing suite covered:

HPC compute workloads like HPCG, HPL (FP64), and HPL-MxP (FP16/FP8) using official NVIDIA containers.
Memory bandwidth testing, including CPU STREAM and NVBench.
Collective communications, evaluating NCCL AllReduce, AllGather, and Broadcast.
Machine learning training and inference using MLPerf BERT-Large and MLPerf BERT-99.

Pushing Performance Boundaries

Our results were outstanding, proving that top-tier performance was accessible out of the box, and without the need for complex, proprietary configurations. The most important metric came from our Mixed Precision Tensor Core Benchmark (HPL-MxP):

With the HPL-MxP benchmark, we clocked an 89.9 FP8 tensor core efficiency.
The hardware achieved a 3,554.6 TFLOPS LU peak, telling us that the H100 tensor cores were operating at their rated capacity for AI training precision.

In addition, we proved massive capabilities in memory-bandwidth-bound workloads.

During HPCG testing, the GPU was an incredible 255x faster than the CPU.
We recorded a device-to-device HBM3 bandwidth of 2,903 GB/s, realizing 86.7 efficiency against the theoretical peak.

Real-World Impact: Training and Inference

Beyond just raw benchmarks, we wanted to ensure that the cluster was production-ready for the demanding tasks that our AI@GT project teams tackled daily.

While testing the MLPerf BERT-Large training across all 32 GPUs, we saw an average throughput of 4,937 sequences/second, and reached a time to convergence of 154.86 minutes. This represented a 79.4% multi-node scaling efficiency across the four nodes. To test inference, we ran MLPerf BERT-99 Offline scenarios on a single 8x H100 node, logging over 3,330 Queries per Second (QPS) with an accuracy (F1) of 90.9. If we scale this linearly, we project over 13,000 QPS across the entire cluster, confirming that the infrastructure was highly viable for real-time NLP services and large-scale model pretraining.

Powered By Voltage Park

This entire project wouldn’t have been possible without Voltage Park granting us access to their high-performance H100 deployment. During the process of completing this project and validating that near 90% tensor core efficiency is achievable on a standard cloud setup, we realized that high-performance AI is more accessible than ever before.

We believe that good research is reproducible, transparent, and open for all. We have made all our test scripts, configuration files, and raw output logs entirely open-source. You can look at the exact methodologies we used and review the raw data for yourself on our GitHub repository at https://github.com/aigatech/vpbenchmarking, or read our full report at AI GT x Voltage Park Report