Tensorコア

Tensorコア（英えい: Tensor Cores）はNVIDIAが開発かいはつする混合こんごう精度せいど行列ぎょうれつ積せき和わアクセラレータである^[1]。2017年ねん、データセンター用ようGPU「Tesla V100」（Volta世代せだい）に初はじめて搭載とうさいされた。GeforceとQuadroにはRTXシリーズ（Turing世代せだい）で初はじめて搭載とうさいされた。

概要がいよう

高たかい並列へいれつ計算けいさん能力のうりょくをもったグラフィックス専用せんようの処理しょりユニットであるGPUは、その並列へいれつ計算けいさん特性とくせいが着目ちゃくもくされ現在げんざいではHPCなど様々さまざまな分野ぶんやで並列へいれつ計算けいさん機きとして利用りようされている（GPGPU）。用途ようとが広ひろがるにつれ、深層しんそう学習がくしゅうをはじめとする一部いちぶの計算けいさんでは高たかい数値すうち計算けいさん精度せいどよりも高たかい演算えんざん速度そくどが求もとめられることがわかってきた。NVIDIA GeForceシリーズをはじめとした様々さまざまなGPUを手掛てがけるNVIDIAがこれに応こたえるために開発かいはつした、行列ぎょうれつ積せき和わアクセラレータがTensorコアである^[1]^[2]。

Tensorコアは行列ぎょうれつの混合こんごう精度せいど融合ゆうごう積せき和わ演算えんざん（FMA）に特とく化かした機能きのうを有ゆうする。すなわち1命令めいれいで低てい精度せいど行列ぎょうれつ積せき + 高こう精度せいどアキュムレータ加算かさんを実行じっこうする。例たとえばFP32の行列ぎょうれつ $A,B,C$ を積せき和わ演算えんざんする際さい、行列ぎょうれつ積せき $BC$ をFP16でおこないこれをFP32の $A$ へとアキュムレートする。

専用せんよう回路かいろ（Tensorコア）でこれを実行じっこうするため、1命令めいれいで大量たいりょうの演算えんざんを処理しょりできる。行列ぎょうれつ積せきは低てい精度せいどであるため計算けいさん負荷ふかが小ちいさく、和わは高こう精度せいどかつFMAであるため追加ついかの誤差ごさを生しょうじさせない。例たとえば NVIDIA A100 GPU では単純たんじゅんなFP32が 19.5 TLOPSであるのに対たいし、FP16の低てい精度せいど積せきでTensorCoreを使つかった場合ばあいは 312 TFLOPS すなわち16倍ばいの演算えんざんを実行じっこうできる。このようにTensorコアは混合こんごう精度せいど行列ぎょうれつFMAのアクセラレータとして機能きのうする。

対応たいおう

Tensorコアには世代せだいがあり、世代せだいごとに速度そくどおよびサポートする精度せいどが異ことなる。

表ひょう. Tensorコア世代せだいとサポート精度せいど
世代せだい (対応たいおうarch)	multiply 精度せいど								accum 精度せいど
世代せだい (対応たいおうarch)	FP64	TF32	FP16	BF16	FP8	INT8	INT4	INT1	FP64	FP32	FP16	INT32
1 (Volta) ^[3]	-	-	✔	-	-	-	-	-	-	✔	✔	-
2 (Turing) ^[4]	-	-	✔	-	-	✔	✔	✔	-	✔	✔
3 (Ampere) ^[5]	✔	✔	✔	✔	-	✔	✔	✔	✔	✔	✔	✔
4 (Hopper) ^[6]^[7]	✔	✔	✔	✔	✔	✔	-	-	✔	✔	✔	✔

TensorFloat-32

TensorFloat-32（TF32）はTensorコアにおける混合こんごう精度せいどFMAモードの1つである^[8]^[9]。

Tensorコアは低てい精度せいど行列ぎょうれつ積せきを高速こうそく実行じっこうするアクセラレータであり、FP32 FMAをそのまま高速こうそく計算けいさんはできない。TF32モードではFP32入力にゅうりょくを内部ないぶ的てきに19bitへキャスト、その行列ぎょうれつ積せきをTensorコアで高速こうそく計算けいさんし、最終さいしゅう的てきにFP32のアキュムレータへ加算かさんする。すなわち、TensorFloat-32はFP32 FMAの内部ないぶ低てい精度せいど高速こうそくFMAモードである^[10]。

低てい精度せいど行列ぎょうれつ積せきはある程度ていどの計算けいさん誤差ごさが不可避ふかひである。従来じゅうらい用もちいられていた16bit演算えんざん（FP16・BF16）は混合こんごう精度せいど計算けいさんによりその悪影響あくえいきょうを最低限さいていげんに抑おさえており、またフレームワークの自動じどう混同こんどう精度せいど（AMP）機能きのうにより最低限さいていげんのコード変更へんこうで実行じっこうを可能かのうにしていた。しかし計算けいさん誤差ごさが推論すいろん精度せいどへ大おおきく影響えいきょうするケースが一部いちぶにはあり、また最低限さいていげんであれコードの変更へんこうが必須ひっすであった。

TF32は 1bit の符号ふごう、8bit の指数しすう（= BF16の指数しすう）、10bit の仮数かすう（=FP16の仮数かすう）からなる19bitの表現ひょうげんを利用りようしており、16bitに比くらべて精度せいど低下ていかを更さらに軽減けいげんしている。またTensorコア内部ないぶでキャストしFP32にアキュムレートしているため外部がいぶ的てきにはFP32 FMAをそのまま実行じっこうしているように見みえ、コードの変更へんこうが一切いっさい必要ひつようない。ゆえにTF32は最低限さいていげんの精度せいど低下ていかかつコード変更へんこうなしで数すう倍ばいの演算えんざん効率こうりつを得えられるモードになっている。あくまでコア内部ないぶ精度せいどの変更へんこうであるため、データ量りょう（例れい: 使用しようGPUメモリ量りょう）の減少げんしょう等とうはできない。またTensorコアが19bit表現ひょうげんを高速こうそく処理しょりできるからこそ高速こうそく化かされるのであり、TF32はあくまでTensorコアがもつ特とく化か機能きのう/モードの一種いっしゅである^[9]。

脚注きゃくちゅう

^ ^a ^b "Tensor Cores enable mixed-precision computing, dynamically adapting calculations to accelerate throughput while preserving accuracy." NVIDIA. NVIDIA Tensor Cores. 2023-05-16閲覧えつらん.
^ "Tensor Cores are specialized high-performance compute cores for matrix math operations that provide groundbreaking performance for AI and HPC applications. " NVIDIA. NVIDIA A100 Tensor Core GPU Architecture. 2023-05-16閲覧えつらん.
^ "FP16 precision introduced on the Volta Tensor Core ... Volta GPU ... Results are accumulated into FP32 for mixed precision training or FP16 for inference." NVIDIA. NVIDIA A100 Tensor Core GPU Architecture. 2023-05-16閲覧えつらん.
^ "the INT8, INT4 and binary 1-bit precisions added in the Turing Tensor Core" NVIDIA. NVIDIA A100 Tensor Core GPU Architecture. 2023-05-16閲覧えつらん.
^ "In addition to FP16 precision ... INT8, INT4 and binary 1-bit precisions ... the A100 Tensor Core adds support for TF32, BF16 and FP64 formats" NVIDIA. NVIDIA A100 Tensor Core GPU Architecture. 2023-05-16閲覧えつらん.
^ "NVIDIA H100 GPU ... New fourth-generation Tensor Cores" NVIDIA. NVIDIA H100 Tensor Core GPU Architecture. 2023-05-16閲覧えつらん.
^ "FP8, FP16, BF16, TF32, FP64, and INT8 MMA data types are supported." NVIDIA. NVIDIA H100 Tensor Core GPU Architecture. 2023-05-16閲覧えつらん.
^ "TF32 is a new compute mode added to Tensor Cores in the Ampere generation of GPU architecture." NVIDIA Developer. Accelerating AI Training with NVIDIA TF32 Tensor Cores. 2023-06-18閲覧えつらん.
^ ^a ^b "TF32 is only exposed as a Tensor Core operation mode, not a type." NVIDIA Developer. Accelerating AI Training with NVIDIA TF32 Tensor Cores. 2023-06-18閲覧えつらん.
^ "All storage in memory and other operations remain completely in FP32, only convolutions and matrix-multiplications convert their inputs to TF32 right before multiplication." NVIDIA Developer. Accelerating AI Training with NVIDIA TF32 Tensor Cores. 2023-06-18閲覧えつらん.

[:13-1] "Tensor Cores enable mixed-precision computing, dynamically adapting calculations to accelerate throughput while preserving accuracy." NVIDIA. NVIDIA Tensor Cores. 2023-05-16閲覧えつらん.

[2] "Tensor Cores are specialized high-performance compute cores for matrix math operations that provide groundbreaking performance for AI and HPC applications. " NVIDIA. NVIDIA A100 Tensor Core GPU Architecture. 2023-05-16閲覧えつらん.

[3] "FP16 precision introduced on the Volta Tensor Core ... Volta GPU ... Results are accumulated into FP32 for mixed precision training or FP16 for inference." NVIDIA. NVIDIA A100 Tensor Core GPU Architecture. 2023-05-16閲覧えつらん.

[4] "the INT8, INT4 and binary 1-bit precisions added in the Turing Tensor Core" NVIDIA. NVIDIA A100 Tensor Core GPU Architecture. 2023-05-16閲覧えつらん.

[5] "In addition to FP16 precision ... INT8, INT4 and binary 1-bit precisions ... the A100 Tensor Core adds support for TF32, BF16 and FP64 formats" NVIDIA. NVIDIA A100 Tensor Core GPU Architecture. 2023-05-16閲覧えつらん.

[6] "NVIDIA H100 GPU ... New fourth-generation Tensor Cores" NVIDIA. NVIDIA H100 Tensor Core GPU Architecture. 2023-05-16閲覧えつらん.

[7] "FP8, FP16, BF16, TF32, FP64, and INT8 MMA data types are supported." NVIDIA. NVIDIA H100 Tensor Core GPU Architecture. 2023-05-16閲覧えつらん.

[8] "TF32 is a new compute mode added to Tensor Cores in the Ampere generation of GPU architecture." NVIDIA Developer. Accelerating AI Training with NVIDIA TF32 Tensor Cores. 2023-06-18閲覧えつらん.

[:0-9] "TF32 is only exposed as a Tensor Core operation mode, not a type." NVIDIA Developer. Accelerating AI Training with NVIDIA TF32 Tensor Cores. 2023-06-18閲覧えつらん.

[10] "All storage in memory and other operations remain completely in FP32, only convolutions and matrix-multiplications convert their inputs to TF32 right before multiplication." NVIDIA Developer. Accelerating AI Training with NVIDIA TF32 Tensor Cores. 2023-06-18閲覧えつらん.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

Tensorコア

目次もくじ

概要がいよう

対応たいおう

TensorFloat-32

関連かんれん項目こうもく

脚注きゃくちゅう