[TODO] 开发memory_reserved算さん子こ和わ性能せいのう优化 #122

ccssu · 2023-03-27T02:22:29Z

- 利用りようprofile 工具こうぐ优化性能せいのう（目め标： GLM仓库性能せいのう优化 https://github.com/Oneflow-Inc/libai/tree/main/projects/GLM ， one-yolov5分ふん类模型がた性能せいのう优化)
- 学がく习oneflow 内ない存そん管理かんり

profile工具こうぐ上手じょうず

one-yolov5项目

项目地ち址し: https://github.com/Oneflow-Inc/one-yolov5
数かず据すえ集しゅう路ろ径みち： @oneflow-25:/data/home/fengwen/imagenette160
权重路ろ径みち: @oneflow-25:/data/home/fengwen/weight_v1_2_0

如果执行nsys产生报错

The target application terminated. One or more process it created re-parented.
Waiting for termination of re-parented processes.
Use the `--wait` option to modify this behavior.

请将 train.py中ちゅう check_git_status() 这一いち行ぎょう注ちゅう释

glm 项目

项目地ち址し: https://github.com/Oneflow-Inc/libai/tree/main/projects/GLM
权重路ろ径みち:
@oneflow-25:/data/home/xiezipeng/glm-10b-chinese
@oneflow-25:/data/home/xiezipeng/glm-10b

ccssu · 2023-03-27T02:23:21Z

结合NVTX注ちゅう释上手しゅnsys

NVTX是ぜ一いち种工具ぐ，允まこと许开发人员使用しよう自じ定てい义标记注释其代だい码，这些标记可か以在像ぞうNVIDIA Nsight Systems（nsys）这样的てき性能せいのう分析ぶんせき工具こうぐ中ちゅう可か视化。这些标记可か以帮助じょ开发人じん员了解りょうかい其代码的性能せいのう特とく征せい，并确定てい优化的てき领域。

nvtx 教程きょうてい: https://nvtx.readthedocs.io/en/latest/index.html

Python Demo

import numpy as np
import cupy as cp
import nvtx

@nvtx.annotate("fft function", color="blue")
def fast_fft(input_array):
    with nvtx.annotate("Copy input array to GPU and CuPy", color="red"):
        gpu_array = cp.array(input_array)
    with nvtx.annotate("GPU FFT operation", color="yellow"):
        result = cp.fft.fft(gpu_array)
    with nvtx.annotate("Copy back to CPU and Numpy", color="green"):
        cpu_result = cp.asnumpy(result)
    return cpu_result

for i in range(5):
    print(fast_fft(np.random.random(10)))

启动指令しれい:

nsys profile python3 demo.py

上うえ图对应的 nsys文ぶん件けん report1.zip

C++ Demo

#include <cuda_runtime.h>
#include "nvToolsExt.h"
#include <iostream>

// 定てい义向量りょう加法かほう的てき CUDA 核かく函数かんすう
__global__ void vectorAdd(const float *A, float *C, int N) {
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if(i < N) {
        C[i] = A[i] + 1.0f;
    }
}

// 启动 CUDA 核かく函数かんすう
void launch_kernel(const float *A, float *C, int N) {
    nvtxRangePushA("_FUNCTION_"); // 开始记录 _FUNCTION_ 的てき时间戳
    int threadsPerBlock = 256;
    int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
    for(int i = 0; i < 4; i++) {
        nvtxRangePushA("vectorAdd"); // 开始记录 vectorAdd 的てき时间戳
        vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(A, C, N);
        nvtxRangePop(); // 结束记录 vectorAdd 的てき时间戳
    }
    nvtxRangePop(); // 结束记录 _FUNCTION_ 的てき时间戳
}

int main() {
    const int N = 100;
    float *A, *C;
    cudaMallocManaged(&A, N * sizeof(float));
    cudaMallocManaged(&C, N * sizeof(float));
    for(int i = 0; i < N; i++) {
        A[i] = static_cast<float>(i);
        C[i] = 0.0f;
    }
    std::cout << "Launching kernel..." << std::endl;
    launch_kernel(A, C, N);
    cudaFree(A);
    cudaFree(C);
    return 0;
}
// 完成かんせい程ほど序じょ

Reference

hhhfccz · 2023-03-27T05:43:45Z

memory_reserved算さん子こ目前もくぜん打算ださん不ふ使用しよう直接ちょくせつ调CUDA API的てき方式ほうしき，需要じゅよう更改こうかいoneflow BInAllocator部分ぶぶん，我が把わ他た跟lazy_init放ひ一起かずき了りょう，这周PR

ccssu added the Guide label Mar 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TODO] 开发memory_reserved算さん子こ和わ性能せいのう优化 #122

[TODO] 开发memory_reserved算さん子こ和わ性能せいのう优化 #122

ccssu commented Mar 27, 2023 •

edited

Loading

ccssu commented Mar 27, 2023

hhhfccz commented Mar 27, 2023

[TODO] 开发memory_reserved算さん子こ 和わ 性能せいのう优化 #122

[TODO] 开发memory_reserved算さん子こ 和わ 性能せいのう优化 #122

Comments

ccssu commented Mar 27, 2023 • edited Loading

profile工具こうぐ上手じょうず

one-yolov5项目

glm 项目

ccssu commented Mar 27, 2023

结合NVTX注ちゅう释上手しゅnsys

Python Demo

C++ Demo

Reference

hhhfccz commented Mar 27, 2023

[TODO] 开发memory_reserved算さん子こ和わ性能せいのう优化 #122

[TODO] 开发memory_reserved算さん子こ和わ性能せいのう优化 #122

ccssu commented Mar 27, 2023 •

edited

Loading