I want to create a new pytorch base image in production-images so that I can use the latest Huggingface server whihc lists 2.3.0 version as a requirement. This will also allow us to use latest ROCm version as there is a build for torch2.30-rocm6.0 in https://download.pytorch.org/whl/rocm6.0/torch/
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T362670 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU | |||
Resolved | isarantopoulos | T365166 Update Pytorch base image to 2.3.0 |
Event Timeline
Change #1032725 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[operations/docker-images/production-images@master] Add new version for amd-pytorch image (torch 2.3.0 - rocm 6.0)
Change #1032777 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[machinelearning/liftwing/inference-services@main] huggingface: upgrade kserve to 0.13-rc0
Unfortunately pytorch package seems to get bigger and bigger after each release. Same for ROCm.
Pytorch version | ROCm version | raw image size (GB) | Compressed image size (GB) |
2.1.2 | 5.7 | 10.2 | 3.28 |
2.3.0 | 5.7 | 13.9 | 4.29 |
2.3.0 | 6.0 | 15.9 | 4.86 |
The only pre-built pytorch ROCm binaries available that are supported by huggingfaceserver are the 5.7 and 6.0 ROCm as we
We need a pytorch version of at least 2.3.0 and which leaves us with ROCm 5.7 and 6.0 as the only options for now ( from the pre-built pytorch ROCm binaries available)
Images seem to become more bloated so I am exploring the option to install pytorch-rocm with --no-dependencies option and handle dependencies manually either at the production images repo or on the inference services side. It is a long shot but I think it is worth to try from our side at least to cross it out if it can't be done.
Whether this approach is feasible or not will depend on:
- the need to include all pytorch dependencies: perhaps some of the dependencies in the list are not needed.
- the upgrade process: if upgrading the requirements manually is too much of a burden it terms of complexity
As it turns out the above approach won't cut it. Even without the dependencies the compressed image with pytorch 2.3.0 and rocm 6.0 is 4.36GB.
This is the list of packages under /opt/lib/site-packages
functorch torch torch-2.3.0+rocm6.0.dist-info torchgen
Also seems that torch-ROCm by itself is ~12GB, so it is indeed getting bigger and bigger:
somebody@2b71fb785583:/opt/lib/python$ du -hs /opt/lib/python/site-packages/torch/lib/* | sort -h | tail 240M /opt/lib/python/site-packages/torch/lib/librccl.so 466M /opt/lib/python/site-packages/torch/lib/libtorch_cpu.so 643M /opt/lib/python/site-packages/torch/lib/libmagma.so 806M /opt/lib/python/site-packages/torch/lib/librocblas.so 892M /opt/lib/python/site-packages/torch/lib/libMIOpen.so 1.2G /opt/lib/python/site-packages/torch/lib/librocsparse.so 1.3G /opt/lib/python/site-packages/torch/lib/libtorch_hip.so 1.5G /opt/lib/python/site-packages/torch/lib/hipblaslt 1.5G /opt/lib/python/site-packages/torch/lib/librocsolver.so 2.5G /opt/lib/python/site-packages/torch/lib/rocblas
The entry torch-2.3.0+rocm6.0.dist-info doesn't take much space (<10MB) and holds metadata.
Change #1032725 merged by Klausman:
[operations/docker-images/production-images@master] Add new version for amd-pytorch image (torch 2.3.0 - rocm 6.0)
Change #1034975 had a related patch set uploaded (by Ilias Sarantopoulos; author: Ilias Sarantopoulos):
[operations/docker-images/production-images@master] fix: remove --cache-dir from pytorch image
We had forgotten the .pip dir inside the docker image which increased its size by more than 2GB (the size of the packages since torch compressed is really big by itself).
New image is now 13.5GB and 2.5GB when compressed which allows us to publish it in our docker registry.
Change #1034975 merged by Klausman:
[operations/docker-images/production-images@master] fix: remove --cache-dir from pytorch image
# build-production-images --select '*pytorch23*' == Step 0: scanning /srv/images/production-images/images == Will build the following images: * docker-registry.discovery.wmnet/amd-pytorch23:2.3.0rocm6.0-1 == Step 1: building images == * Built image docker-registry.discovery.wmnet/amd-pytorch23:2.3.0rocm6.0-1 == Step 2: publishing == Successfully published image docker-registry.discovery.wmnet/amd-pytorch23:2.3.0rocm6.0-1 == Build done! == You can see the logs at ./docker-pkg-build.log == Step 0: scanning /srv/images/production-images/istio == Will build the following images: == Step 1: building images == == Step 2: publishing == == Build done! == You can see the logs at ./docker-pkg-build.log == Step 0: scanning /srv/images/production-images/cert-manager == Will build the following images: == Step 1: building images == == Step 2: publishing == == Build done! == You can see the logs at ./docker-pkg-build.log #
and:
$ docker pull docker-registry.wikimedia.org/amd-pytorch23 Using default tag: latest latest: Pulling from amd-pytorch23 9e94c62ce5a2: Already exists 7bd1fb5b4955: Already exists 253ad1301e1a: Pull complete e32d8a205d5b: Pull complete Digest: sha256:cff85430a98674eae970e9f0a30531388b1deb5c229d77c2f7711a8f3b4b89df Status: Downloaded newer image for docker-registry.wikimedia.org/amd-pytorch23:latest docker-registry.wikimedia.org/amd-pytorch23:latest $ docker images REPOSITORY TAG IMAGE ID CREATED SIZE docker-registry.wikimedia.org/amd-pytorch23 latest 54fa55e17951 28 minutes ago 13.5GB [...] $
Change #1032777 merged by jenkins-bot:
[machinelearning/liftwing/inference-services@main] huggingface: upgrade kserve to 0.13-rc0