(Translated by https://www.hiragana.jp/)
Enable kernel & memcpy overlapping in IVF index building by abc99lr · Pull Request #230 · rapidsai/cuvs · GitHub
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable kernel & memcpy overlapping in IVF index building #230

Merged
merged 10 commits into from
Jul 31, 2024

Conversation

abc99lr
Copy link
Contributor

@abc99lr abc99lr commented Jul 17, 2024

Currently, in IVF index building (both IVF-Flat and IVF-PQ), large dataset is usually in pageable host memory or mmap-ed file. In both case, after the cluster centers are trained, the entire dataset needs to be copied twice to the GPU -- one for assigning vectors to clusters, the other for copying vectors to the corresponding clusters. Both copies are done using batch_load_iterator in a chunk-by-chunk fashion. Since the source buffer is in pageable memory, the current batch_load_iterator implementation doesn't support kernel and memcopy overlapping. This PR adds support on prefetching with cudaMemcpyAsync on pageable memory. We achieve kernel copy overlapping by launching kernel first following by the prefetching of the next chunk.

We benchmarked the change on L40S. The results show 3%-21% speedup on index building, without impacting the search recall (about 1-2%, similar to run-to-run variance).

algo dataset model with prefetching (s) without prefetching (s) speedup
IVF-PQ deep-100M d64b5n50K 97.3547 100.36 1.03
IVF-PQ wiki-all-10M d64-nlist16K 14.9763 18.1602 1.21
IVF-Flat deep-100M nlist50K 78.8188 81.4461 1.03

This PR is related to the issue submitted to RAFT: rapidsai/raft#2106

@abc99lr abc99lr requested review from a team as code owners July 17, 2024 20:54
Copy link

copy-pr-bot bot commented Jul 17, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cjnolet cjnolet added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Jul 17, 2024
@cjnolet
Copy link
Member

cjnolet commented Jul 17, 2024

/ok to test

@abc99lr
Copy link
Contributor Author

abc99lr commented Jul 18, 2024

@tfeher , do you have any suggestions on how we should expose the overlapping feature to users? Currently, user need to create a stream pool with one stream in it (separate from the main stream), so we could achieve kernel copy overlapping. But this model is not intuitive. Supporting overlapping in ANN benchmark is not difficult, but for end-user, do we want to add a flag to our API (specifically extend)? Or do you have other suggestions?

Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Rui for the PR, this is an important improvement for the build time of IVF methods! I In general it looks good, but there are a few small things that we shall clarify.

cpp/src/neighbors/ivf_flat/ivf_flat_build.cuh Outdated Show resolved Hide resolved
cpp/src/neighbors/ivf_pq/ivf_pq_build.cuh Outdated Show resolved Hide resolved
cpp/src/neighbors/detail/ann_utils.cuh Show resolved Hide resolved
cpp/src/neighbors/detail/ann_utils.cuh Show resolved Hide resolved
@tfeher tfeher requested a review from achirkin July 19, 2024 21:09
@tfeher
Copy link
Contributor

tfeher commented Jul 19, 2024

currently, user need to create a stream pool [...] so we could achieve kernel copy overlapping. But this model is not intuitive. [...] but for end-user, do we want to add a flag to our API?

This is a good question. In other words, to enable overlapping of clustering and H2D copies, we have to do the following:

handle = DeviceResources(n_streams=1)
index = ivf_pq.build(build_params, dataset, resources=handle)

The question is, whether we would like to have an extra build option like overlap_h2d_copies, and spawn a extra stream in the background if that is set:

build_params.overlap_hd2_copies = True
index = ivf_pq.build(build_params, dataset)

I would prefer that we use the first option, if we have the resources (stream pool) then we use that automatically. But we would need to update the docstring of build, so that the user is aware that we need a handle with a stream pool for the fastest build. What do you think @cjnolet?

@abc99lr
Copy link
Contributor Author

abc99lr commented Jul 22, 2024

currently, user need to create a stream pool [...] so we could achieve kernel copy overlapping. But this model is not intuitive. [...] but for end-user, do we want to add a flag to our API?

This is a good question. In other words, to enable overlapping of clustering and H2D copies, we have to do the following:

handle = DeviceResources(n_streams=1)
index = ivf_pq.build(build_params, dataset, resources=handle)

The question is, whether we would like to have an extra build option like overlap_h2d_copies, and spawn a extra stream in the background if that is set:

build_params.overlap_hd2_copies = True
index = ivf_pq.build(build_params, dataset)

I would prefer that we use the first option, if we have the resources (stream pool) then we use that automatically. But we would need to update the docstring of build, so that the user is aware that we need a handle with a stream pool for the fastest build. What do you think @cjnolet?

Hi @cjnolet , could you comment on the two choices Tamas summarized here?

@abc99lr abc99lr changed the title [WIP] Enable kernel & memcpy overlapping in IVF index building Enable kernel & memcpy overlapping in IVF index building Jul 26, 2024
@abc99lr abc99lr requested a review from tfeher July 27, 2024 00:29
@abc99lr
Copy link
Contributor Author

abc99lr commented Jul 27, 2024

Re-requesting reviews.

Added documentations. Exposed the overlapping option to users. Users need to setup a stream pool in resource and pass that to the build or extend APIs.

@cjnolet
Copy link
Member

cjnolet commented Jul 29, 2024

@abc99lr just to add my response here quickly- I do agree with Tamas' suggestion that we should utilize the infrastructure already available in the handle to inform the user-configurable behavior of algorithms as much as possible.

That being said, the expectations of these configurations do need to be communicated more clearly to the users through the docs, I +1 to improving the docstring.

@abc99lr
Copy link
Contributor Author

abc99lr commented Jul 29, 2024

Thanks for the review @cjnolet . Could you let me know if there is anything specific we should add to the docstring? I have added more documentations in my changes last Friday.

@cjnolet
Copy link
Member

cjnolet commented Jul 30, 2024

@abc99lr the doc description itself looks okay, but, but it would be helpful to users to add a line or two to the usage example below the description that sets a stream pool on the handle. I would also add a small code comment to the snippet that says something like “create a cuda stream pool to …”.

Thanks again for the contribution!

Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @abc99lr for the updates! It looks good, let's just add a small improvement to the docstring code examples.

cpp/include/cuvs/neighbors/ivf_flat.hpp Outdated Show resolved Hide resolved
cpp/include/cuvs/neighbors/ivf_flat.hpp Show resolved Hide resolved
@cjnolet
Copy link
Member

cjnolet commented Jul 30, 2024

/ok to test

@abc99lr abc99lr requested a review from tfeher July 30, 2024 21:20
Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Rui for the update, the PR looks good to me!

@tfeher
Copy link
Contributor

tfeher commented Jul 31, 2024

/ok to test

@cjnolet
Copy link
Member

cjnolet commented Jul 31, 2024

@abc99lr there's some errors in the style checker. Can you please install and use pre-commit to guarantee that each commit will pass the style checker? It takes only a few seconds to install and it's automatic each time you commit. Instructions here.

@abc99lr
Copy link
Contributor Author

abc99lr commented Jul 31, 2024

@abc99lr there's some errors in the style checker. Can you please install and use pre-commit to guarantee that each commit will pass the style checker? It takes only a few seconds to install and it's automatic each time you commit. Instructions here.

Thanks, my pre-commit was installed in my previous conda env. I had to rebuild an env so forgot to add it. Thanks for the instruction. I think we should be good now.

@cjnolet
Copy link
Member

cjnolet commented Jul 31, 2024

/ok to test

@tfeher
Copy link
Contributor

tfeher commented Jul 31, 2024

/merge

@rapids-bot rapids-bot bot merged commit eb4d38e into rapidsai:branch-24.08 Jul 31, 2024
54 checks passed
divyegala pushed a commit to divyegala/cuvs that referenced this pull request Aug 7, 2024
Currently, in IVF index building (both IVF-Flat and IVF-PQ), large dataset is usually in pageable host memory or mmap-ed file. In both case, after the cluster centers are trained, the entire dataset needs to be copied twice to the GPU -- one for assigning vectors to clusters, the other for copying vectors to the corresponding clusters. Both copies are done using `batch_load_iterator` in a chunk-by-chunk fashion. Since the source buffer is in pageable memory, the current `batch_load_iterator` implementation doesn't support kernel and memcopy overlapping. This PR adds support on prefetching with `cudaMemcpyAsync` on pageable memory. We achieve kernel copy overlapping by launching kernel first following by the prefetching of the next chunk. 

We benchmarked the change on L40S. The results show 3%-21% speedup on index building, without impacting the search recall (about 1-2%, similar to run-to-run variance). 
algo | dataset | model | with prefetching (s) | without prefetching (s) | speedup
-- | -- | -- | -- | -- | --
IVF-PQ | deep-100M | d64b5n50K | 97.3547 | 100.36 | 1.03
IVF-PQ | wiki-all-10M | d64-nlist16K | 14.9763 | 18.1602 | 1.21
IVF-Flat | deep-100M | nlist50K | 78.8188 | 81.4461 | 1.03

This PR is related to the issue submitted to RAFT: rapidsai/raft#2106

Authors:
  - Rui Lan (https://github.com/abc99lr)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Tamas Bela Feher (https://github.com/tfeher)

URL: rapidsai#230
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cpp improvement Improves an existing functionality non-breaking Introduces a non-breaking change
Projects
Development

Successfully merging this pull request may close these issues.

3 participants