Enable kernel & memcpy overlapping in IVF index building #230

abc99lr · 2024-07-17T20:54:21Z

Currently, in IVF index building (both IVF-Flat and IVF-PQ), large dataset is usually in pageable host memory or mmap-ed file. In both case, after the cluster centers are trained, the entire dataset needs to be copied twice to the GPU -- one for assigning vectors to clusters, the other for copying vectors to the corresponding clusters. Both copies are done using batch_load_iterator in a chunk-by-chunk fashion. Since the source buffer is in pageable memory, the current batch_load_iterator implementation doesn't support kernel and memcopy overlapping. This PR adds support on prefetching with cudaMemcpyAsync on pageable memory. We achieve kernel copy overlapping by launching kernel first following by the prefetching of the next chunk.

We benchmarked the change on L40S. The results show 3%-21% speedup on index building, without impacting the search recall (about 1-2%, similar to run-to-run variance).

algo	dataset	model	with prefetching (s)	without prefetching (s)	speedup
IVF-PQ	deep-100M	d64b5n50K	97.3547	100.36	1.03
IVF-PQ	wiki-all-10M	d64-nlist16K	14.9763	18.1602	1.21
IVF-Flat	deep-100M	nlist50K	78.8188	81.4461	1.03

This PR is related to the issue submitted to RAFT: rapidsai/raft#2106

copy-pr-bot · 2024-07-17T20:54:24Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

cjnolet · 2024-07-17T23:38:07Z

/ok to test

abc99lr · 2024-07-18T23:20:05Z

@tfeher , do you have any suggestions on how we should expose the overlapping feature to users? Currently, user need to create a stream pool with one stream in it (separate from the main stream), so we could achieve kernel copy overlapping. But this model is not intuitive. Supporting overlapping in ANN benchmark is not difficult, but for end-user, do we want to add a flag to our API (specifically extend)? Or do you have other suggestions?

tfeher

Thanks Rui for the PR, this is an important improvement for the build time of IVF methods! I In general it looks good, but there are a few small things that we shall clarify.

cpp/src/neighbors/ivf_flat/ivf_flat_build.cuh

cpp/src/neighbors/ivf_pq/ivf_pq_build.cuh

cpp/src/neighbors/detail/ann_utils.cuh

tfeher · 2024-07-19T21:30:18Z

currently, user need to create a stream pool [...] so we could achieve kernel copy overlapping. But this model is not intuitive. [...] but for end-user, do we want to add a flag to our API?

This is a good question. In other words, to enable overlapping of clustering and H2D copies, we have to do the following:

handle = DeviceResources(n_streams=1)
index = ivf_pq.build(build_params, dataset, resources=handle)

The question is, whether we would like to have an extra build option like overlap_h2d_copies, and spawn a extra stream in the background if that is set:

build_params.overlap_hd2_copies = True
index = ivf_pq.build(build_params, dataset)

I would prefer that we use the first option, if we have the resources (stream pool) then we use that automatically. But we would need to update the docstring of build, so that the user is aware that we need a handle with a stream pool for the fastest build. What do you think @cjnolet?

abc99lr · 2024-07-22T21:33:47Z

currently, user need to create a stream pool [...] so we could achieve kernel copy overlapping. But this model is not intuitive. [...] but for end-user, do we want to add a flag to our API?

This is a good question. In other words, to enable overlapping of clustering and H2D copies, we have to do the following:
handle = DeviceResources(n_streams=1)
index = ivf_pq.build(build_params, dataset, resources=handle)
The question is, whether we would like to have an extra build option like overlap_h2d_copies, and spawn a extra stream in the background if that is set:
build_params.overlap_hd2_copies = True
index = ivf_pq.build(build_params, dataset)
I would prefer that we use the first option, if we have the resources (stream pool) then we use that automatically. But we would need to update the docstring of build, so that the user is aware that we need a handle with a stream pool for the fastest build. What do you think @cjnolet?

Hi @cjnolet , could you comment on the two choices Tamas summarized here?

…prefetch or not.

abc99lr · 2024-07-27T00:30:40Z

Re-requesting reviews.

Added documentations. Exposed the overlapping option to users. Users need to setup a stream pool in resource and pass that to the build or extend APIs.

cjnolet · 2024-07-29T20:24:03Z

@abc99lr just to add my response here quickly- I do agree with Tamas' suggestion that we should utilize the infrastructure already available in the handle to inform the user-configurable behavior of algorithms as much as possible.

That being said, the expectations of these configurations do need to be communicated more clearly to the users through the docs, I +1 to improving the docstring.

abc99lr · 2024-07-29T21:22:14Z

Thanks for the review @cjnolet . Could you let me know if there is anything specific we should add to the docstring? I have added more documentations in my changes last Friday.

cjnolet · 2024-07-30T14:18:58Z

@abc99lr the doc description itself looks okay, but, but it would be helpful to users to add a line or two to the usage example below the description that sets a stream pool on the handle. I would also add a small code comment to the snippet that says something like “create a cuda stream pool to …”.

Thanks again for the contribution!

tfeher

Thanks @abc99lr for the updates! It looks good, let's just add a small improvement to the docstring code examples.

cpp/include/cuvs/neighbors/ivf_flat.hpp

cjnolet · 2024-07-30T19:23:24Z

/ok to test

tfeher

Thanks Rui for the update, the PR looks good to me!

tfeher · 2024-07-31T13:17:07Z

/ok to test

cjnolet · 2024-07-31T14:28:55Z

@abc99lr there's some errors in the style checker. Can you please install and use pre-commit to guarantee that each commit will pass the style checker? It takes only a few seconds to install and it's automatic each time you commit. Instructions here.

abc99lr · 2024-07-31T16:06:20Z

@abc99lr there's some errors in the style checker. Can you please install and use pre-commit to guarantee that each commit will pass the style checker? It takes only a few seconds to install and it's automatic each time you commit. Instructions here.

Thanks, my pre-commit was installed in my previous conda env. I had to rebuild an env so forgot to add it. Thanks for the instruction. I think we should be good now.

cjnolet · 2024-07-31T16:20:07Z

/ok to test

tfeher · 2024-07-31T23:11:06Z

/merge

Currently, in IVF index building (both IVF-Flat and IVF-PQ), large dataset is usually in pageable host memory or mmap-ed file. In both case, after the cluster centers are trained, the entire dataset needs to be copied twice to the GPU -- one for assigning vectors to clusters, the other for copying vectors to the corresponding clusters. Both copies are done using `batch_load_iterator` in a chunk-by-chunk fashion. Since the source buffer is in pageable memory, the current `batch_load_iterator` implementation doesn't support kernel and memcopy overlapping. This PR adds support on prefetching with `cudaMemcpyAsync` on pageable memory. We achieve kernel copy overlapping by launching kernel first following by the prefetching of the next chunk. We benchmarked the change on L40S. The results show 3%-21% speedup on index building, without impacting the search recall (about 1-2%, similar to run-to-run variance). algo | dataset | model | with prefetching (s) | without prefetching (s) | speedup -- | -- | -- | -- | -- | -- IVF-PQ | deep-100M | d64b5n50K | 97.3547 | 100.36 | 1.03 IVF-PQ | wiki-all-10M | d64-nlist16K | 14.9763 | 18.1602 | 1.21 IVF-Flat | deep-100M | nlist50K | 78.8188 | 81.4461 | 1.03 This PR is related to the issue submitted to RAFT: rapidsai/raft#2106 Authors: - Rui Lan (https://github.com/abc99lr) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Tamas Bela Feher (https://github.com/tfeher) URL: rapidsai#230

abc99lr requested review from a team as code owners July 17, 2024 20:54

github-actions bot added cpp CMake labels Jul 17, 2024

cjnolet assigned abc99lr Jul 17, 2024

cjnolet added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Jul 17, 2024

tfeher requested changes Jul 19, 2024

View reviewed changes

cpp/src/neighbors/ivf_flat/ivf_flat_build.cuh Outdated Show resolved Hide resolved

cpp/src/neighbors/ivf_pq/ivf_pq_build.cuh Outdated Show resolved Hide resolved

cpp/src/neighbors/detail/ann_utils.cuh Show resolved Hide resolved

cpp/src/neighbors/detail/ann_utils.cuh Show resolved Hide resolved

tfeher requested a review from achirkin July 19, 2024 21:09

abc99lr added 3 commits July 26, 2024 16:50

Add prefetch options to batch load iterator.

0ee4266

Remove multi-threaded prefetching.

e8bac0d

Fix no resource found error.

8b7a116

abc99lr force-pushed the enable-ann-bench-overlap branch from 08dde39 to 8b7a116 Compare July 26, 2024 16:52

github-actions bot removed the CMake label Jul 26, 2024

Add tests in ANN; multiple changes to enable prefetch.

e699f05

abc99lr changed the title ~~[WIP] Enable kernel & memcpy overlapping in IVF index building~~ Enable kernel & memcpy overlapping in IVF index building Jul 26, 2024

More documentations and change the way for detemining whether enable …

39dd1d4

…prefetch or not.

abc99lr requested a review from tfeher July 27, 2024 00:29

tfeher requested changes Jul 30, 2024

View reviewed changes

cpp/include/cuvs/neighbors/ivf_flat.hpp Outdated Show resolved Hide resolved

cpp/include/cuvs/neighbors/ivf_flat.hpp Show resolved Hide resolved

Merge branch 'branch-24.08' into enable-ann-bench-overlap

11068d0

Add more documentations/examples.

c84f718

abc99lr requested a review from tfeher July 30, 2024 21:20

tfeher approved these changes Jul 31, 2024

View reviewed changes

Merge branch 'branch-24.08' into enable-ann-bench-overlap

7c74ab2

abc99lr added 2 commits July 31, 2024 16:02

Fix styles.

3528fd3

Fix styles.

0197b84

rapids-bot bot merged commit eb4d38e into rapidsai:branch-24.08 Jul 31, 2024
54 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable kernel & memcpy overlapping in IVF index building #230

Enable kernel & memcpy overlapping in IVF index building #230

abc99lr commented Jul 17, 2024 •

edited

Loading

copy-pr-bot bot commented Jul 17, 2024

cjnolet commented Jul 17, 2024

abc99lr commented Jul 18, 2024 •

edited

Loading

tfeher left a comment

tfeher commented Jul 19, 2024

abc99lr commented Jul 22, 2024

abc99lr commented Jul 27, 2024

cjnolet commented Jul 29, 2024

abc99lr commented Jul 29, 2024

cjnolet commented Jul 30, 2024

tfeher left a comment

cjnolet commented Jul 30, 2024

tfeher left a comment

tfeher commented Jul 31, 2024

cjnolet commented Jul 31, 2024

abc99lr commented Jul 31, 2024

cjnolet commented Jul 31, 2024

tfeher commented Jul 31, 2024

Enable kernel & memcpy overlapping in IVF index building #230

Enable kernel & memcpy overlapping in IVF index building #230

Conversation

abc99lr commented Jul 17, 2024 • edited Loading

copy-pr-bot bot commented Jul 17, 2024

cjnolet commented Jul 17, 2024

abc99lr commented Jul 18, 2024 • edited Loading

tfeher left a comment

Choose a reason for hiding this comment

tfeher commented Jul 19, 2024

abc99lr commented Jul 22, 2024

abc99lr commented Jul 27, 2024

cjnolet commented Jul 29, 2024

abc99lr commented Jul 29, 2024

cjnolet commented Jul 30, 2024

tfeher left a comment

Choose a reason for hiding this comment

cjnolet commented Jul 30, 2024

tfeher left a comment

Choose a reason for hiding this comment

tfeher commented Jul 31, 2024

cjnolet commented Jul 31, 2024

abc99lr commented Jul 31, 2024

cjnolet commented Jul 31, 2024

tfeher commented Jul 31, 2024

abc99lr commented Jul 17, 2024 •

edited

Loading

abc99lr commented Jul 18, 2024 •

edited

Loading