Find Any Part in 3D

Ziqi Ma Yisong Yue Georgia Gkioxari
California Institute of Technology

Abstract

Why don’t we have foundation models in 3D yet? A key limitation is data scarcity. For 3D object part segmentation, existing datasets are small in size and lack diversity. We show that it is possible to break this data barrier by building a data engine powered by 2D foundation models. Our data engine automatically annotates any number of object parts: 1755 $\times$ more unique part types than existing datasets combined. By training on our annotated data with a simple contrastive objective, we obtain an open-world model that generalizes to any part in any object based on any text query. Even when evaluated zero-shot, we outperform existing methods on the datasets they train on. We achieve $\mathbf{260\%}$ improvement in mIoU and boost speed by 6 $\times$ to 300 $\times$ . Our scaling analysis confirms that this generalization stems from the data scale, which underscores the impact of our data engine. Finally, to advance general-category open-world 3D part segmentation, we release a benchmark covering a wide range of objects and parts. Project website: https://ziqi-ma.github.io/find3dsite/

Figure 1: Find3D is the first general-category 3D model that can segment any part of any object with any text query. We achieve this by building a scalable Data Engine powered by 2D foundation models – SAM & Gemini – that automatically annotates 3D assets from the web. Using the labeled data, Find3D trains a transformer-based point cloud model with a contrastive training recipe. Our method works on diverse 3D objects and parts, e.g. the easel, the imaginary animal, and the ceiling fan.

1 Introduction

Is it possible to build foundation models in 3D? For text and image modalities, we have seen that strong, general models come from internet-scale training data. In the absence of such large-scale 3D datasets, can we attempt to replicate this success in 3D?

In this paper, we provide an answer to this question. We show that when you tackle the data challenge, you can get a strong model with a simple, general training recipe. This approach not only unlocks generalization to unseen objects with a $\mathbf{260\%}$ improvement in mIoU, but even outperforms prior methods on the datasets they train on.

Prior works, heavy on dataset-specific customization, suffer from poor generalization because existing datasets are small and homogeneous. For example, ShapeNet-Part [27] contains only 16 categories and makes assumptions such as “all chairs face right”. Evaluation on such limited datasets implicitly encourages dataset-specific customization, which is not the path towards generalization. Moreover, many prior works use pipelines that are computationally expensive and cannot scale to larger training sets.

In this paper, our goal is to: 1) establish a scalable data engine that can generate useful labels for any number of 3D assets; and 2) show that having a large-scale training set enables strong generalization with a simple training recipe, without any customizations for specific datasets such as per-category prompt and viewpoint search [33], category-specific finetuning [32, 21], multi-pass inference customization with predefined part ranking logic [32], or slow test-time pipelines [20]. Conceptually, our findings mirror those in other domains such as text, where using general training recipes at scale leads to powerful and general foundation models.

Concretely, as shown in Fig. 1, we enable scaling in 3D by building a data engine that automatically annotates synthetic 3D assets on the internet, yielding $\mathbf{2.1}$ million part annotations of 761 object categories. Our dataset contains $124615$ unique part types, which is over $\mathbf{1775\times}$ the number of unique part types in existing datasets combined (ShapeNet-Part [27] and PartNet-E [12] contain 71 unique part types combined). To leverage such large-scale data, we devise a contrastive training objective to handle part hierarchy and ambiguity. Our model takes in a point cloud and predicts a queryable semantic feature for every point. The features are in the latent embedding space of a CLIP-like [15] model, so that they can be queried with any free-form text by calculating pointwise cosine similarities with the query embedding.

This approach yields a model that can segment any part of any object, with any text query. We highlight the following contributions:

•

We develop a data engine that labels 3D object parts from large-scale internet data to train a general-category model without the need for human annotation. Our data engine creatively combines existing vision and language foundation models.
•

We build the first model for 3D segmentation that is simultaneously open-world, cross-category, part-level and feed-forward. We achieve $\mathbf{260\%}$ improvement in mIoU and 6 $\times$ to over 300 $\times$ the inference speed compared to existing methods.
•

We release a benchmark for evaluating open-world 3D part segmentation for diverse objects, with 5 $\times$ more unique part types than the largest existing benchmark.

Refer to caption — Figure 2: The Data Engine. (a) We render Objaverse assets into multiple views and pass each rendering to SAM with gridpoint prompts for segmentation. For each mask, we query Gemini for the corresponding part name, which gives us (mask, text) pairs. We embed the part name into the latent embedding space of a vision and language foundation model such as SigLIP. We back-project mask pixels to obtain the points associated with each label embedding, yielding (points, text embedding) pairs. (b) Example annotations by the data engine.

2 Related Work

Closed-world 3D segmentation. 3D segmentation has been studied primarily in a closed world and with a coarse granularity that cannot go below whole objects. In specific settings such as indoor scenes or self-driving, state-of-the-art models are starting to achieve better generalization by training on multiple datasets, such as Mask3D [18] and the PointTransformer series [30, 23, 22]. However, these models are still domain-specific, and can only segment whole objects rather than parts. Part-level segmentation is less studied. Early efforts started with the ShapeNet-Part dataset [27] (16 object classes, $\leq 6$ parts per object). PartNet-E introduces articulated objects but is still limited to only 45 categories. Due to the limited number of categories and shared orientations (e.g., chairs all facing right), state-of-the-art part-level models [14, 9] cannot generalize well. Our work tackles both the challenges of generalization and granularity – our model is part-level, and can segment any object part in an open-world setting.

3D aggregation methods based on 2D renderings. With the progress of vision language models in 2D image understanding, some works directly assemble these models to obtain an “aggregated” 3D understanding without training a 3D model. An exemplary aggregation method uses multiview renderings of 3D scenes or objects, obtains their features in 2D based on models like CLIP [15], SAM [6], or GLIP [7], and combines them in 3D based on projection geometry. On the whole object level, such methods include OpenMask3D [20]. On the part level, such methods include PointCLIP [29], PointCLIPV2 [33], PartSLIP [8] and PartSLIP++ [32] for point clouds, and SATR [1] for meshes. These models lack 3D geometry information and suffer from inconsistency across views. Furthermore, these methods are slow because they perform many inferences and the aggregation logic at test time. Our method, which predicts in 3D with a single inference, is significantly faster. Our method also achieves stronger performance and better robustness to pose changes by leveraging 3D geometry information.

Test-time optimization. Test-time optimization methods combine features from 2D models with a 3D representation, such as NeRF or Gaussian Splatting. At test time, these methods optimize the 3D representation with the 2D-sourced features attached. LERF [3], Distilled Feature Field [19], and Garfield [4] are based on radiance fields. Feature3DGS [31] is based on Gaussian splatting. These methods need to be optimized per scene (or per object), which can be slow (several minutes). Moreoever, their part-level capabilities have not been well-studied. Our method, feed-forward in nature, provides much faster inference with better performance.

Distillation methods. Distillation methods train 3D models using 2D annotations. Generalization is a key limitation in prior works – distillation is usually performed per dataset, even per category. OpenScene [13], a whole-object segmentation model for indoor scenes, is distilled per dataset. For part segmentation, PartDistill [21] is distilled per category. Such models cannot perform inference zero-shot on unseen object classes, which is critical in real-world use cases. Our approach can be considered a distillation method that tackles the challenge of zero-shot generalization.

3 Method

We propose a method, Find3D, to locate any object part in 3D based on a free-form language description, such as “the wheel of a car”. As shown in Fig. 1 (panel b), we design a model that takes in a point cloud and outputs a queryable semantic feature for every point. This semantic feature is in the latent embedding space of a pre-trained CLIP-like [15] model, such as SigLIP [28]. For any text query, we embed the query using the same model and calculate its cosine similarity with each point’s feature. This yields a pointwise similarity score that reflects the confidence of the part being located at that point. This score can be used to segment the object or localize specific parts.

Formally, given a point cloud $C=\{\mathbf{p_{1}},...{\mathbf{p_{n}}}\}$ with color and normals, for any point $\mathbf{p_{i}}=(x,y,z,n_{x},n_{y},n_{z},r,g,b)\in\mathbb{R}^{9}$ , we want to find a semantic feature $f_{i}\in\mathbb{R}^{d}$ which belongs in the same latent embedding space as a CLIP-like model, e.g., SigLIP. At inference time, for any text $s$ , we can get its SigLIP embedding $T(s)$ and compute its cosine similarity with $f_{i}$ , $cos(T(s),f_{i})$ . For segmentation, Find3D assigns each point to the text query with the highest cosine similarity, and assigns “no label” if all queries yield negative similarity scores.

3.1 Data Engine

Obtaining large-scale 3D annotations for generic object categories with human-in-the-loop pipelines is onerous. We develop a scalable data engine that leverages annotations from 2D foundation models and geometrically unprojects them to 3D.

As illustrated in Fig. 2, our data engine leverages SAM [6] and Gemini [17] to annotate 3D assets from Objaverse [2]. Since Objaverse assets do not have a fixed orientation, and Gemini provides higher-quality labels to objects seen in familiar orientations, we first prompt Gemini to select the best orientation based on 10 renderings (from different camera angles) of an object in each orientation. For the chosen orientation, we pass all renderings to SAM with grid point prompts. We discard masks that are too small (less than 350 pixels out of a 500 $\times$ 500 image), too large (greater than 20% of all pixels), or with low confidence from SAM. We overlay each mask on the original image and ask Gemini to name the shaded part. Prompts are detailed in the appendix. Masks with the same label are merged. This process generates labeled (mask, text) pairs. We map each mask to a set of points in the point cloud based on projection geometry. To make the point features queryable by language, we align point features to the language embedding space of a pretrained model, such as SigLIP. We embed the label texts and use the text features as supervision.

The data engine processes 36044 Objaverse objects under LVIS categories selected by [11, 10]. Each part can be annotated differently from different views, denoting various aspects of part, such as location (e.g., “bottom”), material (e.g., “snowball”), and function (e.g., “body”). Labels also have different levels of granularity. For example, in Fig. 2, one granularity is individual snowballs, and another granularity is the whole snowman. The diversity of our labels helps the model handle the inherent ambiguity in segmentation. Our data engine annotates 30K objects from 761 unique categories with 2.1 million parts in total. Our annotations contain 124615 unique part types, which is over $\mathbf{1775\times}$ the number of unique part types in existing datasets combined (ShapeNet-Part and PartNet-E contain 71 unique part types in total). Fig. 2 panel b shows some example annotations covering a wide range of part types and object geometries. We provide more annotation examples by our data engine in the appendix.

3.2 Open-World 3D Part Model

Architecture. Find3D adopts the PT3 [22] architecture that treats point clouds as sequences, as illustrated in Fig. 3. To align the point features into the latent embedding space of SigLIP, we append a lightweight 4-layer MLP to the last layer of the transformer. This returns a 768-dimension feature per point. Our model contains 46.2 million parameters.

Training. Leveraging the diverse annotations from our data engine requires some care. We cannot define a direct pointwise loss because: 1) The same point can have multiple labels that denote various aspects of a part such as location, material, and function. Some labels may also be incorrect; and 2) Many points are unlabeled - as shown in Fig. 3 (right), each mask only labels points visible from one camera view, and thus parts are likely to be labeled partially.

The challenge of partial labels can be resolved if the model can map features based on 3D geometry: points on the same ball should share similar features, and if we align the features of some points on that ball correctly to the text embedding, the other points’ features should also be aligned. The challenge of multiple labels can be resolved by the contrastive formulation: each point’s feature is encouraged to be close to the embeddings of all its labels, which allows for flexible text queries at inference time. As illustrated in Fig. 3 (right), we define the contrastive pairing as follows: for each label, the ground truth is the SigLIP embedding of the text. The predicted value is the average feature of all points that correspond to the label. This pooling can also be regarded as a way to “denoise” the labels – while an individual point might be affected by conflicting or incorrect labels, it is unlikely that all points are subjected to the same error.

Formally, our data engine provides (points, text embedding) labels, which we denote as $(C_{i},T(\text{label}_{i}))$ where $C_{i}$ is a subset of the point cloud that this label corresponds to, and $T(\text{label}_{i})$ is the label embedding. We denote the pooled feature from the labeled points as $f(C_{i})$ , where $f$ is our model. We define the contrastive loss as follows:

l_{i}=-\log\frac{\exp(f(C_{i})\cdot T(\text{label}_{i}))}{\sum_{j=1}^{|% \mathcal{B}|}\exp(f(C_{i})\cdot T(\text{label}_{j}))}

(1)

where $\mathcal{B}$ denotes all labels of all objects in a batch. For training, we use a batch size of 64 objects, corresponding to $\sim 3000$ positive pairs per batch.

To achieve generalization, in addition to training on diverse data provided by the data engine, we also apply data augmentations, including random rotation (implemented as sequential random rotation along all three axes), scaling, flipping, jittering, chromatic auto contrast, chromatic translation, and chromatic jitter. These augmentations help avoid over-reliance on object poses and color, and nudges the model to take up 3D geometric cues. We perform a 90:10 train-validation split on the 27552 objects provided by the data engine, and train with the Adam optimizer [5] with a cosine annealing learning rate schedule, starting at 0.0003 and ending at 0.00005 over 80 epochs.

4 A General Open-World 3D Part Benchmark

Existing 3D part segmentation benchmarks only contain a small number of categories with a fixed set of parts, and are limited to narrow domains. ShapeNet-Part [27] contains 16 object categories with 41 unique part types, and the domain is limited to CAD models with canonical orientations (e.g., all chairs face right). PartNet-E [12, 24] contains 45 categories with 40 unique part types (e.g., “button” is a common part across categories), and is restricted to simple home objects such as bottles and doors. As shown in Fig. 4, we introduce a new human-annotated benchmark featuring a diverse range of objects, shapes, parts, and poses. We source our data from Objaverse [2]. Our benchmark contains 132 object categories, 450 total parts and 207 unique part types, over $\mathbf{5\times}$ that of existing benchmarks, as shown in Fig. 4. We hope this benchmark can advance 3D part segmentation towards more variable, “in-the-wild” scenarios. The benchmark is divided into two sets:

Objaverse-General contains 100 objects with 350 parts from 100 diverse object categories, such as gondola, slide, lamppost, easel, penguin. These objects are in random orientations. We hold out 50 out of the 100 categories from training, in order to evaluate out-of-distribution generalization to novel object types. The holdout categories are termed Unseen-Categories in Table 2.

ShapeNetPart-V2 contains 32 objects from the same 16 object categories in ShapeNet-Part [27]. Inspired by ImageNetV2 [16], we create this benchmark to evaluate generalization for models trained on ShapeNet-Part.

5 Experiments

Method	Time	Open-world	Cross-category	Part-level	Feed-forward
Find3D	000.9s	✓	✓	✓	✓
PointCLIPV2	005.4s	✓	✓	✓	✗
PartSLIP++	174.3s	✓	✗^∗	✓	✗
OpenMask3D	296.5s	✓	✓	✗	✗
PartDistill^†	000.7s (+348s)	✗	✗	✓	✓
PointNext	001.4s	✗	✓	✓	✓

Table 1: Properties and inference time of Find3D and baselines. Find3D is the only method that is open-world, cross-category, part-level and feed-forward. By “feed-forward”, we mean a method that performs direct 3D inference, without relying on a pipeline of multiview rendering, multiple 2D model inferences, and backprojection. Our method is

\mathbf{6\times}

\mathbf{300\times}

faster than other open-world models, and on par with closed-world models.

\ast

PartSLIP++ finetunes a model per category.

\dagger

PartDistill performs distillation for each new category (348s) and then does inference (0.7s). It is not open-world as the part names need to be defined prior to distillation. It only releases source code for two categories, and our reported speed is averaged across them.

As summarized in Table 1, Find3D is the first method that is simultaneously, open-world, cross-category, part-level and feed-forward. Find3D not only shows strong zero-shot generalization, but also outperforms existing methods on their own domain. Our experiments show:

•

Find3D achieves strong performance on diverse objects, with $\mathbf{260\%}$ improvement in mIoU from existing methods. Find3D exhibits strong out-of-distribution generalization, whereas baseline methods perform poorly on datasets they are not trained on, as shown qualitatively in Fig. 5 and quantitatively in Tab. 2, Tab. 3, Tab. 4.
•

Find3D is robust to variations such as query prompt rephrasing, object rotation, and domain shift, whereas baselines are sensitive to these changes. This is shown in Fig. 7, Tab. 3, Tab. 4.
•

Find3D is the most efficient open-world method with 6x to 300x speed improvement, as shown in Tab. 1.

5.1 Experimental Settings

Benchmarks. In addition to our proposed benchmark (Sec. 4), we also evaluate on two commonly used datasets for 3D part segmentation: ShapeNet-Part [27] (16 object categories) and PartNet-E [12] (45 object categories). For both datasets, we evaluate on their test set both in the canonical pose and in a randomly rotated (around all axes) pose, which correspond to the Canonical and Rotated columns in Tab. 2, Tab. 3.

mIoU (%)	Objaverse-General				ShapeNet-Part
	Seen Categories		Unseen Categories		Canonical		Rotated		ShapeNetPart-V2
	{part} of a {object}	{part}	{part} of a {object}	{part}	{part} of a {object}	{part}	{part} of a {object}	{part}	{part} of a {object}	{part}
Find3D (ours)	33.78	34.10	26.21	27.41	28.39	24.09	29.64	23.71	42.15	30.02
PointCLIPV2	$9.81$	$11.27$	$10.27$	$11.09$	16.91	20.22	16.88	18.19	$15.14$	$17.11$
PartSLIP++	$2.69$	$15.03$	$0.57$	$10.43$	$1.43$	$6.46$	$0.94$	$6.03$	$1.54$	$11.62$
OpenMask3D	$11.81$	$11.93$	$7.01$	$10.31$	$8.94$	$10.37$	$6.75$	$14.56$	$15.87$	$13.77$

Table 2: Performance comparison with open-world methods on Objaverse-General and ShapeNet-Part. Shaded cells mean the method is trained on the same dataset (expected higher than white cells); white cells mean zero-shot evaluation. Find3D performs best on Objaverse-General, with

\mathbf{260\%}

improvement in mIoU on unseen categories where all methods are evaluated zero-shot. On ShapeNet-Part, Find3D’s zero-shot performance even exceeds PointCLIPV2 which is trained on the this dataset. We show results evaluated with 2 common query prompts: “{part} of a {object}” and “{part}” for all methods.

Metric. We report class-average intersection-over-union (mIoU) as our metric, which is the mean IoU for all labeled parts per object, averaged across all object categories.

Competing Methods

Open-world Baselines: PointCLIPV2 [33] is an open-world 2D-to-3D pipeline involving multiple invocations of CLIP [15]. It uses top-k prompts ( $k=1400\times n_{\text{parts}}$ per object) selected on the test set of ShapeNet-Part. PartSLIP++ [32] is a detection-based pipeline involving invocations of GLIP [7] and a custom algorithm for finding superpoints. It finetunes a separate model for each category in PartNet-E. We evaluate its zero-shot checkpoint for fairness of comparison. OpenMask3D [20] is an open-vocabulary, 2D-to-3D pipeline trained on scenes.

PointCLIPV2 and OpenMask3D are dense methods that assign a label to every point. We provide the text query “other” as an option for no label on benchmarks that contain unlabeled points.

Closed-world Baselines: PointNeXt [14] is a state-of-the-art closed-world point cloud segmentation model trained on ShapeNet-Part. Due to its closed vocabulary, it cannot be evaluated on other datasets. PartDistill [21] is a category-specific 2D-to-3D distillation method, which is open-world prior to distillation but closed-world at inference time. It cannot be evaluated on unseen object categories due to the category-specific nature of distillation. The code and data for this method are not fully released (only two categories are released). Since we cannot reproduce the approach, we show numbers claimed in the paper.

Because PartSLIP++ and OpenMask3D are slow (up to 5 minutes per object), they are infeasible to evaluate on the full test sets (evaluating OpenMask3D on PartNet-E test set would take 628 hours). For fair evaluations of all methods, we create smaller subsets of 160 objects (10 objects/category $\times$ 16 categories) for ShapeNet-Part and 225 objects (5 objects/category $\times$ 45 categories) for PartNet-E. For methods that are efficient to evaluate, we additionally report performance on the full test sets in the appendix. We observe the same rankings and similar results on the subsets and full sets.

5.2 Experimental Results

Results on Objaverse-General and ShapeNet-Part. Tab. 2 reports the mIoU of Find3D and open-world baselines on Objaverse-General and ShapeNet-Part. Find3D shows the strongest performance, with $\mathbf{260\%}$ improvement in mIoU compared to the best baseline, PointCLIPV2, when both are evaluated zero-shot out of distribution (Objaverse-General– Unseen Categories). Additionally, even when evaluated zero-shot, Find3D outperforms PointCLIPV2 on ShapeNet-Part, the dataset it is trained on.

Qualitative results. As seen in Fig. 5, Find3D consistently outputs reasonable segmentations, while other methods struggle. PartSLIP++ is trained on PartNet-E with sparse part annotations, and thus tends to output “no label” overly often. OpenMask3D struggles with the part-level granularity, and it usually only picks one part, or at most two parts, to represent the whole object. We additionally show examples in Objaverse-General and PartObjaverse-Tiny [26] for our method on the right. Find3D not only works with diverse objects and parts, but also generalizes to real-world iPhone photos (converted to point clouds with Trellis [25], an off-the-shelf image-to-3D model), despite only being trained on synthetic assets.

Results on PartNet-E.

mIoU(%)	Canonical Orientation		Rotated
	{part} of a {object}	{part}	{part} of a {object}	{part}
Find3D (ours)	16.86	16.38	17.62	17.16
PartSLIP++	05.12	32.71	03.87	23.03
PointCLIPV2	11.28	09.70	10.32	10.22
OpenMask3D	12.54	11.24	11.93	11.67

Table 3: Comparison of open-world methods on PartNet-E. Shaded cells mean the method is trained on the same dataset (expected higher than white cells); white cells mean zero-shot evaluation. We evaluate with 2 prompt formats: “{part} of a {object}” and “{part}”. PartSLIP++ achieves good performance with the “{part}” prompt, but its performance drops

\mathbf{84\%}

when we vary the query prompt. This dataset is challenging for our method due to the sparsity of labels and the presence of small parts that are not geometrically or colorfully prominent (e.g., buttons on a surface with the same color). Nevertheless, our method is more robust to rotation and prompt variation, and clearly outperforms the other baselines that are also evaluated zero-shot.

Tab. 3 compares open-world methods on PartNet-E. PartSLIP++ is trained on this dataset while all other methods, including Find3D, are evaluated zero-shot, so the results in this table favor PartSLIP++. We evaluate on two common prompts: “part of a object” (such as “leg of a chair”), and “part name” (“leg”). We see that PartSLIP++’s performance decreases greatly when we vary the query prompt, up to a $83\%$ drop. Our method, evaluated zero-shot, is more robust and outperforms PartSLIP++ with the “part of a object” prompt. It also outperforms other zero-shot baselines under all evaluation configurations.

Efficiency. As shown in Tab. 1, Find3D only takes 0.9 seconds for inference, which is 6 $\times$ to 300 $\times$ faster than open-world baselines and on par with closed-world models. Inference time is the average per-object inference time on the PartNet-E subset evaluated on an A100.

Comparing with closed-world methods on ShapeNet-Part and ShapeNetPart-V2. Tab. 4 compares Find3D zero-shot with closed-world methods that are trained on ShapeNet-Part, which greatly favors the closed-world methods. PointNeXt is the leading closed-world method for this dataset, and PartDistill trains one model for each object category of ShapeNet-Part. We additionally evaluate the methods’ generalization capability on our ShapeNetPart-V2 benchmark, similar to ImageNetV2 [16]. We see a $\mathbf{64\%}$ drop of PointNeXt. Even though PointNeXt is still in-domain and Find3D is evaluated out-of-distribution, Find3D shows a $\mathbf{1.5\times}$ advantage. PartDistill is not reproducible and thus cannot be evaluated on ShapeNetPart-V2.

mIoU (%)	Trained on	ShapeNet-Part	ShapeNetPart-V2
Find3D	Our data engine	$28.39$	$42.15$
PointNeXt	ShapeNet-Part	80.44	$28.70$
PartDistill^†	ShapeNet-Part	63.9	N/A

Table 4: Performance comparison with closed-world methods. Shaded cells mean the method is trained on the same dataset (expected higher than white cells); white cells mean zero-shot evaluation. PointNeXt, a state-of-the-art closed-world model, is trained on ShapeNet-Part, but its performance drops significantly on ShapeNetPart-V2. Our approach, which is trained on a domain different from either ShapeNet-Part and ShapeNetPart-V2, demonstrates a stronger out-of-domain performance (

1.5\times

better on ShapeNetPart-V2, +13.2%).

\dagger

PartDistill trains a model per-category on ShapeNet-Part. It does not release training source-code or checkpoints (apart from two categories), thus cannot be evaluated on ShapeNetPart-V2.

5.2.1 Scaling Analysis

Data scaling is critical, as shown by the scaling analysis in Fig. 6. This finding highlights the importance of our data engine approach, which enables scaling in 3D. We vary training object categories (x-axis) ranging from 16 categories (ShapeNet-Part dataset size), 45 categories (PartNet-E dataset size), all the way to 761 (our setting). We report zero-shot mIoU on Objaverse-Unseen Categories (y-axis). We observe a strong scaling trend which is consistent with findings in many other data domains.

5.2.2 Quantifying and Comparing Robustness

Fig. 7 evaluates the robustness of our method under different query text prompt, object orientation, and data domain, i.e. the data source of similarly-looking objects.

Robustness to query prompt. PointCLIPV2 performs an extensive top-k prompt search on the test set: they iteratively optimize the prompt for each part (iteratively searching over 700 prompts per part and looping over all parts twice) and select the best prompt, i.e. iterate over $k=1400\times n_{\text{part}}$ prompts and pick the one with the best test performance. For a fair comparison, we perform the same top-k search for our method. We also evaluate on two common prompts: “part of a object” (such as “leg of a chair”), and “part name” (“leg”). As shown in Fig. 7, with a change of prompt from top-k to “part of a object”, PointCLIPV2’s performance drops from 48.47 to 17.42 ( $64\%$ decrease), whereas our method exhibits more robust performance.

Robustness to object orientation. We apply a random rotation by sampling three angles from $-\pi$ to $\pi$ and applying rotations along each of the X, Y, Z axis sequentially. PointCLIPV2’s performance drops $46\%$ whereas our method does not drop but even increases $3\%$ .

Robustness to domain. We constructed ShapeNetPart-V2, a benchmark with objects from the same categories as ShapeNet-Part, but sourced from Objaverse assets. With this domain shift, PointCLIPV2’s performance drops from 48.47 to 21.18 by $56\%$ , whereas our method stays robust with a $20\%$ increase.

Comparison with other methods on other datasets in Tab. 2 and Tab. 3 show similar trends.

5.2.3 Flexibility of text queries

Find3D supports various query types that might occur in-the-wild. As shown in Fig. 8, Find3D can locate hands via different query types – either by the body part “hand” or by the clothing “gloves”. The teddy bear example demonstrates flexibility in query granularity – one can query with “limbs”, a combination of arms and legs, or with “arms” and “legs” separately. For ease of visualization, the scores are min-clipped at 0.

5.2.4 Failure Modes

We observe some limitations of Find3D: 1) Our model voxel-samples point clouds at the 0.02 resolution (after normalization). Fine-grained parts that are not geometrically prominent, such as bottons on a surface, are difficult for a point-cloud-only model like ours. 2) Because the model is trained to be rotational-equivariant, it tends to make symmetric predictions where all symmetric parts have the same label. Fig. 9 demonstrates an example from the PartNet-E dataset. These limitations point to the complementary nature of the 2D and 3D modalities. While lacking in 3D geometry, the 2D modalities can better convey detailed appearance. Combining the image and the point cloud modality is a future direction.

6 Discussions and Conclusions

We present the first scaling study for 3D part segmentation. Key to our approach is a data engine that automatically annotates 3D assets from the internet, which allows us to train the first zero-shot generalist model for open-world 3D part segmentation on any object. Our method not only shows strong generalization, but even outperforms prior methods on the datasets they train on, despite being zero-shot. We show that training object diversity is critical with a scaling analysis. We will release our code, benchmark and model checkpoints. We hope that by providing a diverse benchmark and the first demonstration of open-world 3D part segmentation at scale, we can encourage the community to shift away from customizations for small-scale datasets towards scale and generalization.

7 Acknowledgments

We would like to thank Ilona Demler, Raphi Kang, and Jiacheng Liu for feedback on the paper draft. We also thank Jiacheng Liu for help with the project demo. Ziqi Ma is supported by the Kortschak scholarship. This project is funded in part by NSF #1918655, William H. Hurt Scholars Program, Powell Foundation, Google, and Amazon.

References

Abdelreheem et al. [2023] Ahmed Abdelreheem, Ivan Skorokhodov, Maks Ovsjanikov, and Peter Wonka. Satr: Zero-shot semantic segmentation of 3d shapes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15166–15179, 2023.
Deitke et al. [2023] Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023.
Kerr et al. [2023] Justin Kerr, Chung Min Kim, Ken Goldberg, Angjoo Kanazawa, and Matthew Tancik. Lerf: Language embedded radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19729–19739, 2023.
Kim et al. [2024] Chung Min Kim, Mingxuan Wu, Justin Kerr, Ken Goldberg, Matthew Tancik, and Angjoo Kanazawa. Garfield: Group anything with radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21530–21539, 2024.
Kingma and Ba [2015] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
Kirillov et al. [2023] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023.
Li et al. [2022] Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022.
Liu et al. [2023] Minghua Liu, Yinhao Zhu, Hong Cai, Shizhong Han, Zhan Ling, Fatih Porikli, and Hao Su. Partslip: Low-shot part segmentation for 3d point clouds via pretrained image-language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21736–21746, 2023.
Loizou et al. [2023] Marios Loizou, Siddhant Garg, Dmitry Petrov, Melinos Averkiou, and Evangelos Kalogerakis. Cross-shape attention for part segmentation of 3d point clouds. In Computer Graphics Forum, page e14909. Wiley Online Library, 2023.
[10] Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, et al. Wonder3d objaverse subset uids. https://github.com/xxlong0/Wonder3D/blob/main/data_lists/lvis_uids_filter_by_vertex.json. Accessed: 2024-11-01.
Long et al. [2024] Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, et al. Wonder3d: Single image to 3d using cross-domain diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9970–9980, 2024.
Mo et al. [2019] Kaichun Mo, Shilin Zhu, Angel X Chang, Li Yi, Subarna Tripathi, Leonidas J Guibas, and Hao Su. Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 909–918, 2019.
Peng et al. [2023] Songyou Peng, Kyle Genova, Chiyu Jiang, Andrea Tagliasacchi, Marc Pollefeys, Thomas Funkhouser, et al. Openscene: 3d scene understanding with open vocabularies. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 815–824, 2023.
Qian et al. [2024] Guocheng Qian, Yuchen Li, Houwen Peng, Jinjie Mai, Hasan Abed Al Kader Hammoud, Mohamed Elhoseiny, and Bernard Ghanem. Pointnext: revisiting pointnet++ with improved training and scaling strategies. In Proceedings of the 36th International Conference on Neural Information Processing Systems, 2024.
Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
Recht et al. [2019] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? In International conference on machine learning, pages 5389–5400. PMLR, 2019.
Reid et al. [2024] Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.
Schult et al. [2023] Jonas Schult, Francis Engelmann, Alexander Hermans, Or Litany, Siyu Tang, and Bastian Leibe. Mask3d: Mask transformer for 3d semantic instance segmentation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 8216–8223. IEEE, 2023.
Shen et al. [2023] William Shen, Ge Yang, Alan Yu, Jansen Wong, Leslie Pack Kaelbling, and Phillip Isola. Distilled feature fields enable few-shot language-guided manipulation. In 7th Annual Conference on Robot Learning, 2023.
Takmaz et al. [2023] Ayça Takmaz, Elisabetta Fedele, Robert W. Sumner, Marc Pollefeys, Federico Tombari, and Francis Engelmann. OpenMask3D: Open-Vocabulary 3D Instance Segmentation. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
Umam et al. [2024] Ardian Umam, Cheng-Kun Yang, Min-Hung Chen, Jen-Hui Chuang, and Yen-Yu Lin. Partdistill: 3d shape part segmentation by vision-language model distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3470–3479, 2024.
Wu et al. [2024a] Xiaoyang Wu, Li Jiang, Peng-Shuai Wang, Zhijian Liu, Xihui Liu, Yu Qiao, Wanli Ouyang, Tong He, and Hengshuang Zhao. Point transformer v3: Simpler faster stronger. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4840–4851, 2024a.
Wu et al. [2024b] Xiaoyang Wu, Yixing Lao, Li Jiang, Xihui Liu, and Hengshuang Zhao. Point transformer v2: grouped vector attention and partition-based pooling. In Proceedings of the 36th International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 2024b.
Xiang et al. [2020] Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Hao Zhu, Fangchen Liu, Minghua Liu, Hanxiao Jiang, Yifu Yuan, He Wang, Li Yi, Angel X. Chang, Leonidas J. Guibas, and Hao Su. SAPIEN: A simulated part-based interactive environment. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
Xiang et al. [2024] Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3d latents for scalable and versatile 3d generation. arXiv preprint arXiv:2412.01506, 2024.
Yang et al. [2024] Yunhan Yang, Yukun Huang, Yuan-Chen Guo, Liangjun Lu, Xiaoyang Wu, Edmund Y. Lam, Yan-Pei Cao, and Xihui Liu. Sampart3d: Segment any part in 3d objects, 2024.
Yi et al. [2016] Li Yi, Vladimir G Kim, Duygu Ceylan, I-Chao Shen, Mengyan Yan, Hao Su, Cewu Lu, Qixing Huang, Alla Sheffer, and Leonidas Guibas. A scalable active framework for region annotation in 3d shape collections. ACM Transactions on Graphics (ToG), 35(6):1–12, 2016.
Zhai et al. [2023] Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975–11986, 2023.
Zhang et al. [2022] Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xupeng Miao, Bin Cui, Yu Qiao, Peng Gao, and Hongsheng Li. Pointclip: Point cloud understanding by clip. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8552–8562, 2022.
Zhao et al. [2021] Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and Vladlen Koltun. Point transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 16259–16268, 2021.
Zhou et al. [2024] Shijie Zhou, Haoran Chang, Sicheng Jiang, Zhiwen Fan, Zehao Zhu, Dejia Xu, Pradyumna Chari, Suya You, Zhangyang Wang, and Achuta Kadambi. Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21676–21685, 2024.
Zhou et al. [2023] Yuchen Zhou, Jiayuan Gu, Xuanlin Li, Minghua Liu, Yunhao Fang, and Hao Su. Partslip++: Enhancing low-shot 3d part segmentation via multi-view instance segmentation and maximum likelihood estimation. arXiv preprint arXiv:2312.03015, 2023.
Zhu et al. [2023] Xiangyang Zhu, Renrui Zhang, Bowei He, Ziyu Guo, Ziyao Zeng, Zipeng Qin, Shanghang Zhang, and Peng Gao. Pointclip v2: Prompting clip and gpt for powerful 3d open-world learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2639–2650, 2023.

\thetitle

Supplementary Material

A Additional data engine annotation examples

We provide additional examples of our data engine annotations, both for high-quality examples in Fig. 10 and lower-quality (but still useful) examples in Fig. 11. Upon manual inspection of 50 randomly sampled objects, we observe $76\%$ high-quality examples. The annotations cover diverse objects and descriptions. For example, the body of a fire extinguisher is referred to both as “body” and “cylinder” from different views. The lower-quality examples are still useful for training – they might not have pronounced parts, or contain partial masks (e.g., the baguette example), but the supervision signal still pushes the point features close to the correct semantic embedding (e.g. bread-related concepts). The low-quality cartoon frog contains both correct and incorrect masks. When learning from millions of such labels, the incorrect labels can be “smoothed out” because it’s unlikely that many frogs’ bellies are all incorrectly labeled as “bowtie”.

A Additional qualitative examples of Find3D

We provide additional qualitative results of Find3D in Fig. 12 and Fig. 13. Fig. 12 shows predictions on Objaverse-General from 4 views for each object. Fig. 13 shows predictions on PartObjaverse-Tiny [yang2024sampart3d] and iPhone photos (reconstructed to 3D via off-the-shelf single-image reconstruction method, Trellis [25])). Find3D can segment diverse objects and parts, and can generalize to real-world objects, despite being trained on synthetic data.

A Experiments

A.1 Additional Results

In Tab. 2, Tab. 3, Tab. 4 of the main paper, in order to evaluate all methods on the exact same data, we had to report results on subsets of ShapeNet-Part and PartNet-E because methods like PartSLIP++ and OpenMask3D are slow and infeasible to evaluate on the full test sets (e.g., OpenMask3D would take 628 hours on PartNet-E). Here we provide full-set results for methods that are feasible for full-set evaluation in Tab. 6 and Tab. 7. The ranking of methods on the full sets and the subsets are the same.

ShapeNet-Part. Tab. 5 compares all methods with various prompts, orientations, and data sources (ShapeNet-Part vs. ShapeNetPart-V2, a benchmark of the same object classes as ShapeNet-Part but sourced from Objaverse that we constructed, similar to ImageNetV2 [16]). PointCLIPV2 is trained on this dataset, and other methods are evaluated zero-shot. Find3D performs the best in 8 out of 9 configurations, despite being zero-shot. While Tab. 5 reports metrics on the subset of ShapeNet-Part so that all methods can be evaluated strictly on the same dataset, for methods that are fast enough to evaluate on the full test set (Find3D and PointCLIPV2), we also report the full-set evaluation results in Tab. 6. The full-set metrics are very close to the subset metrics. On the full set, we also see that Find3D performs better in 5 out of 6 settings.

On both the full set and the subset, Find3D, despite being zero-shot on this dataset, is the best-performing method in all configurations except for one—the canonical orientation with test-time top-k prompt searching. In this setting, PointCLIPV2, a method trained on this dataset and designed with test-time prompt searching in mind, performs slightly better. We note that this searching takes over an hour on an A100, which is unrealistic to perform in real applications. Our method is not designed for test-time prompt searching but clearly outperforms all baselines when doing direct inference.

PartNet-E. Tab. 7 shows results on PartNet-E, both on the subset (for all methods) and on the full set (for methods that are fast enough to evaluate on the full set). PartSLIP++, trained on this dataset, achieves the highest performance with the “{part}” prompts, yet is very sensitive to prompt variation. We note that PartSLIP++ also releases category-specific checkpoints, but we use the cross-category checkpoint for fairness of comparison. This dataset is more challenging for our method because many objects contain small parts that are not geometrically or colorfully prominent, such as buttons on a surface with the same color. Nevertheless, we see our method to be more robust to rotation and prompt variation, and clearly outperforms the other baselines that are not trained on this dataset. Furthermore, PartSLIP++ is a slow 2D-3D aggregation method, taking up to 3 minutes per object. Our method is over 30 $\times$ faster.

A Data engine prompts

Fig. 14 shows the prompt we use to obtain object orientations from Gemini. For a given orientation, we render the object in 10 different views, and pass the prompt along with 10 renderings to Gemini. We calculate the percentage of “yes” answers and choose the orientation with the highest “yes” percentage. Fig. 14 also provides some example objects with answers from Gemini. Fig. 15 shows the prompt we use to obtain part names from Gemini, along with some examples.

mIoU(%)	Canonical Orientation			Rotated			Objaverse-ShapeNetPart
	top-k	{part} of a {object}	{part}	top-k	{part} of a {object}	{part}	top-k	{part} of a {object}	{part}
PointCLIPV2	48.666	16.912	20.215	26.111	16.878	18.193	21.177	15.136	17.110
PartSLIP++	–	1.432	6.460	–	0.937	6.034	–	1.542	11.622
OpenMask3D	–	8.938	10.373	–	6.748	14.556	–	15.870	13.768
Find3D (Ours)	43.613	28.386	24.085	43.781	29.637	23.712	50.002	42.151	30.018

Table 5: Detailed results on ShapeNet-Part subset. Shaded cells mean the method is trained on the same dataset (expected higher than white cells), and white cells mean zero-shot evaluation. We evaluate different orientations, query prompts, and data domains (ShapeNet-Part vs. ShapeNetPart-V2). We evaluate on 3 types of prompts: “{part} of a {object}”, “{part}”, and top-k. Top-k prompt reproduces the PointCLIPV2 paper, which runs an iterative search over

1400\times n_{\text{parts}}

prompts per object category to choose the best query text prompts. For fairness of comparison, we follow the same procedure to get top-k prompt metrics, although our method is not designed with prompt searching in mind, and it is not realistic to conduct this slow (

>1

hour on A100) searching process at inference time. Our method, despite being zero-shot on this dataset, has the best performance in 8 out of 9 configurations—all configurations except for the canonical orientation with top-k prompt searching.

mIoU(%)	Canonical Orientation			Rotated
	top-k	{part} of a {object}	{part}	top-k	{part} of a {object}	{part}
PointCLIPV2	48.472	17.471	20.157	26.337	17.034	18.021
Find3D (Ours)	41.517	28.532	23.569	42.734	29.966	23.794

Table 6: Detailed results on ShapeNet-Part full test set. Shaded cells mean the method is trained on the same dataset (expected higher than white cells), and white cells mean zero-shot evaluation. PartSLIP2 and OpenMask3D are too slow and thus infeasible to evaluate on the full test set. The metrics are very close to the subset results in the previous table. Our method, despite being zero-shot on this dataset, has the best performance in 5 out of 6 configurations—all configurations except for the canonical orientation with top-k prompt searching. This searching process takes over an hour on an A100 and our method is not designed for test-time prompt searching.

mIoU(%)	Canonical Orientation				Rotated
	Full		Subset		Full		Subset
	{part} of a {object}	{part}	{part} of a {object}	{part}	{part} of a {object}	{part}	{part} of a {object}	{part}
PointCLIPV2	11.619	9.647	11.275	9.700	10.943	10.261	10.317	10.216
PartSLIP++	–	–	5.123	32.705	–	–	3.866	23.033
OpenMask3D	–	–	12.538	11.242	–	–	11.933	11.673
Find3D (Ours)	17.143	16.211	16.861	16.384	17.703	16.819	17.620	17.164

Table 7: Detailed results on PartNet-E test set. Shaded cells mean the method is trained on the same dataset (expected higher than white cells), and white cells mean zero-shot evaluation. Cells with “-” denote that the method is too slow to be evaluated on the full test set. We evaluate with 2 types of prompts: “{part} of a {object}” and “{part}”. PartSLIP++ achieves the highest performance with the “{part}” prompts, yet the performance drops

84\%

when we vary the query prompt. This dataset is more challenging for our method due to the sparsity of labels and the presence of small parts that are not geometrically or colorfully prominent (e.g., buttons on a surface with the same color). Nevertheless, our method is more robust to rotation and prompt variation, and clearly outperforms the other baselines not trained on this dataset.