Extensible Data Skipping

Ta-Shma, Paula; Khazma, Guy; Lushi, Gal; Feder, Oshrit

Computer Science > Databases

arXiv:2009.08150 (cs)

[Submitted on 17 Sep 2020 (v1), last revised 15 Nov 2020 (this version, v2)]

Title:Extensible Data Skipping

Authors:Paula Ta-Shma, Guy Khazma, Gal Lushi, Oshrit Feder

View PDF

Abstract:Data skipping reduces I/O for SQL queries by skipping over irrelevant data objects (files) based on their metadata. We extend this notion by allowing developers to define their own data skipping metadata types and indexes using a flexible API. Our framework is the first to natively support data skipping for arbitrary data types (e.g. geospatial, logs) and queries with User Defined Functions (UDFs). We integrated our framework with Apache Spark and it is now deployed across multiple products/services at IBM. We present our extensible data skipping APIs, discuss index design, and implement various metadata indexes, requiring only around 30 lines of additional code per index. In particular we implement data skipping for a third party library with geospatial UDFs and demonstrate speedups of two orders of magnitude. Our centralized metadata approach provides a x3.6 speed up even when compared to queries which are rewritten to exploit Parquet min/max metadata. We demonstrate that extensible data skipping is applicable to broad class of applications, where user defined indexes achieve significant speedups and cost savings with very low development cost.

Subjects:	Databases (cs.DB)
Cite as:	arXiv:2009.08150 [cs.DB]
	(or arXiv:2009.08150v2 [cs.DB] for this version)
	https://doi.org/10.48550/arXiv.2009.08150

Submission history

From: Guy Khazma [view email]
[v1] Thu, 17 Sep 2020 08:34:51 UTC (1,196 KB)
[v2] Sun, 15 Nov 2020 14:16:02 UTC (1,135 KB)

Computer Science > Databases

Title:Extensible Data Skipping

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Databases

Title:Extensible Data Skipping

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators