Open technologies - Open data
Datasets are the foundation of AI models, and their content holds the key to a better understanding of how models work and how to make them more useful, efficient, and safe. We’re committed to creating and sharing open datasets that can help move the field forward.
For a full list of our available datasets, visit us on Hugging Face.
Featured dataset - Dolma
Dolma is a dataset from a diverse mix of web content, academic publications, code, books, and encyclopedic materials.
The WildChat Dataset is a corpus of 1 million real-world user-ChatGPT interactions, characterized by a wide range of languages and a diversity of user prompts. It was constructed by offering free access to ChatGPT and GPT-4 in exchange for consensual chat history collection.
1,616 diverse NLP tasks over 76 distinct task types along with expert-written instructions to measure how well NLP models can generalize to a variety of unseen tasks when provided with clear guidance.
A framework that helps language models improve their ability to follow natural language instructions by using the model's own generations to create a large collection of instructional data.
A large corpus of structured full text for English-language open access academic papers. It is the largest publicly-available collection of machine-readable academic text, comprising over 10M documents. It aims to facilitate research and development of tools for text mining over academic text.
A collection of over 200M paper titles, abstracts, citations, and other metadata of open-access papers from the Semantic Scholar Academic Graph.
A challenge dataset of questions that are trivial for humans (>95% accuracy) but that state-of-the-art models struggle with (<48%), created through a collection paradigm wherein a series of discriminators iteratively select an adversarial set of machine-generated wrong answers.
WinoGrande is a collection of 44K problems, inspired by Winograd Schema Challenge, but adjusted to improve the scale and robustness against the dataset-specific bias. Formulated as a fill-in-a-blank task with binary options, the goal is to choose the right option for a given sentence which requires commonsense reasoning.
137K instruction-following demonstrations for 54 scientific literature understanding tasks. The tasks cover five essential scientific literature categories and span five domains.
Instruction data collected for writing paragraph-level answers to multiple document-grounded NLP research questions. It was collected via 234 interactive sessions of NLP experts instructing different language models, culminating in 1.2K interaction turns.
2.1K LLM-generated hierarchical organizations of medical studies on 472 research topics, with expert-provided corrections for a subset of 100 topics. This data can be used to assess and improve LLM-based tools to assist literature review.
1.4K expert-written scientific claims paired with evidence-containing abstracts annotated with labels and rationales to support the development of scientific claim verification systems. It’s been used in shared tasks like SCIVER and retrieval benchmarks like BEIR.
5.4K extremely short (<30 words) expert-written summaries of 3.2K scientific papers, used to develop models for single document summarization and to develop the initial version of the TLDR feature on Semantic Scholar.
7,787 genuine grade-school level, multiple-choice science questions partitioned into a Challenge Set and an Easy Set, along with a corpus of over 14 million science sentences relevant to the task. Offered as a challenge to the machine reasoning community.
A QA dataset that tests the comprehensive understanding of paragraphs. In this crowdsourced, adversarially-created, 96K question-answering benchmark, a system must resolve multiple references in a question, map them onto a paragraph, and perform discrete operations over them (such as addition, counting, or sorting).
5K information-seeking questions over 1.5K scientific papers. Each question is asked by an expert researcher and answered by a different expert researcher using supporting evidence from the paper's full text. Qasper has been included in long-context benchmarks such as SCROLLS.
20K biomedical literature review summaries synthesizing information from over 470K studies. This dataset facilitates the development of systems that can assess and aggregate contradictory evidence across multiple studies, and is one of the first large-scale, publicly available multi-document summarization dataset in the biomedical domain.
3386 author-written alt texts from HCI publications, of which 547 have been annotated with semantic content. Most figures in scientific papers lack alt text, harming accessibility, and this dataset can be used to build tools for understanding and describing figures, leading to a higher prevalence of alt texts.