Search | arXiv e-print repository

Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources

Authors: Alisia Lupidi, Carlos Gemmell, Nicola Cancedda, Jane Dwivedi-Yu, Jason Weston, Jakob Foerster, Roberta Raileanu, Maria Lomeli

Abstract: Large Language Models still struggle in challenging scenarios that leverage structured data, complex reasoning, or tool usage. In this paper, we propose Source2Synth: a new method that can be used for teaching LLMs new skills without relying on costly human annotations. Source2Synth takes as input a custom data source and produces synthetic data points with intermediate reasoning steps grounded in… ▽ More Large Language Models still struggle in challenging scenarios that leverage structured data, complex reasoning, or tool usage. In this paper, we propose Source2Synth: a new method that can be used for teaching LLMs new skills without relying on costly human annotations. Source2Synth takes as input a custom data source and produces synthetic data points with intermediate reasoning steps grounded in real-world sources. Source2Synth improves the dataset quality by discarding low-quality generations based on their answerability. We demonstrate the generality of this approach by applying it to two challenging domains: we test reasoning abilities in multi-hop question answering (MHQA), and tool usage in tabular question answering (TQA). Our method improves performance by 25.51% for TQA on WikiSQL and 22.57% for MHQA on HotPotQA compared to the fine-tuned baselines. △ Less

Submitted 12 September, 2024; originally announced September 2024.

arXiv:2402.14158 [pdf, other]

TOOLVERIFIER: Generalization to New Tools via Self-Verification

Authors: Dheeraj Mekala, Jason Weston, Jack Lanchantin, Roberta Raileanu, Maria Lomeli, Jingbo Shang, Jane Dwivedi-Yu

Abstract: Teaching language models to use tools is an important milestone towards building general assistants, but remains an open problem. While there has been significant progress on learning to use specific tools via fine-tuning, language models still struggle with learning how to robustly use new tools from only a few demonstrations. In this work we introduce a self-verification method which distinguish… ▽ More Teaching language models to use tools is an important milestone towards building general assistants, but remains an open problem. While there has been significant progress on learning to use specific tools via fine-tuning, language models still struggle with learning how to robustly use new tools from only a few demonstrations. In this work we introduce a self-verification method which distinguishes between close candidates by self-asking contrastive questions during (1) tool selection; and (2) parameter generation. We construct synthetic, high-quality, self-generated data for this goal using Llama-2 70B, which we intend to release publicly. Extensive experiments on 4 tasks from the ToolBench benchmark, consisting of 17 unseen tools, demonstrate an average improvement of 22% over few-shot baselines, even in scenarios where the distinctions between candidate tools are finely nuanced. △ Less

Submitted 13 March, 2024; v1 submitted 21 February, 2024; originally announced February 2024.

arXiv:2401.08281 [pdf, other]

The Faiss library

Authors: Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, Hervé Jégou

Abstract: Vector databases typically manage large collections of embedding vectors. Currently, AI applications are growing rapidly, and so is the number of embeddings that need to be stored and indexed. The Faiss library is dedicated to vector similarity search, a core functionality of vector databases. Faiss is a toolkit of indexing methods and related primitives used to search, cluster, compress and trans… ▽ More Vector databases typically manage large collections of embedding vectors. Currently, AI applications are growing rapidly, and so is the number of embeddings that need to be stored and indexed. The Faiss library is dedicated to vector similarity search, a core functionality of vector databases. Faiss is a toolkit of indexing methods and related primitives used to search, cluster, compress and transform vectors. This paper describes the trade-off space of vector search and the design principles of Faiss in terms of structure, approach to optimization and interfacing. We benchmark key features of the library and discuss a few selected applications to highlight its broad applicability. △ Less

Submitted 6 September, 2024; v1 submitted 16 January, 2024; originally announced January 2024.

arXiv:2310.10638 [pdf, other]

In-context Pretraining: Language Modeling Beyond Document Boundaries

Authors: Weijia Shi, Sewon Min, Maria Lomeli, Chunting Zhou, Margaret Li, Gergely Szilvasy, Rich James, Xi Victoria Lin, Noah A. Smith, Luke Zettlemoyer, Scott Yih, Mike Lewis

Abstract: Large language models (LMs) are currently trained to predict tokens given document prefixes, enabling them to directly perform long-form generation and prompting-style tasks which can be reduced to document completion. Existing pretraining pipelines train LMs by concatenating random sets of short documents to create input contexts but the prior documents provide no signal for predicting the next d… ▽ More Large language models (LMs) are currently trained to predict tokens given document prefixes, enabling them to directly perform long-form generation and prompting-style tasks which can be reduced to document completion. Existing pretraining pipelines train LMs by concatenating random sets of short documents to create input contexts but the prior documents provide no signal for predicting the next document. We instead present In-Context Pretraining, a new approach where language models are pretrained on a sequence of related documents, thereby explicitly encouraging them to read and reason across document boundaries. We can do In-Context Pretraining by simply changing the document ordering so that each context contains related documents, and directly applying existing pretraining pipelines. However, this document sorting problem is challenging. There are billions of documents and we would like the sort to maximize contextual similarity for every document without repeating any data. To do this, we introduce approximate algorithms for finding related documents with efficient nearest neighbor search and constructing coherent input contexts with a graph traversal algorithm. Our experiments show In-Context Pretraining offers a simple and scalable approach to significantly enhance LMs'performance: we see notable improvements in tasks that require more complex contextual reasoning, including in-context learning (+8%), reading comprehension (+15%), faithfulness to previous contexts (+16%), long-context reasoning (+5%), and retrieval augmentation (+9%). △ Less

Submitted 24 June, 2024; v1 submitted 16 October, 2023; originally announced October 2023.

arXiv:2310.01352 [pdf, other]

RA-DIT: Retrieval-Augmented Dual Instruction Tuning

Authors: Xi Victoria Lin, Xilun Chen, Mingda Chen, Weijia Shi, Maria Lomeli, Rich James, Pedro Rodriguez, Jacob Kahn, Gergely Szilvasy, Mike Lewis, Luke Zettlemoyer, Scott Yih

Abstract: Retrieval-augmented language models (RALMs) improve performance by accessing long-tail and up-to-date knowledge from external data stores, but are challenging to build. Existing approaches require either expensive retrieval-specific modifications to LM pre-training or use post-hoc integration of the data store that leads to suboptimal performance. We introduce Retrieval-Augmented Dual Instruction… ▽ More Retrieval-augmented language models (RALMs) improve performance by accessing long-tail and up-to-date knowledge from external data stores, but are challenging to build. Existing approaches require either expensive retrieval-specific modifications to LM pre-training or use post-hoc integration of the data store that leads to suboptimal performance. We introduce Retrieval-Augmented Dual Instruction Tuning (RA-DIT), a lightweight fine-tuning methodology that provides a third option by retrofitting any LLM with retrieval capabilities. Our approach operates in two distinct fine-tuning steps: (1) one updates a pre-trained LM to better use retrieved information, while (2) the other updates the retriever to return more relevant results, as preferred by the LM. By fine-tuning over tasks that require both knowledge utilization and contextual awareness, we demonstrate that each stage yields significant performance improvements, and using both leads to additional gains. Our best model, RA-DIT 65B, achieves state-of-the-art performance across a range of knowledge-intensive zero- and few-shot learning benchmarks, significantly outperforming existing in-context RALM approaches by up to +8.9% in 0-shot setting and +1.4% in 5-shot setting on average. △ Less

Submitted 6 May, 2024; v1 submitted 2 October, 2023; originally announced October 2023.

Comments: v4: ICLR 2024 camera-ready version

arXiv:2302.07842 [pdf, ps, other]

Augmented Language Models: a Survey

Authors: Grégoire Mialon, Roberto Dessì, Maria Lomeli, Christoforos Nalmpantis, Ram Pasunuru, Roberta Raileanu, Baptiste Rozière, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, Edouard Grave, Yann LeCun, Thomas Scialom

Abstract: This survey reviews works in which language models (LMs) are augmented with reasoning skills and the ability to use tools. The former is defined as decomposing a potentially complex task into simpler subtasks while the latter consists in calling external modules such as a code interpreter. LMs can leverage these augmentations separately or in combination via heuristics, or learn to do so from demo… ▽ More This survey reviews works in which language models (LMs) are augmented with reasoning skills and the ability to use tools. The former is defined as decomposing a potentially complex task into simpler subtasks while the latter consists in calling external modules such as a code interpreter. LMs can leverage these augmentations separately or in combination via heuristics, or learn to do so from demonstrations. While adhering to a standard missing tokens prediction objective, such augmented LMs can use various, possibly non-parametric external modules to expand their context processing ability, thus departing from the pure language modeling paradigm. We therefore refer to them as Augmented Language Models (ALMs). The missing token objective allows ALMs to learn to reason, use tools, and even act, while still performing standard natural language tasks and even outperforming most regular LMs on several benchmarks. In this work, after reviewing current advance in ALMs, we conclude that this new research direction has the potential to address common limitations of traditional LMs such as interpretability, consistency, and scalability issues. △ Less

Submitted 15 February, 2023; originally announced February 2023.

arXiv:2302.04761 [pdf, other]

Toolformer: Language Models Can Teach Themselves to Use Tools

Authors: Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, Thomas Scialom

Abstract: Language models (LMs) exhibit remarkable abilities to solve new tasks from just a few examples or textual instructions, especially at scale. They also, paradoxically, struggle with basic functionality, such as arithmetic or factual lookup, where much simpler and smaller models excel. In this paper, we show that LMs can teach themselves to use external tools via simple APIs and achieve the best of… ▽ More Language models (LMs) exhibit remarkable abilities to solve new tasks from just a few examples or textual instructions, especially at scale. They also, paradoxically, struggle with basic functionality, such as arithmetic or factual lookup, where much simpler and smaller models excel. In this paper, we show that LMs can teach themselves to use external tools via simple APIs and achieve the best of both worlds. We introduce Toolformer, a model trained to decide which APIs to call, when to call them, what arguments to pass, and how to best incorporate the results into future token prediction. This is done in a self-supervised way, requiring nothing more than a handful of demonstrations for each API. We incorporate a range of tools, including a calculator, a Q\&A system, two different search engines, a translation system, and a calendar. Toolformer achieves substantially improved zero-shot performance across a variety of downstream tasks, often competitive with much larger models, without sacrificing its core language modeling abilities. △ Less

Submitted 9 February, 2023; originally announced February 2023.

arXiv:2209.13331 [pdf, other]

EditEval: An Instruction-Based Benchmark for Text Improvements

Authors: Jane Dwivedi-Yu, Timo Schick, Zhengbao Jiang, Maria Lomeli, Patrick Lewis, Gautier Izacard, Edouard Grave, Sebastian Riedel, Fabio Petroni

Abstract: Evaluation of text generation to date has primarily focused on content created sequentially, rather than improvements on a piece of text. Writing, however, is naturally an iterative and incremental process that requires expertise in different modular skills such as fixing outdated information or making the style more consistent. Even so, comprehensive evaluation of a model's capacity to perform th… ▽ More Evaluation of text generation to date has primarily focused on content created sequentially, rather than improvements on a piece of text. Writing, however, is naturally an iterative and incremental process that requires expertise in different modular skills such as fixing outdated information or making the style more consistent. Even so, comprehensive evaluation of a model's capacity to perform these skills and the ability to edit remains sparse. This work presents EditEval: An instruction-based, benchmark and evaluation suite that leverages high-quality existing and new datasets for automatic evaluation of editing capabilities such as making text more cohesive and paraphrasing. We evaluate several pre-trained models, which shows that InstructGPT and PEER perform the best, but that most baselines fall below the supervised SOTA, particularly when neutralizing and updating information. Our analysis also shows that commonly used metrics for editing tasks do not always correlate well, and that optimization for prompts with the highest performance does not necessarily entail the strongest robustness to different models. Through the release of this benchmark and a publicly available leaderboard challenge, we hope to unlock future research in developing models capable of iterative and more controllable editing. △ Less

Submitted 27 September, 2022; originally announced September 2022.

arXiv:2208.03299 [pdf, other]

Atlas: Few-shot Learning with Retrieval Augmented Language Models

Authors: Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, Edouard Grave

Abstract: Large language models have shown impressive few-shot results on a wide range of tasks. However, when knowledge is key for such results, as is the case for tasks such as question answering and fact checking, massive parameter counts to store knowledge seem to be needed. Retrieval augmented models are known to excel at knowledge intensive tasks without the need for as many parameters, but it is uncl… ▽ More Large language models have shown impressive few-shot results on a wide range of tasks. However, when knowledge is key for such results, as is the case for tasks such as question answering and fact checking, massive parameter counts to store knowledge seem to be needed. Retrieval augmented models are known to excel at knowledge intensive tasks without the need for as many parameters, but it is unclear whether they work in few-shot settings. In this work we present Atlas, a carefully designed and pre-trained retrieval augmented language model able to learn knowledge intensive tasks with very few training examples. We perform evaluations on a wide range of tasks, including MMLU, KILT and NaturalQuestions, and study the impact of the content of the document index, showing that it can easily be updated. Notably, Atlas reaches over 42% accuracy on Natural Questions using only 64 examples, outperforming a 540B parameters model by 3% despite having 50x fewer parameters. △ Less

Submitted 16 November, 2022; v1 submitted 5 August, 2022; originally announced August 2022.

arXiv:2207.06220 [pdf, other]

Improving Wikipedia Verifiability with AI

Authors: Fabio Petroni, Samuel Broscheit, Aleksandra Piktus, Patrick Lewis, Gautier Izacard, Lucas Hosseini, Jane Dwivedi-Yu, Maria Lomeli, Timo Schick, Pierre-Emmanuel Mazaré, Armand Joulin, Edouard Grave, Sebastian Riedel

Abstract: Verifiability is a core content policy of Wikipedia: claims that are likely to be challenged need to be backed by citations. There are millions of articles available online and thousands of new articles are released each month. For this reason, finding relevant sources is a difficult task: many claims do not have any references that support them. Furthermore, even existing citations might not supp… ▽ More Verifiability is a core content policy of Wikipedia: claims that are likely to be challenged need to be backed by citations. There are millions of articles available online and thousands of new articles are released each month. For this reason, finding relevant sources is a difficult task: many claims do not have any references that support them. Furthermore, even existing citations might not support a given claim or become obsolete once the original source is updated or deleted. Hence, maintaining and improving the quality of Wikipedia references is an important challenge and there is a pressing need for better tools to assist humans in this effort. Here, we show that the process of improving references can be tackled with the help of artificial intelligence (AI). We develop a neural network based system, called Side, to identify Wikipedia citations that are unlikely to support their claims, and subsequently recommend better ones from the web. We train this model on existing Wikipedia references, therefore learning from the contributions and combined wisdom of thousands of Wikipedia editors. Using crowd-sourcing, we observe that for the top 10% most likely citations to be tagged as unverifiable by our system, humans prefer our system's suggested alternatives compared to the originally cited reference 70% of the time. To validate the applicability of our system, we built a demo to engage with the English-speaking Wikipedia community and find that Side's first citation recommendation collects over 60% more preferences than existing Wikipedia citations for the same top 10% most likely unverifiable claims according to Side. Our results indicate that an AI-based system could be used, in tandem with humans, to improve the verifiability of Wikipedia. More generally, we hope that our work can be used to assist fact checking efforts and increase the general trustworthiness of information online. △ Less

Submitted 8 July, 2022; originally announced July 2022.

arXiv:2112.03186 [pdf, other]

Statistical implications of relaxing the homogeneous mixing assumption in time series Susceptible-Infectious-Removed models

Authors: Luis D. J. Martinez Lomeli, Michelle N. Ngo, Jon Wakefield, Babak Shahbaba, Vladimir N. Minin

Abstract: Infectious disease epidemiologists routinely fit stochastic epidemic models to time series data to elucidate infectious disease dynamics, evaluate interventions, and forecast epidemic trajectories. To improve computational tractability, many approximate stochastic models have been proposed. In this paper, we focus on one class of such approximations -- time series Susceptible-Infectious-Removed (T… ▽ More Infectious disease epidemiologists routinely fit stochastic epidemic models to time series data to elucidate infectious disease dynamics, evaluate interventions, and forecast epidemic trajectories. To improve computational tractability, many approximate stochastic models have been proposed. In this paper, we focus on one class of such approximations -- time series Susceptible-Infectious-Removed (TSIR) models. Infectious disease modeling often starts with a homogeneous mixing assumption, postulating that the rate of disease transmission is proportional to a product of the numbers of susceptible and infectious individuals. One popular way to relax this assumption proceeds by raising the number of susceptible and/or infectious individuals to some positive powers. We show that when this technique is used within the TSIR models they cannot be interpreted as approximate SIR models, which has important implications for TSIR-based statistical inference. Our simulation study shows that TSIR-based estimates of infection and mixing rates are systematically biased in the presence of non-homogeneous mixing, suggesting that caution is needed when interpreting TSIR model parameter estimates when this class of models relaxes the homogeneous mixing assumption. △ Less

Submitted 6 December, 2021; originally announced December 2021.

Comments: 16 pages of the main text, 6 figures and 3 tables in the main text

arXiv:2004.09065 [pdf, other]

Optimal Experimental Design for Mathematical Models of Hematopoiesis

Authors: Luis Martinez Lomeli, Abdon Iniguez, Babak Shahbaba, John S Lowengrub, Vladimir Minin

Abstract: The hematopoietic system has a highly regulated and complex structure in which cells are organized to successfully create and maintain new blood cells. Feedback regulation is crucial to tightly control this system, but the specific mechanisms by which control is exerted are not completely understood. In this work, we aim to uncover the underlying mechanisms in hematopoiesis by conducting perturbat… ▽ More The hematopoietic system has a highly regulated and complex structure in which cells are organized to successfully create and maintain new blood cells. Feedback regulation is crucial to tightly control this system, but the specific mechanisms by which control is exerted are not completely understood. In this work, we aim to uncover the underlying mechanisms in hematopoiesis by conducting perturbation experiments, where animal subjects are exposed to an external agent in order to observe the system response and evolution. Developing a proper experimental design for these studies is an extremely challenging task. To address this issue, we have developed a novel Bayesian framework for optimal design of perturbation experiments. We model the numbers of hematopoietic stem and progenitor cells in mice that are exposed to a low dose of radiation. We use a differential equations model that accounts for feedback and feedforward regulation. A significant obstacle is that the experimental data are not longitudinal, rather each data point corresponds to a different animal. This model is embedded in a hierarchical framework with latent variables that capture unobserved cellular population levels. We select the optimum design based on the amount of information gain, measured by the Kullback-Leibler divergence between the probability distributions before and after observing the data. We evaluate our approach using synthetic and experimental data. We show that a proper design can lead to better estimates of model parameters even with relatively few subjects. Additionally, we demonstrate that the model parameters show a wide range of sensitivities to design options. Our method should allow scientists to find the optimal design by focusing on their specific parameters of interest and provide insight to hematopoiesis. Our approach can be extended to more complex models where latent components are used. △ Less

Submitted 30 June, 2020; v1 submitted 20 April, 2020; originally announced April 2020.

arXiv:2001.05895 [pdf, other]

Masking schemes for universal marginalisers

Authors: Divya Gautam, Maria Lomeli, Kostis Gourgoulias, Daniel H. Thompson, Saurabh Johri

Abstract: We consider the effect of structure-agnostic and structure-dependent masking schemes when training a universal marginaliser (arXiv:1711.00695) in order to learn conditional distributions of the form $P(x_i |\mathbf x_{\mathbf b})$, where $x_i$ is a given random variable and $\mathbf x_{\mathbf b}$ is some arbitrary subset of all random variables of the generative model of interest. In other words,… ▽ More We consider the effect of structure-agnostic and structure-dependent masking schemes when training a universal marginaliser (arXiv:1711.00695) in order to learn conditional distributions of the form $P(x_i |\mathbf x_{\mathbf b})$, where $x_i$ is a given random variable and $\mathbf x_{\mathbf b}$ is some arbitrary subset of all random variables of the generative model of interest. In other words, we mimic the self-supervised training of a denoising autoencoder, where a dataset of unlabelled data is used as partially observed input and the neural approximator is optimised to minimise reconstruction loss. We focus on studying the underlying process of the partially observed data---how good is the neural approximator at learning all conditional distributions when the observation process at prediction time differs from the masking process during training? We compare networks trained with different masking schemes in terms of their predictive performance and generalisation properties. △ Less

Submitted 16 January, 2020; originally announced January 2020.

Comments: To be published in Proceedings of the 2nd Symposium on Advances in Approximate Bayesian Inference, 2019

arXiv:1910.07474 [pdf, other]

Universal Marginaliser for Deep Amortised Inference for Probabilistic Programs

Authors: Robert Walecki, Kostis Gourgoulias, Adam Baker, Chris Hart, Chris Lucas, Max Zwiessele, Albert Buchard, Maria Lomeli, Yura Perov, Saurabh Johri

Abstract: Probabilistic programming languages (PPLs) are powerful modelling tools which allow to formalise our knowledge about the world and reason about its inherent uncertainty. Inference methods used in PPL can be computationally costly due to significant time burden and/or storage requirements; or they can lack theoretical guarantees of convergence and accuracy when applied to large scale graphical mode… ▽ More Probabilistic programming languages (PPLs) are powerful modelling tools which allow to formalise our knowledge about the world and reason about its inherent uncertainty. Inference methods used in PPL can be computationally costly due to significant time burden and/or storage requirements; or they can lack theoretical guarantees of convergence and accuracy when applied to large scale graphical models. To this end, we present the Universal Marginaliser (UM), a novel method for amortised inference, in PPL. We show how combining samples drawn from the original probabilistic program prior with an appropriate augmentation method allows us to train one neural network to approximate any of the corresponding conditional marginal distributions, with any separation into latent and observed variables, and thus amortise the cost of inference. Finally, we benchmark the method on multiple probabilistic programs, in Pyro, with different model structure. △ Less

Submitted 16 October, 2019; originally announced October 2019.

arXiv:1910.05692 [pdf, other]

Deep Markov Chain Monte Carlo

Authors: Babak Shahbaba, Luis Martinez Lomeli, Tian Chen, Shiwei Lan

Abstract: We propose a new computationally efficient sampling scheme for Bayesian inference involving high dimensional probability distributions. Our method maps the original parameter space into a low-dimensional latent space, explores the latent space to generate samples, and maps these samples back to the original space for inference. While our method can be used in conjunction with any dimension reducti… ▽ More We propose a new computationally efficient sampling scheme for Bayesian inference involving high dimensional probability distributions. Our method maps the original parameter space into a low-dimensional latent space, explores the latent space to generate samples, and maps these samples back to the original space for inference. While our method can be used in conjunction with any dimension reduction technique to obtain the latent space, and any standard sampling algorithm to explore the low-dimensional space, here we specifically use a combination of auto-encoders (for dimensionality reduction) and Hamiltonian Monte Carlo (HMC, for sampling). To this end, we first run an HMC to generate some initial samples from the original parameter space, and then use these samples to train an auto-encoder. Next, starting with an initial state, we use the encoding part of the autoencoder to map the initial state to a point in the low-dimensional latent space. Using another HMC, this point is then treated as an initial state in the latent space to generate a new state, which is then mapped to the original space using the decoding part of the auto-encoder. The resulting point can be treated as a Metropolis-Hasting (MH) proposal, which is either accepted or rejected. While the induced dynamics in the parameter space is no longer Hamiltonian, it remains time reversible, and the Markov chain could still converge to the canonical distribution using a volume correction term. Dropping the volume correction step results in convergence to an approximate but reasonably accurate distribution. The empirical results based on several high-dimensional problems show that our method could substantially reduce the computational cost of Bayesian inference. △ Less

Submitted 13 October, 2019; originally announced October 2019.

arXiv:1811.04727 [pdf, other]

Universal Marginalizer for Amortised Inference and Embedding of Generative Models

Authors: Robert Walecki, Albert Buchard, Kostis Gourgoulias, Chris Hart, Maria Lomeli, A. K. W. Navarro, Max Zwiessele, Yura Perov, Saurabh Johri

Abstract: Probabilistic graphical models are powerful tools which allow us to formalise our knowledge about the world and reason about its inherent uncertainty. There exist a considerable number of methods for performing inference in probabilistic graphical models; however, they can be computationally costly due to significant time burden and/or storage requirements; or they lack theoretical guarantees of c… ▽ More Probabilistic graphical models are powerful tools which allow us to formalise our knowledge about the world and reason about its inherent uncertainty. There exist a considerable number of methods for performing inference in probabilistic graphical models; however, they can be computationally costly due to significant time burden and/or storage requirements; or they lack theoretical guarantees of convergence and accuracy when applied to large scale graphical models. To this end, we propose the Universal Marginaliser Importance Sampler (UM-IS) -- a hybrid inference scheme that combines the flexibility of a deep neural network trained on samples from the model and inherits the asymptotic guarantees of importance sampling. We show how combining samples drawn from the graphical model with an appropriate masking function allows us to train a single neural network to approximate any of the corresponding conditional marginal distributions, and thus amortise the cost of inference. We also show that the graph embeddings can be applied for tasks such as: clustering, classification and interpretation of relationships between the nodes. Finally, we benchmark the method on a large graph (>1000 nodes), showing that UM-IS outperforms sampling-based methods by a large margin while being computationally efficient. △ Less

Submitted 12 November, 2018; originally announced November 2018.

arXiv:1808.00385 [pdf, ps, other]

A Note on the Maximum Rectilinear Crossing Number of Spiders

Authors: Joshua Fallon, Kirsten Hogenson, Lauren Keough, Mario Lomelí, Marcus Schaefer, Pablo Soberón

Abstract: The maximum rectilinear crossing number of a graph $G$ is the maximum number of crossings in a good straight-line drawing of $G$ in the plane. In a good drawing any two edges intersect in at most one point (counting endpoints), no three edges have an interior point in common, and edges do not contain vertices in their interior. A spider is a subdivision of $K_{1,k}$. We provide both upper and lowe… ▽ More The maximum rectilinear crossing number of a graph $G$ is the maximum number of crossings in a good straight-line drawing of $G$ in the plane. In a good drawing any two edges intersect in at most one point (counting endpoints), no three edges have an interior point in common, and edges do not contain vertices in their interior. A spider is a subdivision of $K_{1,k}$. We provide both upper and lower bounds for the maximum rectilinear crossing number of spiders. While there are not many results on the maximum rectilinear crossing numbers of infinite families of graphs, our methods can be used to find the exact maximum rectilinear crossing number of $K_{1,k}$ where each edge is subdivided exactly once. This is a first step towards calculating the maximum rectilinear crossing number of arbitrary trees. △ Less

Submitted 20 August, 2021; v1 submitted 1 August, 2018; originally announced August 2018.

Comments: 8 pages

MSC Class: 05C10; 05C05; 68R10

arXiv:1807.00400 [pdf, other]

Antithetic and Monte Carlo kernel estimators for partial rankings

Authors: Maria Lomeli, Mark Rowland, Arthur Gretton, Zoubin Ghahramani

Abstract: In the modern age, rankings data is ubiquitous and it is useful for a variety of applications such as recommender systems, multi-object tracking and preference learning. However, most rankings data encountered in the real world is incomplete, which prevents the direct application of existing modelling tools for complete rankings. Our contribution is a novel way to extend kernel methods for complet… ▽ More In the modern age, rankings data is ubiquitous and it is useful for a variety of applications such as recommender systems, multi-object tracking and preference learning. However, most rankings data encountered in the real world is incomplete, which prevents the direct application of existing modelling tools for complete rankings. Our contribution is a novel way to extend kernel methods for complete rankings to partial rankings, via consistent Monte Carlo estimators for Gram matrices: matrices of kernel values between pairs of observations. We also present a novel variance reduction scheme based on an antithetic variate construction between permutations to obtain an improved estimator for the Mallows kernel. The corresponding antithetic kernel estimator has lower variance and we demonstrate empirically that it has a better performance in a variety of Machine Learning tasks. Both kernel estimators are based on extending kernel mean embeddings to the embedding of a set of full rankings consistent with an observed partial ranking. They form a computationally tractable alternative to previous approaches for partial rankings data. An overview of the existing kernels and metrics for permutations is also provided. △ Less

Submitted 25 July, 2018; v1 submitted 1 July, 2018; originally announced July 2018.

arXiv:1706.03779 [pdf, other]

General Latent Feature Models for Heterogeneous Datasets

Authors: Isabel Valera, Melanie F. Pradier, Maria Lomeli, Zoubin Ghahramani

Abstract: Latent feature modeling allows capturing the latent structure responsible for generating the observed properties of a set of objects. It is often used to make predictions either for new values of interest or missing information in the original data, as well as to perform data exploratory analysis. However, although there is an extensive literature on latent feature models for homogeneous datasets,… ▽ More Latent feature modeling allows capturing the latent structure responsible for generating the observed properties of a set of objects. It is often used to make predictions either for new values of interest or missing information in the original data, as well as to perform data exploratory analysis. However, although there is an extensive literature on latent feature models for homogeneous datasets, where all the attributes that describe each object are of the same (continuous or discrete) nature, there is a lack of work on latent feature modeling for heterogeneous databases. In this paper, we introduce a general Bayesian nonparametric latent feature model suitable for heterogeneous datasets, where the attributes describing each object can be either discrete, continuous or mixed variables. The proposed model presents several important properties. First, it accounts for heterogeneous data while keeping the properties of conjugate models, which allow us to infer the model in linear time with respect to the number of objects and attributes. Second, its Bayesian nonparametric nature allows us to automatically infer the model complexity from the data, i.e., the number of features necessary to capture the latent structure in the data. Third, the latent features in the model are binary-valued variables, easing the interpretability of the obtained latent features in data exploratory analysis. We show the flexibility of the proposed model by solving both prediction and data analysis tasks on several real-world datasets. Moreover, a software package of the GLFM is publicly available for other researcher to use and improve it. △ Less

Submitted 8 March, 2018; v1 submitted 12 June, 2017; originally announced June 2017.

Comments: Software library available at https://github.com/ivaleraM/GLFM

arXiv:1702.08781 [pdf, other]

General Bayesian inference schemes in infinite mixture models

Authors: Maria Lomeli

Abstract: Bayesian statistical models allow us to formalise our knowledge about the world and reason about our uncertainty, but there is a need for better procedures to accurately encode its complexity. One way to do so is through compositional models, which are formed by combining blocks consisting of simpler models. One can increase the complexity of the compositional model by either stacking more blocks… ▽ More Bayesian statistical models allow us to formalise our knowledge about the world and reason about our uncertainty, but there is a need for better procedures to accurately encode its complexity. One way to do so is through compositional models, which are formed by combining blocks consisting of simpler models. One can increase the complexity of the compositional model by either stacking more blocks or by using a not-so-simple model as a building block. This thesis is an example of the latter. One first aim is to expand the choice of Bayesian nonparametric (BNP) blocks for constructing tractable compositional models. So far, most of the models that have a Bayesian nonparametric component use a Dirichlet Process or a Pitman-Yor process because of the availability of tractable and compact representations. This thesis shows how to overcome certain intractabilities in order to obtain analogous compact representations for the class of Poisson-Kingman priors which includes the Dirichlet and Pitman-Yor processes. A major impediment to the widespread use of Bayesian nonparametric building blocks is that inference is often costly, intractable or difficult to carry out. This is an active research area since dealing with the model's infinite dimensional component forbids the direct use of standard simulation-based methods. The main contribution of this thesis is a variety of inference schemes that tackle this problem: Markov chain Monte Carlo and Sequential Monte Carlo methods, which are exact inference schemes since they target the true posterior. The contributions of this thesis, in a larger context, provide general purpose exact inference schemes in the flavour or probabilistic programming: the user is able to choose from a variety of models, focusing only on the modelling part. Indeed, if the wide enough class of Poisson-Kingman priors is used as one of our blocks, this objective is achieved. △ Less

Submitted 28 February, 2017; originally announced February 2017.

Comments: Doctoral dissertation, University College London

arXiv:1509.07376 [pdf, other]

A hybrid sampler for Poisson-Kingman mixture models

Authors: Maria Lomeli, Stefano Favaro, Yee Whye Teh

Abstract: This paper concerns the introduction of a new Markov Chain Monte Carlo scheme for posterior sampling in Bayesian nonparametric mixture models with priors that belong to the general Poisson-Kingman class. We present a novel compact way of representing the infinite dimensional component of the model such that while explicitly representing this infinite component it has less memory and storage requir… ▽ More This paper concerns the introduction of a new Markov Chain Monte Carlo scheme for posterior sampling in Bayesian nonparametric mixture models with priors that belong to the general Poisson-Kingman class. We present a novel compact way of representing the infinite dimensional component of the model such that while explicitly representing this infinite component it has less memory and storage requirements than previous MCMC schemes. We describe comparative simulation results demonstrating the efficacy of the proposed MCMC algorithm against existing marginal and conditional MCMC samplers. △ Less

Submitted 24 September, 2015; originally announced September 2015.

Journal ref: NIPS 2015

arXiv:1411.1699

The $2nd-$convex hull of every optimal rectilinear drawing of $K_{n}$ is a triangle

Authors: J. Leaños, M. Lomeli, M. Ramírez-Ibáñez, L. M. Rivera-Martínez

Abstract: A rectilinear drawing of a graph $G$ is optimal if it has the smallest number of crossings among all rectilinear drawings of $G$. In this paper it is shown that for $n\geq 8$, the second convex hull of every optimal rectilinear drawing of the complete graph $K_n$ is a triangle. A rectilinear drawing of a graph $G$ is optimal if it has the smallest number of crossings among all rectilinear drawings of $G$. In this paper it is shown that for $n\geq 8$, the second convex hull of every optimal rectilinear drawing of the complete graph $K_n$ is a triangle. △ Less

Submitted 14 June, 2016; v1 submitted 2 November, 2014; originally announced November 2014.

Comments: This paper has been withdrawn by the authors due to a crucial error in the proof of the main result

MSC Class: 05C10; 68R10

arXiv:1407.4211 [pdf, other]

doi 10.1080/10618600.2015.1110526

A marginal sampler for $σしぐま$-Stable Poisson-Kingman mixture models

Authors: María Lomelí, Stefano Favaro, Yee Whye Teh

Abstract: We investigate the class of $σしぐま$-stable Poisson-Kingman random probability measures (RPMs) in the context of Bayesian nonparametric mixture modeling. This is a large class of discrete RPMs which encompasses most of the the popular discrete RPMs used in Bayesian nonparametrics, such as the Dirichlet process, Pitman-Yor process, the normalized inverse Gaussian process and the normalized generalized G… ▽ More We investigate the class of $σしぐま$-stable Poisson-Kingman random probability measures (RPMs) in the context of Bayesian nonparametric mixture modeling. This is a large class of discrete RPMs which encompasses most of the the popular discrete RPMs used in Bayesian nonparametrics, such as the Dirichlet process, Pitman-Yor process, the normalized inverse Gaussian process and the normalized generalized Gamma process. We show how certain sampling properties and marginal characterizations of $σしぐま$-stable Poisson-Kingman RPMs can be usefully exploited for devising a Markov chain Monte Carlo (MCMC) algorithm for making inference in Bayesian nonparametric mixture modeling. Specifically, we introduce a novel and efficient MCMC sampling scheme in an augmented space that has a fixed number of auxiliary variables per iteration. We apply our sampling scheme for a density estimation and clustering tasks with unidimensional and multidimensional datasets, and we compare it against competing sampling schemes. △ Less

Submitted 24 September, 2015; v1 submitted 16 July, 2014; originally announced July 2014.

Comments: New algorithmic performance comparisons were added

Showing 1–23 of 23 results for author: Lomeli, M