Search | arXiv e-print repository

INDUS: Effective and Efficient Language Models for Scientific Applications

Authors: Bishwaranjan Bhattacharjee, Aashka Trivedi, Masayasu Muraoka, Muthukumaran Ramasubramanian, Takuma Udagawa, Iksha Gurung, Rong Zhang, Bharath Dandala, Rahul Ramachandran, Manil Maskey, Kaylin Bugbee, Mike Little, Elizabeth Fancher, Lauren Sanders, Sylvain Costes, Sergi Blanco-Cuaresma, Kelly Lockhart, Thomas Allen, Felix Grezes, Megan Ansdell, Alberto Accomazzi, Yousef El-Kurdi, Davis Wertheimer, Birgit Pfitzmann, Cesar Berrospi Ramis , et al. (9 additional authors not shown)

Abstract: Large language models (LLMs) trained on general domain corpora showed remarkable results on natural language processing (NLP) tasks. However, previous research demonstrated LLMs trained using domain-focused corpora perform better on specialized tasks. Inspired by this pivotal insight, we developed INDUS, a comprehensive suite of LLMs tailored for the Earth science, biology, physics, heliophysics,… ▽ More Large language models (LLMs) trained on general domain corpora showed remarkable results on natural language processing (NLP) tasks. However, previous research demonstrated LLMs trained using domain-focused corpora perform better on specialized tasks. Inspired by this pivotal insight, we developed INDUS, a comprehensive suite of LLMs tailored for the Earth science, biology, physics, heliophysics, planetary sciences and astrophysics domains and trained using curated scientific corpora drawn from diverse data sources. The suite of models include: (1) an encoder model trained using domain-specific vocabulary and corpora to address natural language understanding tasks, (2) a contrastive-learning-based general text embedding model trained using a diverse set of datasets drawn from multiple sources to address information retrieval tasks and (3) smaller versions of these models created using knowledge distillation techniques to address applications which have latency or resource constraints. We also created three new scientific benchmark datasets namely, CLIMATE-CHANGE-NER (entity-recognition), NASA-QA (extractive QA) and NASA-IR (IR) to accelerate research in these multi-disciplinary fields. Finally, we show that our models outperform both general-purpose encoders (RoBERTa) and existing domain-specific encoders (SciBERT) on these new tasks as well as existing benchmark tasks in the domains of interest. △ Less

Submitted 20 May, 2024; v1 submitted 17 May, 2024; originally announced May 2024.

arXiv:2312.14211 [pdf, ps, other]

Experimenting with Large Language Models and vector embeddings in NASA SciX

Authors: Sergi Blanco-Cuaresma, Ioana Ciucă, Alberto Accomazzi, Michael J. Kurtz, Edwin A. Henneken, Kelly E. Lockhart, Felix Grezes, Thomas Allen, Golnaz Shapurian, Carolyn S. Grant, Donna M. Thompson, Timothy W. Hostetler, Matthew R. Templeton, Shinyi Chen, Jennifer Koch, Taylor Jacovich, Daniel Chivvis, Fernanda de Macedo Alves, Jean-Claude Paquin, Jennifer Bartlett, Mugdha Polimera, Stephanie Jarmak

Abstract: Open-source Large Language Models enable projects such as NASA SciX (i.e., NASA ADS) to think out of the box and try alternative approaches for information retrieval and data augmentation, while respecting data copyright and users' privacy. However, when large language models are directly prompted with questions without any context, they are prone to hallucination. At NASA SciX we have developed a… ▽ More Open-source Large Language Models enable projects such as NASA SciX (i.e., NASA ADS) to think out of the box and try alternative approaches for information retrieval and data augmentation, while respecting data copyright and users' privacy. However, when large language models are directly prompted with questions without any context, they are prone to hallucination. At NASA SciX we have developed an experiment where we created semantic vectors for our large collection of abstracts and full-text content, and we designed a prompt system to ask questions using contextual chunks from our system. Based on a non-systematic human evaluation, the experiment shows a lower degree of hallucination and better responses when using Retrieval Augmented Generation. Further exploration is required to design new features and data augmentation processes at NASA SciX that leverages this technology while respecting the high level of trust and quality that the project holds. △ Less

Submitted 21 December, 2023; originally announced December 2023.

Comments: To appear in the proceedings of the 33th annual international Astronomical Data Analysis Software & Systems (ADASS XXXIII)

arXiv:2312.07743 [pdf, other]

doi 10.1145/3447818.3460373

FULL-W2V: Fully Exploiting Data Reuse for W2V on GPU-Accelerated Systems

Authors: Thomas Randall, Tyler Allen, Rong Ge

Abstract: Word2Vec remains one of the highly-impactful innovations in the field of Natural Language Processing (NLP) that represents latent grammatical and syntactical information in human text with dense vectors in a low dimension. Word2Vec has high computational cost due to the algorithm's inherent sequentiality, intensive memory accesses, and the large vocabularies it represents. While prior studies have… ▽ More Word2Vec remains one of the highly-impactful innovations in the field of Natural Language Processing (NLP) that represents latent grammatical and syntactical information in human text with dense vectors in a low dimension. Word2Vec has high computational cost due to the algorithm's inherent sequentiality, intensive memory accesses, and the large vocabularies it represents. While prior studies have investigated technologies to explore parallelism and improve memory system performance, they struggle to effectively gain throughput on powerful GPUs. We identify memory data access and latency as the primary bottleneck in prior works on GPUs, which prevents highly optimized kernels from attaining the architecture's peak performance. We present a novel algorithm, FULL-W2V, which maximally exploits the opportunities for data reuse in the W2V algorithm and leverages GPU architecture and resources to reduce access to low memory levels and improve temporal locality. FULL-W2V is capable of reducing accesses to GPU global memory significantly, e.g., by more than 89\%, compared to prior state-of-the-art GPU implementations, resulting in significant performance improvement that scales across successive hardware generations. Our prototype implementation achieves 2.97X speedup when ported from Nvidia Pascal P100 to Volta V100 cards, and outperforms the state-of-the-art by 5.72X on V100 cards with the same embedding quality. In-depth analysis indicates that the reduction of memory accesses through register and shared memory caching and high-throughput shared memory reduction leads to a significantly improved arithmetic intensity. FULL-W2V can potentially benefit many applications in NLP and other domains. △ Less

Submitted 12 December, 2023; originally announced December 2023.

Comments: 12 pages, 7 figures, 7 tables, the definitive version of this work is published in the Proceedings of the ACM International Conference on Supercomputing 2021, available at https://doi.org/10.1145/3447818.3460373

ACM Class: I.2.7; D.1.3; G.4

Journal ref: Proceedings of the ACM International Conference on Supercomputing (2021) 455-466

arXiv:2312.01180 [pdf, other]

A Comparative Analysis of Text-to-Image Generative AI Models in Scientific Contexts: A Case Study on Nuclear Power

Authors: Veda Joynt, Jacob Cooper, Naman Bhargava, Katie Vu, O Hwang Kwon, Todd R. Allen, Aditi Verma, Majdi I. Radaideh

Abstract: In this work, we propose and assess the potential of generative artificial intelligence (AI) to generate public engagement around potential clean energy sources. Such an application could increase energy literacy -- an awareness of low-carbon energy sources among the public therefore leading to increased participation in decision-making about the future of energy systems. We explore the use of gen… ▽ More In this work, we propose and assess the potential of generative artificial intelligence (AI) to generate public engagement around potential clean energy sources. Such an application could increase energy literacy -- an awareness of low-carbon energy sources among the public therefore leading to increased participation in decision-making about the future of energy systems. We explore the use of generative AI to communicate technical information about low-carbon energy sources to the general public, specifically in the realm of nuclear energy. We explored 20 AI-powered text-to-image generators and compared their individual performances on general and scientific nuclear-related prompts. Of these models, DALL-E, DreamStudio, and Craiyon demonstrated promising performance in generating relevant images from general-level text related to nuclear topics. However, these models fall short in three crucial ways: (1) they fail to accurately represent technical details of energy systems; (2) they reproduce existing biases surrounding gender and work in the energy sector; and (3) they fail to accurately represent indigenous landscapes -- which have historically been sites of resource extraction and waste deposition for energy industries. This work is performed to motivate the development of specialized generative tools and their captions to improve energy literacy and effectively engage the public with low-carbon energy sources. △ Less

Submitted 2 December, 2023; originally announced December 2023.

Comments: 26 pages, 11 figures, 9 tables, submitted to review

arXiv:2302.03976 [pdf, other]

Parma: Confidential Containers via Attested Execution Policies

Authors: Matthew A. Johnson, Stavros Volos, Ken Gordon, Sean T. Allen, Christoph M. Wintersteiger, Sylvan Clebsch, John Starks, Manuel Costa

Abstract: Container-based technologies empower cloud tenants to develop highly portable software and deploy services in the cloud at a rapid pace. Cloud privacy, meanwhile, is important as a large number of container deployments operate on privacy-sensitive data, but challenging due to the increasing frequency and sophistication of attacks. State-of-the-art confidential container-based designs leverage proc… ▽ More Container-based technologies empower cloud tenants to develop highly portable software and deploy services in the cloud at a rapid pace. Cloud privacy, meanwhile, is important as a large number of container deployments operate on privacy-sensitive data, but challenging due to the increasing frequency and sophistication of attacks. State-of-the-art confidential container-based designs leverage process-based trusted execution environments (TEEs), but face security and compatibility issues that limits their practical deployment. We propose Parma, an architecture that provides lift-and-shift deployment of unmodified containers while providing strong security protection against a powerful attacker who controls the untrusted host and hypervisor. Parma leverages VM-level isolation to execute a container group within a unique VM-based TEE. Besides container integrity and user data confidentiality and integrity, Parma also offers container attestation and execution integrity based on an attested execution policy. Parma execution policies provide an inductive proof over all future states of the container group. This proof, which is established during initialization, forms a root of trust that can be used for secure operations within the container group without requiring any modifications of the containerized workflow itself (aside from the inclusion of the execution policy.) We evaluate Parma on AMD SEV-SNP processors by running a diverse set of workloads demonstrating that workflows exhibit 0-26% additional overhead in performance over running outside the enclave, with a mean 13% overhead on SPEC2017, while requiring no modifications to their program code. Adding execution policies introduces less than 1% additional overhead. Furthermore, we have deployed Parma as the underlying technology driving Confidential Containers on Azure Container Instances. △ Less

Submitted 7 March, 2023; v1 submitted 8 February, 2023; originally announced February 2023.

Comments: 12 pages, 6 figures, 2 tables

arXiv:2212.00744 [pdf, ps, other]

Improving astroBERT using Semantic Textual Similarity

Authors: Felix Grezes, Thomas Allen, Sergi Blanco-Cuaresma, Alberto Accomazzi, Michael J. Kurtz, Golnaz Shapurian, Edwin Henneken, Carolyn S. Grant, Donna M. Thompson, Timothy W. Hostetler, Matthew R. Templeton, Kelly E. Lockhart, Shinyi Chen, Jennifer Koch, Taylor Jacovich, Pavlos Protopapas

Abstract: The NASA Astrophysics Data System (ADS) is an essential tool for researchers that allows them to explore the astronomy and astrophysics scientific literature, but it has yet to exploit recent advances in natural language processing. At ADASS 2021, we introduced astroBERT, a machine learning language model tailored to the text used in astronomy papers in ADS. In this work we: - announce the first… ▽ More The NASA Astrophysics Data System (ADS) is an essential tool for researchers that allows them to explore the astronomy and astrophysics scientific literature, but it has yet to exploit recent advances in natural language processing. At ADASS 2021, we introduced astroBERT, a machine learning language model tailored to the text used in astronomy papers in ADS. In this work we: - announce the first public release of the astroBERT language model; - show how astroBERT improves over existing public language models on astrophysics specific tasks; - and detail how ADS plans to harness the unique structure of scientific papers, the citation graph and citation context, to further improve astroBERT. △ Less

Submitted 29 November, 2022; originally announced December 2022.

arXiv:2106.02118 [pdf]

doi 10.1101/2021.06.04.21258316

A Prospective Observational Study to Investigate Performance of a Chest X-ray Artificial Intelligence Diagnostic Support Tool Across 12 U.S. Hospitals

Authors: Ju Sun, Le Peng, Taihui Li, Dyah Adila, Zach Zaiman, Genevieve B. Melton, Nicholas Ingraham, Eric Murray, Daniel Boley, Sean Switzer, John L. Burns, Kun Huang, Tadashi Allen, Scott D. Steenburg, Judy Wawira Gichoya, Erich Kummerfeld, Christopher Tignanelli

Abstract: Importance: An artificial intelligence (AI)-based model to predict COVID-19 likelihood from chest x-ray (CXR) findings can serve as an important adjunct to accelerate immediate clinical decision making and improve clinical decision making. Despite significant efforts, many limitations and biases exist in previously developed AI diagnostic models for COVID-19. Utilizing a large set of local and int… ▽ More Importance: An artificial intelligence (AI)-based model to predict COVID-19 likelihood from chest x-ray (CXR) findings can serve as an important adjunct to accelerate immediate clinical decision making and improve clinical decision making. Despite significant efforts, many limitations and biases exist in previously developed AI diagnostic models for COVID-19. Utilizing a large set of local and international CXR images, we developed an AI model with high performance on temporal and external validation. Conclusions and Relevance: AI-based diagnostic tools may serve as an adjunct, but not replacement, for clinical decision support of COVID-19 diagnosis, which largely hinges on exposure history, signs, and symptoms. While AI-based tools have not yet reached full diagnostic potential in COVID-19, they may still offer valuable information to clinicians taken into consideration along with clinical signs and symptoms. △ Less

Submitted 6 June, 2021; v1 submitted 3 June, 2021; originally announced June 2021.

Comments: Check out the medRxiv version at https://doi.org/10.1101/2021.06.04.21258316 for updates

arXiv:2105.11538 [pdf]

doi 10.1108/JSBED-08-2015-0110

The power of reciprocal knowledge sharing relationships for startup success

Authors: T. J. Allen, P. Gloor, A. Fronzetti Colladon, S. L. Woerner, O. Raz

Abstract: Purpose: The purpose of this paper is to examine the innovative capabilities of biotech start-ups in relation to geographic proximity and knowledge sharing interaction in the R&D network of a major high-tech cluster. Design-methodology-approach: This study compares longitudinal informal communication networks of researchers at biotech start-ups with company patent applications in subsequent year… ▽ More Purpose: The purpose of this paper is to examine the innovative capabilities of biotech start-ups in relation to geographic proximity and knowledge sharing interaction in the R&D network of a major high-tech cluster. Design-methodology-approach: This study compares longitudinal informal communication networks of researchers at biotech start-ups with company patent applications in subsequent years. For a year, senior R&D staff members from over 70 biotech firms located in the Boston biotech cluster were polled and communication information about interaction with peers, universities and big pharmaceutical companies was collected, as well as their geolocation tags. Findings: Location influences the amount of communication between firms, but not their innovation success. Rather, what matters is communication intensity and recollection by others. In particular, there is evidence that rotating leadership - changing between a more active and passive communication style - is a predictor of innovative performance. Practical implications: Expensive real-estate investments can be replaced by maintaining social ties. A more dynamic communication style and more diverse social ties are beneficial to innovation. Originality-value: Compared to earlier work that has shown a connection between location, network and firm performance, this paper offers a more differentiated view; including a novel measure of communication style, using a unique data set and providing new insights for firms who want to shape their communication patterns to improve innovation, independently of their location. △ Less

Submitted 20 May, 2021; originally announced May 2021.

ACM Class: J.4

Journal ref: Journal of Small Business and Enterprise Development 23(3), 636-651 (2016)

arXiv:2009.12849 [pdf, other]

A highly scalable Met Office NERC Cloud model

Authors: Nick Brown, Michèle Weiland, Adrian Hill, Ben Shipway, Chris Maynard, Thomas Allen, Mike Rezny

Abstract: Large Eddy Simulation is a critical modelling tool for scientists investigating atmospheric flows, turbulence and cloud microphysics. Within the UK, the principal LES model used by the atmospheric research community is the Met Office Large Eddy Model (LEM). The LEM was originally developed in the late 1980s using computational techniques and assumptions of the time, which means that the it does no… ▽ More Large Eddy Simulation is a critical modelling tool for scientists investigating atmospheric flows, turbulence and cloud microphysics. Within the UK, the principal LES model used by the atmospheric research community is the Met Office Large Eddy Model (LEM). The LEM was originally developed in the late 1980s using computational techniques and assumptions of the time, which means that the it does not scale beyond 512 cores. In this paper we present the Met Office NERC Cloud model, MONC, which is a re-write of the existing LEM. We discuss the software engineering and architectural decisions made in order to develop a flexible, extensible model which the community can easily customise for their own needs. The scalability of MONC is evaluated, along with numerous additional customisations made to further improve performance at large core counts. The result of this work is a model which delivers to the community significant new scientific modelling capability that takes advantage of the current and future generation HPC machines. △ Less

Submitted 27 September, 2020; originally announced September 2020.

arXiv:1301.2313 [pdf]

Bayesian Error-Bars for Belief Net Inference

Authors: Tim Van Allen, Russell Greiner, Peter Hooper

Abstract: A Bayesian Belief Network (BN) is a model of a joint distribution over a setof n variables, with a DAG structure to represent the immediate dependenciesbetween the variables, and a set of parameters (aka CPTables) to represent thelocal conditional probabilities of a node, given each assignment to itsparents. In many situations, these parameters are themselves random variables - this may reflect t… ▽ More A Bayesian Belief Network (BN) is a model of a joint distribution over a setof n variables, with a DAG structure to represent the immediate dependenciesbetween the variables, and a set of parameters (aka CPTables) to represent thelocal conditional probabilities of a node, given each assignment to itsparents. In many situations, these parameters are themselves random variables - this may reflect the uncertainty of the domain expert, or may come from atraining sample used to estimate the parameter values. The distribution overthese "CPtable variables" induces a distribution over the response the BNwill return to any "What is Pr(H | E)?" query. This paper investigates thevariance of this response, showing first that it is asymptotically normal,then providing its mean and asymptotical variance. We then present aneffective general algorithm for computing this variance, which has the samecomplexity as simply computing the (mean value of) the response itself - ie,O(n 2^w), where n is the number of variables and w is the effective treewidth. Finally, we provide empirical evidence that this algorithm, whichincorporates assumptions and approximations, works effectively in practice,given only small samples. △ Less

Submitted 10 January, 2013; originally announced January 2013.

Comments: Appears in Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence (UAI2001)

Report number: UAI-P-2001-PG-522-529

arXiv:1105.1383 [pdf, other]

Topological Considerations for Tuning and Fingering Stringed Instruments

Authors: Terry Allen, Camille Goudeseune

Abstract: We present a formal language for assigning pitches to strings for fingered multi-string instruments, particularly the six-string guitar. Given the instrument's tuning (the strings' open pitches) and the compass of the fingers of the hand stopping the strings, the formalism yields a framework for simultaneously optimizing three things: the mapping of pitches to strings, the choice of instrument tun… ▽ More We present a formal language for assigning pitches to strings for fingered multi-string instruments, particularly the six-string guitar. Given the instrument's tuning (the strings' open pitches) and the compass of the fingers of the hand stopping the strings, the formalism yields a framework for simultaneously optimizing three things: the mapping of pitches to strings, the choice of instrument tuning, and the key of the composition. Final optimization relies on heuristics idiomatic to the tuning, the particular musical style, and the performer's proficiency. △ Less

Submitted 6 May, 2011; originally announced May 2011.

Comments: 8 pages, 3 figures

MSC Class: 14P10 ACM Class: F.4.0; H.5.5; G.2.3

Showing 1–11 of 11 results for author: Allen, T