-
Axion Dark Matter eXperiment around 3.3 μeV with Dine-Fischler-Srednicki-Zhitnitsky Discovery Ability
Authors:
C. Bartram,
C. Boutan,
T. Braine,
J. H. Buckley,
T. J. Caligiure,
G. Carosi,
A. S. Chou,
C. Cisneros,
John Clarke,
E. J. Daw,
N. Du,
L. D. Duffy,
T. A. Dyson,
C. Gaikwad,
J. R. Gleason,
C. Goodman,
M. Goryachev,
M. Guzzetti,
C. Hanretty,
E. Hartman,
A. T. Hipp,
J. Hoffman,
M. Hollister,
R. Khatiwada,
S. Knirck
, et al. (24 additional authors not shown)
Abstract:
We report the results of a QCD axion dark matter search with discovery ability for Dine-Fischler-Srednicki-Zhitnitsky (DFSZ) axions using an axion haloscope. Sub-Kelvin noise temperatures are reached with an ultra low-noise Josephson parametric amplifier cooled by a dilution refrigerator. This work excludes (with a 90% confidence level) DFSZ axions with masses between 3.27 to 3.34 μeV, assuming a…
▽ More
We report the results of a QCD axion dark matter search with discovery ability for Dine-Fischler-Srednicki-Zhitnitsky (DFSZ) axions using an axion haloscope. Sub-Kelvin noise temperatures are reached with an ultra low-noise Josephson parametric amplifier cooled by a dilution refrigerator. This work excludes (with a 90% confidence level) DFSZ axions with masses between 3.27 to 3.34 μeV, assuming a standard halo model with a local energy density of 0.45 GeV/cc made up 100% of axions.
△ Less
Submitted 27 August, 2024;
originally announced August 2024.
-
RTF-Q: Unsupervised domain adaptation based retraining-free quantization network
Authors:
Nanyang Du,
Chen Tang,
Yuan Meng,
Zhi Wang
Abstract:
Performing unsupervised domain adaptation on resource-constrained edge devices is a significant task. Although existing research allows edge devices to use subnets with different computational budgets for inference, they often require expensive pre-training and do not consider the issues of parameter precision redundancy in the model, which is not conducive to the deployment of the model on edge d…
▽ More
Performing unsupervised domain adaptation on resource-constrained edge devices is a significant task. Although existing research allows edge devices to use subnets with different computational budgets for inference, they often require expensive pre-training and do not consider the issues of parameter precision redundancy in the model, which is not conducive to the deployment of the model on edge devices. In this paper, we introduce a ReTraining-Free Quantized (RTF-Q) network based on unsupervised domain adaptation, featuring quantized subnets of varying computational costs that can operate on devices with dynamically changing computation budgets. Our network has three switchable dimensions: width (number of channels), input resolution, and quantization bit-width. Specifically, we choose subnet dimensions that have minimal impact on network performance and then directly load the official weight files without requiring expensive and time-consuming pre-training on Imagenet-1K. To further reduce the network's computational load and memory usage, we use quantization-aware training, reducing the BitOPs of full-precision networks by at least 1/16. We propose a training method called SandwichQ for multiple quantization bit widths, which can efficiently train multiple quantization subnets. By training in multiple quantization bit-width spaces simultaneously and using the proposed SandwichQ rule, we achieve better network performance compared to using a single quantization bit-width alone. Experimental results show that our method achieves classification accuracy comparable to SOTA methods on various UDA tasks, significantly reducing network size and computational overhead. Code will be available at https://github.com/dunanyang/RTF-Q.
△ Less
Submitted 11 August, 2024;
originally announced August 2024.
-
Apple Intelligence Foundation Language Models
Authors:
Tom Gunter,
Zirui Wang,
Chong Wang,
Ruoming Pang,
Andy Narayanan,
Aonan Zhang,
Bowen Zhang,
Chen Chen,
Chung-Cheng Chiu,
David Qiu,
Deepak Gopinath,
Dian Ang Yap,
Dong Yin,
Feng Nan,
Floris Weers,
Guoli Yin,
Haoshuo Huang,
Jianyu Wang,
Jiarui Lu,
John Peebles,
Ke Ye,
Mark Lee,
Nan Du,
Qibin Chen,
Quentin Keunebroek
, et al. (130 additional authors not shown)
Abstract:
We present foundation language models developed to power Apple Intelligence features, including a ~3 billion parameter model designed to run efficiently on devices and a large server-based language model designed for Private Cloud Compute. These models are designed to perform a wide range of tasks efficiently, accurately, and responsibly. This report describes the model architecture, the data used…
▽ More
We present foundation language models developed to power Apple Intelligence features, including a ~3 billion parameter model designed to run efficiently on devices and a large server-based language model designed for Private Cloud Compute. These models are designed to perform a wide range of tasks efficiently, accurately, and responsibly. This report describes the model architecture, the data used to train the model, the training process, how the models are optimized for inference, and the evaluation results. We highlight our focus on Responsible AI and how the principles are applied throughout the model development.
△ Less
Submitted 29 July, 2024;
originally announced July 2024.
-
Deep State-Space Generative Model For Correlated Time-to-Event Predictions
Authors:
Yuan Xue,
Denny Zhou,
Nan Du,
Andrew M. Dai,
Zhen Xu,
Kun Zhang,
Claire Cui
Abstract:
Capturing the inter-dependencies among multiple types of clinically-critical events is critical not only to accurate future event prediction, but also to better treatment planning. In this work, we propose a deep latent state-space generative model to capture the interactions among different types of correlated clinical events (e.g., kidney failure, mortality) by explicitly modeling the temporal d…
▽ More
Capturing the inter-dependencies among multiple types of clinically-critical events is critical not only to accurate future event prediction, but also to better treatment planning. In this work, we propose a deep latent state-space generative model to capture the interactions among different types of correlated clinical events (e.g., kidney failure, mortality) by explicitly modeling the temporal dynamics of patients' latent states. Based on these learned patient states, we further develop a new general discrete-time formulation of the hazard rate function to estimate the survival distribution of patients with significantly improved accuracy. Extensive evaluations over real EMR data show that our proposed model compares favorably to various state-of-the-art baselines. Furthermore, our method also uncovers meaningful insights about the latent correlations among mortality and different types of organ failures.
△ Less
Submitted 27 July, 2024;
originally announced July 2024.
-
Learning to Select the Best Forecasting Tasks for Clinical Outcome Prediction
Authors:
Yuan Xue,
Nan Du,
Anne Mottram,
Martin Seneviratne,
Andrew M. Dai
Abstract:
We propose to meta-learn an a self-supervised patient trajectory forecast learning rule by meta-training on a meta-objective that directly optimizes the utility of the patient representation over the subsequent clinical outcome prediction. This meta-objective directly targets the usefulness of a representation generated from unlabeled clinical measurement forecast for later supervised tasks.
The…
▽ More
We propose to meta-learn an a self-supervised patient trajectory forecast learning rule by meta-training on a meta-objective that directly optimizes the utility of the patient representation over the subsequent clinical outcome prediction. This meta-objective directly targets the usefulness of a representation generated from unlabeled clinical measurement forecast for later supervised tasks.
The meta-learned can then be directly used in target risk prediction, and the limited available samples can be used for further fine-tuning the model performance. The effectiveness of our approach is tested on a real open source patient EHR dataset MIMIC-III. We are able to demonstrate that our attention-based patient state representation approach can achieve much better performance for predicting target risk with low resources comparing with both direct supervised learning and pretraining with all-observation trajectory forecast.
△ Less
Submitted 27 July, 2024;
originally announced July 2024.
-
History-Aware Planning for Risk-free Autonomous Navigation on Unknown Uneven Terrain
Authors:
Yinchuan Wang,
Nianfei Du,
Yongsen Qin,
Xiang Zhang,
Rui Song,
Chaoqun Wang
Abstract:
It is challenging for the mobile robot to achieve autonomous and mapless navigation in the unknown environment with uneven terrain. In this study, we present a layered and systematic pipeline. At the local level, we maintain a tree structure that is dynamically extended with the navigation. This structure unifies the planning with the terrain identification. Besides, it contributes to explicitly i…
▽ More
It is challenging for the mobile robot to achieve autonomous and mapless navigation in the unknown environment with uneven terrain. In this study, we present a layered and systematic pipeline. At the local level, we maintain a tree structure that is dynamically extended with the navigation. This structure unifies the planning with the terrain identification. Besides, it contributes to explicitly identifying the hazardous areas on uneven terrain. In particular, certain nodes of the tree are consistently kept to form a sparse graph at the global level, which records the history of the exploration. A series of subgoals that can be obtained in the tree and the graph are utilized for leading the navigation. To determine a subgoal, we develop an evaluation method whose input elements can be efficiently obtained on the layered structure. We conduct both simulation and real-world experiments to evaluate the developed method and its key modules. The experimental results demonstrate the effectiveness and efficiency of our method. The robot can travel through the unknown uneven region safely and reach the target rapidly without a preconstructed map.
△ Less
Submitted 3 June, 2024;
originally announced June 2024.
-
Inducing ferroelectricity in NH$_4$I and NH$_4$Br via partial replacement of protons by deuterons
Authors:
Miao Miao Zhao,
Lei Meng,
Yi Yang Xu,
Na Du,
Fei Yen
Abstract:
While all of the polymorphs of NH$_4$I and NH$_4$Br are non-polar, a reversible electric polarization is established in the ordered $γ$ phases of (NH$_4$)$_{0.73}$(ND$_4$)$_{0.27}$I and (NH$_4$)$_{0.84}$(ND$_4$)$_{0.16}$Br (where D is $^2$H) via $dc$ electric fields. The presence of two groups of orbital magnetic moments appears to be responsible for the asymmetric lattice distortions. Our finding…
▽ More
While all of the polymorphs of NH$_4$I and NH$_4$Br are non-polar, a reversible electric polarization is established in the ordered $γ$ phases of (NH$_4$)$_{0.73}$(ND$_4$)$_{0.27}$I and (NH$_4$)$_{0.84}$(ND$_4$)$_{0.16}$Br (where D is $^2$H) via $dc$ electric fields. The presence of two groups of orbital magnetic moments appears to be responsible for the asymmetric lattice distortions. Our findings provide an alternative pathway for hydrogen-based materials to potentially add a ferroelectric functionality.
△ Less
Submitted 24 May, 2024;
originally announced May 2024.
-
Revisiting MoE and Dense Speed-Accuracy Comparisons for LLM Training
Authors:
Xianzhi Du,
Tom Gunter,
Xiang Kong,
Mark Lee,
Zirui Wang,
Aonan Zhang,
Nan Du,
Ruoming Pang
Abstract:
Mixture-of-Experts (MoE) enjoys performance gain by increasing model capacity while keeping computation cost constant. When comparing MoE to dense models, prior work typically adopt the following setting: 1) use FLOPs or activated parameters as a measure of model complexity; 2) train all models to the same number of tokens. We argue that this setting favors MoE as FLOPs and activated parameters do…
▽ More
Mixture-of-Experts (MoE) enjoys performance gain by increasing model capacity while keeping computation cost constant. When comparing MoE to dense models, prior work typically adopt the following setting: 1) use FLOPs or activated parameters as a measure of model complexity; 2) train all models to the same number of tokens. We argue that this setting favors MoE as FLOPs and activated parameters do not accurately measure the communication overhead in sparse layers, leading to a larger actual training budget for MoE. In this work, we revisit the settings by adopting step time as a more accurate measure of model complexity, and by determining the total compute budget under the Chinchilla compute-optimal settings. To efficiently run MoE on modern accelerators, we adopt a 3D sharding method that keeps the dense-to-MoE step time increase within a healthy range. We evaluate MoE and dense LLMs on a set of nine 0-shot and two 1-shot English tasks, as well as MMLU 5-shot and GSM8K 8-shot across three model scales at 6.4B, 12.6B, and 29.6B. Experimental results show that even under these settings, MoE consistently outperform dense LLMs on the speed-accuracy trade-off curve with meaningful gaps. Our full model implementation and sharding strategy has been released at~\url{https://github.com/apple/axlearn}
△ Less
Submitted 28 June, 2024; v1 submitted 23 May, 2024;
originally announced May 2024.
-
Knowledge Graph Reasoning with Self-supervised Reinforcement Learning
Authors:
Ying Ma,
Owen Burns,
Mingqiu Wang,
Gang Li,
Nan Du,
Laurent El Shafey,
Liqiang Wang,
Izhak Shafran,
Hagen Soltau
Abstract:
Reinforcement learning (RL) is an effective method of finding reasoning pathways in incomplete knowledge graphs (KGs). To overcome the challenges of a large action space, a self-supervised pre-training method is proposed to warm up the policy network before the RL training stage. To alleviate the distributional mismatch issue in general self-supervised RL (SSRL), in our supervised learning (SL) st…
▽ More
Reinforcement learning (RL) is an effective method of finding reasoning pathways in incomplete knowledge graphs (KGs). To overcome the challenges of a large action space, a self-supervised pre-training method is proposed to warm up the policy network before the RL training stage. To alleviate the distributional mismatch issue in general self-supervised RL (SSRL), in our supervised learning (SL) stage, the agent selects actions based on the policy network and learns from generated labels; this self-generation of labels is the intuition behind the name self-supervised. With this training framework, the information density of our SL objective is increased and the agent is prevented from getting stuck with the early rewarded paths. Our self-supervised RL (SSRL) method improves the performance of RL by pairing it with the wide coverage achieved by SL during pretraining, since the breadth of the SL objective makes it infeasible to train an agent with that alone. We show that our SSRL model meets or exceeds current state-of-the-art results on all Hits@k and mean reciprocal rank (MRR) metrics on four large benchmark KG datasets. This SSRL method can be used as a plug-in for any RL architecture for a KGR task. We adopt two RL architectures, i.e., MINERVA and MultiHopKG as our baseline RL models and experimentally show that our SSRL model consistently outperforms both baselines on all of these four KG reasoning tasks. Full code for the paper available at https://github.com/owenonline/Knowledge-Graph-Reasoning-with-Self-supervised-Reinforcement-Learning.
△ Less
Submitted 22 May, 2024;
originally announced May 2024.
-
Self-playing Adversarial Language Game Enhances LLM Reasoning
Authors:
Pengyu Cheng,
Tianhao Hu,
Han Xu,
Zhisong Zhang,
Yong Dai,
Lei Han,
Nan Du
Abstract:
We explore the self-play training procedure of large language models (LLMs) in a two-player adversarial language game called Adversarial Taboo. In this game, an attacker and a defender communicate around a target word only visible to the attacker. The attacker aims to induce the defender to speak the target word unconsciously, while the defender tries to infer the target word from the attacker's u…
▽ More
We explore the self-play training procedure of large language models (LLMs) in a two-player adversarial language game called Adversarial Taboo. In this game, an attacker and a defender communicate around a target word only visible to the attacker. The attacker aims to induce the defender to speak the target word unconsciously, while the defender tries to infer the target word from the attacker's utterances. To win the game, both players should have sufficient knowledge about the target word and high-level reasoning ability to infer and express in this information-reserved conversation. Hence, we are curious about whether LLMs' reasoning ability can be further enhanced by self-play in this adversarial language game (SPAG). With this goal, we select several open-source LLMs and let each act as the attacker and play with a copy of itself as the defender on an extensive range of target words. Through reinforcement learning on the game outcomes, we observe that the LLMs' performances uniformly improve on a broad range of reasoning benchmarks. Furthermore, iteratively adopting this self-play process can continuously promote LLMs' reasoning abilities. The code is at https://github.com/Linear95/SPAG.
△ Less
Submitted 23 May, 2024; v1 submitted 16 April, 2024;
originally announced April 2024.
-
Determining the chemical composition of diamagnetic mixed solids via measurements of the magnetic susceptibility
Authors:
Miao Miao Zhao,
Yang Yang,
Na Du,
Yu Ying Zhu,
Peng Ren,
Fei Yen
Abstract:
Mixed solid compounds are employed in a vast array of applications so an accurate determination of their chemical compositions is of crucial importance. All current characterization methods require specially-treated samples so the availability of a more practical method with similar accuracy should alleviate the quantification process. In this work, we show how the doping concentration $δ$ (or iso…
▽ More
Mixed solid compounds are employed in a vast array of applications so an accurate determination of their chemical compositions is of crucial importance. All current characterization methods require specially-treated samples so the availability of a more practical method with similar accuracy should alleviate the quantification process. In this work, we show how the doping concentration $δ$ (or isotope concentration) of a mixed solid compound in powdered form, where both parent compounds are diamagnetic, can be obtained from the measurement of the mass magnetization. We exploit the additive nature of the molar magnetic susceptibility $χ_{Mol}$ and molar mass to construct two equations with the same two unknowns in the $χ_{Mol}$ vs. $δ$ space to simultaneously solve $χ_{Mol}$ and $δ$ of a mixed solid. Eight examples are provided to show the wide applicability of this method: NH$_{4(1-δ)}$D$_{4δ}$Br (where D = $^2$H), NH$_4$I$_{1-δ}$Br$_δ$, (NH$_4$H$_2$)$_{1-δ}$(ND$_4$D$_2$)$_δ$PO$_4$, C$_{48}$H$_{22+6δ}$Br$_{6(1-δ)}$O$_{32}$Zr$_6$, [creatine]$_{1-δ}$[$_D$-glucose]$_δ$, [$_L$-glutamic acid]$_{1-δ}$[$_L$-leucine]$_δ$, [terephthalic acid]$_{1-δ}$[trimesic acid]$_δ$ and [p-terphenyl]$_{1-δ}$[triphenylphosphine]$_δ$. Experimental errors of ~1.2% were obtained for $δ$ from average sample masses of 16.6 mg in powdered form rendering the presented approach an attractive choice for characterizing the ratios of mixed solids.
△ Less
Submitted 2 April, 2024;
originally announced April 2024.
-
Human Detection in Realistic Through-the-Wall Environments using Raw Radar ADC Data and Parametric Neural Networks
Authors:
Wei Wang,
Naike Du,
Yuchao Guo,
Chao Sun,
Jingyang Liu,
Rencheng Song,
Xiuzhu Ye
Abstract:
The radar signal processing algorithm is one of the core components in through-wall radar human detection technology. Traditional algorithms (e.g., DFT and matched filtering) struggle to adaptively handle low signal-to-noise ratio echo signals in challenging and dynamic real-world through-wall application environments, which becomes a major bottleneck in the system. In this paper, we introduce an…
▽ More
The radar signal processing algorithm is one of the core components in through-wall radar human detection technology. Traditional algorithms (e.g., DFT and matched filtering) struggle to adaptively handle low signal-to-noise ratio echo signals in challenging and dynamic real-world through-wall application environments, which becomes a major bottleneck in the system. In this paper, we introduce an end-to-end through-wall radar human detection network (TWP-CNN), which takes raw radar Analog-to-Digital Converter (ADC) signals without any preprocessing as input. We replace the conventional radar signal processing flow with the proposed DFT-based adaptive feature extraction (DAFE) module. This module employs learnable parameterized 3D complex convolution layers to extract superior feature representations from ADC signals, which is beyond the limitation of traditional preprocessing methods. Additionally, by embedding phase information from radar data within the network and employing multi-task learning, a more accurate detection is achieved. Finally, due to the absence of through-wall radar datasets containing raw ADC data, we gathered a realistic through-wall (RTW) dataset using our in-house developed through-wall radar system. We trained and validated our proposed method on this dataset to confirm its effectiveness and superiority in real through-wall detection scenarios.
△ Less
Submitted 20 March, 2024;
originally announced March 2024.
-
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Authors:
Brandon McKinzie,
Zhe Gan,
Jean-Philippe Fauconnier,
Sam Dodge,
Bowen Zhang,
Philipp Dufter,
Dhruti Shah,
Xianzhi Du,
Futang Peng,
Floris Weers,
Anton Belyi,
Haotian Zhang,
Karanjeet Singh,
Doug Kang,
Ankur Jain,
Hongyu Hè,
Max Schwarzer,
Tom Gunter,
Xiang Kong,
Aonan Zhang,
Jianyu Wang,
Chong Wang,
Nan Du,
Tao Lei,
Sam Wiseman
, et al. (7 additional authors not shown)
Abstract:
In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for la…
▽ More
In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art (SOTA) few-shot results across multiple benchmarks, compared to other published pre-training results. Further, we show that the image encoder together with image resolution and the image token count has substantial impact, while the vision-language connector design is of comparatively negligible importance. By scaling up the presented recipe, we build MM1, a family of multimodal models up to 30B parameters, including both dense models and mixture-of-experts (MoE) variants, that are SOTA in pre-training metrics and achieve competitive performance after supervised fine-tuning on a range of established multimodal benchmarks. Thanks to large-scale pre-training, MM1 enjoys appealing properties such as enhanced in-context learning, and multi-image reasoning, enabling few-shot chain-of-thought prompting.
△ Less
Submitted 18 April, 2024; v1 submitted 14 March, 2024;
originally announced March 2024.
-
Look Before You Leap: Towards Decision-Aware and Generalizable Tool-Usage for Large Language Models
Authors:
Anchun Gui,
Jian Li,
Yong Dai,
Nan Du,
Han Xiao
Abstract:
Tool-augmented large language models (LLMs) are attracting widespread attention when accessing up-to-date knowledge and alleviating hallucination issues. Nowadays, advanced closed-source LLMs (e.g., ChatGPT) have demonstrated surprising tool-usage capabilities through prompting and in-context learning techniques. To empower the capabilities of open-source LLMs (e.g., LLaMA) in manipulating tools,…
▽ More
Tool-augmented large language models (LLMs) are attracting widespread attention when accessing up-to-date knowledge and alleviating hallucination issues. Nowadays, advanced closed-source LLMs (e.g., ChatGPT) have demonstrated surprising tool-usage capabilities through prompting and in-context learning techniques. To empower the capabilities of open-source LLMs (e.g., LLaMA) in manipulating tools, current efforts focus on either template-driven or token-triggered tool-usage. However, the former hampers LLMs' flexibility to address diverse user's queries due to constrained tool interactions, while the latter limits the generalizability when engaging with new tools, since tool-usage learning is based on task- and tool-specific datasets. To alleviate these concerns, in this paper, we propose a decision-aware and generalizable tool-usage framework (DEER). Specifically, we first construct the tool-usage samples with multiple decision branches via an automatic generation pipeline, thereby inspiring the decision-making awareness of LLMs under diverse scenarios. Meanwhile, we propose a novel tool sampling strategy to enhance the generalizability of LLMs over unseen tools. Extensive experiments demonstrate that our proposed DEER is effective and significantly outperforms baselines across various datasets.
△ Less
Submitted 28 August, 2024; v1 submitted 26 February, 2024;
originally announced February 2024.
-
Improving Explainable Object-induced Model through Uncertainty for Automated Vehicles
Authors:
Shihong Ling,
Yue Wan,
Xiaowei Jia,
Na Du
Abstract:
The rapid evolution of automated vehicles (AVs) has the potential to provide safer, more efficient, and comfortable travel options. However, these systems face challenges regarding reliability in complex driving scenarios. Recent explainable AV architectures neglect crucial information related to inherent uncertainties while providing explanations for actions. To overcome such challenges, our stud…
▽ More
The rapid evolution of automated vehicles (AVs) has the potential to provide safer, more efficient, and comfortable travel options. However, these systems face challenges regarding reliability in complex driving scenarios. Recent explainable AV architectures neglect crucial information related to inherent uncertainties while providing explanations for actions. To overcome such challenges, our study builds upon the "object-induced" model approach that prioritizes the role of objects in scenes for decision-making and integrates uncertainty assessment into the decision-making process using an evidential deep learning paradigm with a Beta prior. Additionally, we explore several advanced training strategies guided by uncertainty, including uncertainty-guided data reweighting and augmentation. Leveraging the BDD-OIA dataset, our findings underscore that the model, through these enhancements, not only offers a clearer comprehension of AV decisions and their underlying reasoning but also surpasses existing baselines across a broad range of scenarios.
△ Less
Submitted 23 February, 2024;
originally announced February 2024.
-
Are Large Language Models Good Prompt Optimizers?
Authors:
Ruotian Ma,
Xiaolei Wang,
Xin Zhou,
Jian Li,
Nan Du,
Tao Gui,
Qi Zhang,
Xuanjing Huang
Abstract:
LLM-based Automatic Prompt Optimization, which typically utilizes LLMs as Prompt Optimizers to self-reflect and refine prompts, has shown promising performance in recent studies. Despite the success, the underlying mechanism of this approach remains unexplored, and the true effectiveness of LLMs as Prompt Optimizers requires further validation. In this work, we conducted a comprehensive study to u…
▽ More
LLM-based Automatic Prompt Optimization, which typically utilizes LLMs as Prompt Optimizers to self-reflect and refine prompts, has shown promising performance in recent studies. Despite the success, the underlying mechanism of this approach remains unexplored, and the true effectiveness of LLMs as Prompt Optimizers requires further validation. In this work, we conducted a comprehensive study to uncover the actual mechanism of LLM-based Prompt Optimization. Our findings reveal that the LLM optimizers struggle to identify the true causes of errors during reflection, tending to be biased by their own prior knowledge rather than genuinely reflecting on the errors. Furthermore, even when the reflection is semantically valid, the LLM optimizers often fail to generate appropriate prompts for the target models with a single prompt refinement step, partly due to the unpredictable behaviors of the target models. Based on the observations, we introduce a new "Automatic Behavior Optimization" paradigm, which directly optimizes the target model's behavior in a more controllable manner. We hope our study can inspire new directions for automatic prompt optimization development.
△ Less
Submitted 3 February, 2024;
originally announced February 2024.
-
Improving the Imaging Performance of Microwave Imaging Systems by Exploiting Virtual Antennas
Authors:
Xinhui Zhang,
Naike Du,
Jing Wang,
Andrea Massa,
Xiuzhu Ye
Abstract:
Starting from the observation that the correlation coefficient defined by the scattered field data tested by two adjacent antennas decreases with the noise, it turns out that the imaging performance can be improved by adding non-redundant scattered field information through more measuring antennas.However, adding more measuring antennas faces practical challenges such as the limited antenna space,…
▽ More
Starting from the observation that the correlation coefficient defined by the scattered field data tested by two adjacent antennas decreases with the noise, it turns out that the imaging performance can be improved by adding non-redundant scattered field information through more measuring antennas.However, adding more measuring antennas faces practical challenges such as the limited antenna space, high experimental expenses, and a prolonged data collection time. Therefore, the frequency-domain zero-padding (FDZP) interpolation method is proposed to acquire scattered field data on more virtual antennas. To process the data, a linear inversion algorithm based on the modified Born approximation (MBA) and the nonlinear subspace-based optimization method (SOM) are used to image scatterers of moderate and high contrasts, respectively. The effectiveness and the reliability of the proposed approach are then assessed against synthetic data, semi-experimental data from a full-wave simulation software, and experimental data.
△ Less
Submitted 5 January, 2024; v1 submitted 29 December, 2023;
originally announced December 2023.
-
Axion Dark Matter eXperiment: Run 1A Analysis Details
Authors:
C. Boutan,
B. H. LaRoque,
E. Lentz,
N. S. Oblath,
M. S. Taubman,
J. Tedeschi,
J. Yang,
A. M. Jones,
T. Braine,
N. Crisosto,
L. J Rosenberg,
G. Rybka,
D. Will,
D. Zhang,
S. Kimes,
R. Ottens,
C. Bartram,
D. Bowring,
R. Cervantes,
A. S. Chou,
S. Knirck,
D. V. Mitchell,
A. Sonnenschein,
W. Wester,
R. Khatiwada
, et al. (28 additional authors not shown)
Abstract:
The ADMX collaboration gathered data for its Run 1A axion dark matter search from January to June 2017, scanning with an axion haloscope over the frequency range 645-680 MHz (2.66-2.81 ueV in axion mass) at DFSZ sensitivity. The resulting axion search found no axion-like signals comprising all the dark matter in the form of a virialized galactic halo over the entire frequency range, implying lower…
▽ More
The ADMX collaboration gathered data for its Run 1A axion dark matter search from January to June 2017, scanning with an axion haloscope over the frequency range 645-680 MHz (2.66-2.81 ueV in axion mass) at DFSZ sensitivity. The resulting axion search found no axion-like signals comprising all the dark matter in the form of a virialized galactic halo over the entire frequency range, implying lower bound exclusion limits at or below DFSZ coupling at the 90% confidence level. This paper presents expanded details of the axion search analysis of Run 1A, including review of relevant experimental systems, data-taking operations, preparation and interpretation of raw data, axion search methodology, candidate handling, and final axion limits.
△ Less
Submitted 27 December, 2023;
originally announced December 2023.
-
On Diversified Preferences of Large Language Model Alignment
Authors:
Dun Zeng,
Yong Dai,
Pengyu Cheng,
Longyue Wang,
Tianhao Hu,
Wanshun Chen,
Nan Du,
Zenglin Xu
Abstract:
Aligning large language models (LLMs) with human preferences has been recognized as the key to improving LLMs' interaction quality. However, in this pluralistic world, human preferences can be diversified due to annotators' different tastes, which hinders the effectiveness of LLM alignment methods. This paper presents the first quantitative analysis of commonly used human feedback datasets to inve…
▽ More
Aligning large language models (LLMs) with human preferences has been recognized as the key to improving LLMs' interaction quality. However, in this pluralistic world, human preferences can be diversified due to annotators' different tastes, which hinders the effectiveness of LLM alignment methods. This paper presents the first quantitative analysis of commonly used human feedback datasets to investigate the impact of diversified preferences on reward modeling. Our analysis reveals a correlation between the calibration performance of reward models (RMs) and the alignment performance of LLMs. We find that diversified preference data negatively affect the calibration performance of RMs on human-shared preferences, such as Harmless\&Helpful, thereby impairing the alignment performance of LLMs. To address the ineffectiveness, we propose a novel Multi-Objective Reward learning method (MORE) to enhance the calibration performance of RMs on shared preferences. We validate our findings by experiments on three models and five human preference datasets. Our method significantly improves the prediction calibration of RMs, leading to better alignment of the Alpaca-7B model with Harmless\&Helpful preferences. Furthermore, the connection between reward calibration and preference alignment performance suggests that calibration error can be adopted as a key metric for evaluating RMs. The open-source code and data are available at https://github.com/dunzeng/MORE.
△ Less
Submitted 17 April, 2024; v1 submitted 12 December, 2023;
originally announced December 2023.
-
Non-iterative Methods in Inhomogeneous Background Inverse Scattering Imaging Problem Assisted by Swin Transformer Network
Authors:
Naike Du,
Tiantian Yin,
Jing Wang,
Rencheng Song,
Kuiwen Xu,
Bingyuan Liang,
Sheng Sun,
Xiuzhu Ye
Abstract:
A deep learning-assisted inversion method is proposed to solve the inhomogeneous background imaging problem. Three non-iterative methods, namely the distorted-Born (DB) major current coefficients method, the DB modified Born approximation method, and the DB connection method, are introduced to address the inhomogeneous background inverse scattering problem. These methods retain the multiple scatte…
▽ More
A deep learning-assisted inversion method is proposed to solve the inhomogeneous background imaging problem. Three non-iterative methods, namely the distorted-Born (DB) major current coefficients method, the DB modified Born approximation method, and the DB connection method, are introduced to address the inhomogeneous background inverse scattering problem. These methods retain the multiple scattering information by utilizing the major current obtained through singular value decomposition of the Green's function and the scattered field, without resourcing to optimization techniques. As a result, the proposed methods offer improved reconstruction resolution and accuracy for unknown objects embedded in inhomogeneous backgrounds, surpassing the backpropagation scheme (BPS) and Born approximation (BA) method that disregard the multiple scattering effect. To further enhance the resolution and accuracy of the reconstruction, a Shifted-Window (Swin) transformer network is employed for capturing super-resolution information in the images. The attention mechanism incorporated in the shifted window facilitates global interactions between objects, thereby enhancing the performance of the inhomogeneous background imaging algorithm while reducing computational complexity. Moreover, an adaptive training method is proposed to enhance the generalization ability of the network. The effectiveness of the proposed methods is demonstrated through both synthetic data and experimental data. Notably, super-resolution imaging is achieved with quasi real-time speed, indicating promising application potential for the proposed algorithms.
△ Less
Submitted 11 December, 2023;
originally announced December 2023.
-
Power-balanced Memristive Cryptographic Implementation Against Side Channel Attacks
Authors:
Ziang Chen,
Li-Wei Chen,
Xianyue Zhao,
Kefeng Li,
Heidemarie Schmidt,
Ilia Polian,
Nan Du
Abstract:
Memristors, as emerging nano-devices, offer promising performance and exhibit rich electrical dynamic behavior. Having already found success in applications such as neuromorphic and in-memory computing, researchers are now exploring their potential for cryptographic implementations. In this study, we present a novel power-balanced hiding strategy utilizing memristor groups to conceal power consump…
▽ More
Memristors, as emerging nano-devices, offer promising performance and exhibit rich electrical dynamic behavior. Having already found success in applications such as neuromorphic and in-memory computing, researchers are now exploring their potential for cryptographic implementations. In this study, we present a novel power-balanced hiding strategy utilizing memristor groups to conceal power consumption in cryptographic logic circuits. Our approach ensures consistent power costs of all 16 logic gates in Complementary-Resistive-Switching-with-Reading (CRS-R) logic family during writing and reading cycles regardless of Logic Input Variable (LIV) values. By constructing hiding groups, we enable an effective power balance in each gate hiding group. Furthermore, experimental validation of our strategy includes the implementation of a cryptographic construction, xor4SBox, using NOR gates. The circuit construction without the hiding strategy and with the hiding strategy undergo T-test analysis, confirming the significant improvement achieved with our approach. Our work presents a substantial advancement in power-balanced hiding methods, offering enhanced security and efficiency in logic circuits.
△ Less
Submitted 2 December, 2023;
originally announced December 2023.
-
Learning to Skip for Language Modeling
Authors:
Dewen Zeng,
Nan Du,
Tao Wang,
Yuanzhong Xu,
Tao Lei,
Zhifeng Chen,
Claire Cui
Abstract:
Overparameterized large-scale language models have impressive generalization performance of in-context few-shot learning. However, most language models allocate the same amount of parameters or computation to each token, disregarding the complexity or importance of the input data. We argue that in language model pretraining, a variable amount of computation should be assigned to different tokens,…
▽ More
Overparameterized large-scale language models have impressive generalization performance of in-context few-shot learning. However, most language models allocate the same amount of parameters or computation to each token, disregarding the complexity or importance of the input data. We argue that in language model pretraining, a variable amount of computation should be assigned to different tokens, and this can be efficiently achieved via a simple routing mechanism. Different from conventional early stopping techniques where tokens can early exit at only early layers, we propose a more general method that dynamically skips the execution of a layer (or module) for any input token with a binary router. In our extensive evaluation across 24 NLP tasks, we demonstrate that the proposed method can significantly improve the 1-shot performance compared to other competitive baselines only at mild extra cost for inference.
△ Less
Submitted 26 November, 2023;
originally announced November 2023.
-
Adversarial Preference Optimization: Enhancing Your Alignment via RM-LLM Game
Authors:
Pengyu Cheng,
Yifan Yang,
Jian Li,
Yong Dai,
Tianhao Hu,
Peixin Cao,
Nan Du,
Xiaolong Li
Abstract:
Human preference alignment is essential to improve the interaction quality of large language models (LLMs). Existing alignment methods depend on manually annotated preference data to guide the LLM optimization directions. However, continuously updating LLMs for alignment raises a distribution gap between model-generated samples and human-annotated responses, hindering training effectiveness. To mi…
▽ More
Human preference alignment is essential to improve the interaction quality of large language models (LLMs). Existing alignment methods depend on manually annotated preference data to guide the LLM optimization directions. However, continuously updating LLMs for alignment raises a distribution gap between model-generated samples and human-annotated responses, hindering training effectiveness. To mitigate this issue, previous methods require additional preference annotation on newly generated samples to adapt to the shifted distribution, which consumes a large amount of annotation resources. Targeting more efficient human preference optimization, we propose an Adversarial Preference Optimization (APO) framework, in which the LLM and the reward model update alternatively via a min-max game. Through adversarial training, the reward model can adapt to the shifted generation distribution of the LLM without any additional annotation. With comprehensive experiments, we find the proposed adversarial training framework further enhances existing alignment baselines in terms of LLM helpfulness and harmlessness. The code is at https://github.com/Linear95/APO.
△ Less
Submitted 3 June, 2024; v1 submitted 14 November, 2023;
originally announced November 2023.
-
Non-Virialized Axion Search Sensitive to Doppler Effects in the Milky Way Halo
Authors:
C. Bartram,
T. Braine,
R. Cervantes,
N. Crisosto,
N. Du,
C. Goodman,
M. Guzzetti,
C. Hanretty,
S. Lee,
G. Leum,
L. J. Rosenberg,
G. Rybka,
J. Sinnis,
D. Zhang,
M. H. Awida,
D. Bowring,
A. S. Chou,
M. Hollister,
S. Knirck,
A. Sonnenschein,
W. Wester,
R. Khatiwada,
J. Brodsky,
G. Carosi,
L. D. Duffy
, et al. (31 additional authors not shown)
Abstract:
The Axion Dark Matter eXperiment (ADMX) has previously excluded Dine-Fischler-Srednicki-Zhitnisky (DFSZ) axions between 680-790 MHz under the assumption that the dark matter is described by the isothermal halo model. However, the precise nature of the velocity distribution of dark matter is still unknown, and alternative models have been proposed. We report the results of a non-virialized axion se…
▽ More
The Axion Dark Matter eXperiment (ADMX) has previously excluded Dine-Fischler-Srednicki-Zhitnisky (DFSZ) axions between 680-790 MHz under the assumption that the dark matter is described by the isothermal halo model. However, the precise nature of the velocity distribution of dark matter is still unknown, and alternative models have been proposed. We report the results of a non-virialized axion search over the mass range 2.81-3.31 μeV, corresponding to the frequency range 680-800 MHz. This analysis marks the most sensitive search for non-virialized axions sensitive to Doppler effects in the Milky Way Halo to date. Accounting for frequency shifts due to the detector's motion through the Galaxy, we exclude cold flow relic axions with a velocity dispersion of order 10^-7 c with 95% confidence.
△ Less
Submitted 13 November, 2023;
originally announced November 2023.
-
TDPP: Two-Dimensional Permutation-Based Protection of Memristive Deep Neural Networks
Authors:
Minhui Zou,
Zhenhua Zhu,
Tzofnat Greenberg-Toledo,
Orian Leitersdorf,
Jiang Li,
Junlong Zhou,
Yu Wang,
Nan Du,
Shahar Kvatinsky
Abstract:
The execution of deep neural network (DNN) algorithms suffers from significant bottlenecks due to the separation of the processing and memory units in traditional computer systems. Emerging memristive computing systems introduce an in situ approach that overcomes this bottleneck. The non-volatility of memristive devices, however, may expose the DNN weights stored in memristive crossbars to potenti…
▽ More
The execution of deep neural network (DNN) algorithms suffers from significant bottlenecks due to the separation of the processing and memory units in traditional computer systems. Emerging memristive computing systems introduce an in situ approach that overcomes this bottleneck. The non-volatility of memristive devices, however, may expose the DNN weights stored in memristive crossbars to potential theft attacks. Therefore, this paper proposes a two-dimensional permutation-based protection (TDPP) method that thwarts such attacks. We first introduce the underlying concept that motivates the TDPP method: permuting both the rows and columns of the DNN weight matrices. This contrasts with previous methods, which focused solely on permuting a single dimension of the weight matrices, either the rows or columns. While it's possible for an adversary to access the matrix values, the original arrangement of rows and columns in the matrices remains concealed. As a result, the extracted DNN model from the accessed matrix values would fail to operate correctly. We consider two different memristive computing systems (designed for layer-by-layer and layer-parallel processing, respectively) and demonstrate the design of the TDPP method that could be embedded into the two systems. Finally, we present a security analysis. Our experiments demonstrate that TDPP can achieve comparable effectiveness to prior approaches, with a high level of security when appropriately parameterized. In addition, TDPP is more scalable than previous methods and results in reduced area and power overheads. The area and power are reduced by, respectively, 1218$\times$ and 2815$\times$ for the layer-by-layer system and by 178$\times$ and 203$\times$ for the layer-parallel system compared to prior works.
△ Less
Submitted 10 October, 2023;
originally announced October 2023.
-
Everyone Deserves A Reward: Learning Customized Human Preferences
Authors:
Pengyu Cheng,
Jiawen Xie,
Ke Bai,
Yong Dai,
Nan Du
Abstract:
Reward models (RMs) are essential for aligning large language models (LLMs) with human preferences to improve interaction quality. However, the real world is pluralistic, which leads to diversified human preferences with respect to different religions, politics, cultures, etc. Moreover, each individual can have their unique preferences on various topics. Neglecting the diversity of human preferenc…
▽ More
Reward models (RMs) are essential for aligning large language models (LLMs) with human preferences to improve interaction quality. However, the real world is pluralistic, which leads to diversified human preferences with respect to different religions, politics, cultures, etc. Moreover, each individual can have their unique preferences on various topics. Neglecting the diversity of human preferences, current human feedback aligning methods only consider a general reward model, which is below satisfaction for customized or personalized application scenarios. To explore customized preference learning, we collect a domain-specific preference (DSP) dataset, which includes preferred responses for each given query from four practical domains. Besides, from the perspective of data efficiency, we propose a three-stage customized RM learning scheme, then empirically verify its effectiveness on both general preference datasets and our DSP set. Furthermore, we test multiple training and data strategies on the three learning stages. We find several ways to better preserve the general preferring ability while training the customized RMs, especially general preference enrichment, and customized preference imitation learning. The DSP dataset and code are available at https://github.com/Linear95/DSP.
△ Less
Submitted 15 September, 2023; v1 submitted 6 September, 2023;
originally announced September 2023.
-
Chunk, Align, Select: A Simple Long-sequence Processing Method for Transformers
Authors:
Jiawen Xie,
Pengyu Cheng,
Xiao Liang,
Yong Dai,
Nan Du
Abstract:
Although dominant in natural language processing, transformer-based models remain challenged by the task of long-sequence processing, because the computational cost of self-attention operations in transformers swells quadratically with the input sequence length. To alleviate the complexity of long-sequence processing, we propose a simple framework to enable the offthe-shelf pre-trained transformer…
▽ More
Although dominant in natural language processing, transformer-based models remain challenged by the task of long-sequence processing, because the computational cost of self-attention operations in transformers swells quadratically with the input sequence length. To alleviate the complexity of long-sequence processing, we propose a simple framework to enable the offthe-shelf pre-trained transformers to process much longer sequences, while the computation and memory costs remain growing linearly with the input sequence lengths. More specifically, our method divides each long-sequence input into a batch of chunks, then aligns the interchunk information during the encoding steps, and finally selects the most representative hidden states from the encoder for the decoding process. To extract inter-chunk semantic information, we align the start and end token embeddings among chunks in each encoding transformer block. To learn an effective hidden selection policy, we design a dual updating scheme inspired by reinforcement learning, which regards the decoders of transformers as environments, and the downstream performance metrics as the rewards to evaluate the hidden selection actions. Our empirical results on real-world long-text summarization and reading comprehension tasks demonstrate effective improvements compared to prior longsequence processing baselines.
△ Less
Submitted 5 July, 2024; v1 submitted 25 August, 2023;
originally announced August 2023.
-
Brainformers: Trading Simplicity for Efficiency
Authors:
Yanqi Zhou,
Nan Du,
Yanping Huang,
Daiyi Peng,
Chang Lan,
Da Huang,
Siamak Shakeri,
David So,
Andrew Dai,
Yifeng Lu,
Zhifeng Chen,
Quoc Le,
Claire Cui,
James Laudon,
Jeff Dean
Abstract:
Transformers are central to recent successes in natural language processing and computer vision. Transformers have a mostly uniform backbone where layers alternate between feed-forward and self-attention in order to build a deep network. Here we investigate this design choice and find that more complex blocks that have different permutations of layer primitives can be more efficient. Using this in…
▽ More
Transformers are central to recent successes in natural language processing and computer vision. Transformers have a mostly uniform backbone where layers alternate between feed-forward and self-attention in order to build a deep network. Here we investigate this design choice and find that more complex blocks that have different permutations of layer primitives can be more efficient. Using this insight, we develop a complex block, named Brainformer, that consists of a diverse sets of layers such as sparsely gated feed-forward layers, dense feed-forward layers, attention layers, and various forms of layer normalization and activation functions. Brainformer consistently outperforms the state-of-the-art dense and sparse Transformers, in terms of both quality and efficiency. A Brainformer model with 8 billion activated parameters per token demonstrates 2x faster training convergence and 5x faster step time compared to its GLaM counterpart. In downstream task evaluation, Brainformer also demonstrates a 3% higher SuperGLUE score with fine-tuning compared to GLaM with a similar number of activated parameters. Finally, Brainformer largely outperforms a Primer dense model derived with NAS with similar computation per token on fewshot evaluations.
△ Less
Submitted 25 April, 2024; v1 submitted 29 May, 2023;
originally announced June 2023.
-
Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models
Authors:
Sheng Shen,
Le Hou,
Yanqi Zhou,
Nan Du,
Shayne Longpre,
Jason Wei,
Hyung Won Chung,
Barret Zoph,
William Fedus,
Xinyun Chen,
Tu Vu,
Yuexin Wu,
Wuyang Chen,
Albert Webson,
Yunxuan Li,
Vincent Zhao,
Hongkun Yu,
Kurt Keutzer,
Trevor Darrell,
Denny Zhou
Abstract:
Sparse Mixture-of-Experts (MoE) is a neural architecture design that can be utilized to add learnable parameters to Large Language Models (LLMs) without increasing inference cost. Instruction tuning is a technique for training LLMs to follow instructions. We advocate combining these two approaches, as we find that MoE models benefit more from instruction tuning than dense models. In particular, we…
▽ More
Sparse Mixture-of-Experts (MoE) is a neural architecture design that can be utilized to add learnable parameters to Large Language Models (LLMs) without increasing inference cost. Instruction tuning is a technique for training LLMs to follow instructions. We advocate combining these two approaches, as we find that MoE models benefit more from instruction tuning than dense models. In particular, we conduct empirical studies across three experimental setups: (i) Direct finetuning on individual downstream tasks devoid of instruction tuning; (ii) Instructiontuning followed by in-context few-shot or zero-shot generalization on downstream tasks; and (iii) Instruction tuning supplemented by further finetuning on individual downstream tasks. In the first scenario, MoE models overall underperform dense models of identical computational capacity. This narrative, however, dramatically changes with the introduction of instruction tuning (second and third scenario), used independently or in conjunction with task-specific finetuning. Our most powerful model, FLAN-MOE-32B, surpasses the performance of FLAN-PALM-62B on four benchmark tasks, while using only a third of the FLOPs. The advancements embodied byFLAN-MOE inspire a reevaluation of the design principles of large-scale, high-performance language models in the framework of task-agnostic learning.
△ Less
Submitted 5 July, 2023; v1 submitted 24 May, 2023;
originally announced May 2023.
-
Lifelong Language Pretraining with Distribution-Specialized Experts
Authors:
Wuyang Chen,
Yanqi Zhou,
Nan Du,
Yanping Huang,
James Laudon,
Zhifeng Chen,
Claire Cu
Abstract:
Pretraining on a large-scale corpus has become a standard method to build general language models (LMs). Adapting a model to new data distributions targeting different downstream tasks poses significant challenges. Naive fine-tuning may incur catastrophic forgetting when the over-parameterized LMs overfit the new data but fail to preserve the pretrained features. Lifelong learning (LLL) aims to en…
▽ More
Pretraining on a large-scale corpus has become a standard method to build general language models (LMs). Adapting a model to new data distributions targeting different downstream tasks poses significant challenges. Naive fine-tuning may incur catastrophic forgetting when the over-parameterized LMs overfit the new data but fail to preserve the pretrained features. Lifelong learning (LLL) aims to enable information systems to learn from a continuous data stream across time. However, most prior work modifies the training recipe assuming a static fixed network architecture. We find that additional model capacity and proper regularization are key elements to achieving strong LLL performance. Thus, we propose Lifelong-MoE, an extensible MoE (Mixture-of-Experts) architecture that dynamically adds model capacity via adding experts with regularized pretraining. Our results show that by only introducing a limited number of extra experts while keeping the computation cost constant, our model can steadily adapt to data distribution shifts while preserving the previous knowledge. Compared to existing lifelong learning approaches, Lifelong-MoE achieves better few-shot performance on 19 downstream NLP tasks.
△ Less
Submitted 20 May, 2023;
originally announced May 2023.
-
DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining
Authors:
Sang Michael Xie,
Hieu Pham,
Xuanyi Dong,
Nan Du,
Hanxiao Liu,
Yifeng Lu,
Percy Liang,
Quoc V. Le,
Tengyu Ma,
Adams Wei Yu
Abstract:
The mixture proportions of pretraining data domains (e.g., Wikipedia, books, web text) greatly affect language model (LM) performance. In this paper, we propose Domain Reweighting with Minimax Optimization (DoReMi), which first trains a small proxy model using group distributionally robust optimization (Group DRO) over domains to produce domain weights (mixture proportions) without knowledge of do…
▽ More
The mixture proportions of pretraining data domains (e.g., Wikipedia, books, web text) greatly affect language model (LM) performance. In this paper, we propose Domain Reweighting with Minimax Optimization (DoReMi), which first trains a small proxy model using group distributionally robust optimization (Group DRO) over domains to produce domain weights (mixture proportions) without knowledge of downstream tasks. We then resample a dataset with these domain weights and train a larger, full-sized model. In our experiments, we use DoReMi on a 280M-parameter proxy model to set the domain weights for training an 8B-parameter model (30x larger) more efficiently. On The Pile, DoReMi improves perplexity across all domains, even when it downweights a domain. DoReMi improves average few-shot downstream accuracy by 6.5% points over a baseline model trained using The Pile's default domain weights and reaches the baseline accuracy with 2.6x fewer training steps. On the GLaM dataset, DoReMi, which has no knowledge of downstream tasks, even matches the performance of using domain weights tuned on downstream tasks.
△ Less
Submitted 20 November, 2023; v1 submitted 17 May, 2023;
originally announced May 2023.
-
PaLM 2 Technical Report
Authors:
Rohan Anil,
Andrew M. Dai,
Orhan Firat,
Melvin Johnson,
Dmitry Lepikhin,
Alexandre Passos,
Siamak Shakeri,
Emanuel Taropa,
Paige Bailey,
Zhifeng Chen,
Eric Chu,
Jonathan H. Clark,
Laurent El Shafey,
Yanping Huang,
Kathy Meier-Hellstern,
Gaurav Mishra,
Erica Moreira,
Mark Omernick,
Kevin Robinson,
Sebastian Ruder,
Yi Tay,
Kefan Xiao,
Yuanzhong Xu,
Yujing Zhang,
Gustavo Hernandez Abrego
, et al. (103 additional authors not shown)
Abstract:
We introduce PaLM 2, a new state-of-the-art language model that has better multilingual and reasoning capabilities and is more compute-efficient than its predecessor PaLM. PaLM 2 is a Transformer-based model trained using a mixture of objectives. Through extensive evaluations on English and multilingual language, and reasoning tasks, we demonstrate that PaLM 2 has significantly improved quality on…
▽ More
We introduce PaLM 2, a new state-of-the-art language model that has better multilingual and reasoning capabilities and is more compute-efficient than its predecessor PaLM. PaLM 2 is a Transformer-based model trained using a mixture of objectives. Through extensive evaluations on English and multilingual language, and reasoning tasks, we demonstrate that PaLM 2 has significantly improved quality on downstream tasks across different model sizes, while simultaneously exhibiting faster and more efficient inference compared to PaLM. This improved efficiency enables broader deployment while also allowing the model to respond faster, for a more natural pace of interaction. PaLM 2 demonstrates robust reasoning capabilities exemplified by large improvements over PaLM on BIG-Bench and other reasoning tasks. PaLM 2 exhibits stable performance on a suite of responsible AI evaluations, and enables inference-time control over toxicity without additional overhead or impact on other capabilities. Overall, PaLM 2 achieves state-of-the-art performance across a diverse set of tasks and capabilities.
When discussing the PaLM 2 family, it is important to distinguish between pre-trained models (of various sizes), fine-tuned variants of these models, and the user-facing products that use these models. In particular, user-facing products typically include additional pre- and post-processing steps. Additionally, the underlying models may evolve over time. Therefore, one should not expect the performance of user-facing products to exactly match the results reported in this report.
△ Less
Submitted 13 September, 2023; v1 submitted 17 May, 2023;
originally announced May 2023.
-
Conditional Adapters: Parameter-efficient Transfer Learning with Fast Inference
Authors:
Tao Lei,
Junwen Bai,
Siddhartha Brahma,
Joshua Ainslie,
Kenton Lee,
Yanqi Zhou,
Nan Du,
Vincent Y. Zhao,
Yuexin Wu,
Bo Li,
Yu Zhang,
Ming-Wei Chang
Abstract:
We propose Conditional Adapter (CoDA), a parameter-efficient transfer learning method that also improves inference efficiency. CoDA generalizes beyond standard adapter approaches to enable a new way of balancing speed and accuracy using conditional computation. Starting with an existing dense pretrained model, CoDA adds sparse activation together with a small number of new parameters and a light-w…
▽ More
We propose Conditional Adapter (CoDA), a parameter-efficient transfer learning method that also improves inference efficiency. CoDA generalizes beyond standard adapter approaches to enable a new way of balancing speed and accuracy using conditional computation. Starting with an existing dense pretrained model, CoDA adds sparse activation together with a small number of new parameters and a light-weight training phase. Our experiments demonstrate that the CoDA approach provides an unexpectedly efficient way to transfer knowledge. Across a variety of language, vision, and speech tasks, CoDA achieves a 2x to 8x inference speed-up compared to the state-of-the-art Adapter approaches with moderate to no accuracy loss and the same parameter efficiency.
△ Less
Submitted 26 November, 2023; v1 submitted 10 April, 2023;
originally announced April 2023.
-
Low Frequency (100-600 MHz) Searches with Axion Cavity Haloscopes
Authors:
S. Chakrabarty,
J. R. Gleason,
Y. Han,
A. T. Hipp,
M. Solano,
P. Sikivie,
N. S. Sullivan,
D. B. Tanner,
M. Goryachev,
E. Hartman,
B. T. McAllister,
A. Quiskamp,
C. Thomson,
M. E. Tobar,
M. H. Awida,
A. S. Chou,
M. Hollister,
S. Knirck,
A. Sonnenschein,
W. Wester,
T. Braine,
M. Guzzetti,
C. Hanretty,
G. Leum,
L. J Rosenberg
, et al. (22 additional authors not shown)
Abstract:
We investigate reentrant and dielectric loaded cavities for the purpose of extending the range of axion cavity haloscopes to lower masses, below the range where the Axion Dark Matter eXperiment (ADMX) has already searched. Reentrant and dielectric loaded cavities were simulated numerically to calculate and optimize their form factors and quality factors. A prototype reentrant cavity was built and…
▽ More
We investigate reentrant and dielectric loaded cavities for the purpose of extending the range of axion cavity haloscopes to lower masses, below the range where the Axion Dark Matter eXperiment (ADMX) has already searched. Reentrant and dielectric loaded cavities were simulated numerically to calculate and optimize their form factors and quality factors. A prototype reentrant cavity was built and its measured properties were compared with the simulations. We estimate the sensitivity of axion dark matter searches using reentrant and dielectric loaded cavities inserted in the existing ADMX magnet at the University of Washington and a large magnet being installed at Fermilab.
△ Less
Submitted 28 March, 2023; v1 submitted 7 March, 2023;
originally announced March 2023.
-
Search for a dark-matter induced Cosmic Axion Background with ADMX
Authors:
ADMX Collaboration,
T. Nitta,
T. Braine,
N. Du,
M. Guzzetti,
C. Hanretty,
G. Leum,
L. J Rosenberg,
G. Rybka,
J. Sinnis,
John Clarke,
I. Siddiqi,
M. H. Awida,
A. S. Chou,
M. Hollister,
S. Knirck,
A. Sonnenschein,
W. Wester,
J. R. Gleason,
A. T. Hipp,
P. Sikivie,
N. S. Sullivan,
D. B. Tanner,
R. Khatiwada,
G. Carosi
, et al. (23 additional authors not shown)
Abstract:
We report the first result of a direct search for a Cosmic ${\it axion}$ Background (C$a$B) - a relativistic background of axions that is not dark matter - performed with the axion haloscope, the Axion Dark Matter eXperiment (ADMX). Conventional haloscope analyses search for a signal with a narrow bandwidth, as predicted for dark matter, whereas the C$a$B will be broad. We introduce a novel analys…
▽ More
We report the first result of a direct search for a Cosmic ${\it axion}$ Background (C$a$B) - a relativistic background of axions that is not dark matter - performed with the axion haloscope, the Axion Dark Matter eXperiment (ADMX). Conventional haloscope analyses search for a signal with a narrow bandwidth, as predicted for dark matter, whereas the C$a$B will be broad. We introduce a novel analysis strategy, which searches for a C$a$B induced daily modulation in the power measured by the haloscope. Using this, we repurpose data collected to search for dark matter to set a limit on the axion photon coupling of a C$a$B originating from dark matter cascade decay via a mediator in the 800-995 MHz frequency range. We find that the present sensitivity is limited by fluctuations in the cavity readout as the instrument scans across dark matter masses. Nevertheless, we suggest that these challenges can be surmounted using superconducting qubits as single photon counters, and allow ADMX to operate as a telescope searching for axions emerging from the decay of dark matter. The daily modulation analysis technique we introduce can be deployed for various broadband RF signals, such as other forms of a C$a$B or even high-frequency gravitational waves.
△ Less
Submitted 3 October, 2023; v1 submitted 10 March, 2023;
originally announced March 2023.
-
Massively Multilingual Shallow Fusion with Large Language Models
Authors:
Ke Hu,
Tara N. Sainath,
Bo Li,
Nan Du,
Yanping Huang,
Andrew M. Dai,
Yu Zhang,
Rodrigo Cabrera,
Zhifeng Chen,
Trevor Strohman
Abstract:
While large language models (LLM) have made impressive progress in natural language processing, it remains unclear how to utilize them in improving automatic speech recognition (ASR). In this work, we propose to train a single multilingual language model (LM) for shallow fusion in multiple languages. We push the limits of the multilingual LM to cover up to 84 languages by scaling up using a mixtur…
▽ More
While large language models (LLM) have made impressive progress in natural language processing, it remains unclear how to utilize them in improving automatic speech recognition (ASR). In this work, we propose to train a single multilingual language model (LM) for shallow fusion in multiple languages. We push the limits of the multilingual LM to cover up to 84 languages by scaling up using a mixture-of-experts LLM, i.e., generalist language model (GLaM). When the number of experts increases, GLaM dynamically selects only two at each decoding step to keep the inference computation roughly constant. We then apply GLaM to a multilingual shallow fusion task based on a state-of-the-art end-to-end model. Compared to a dense LM of similar computation during inference, GLaM reduces the WER of an English long-tail test set by 4.4% relative. In a multilingual shallow fusion task, GLaM improves 41 out of 50 languages with an average relative WER reduction of 3.85%, and a maximum reduction of 10%. Compared to the baseline model, GLaM achieves an average WER reduction of 5.53% over 43 languages.
△ Less
Submitted 17 February, 2023;
originally announced February 2023.
-
Model Based Reinforcement Learning with Non-Gaussian Environment Dynamics and its Application to Portfolio Optimization
Authors:
Huifang Huang,
Ting Gao,
Pengbo Li,
Jin Guo,
Peng Zhang,
Nan Du
Abstract:
With the fast development of quantitative portfolio optimization in financial engineering, lots of AI-based algorithmic trading strategies have demonstrated promising results, among which reinforcement learning begins to manifest competitive advantages. However, the environment from real financial markets is complex and hard to be fully simulated, considering the observation of abrupt transitions,…
▽ More
With the fast development of quantitative portfolio optimization in financial engineering, lots of AI-based algorithmic trading strategies have demonstrated promising results, among which reinforcement learning begins to manifest competitive advantages. However, the environment from real financial markets is complex and hard to be fully simulated, considering the observation of abrupt transitions, unpredictable hidden causal factors, heavy tail properties and so on. Thus, in this paper, first, we adopt a heavy-tailed preserving normalizing flows to simulate high-dimensional joint probability of the complex trading environment and develop a model-based reinforcement learning framework to better understand the intrinsic mechanisms of quantitative online trading. Second, we experiment with various stocks from three different financial markets (Dow, NASDAQ and S&P) and show that among these three financial markets, Dow gets the best performance based on various evaluation metrics under our back-testing system. Especially, our proposed method is able to mitigate the impact of unpredictable financial market crises during the COVID-19 pandemic period, resulting in a lower maximum drawdown. Third, we also explore the explanation of our RL algorithm. (1), we utilize the pattern causality method to study the interactive relation among different stocks in the environment. (2), We analyze the dynamic loss and actor loss to ensure the convergence of our strategies. (3), by visualizing high dimensional state transition data comparisons from real and virtual buffers with t-SNE, we uncover some effective patterns of better portfolio optimization strategies. (4), we also utilize eigenvalue analysis to study the convergence properties of the environmen's model.
△ Less
Submitted 9 March, 2023; v1 submitted 23 January, 2023;
originally announced January 2023.
-
Review of security techniques for memristor computing systems
Authors:
Minhui Zou,
Nan Du,
Shahar Kvatinsky
Abstract:
Neural network (NN) algorithms have become the dominant tool in visual object recognition, natural language processing, and robotics. To enhance the computational efficiency of these algorithms, in comparison to the traditional von Neuman computing architectures, researchers have been focusing on memristor computing systems. A major drawback when using memristor computing systems today is that, in…
▽ More
Neural network (NN) algorithms have become the dominant tool in visual object recognition, natural language processing, and robotics. To enhance the computational efficiency of these algorithms, in comparison to the traditional von Neuman computing architectures, researchers have been focusing on memristor computing systems. A major drawback when using memristor computing systems today is that, in the artificial intelligence (AI) era, well-trained NN models are intellectual property and, when loaded in the memristor computing systems, face theft threats, especially when running in edge devices. An adversary may steal the well-trained NN models through advanced attacks such as learning attacks and side-channel analysis. In this paper, we review different security techniques for protecting memristor computing systems. Two threat models are described based on their assumptions regarding the adversary's capabilities: a black-box (BB) model and a white-box (WB) model. We categorize the existing security techniques into five classes in the context of these threat models: thwarting learning attacks (BB), thwarting side-channel attacks (BB), NN model encryption (WB), NN weight transformation (WB), and fingerprint embedding (WB). We also present a cross-comparison of the limitations of the security techniques. This paper could serve as an aid when designing secure memristor computing systems.
△ Less
Submitted 19 December, 2022;
originally announced December 2022.
-
ReAct: Synergizing Reasoning and Acting in Language Models
Authors:
Shunyu Yao,
Jeffrey Zhao,
Dian Yu,
Nan Du,
Izhak Shafran,
Karthik Narasimhan,
Yuan Cao
Abstract:
While large language models (LLMs) have demonstrated impressive capabilities across tasks in language understanding and interactive decision making, their abilities for reasoning (e.g. chain-of-thought prompting) and acting (e.g. action plan generation) have primarily been studied as separate topics. In this paper, we explore the use of LLMs to generate both reasoning traces and task-specific acti…
▽ More
While large language models (LLMs) have demonstrated impressive capabilities across tasks in language understanding and interactive decision making, their abilities for reasoning (e.g. chain-of-thought prompting) and acting (e.g. action plan generation) have primarily been studied as separate topics. In this paper, we explore the use of LLMs to generate both reasoning traces and task-specific actions in an interleaved manner, allowing for greater synergy between the two: reasoning traces help the model induce, track, and update action plans as well as handle exceptions, while actions allow it to interface with external sources, such as knowledge bases or environments, to gather additional information. We apply our approach, named ReAct, to a diverse set of language and decision making tasks and demonstrate its effectiveness over state-of-the-art baselines, as well as improved human interpretability and trustworthiness over methods without reasoning or acting components. Concretely, on question answering (HotpotQA) and fact verification (Fever), ReAct overcomes issues of hallucination and error propagation prevalent in chain-of-thought reasoning by interacting with a simple Wikipedia API, and generates human-like task-solving trajectories that are more interpretable than baselines without reasoning traces. On two interactive decision making benchmarks (ALFWorld and WebShop), ReAct outperforms imitation and reinforcement learning methods by an absolute success rate of 34% and 10% respectively, while being prompted with only one or two in-context examples. Project site with code: https://react-lm.github.io
△ Less
Submitted 9 March, 2023; v1 submitted 5 October, 2022;
originally announced October 2022.
-
Physics inspired compact modelling of BiFeO$_3$ based memristors for hardware security applications
Authors:
Sahitya Yarragolla,
Nan Du,
Torben Hemke,
Xianyue Zhao,
Ziang Chen,
Ilia Polian,
Thomas Mussenbrock
Abstract:
With the advent of the Internet of Things, nanoelectronic devices or memristors have been the subject of significant interest for use as new hardware security primitives. Among the several available memristors, BiFe$\rm O_{3}$ (BFO)-based electroforming-free memristors have attracted considerable attention due to their excellent properties, such as long retention time, self-rectification, intrinsi…
▽ More
With the advent of the Internet of Things, nanoelectronic devices or memristors have been the subject of significant interest for use as new hardware security primitives. Among the several available memristors, BiFe$\rm O_{3}$ (BFO)-based electroforming-free memristors have attracted considerable attention due to their excellent properties, such as long retention time, self-rectification, intrinsic stochasticity, and fast switching. They have been actively investigated for use in physical unclonable function (PUF) key storage modules, artificial synapses in neural networks, nonvolatile resistive switches, and reconfigurable logic applications. In this work, we present a physics-inspired 1D compact model of a BFO memristor to understand its implementation for such applications (mainly PUFs) and perform circuit simulations. The resistive switching based on electric field-driven vacancy migration and intrinsic stochastic behaviour of the BFO memristor are modelled using the cloud-in-a-cell scheme. The experimental current-voltage characteristics of the BFO memristor are successfully reproduced. The response of the BFO memristor to changes in electrical properties, environmental properties (such as temperature) and stress are analyzed and consistent with experimental results.
△ Less
Submitted 7 October, 2022;
originally announced October 2022.
-
Multi-mode Analysis of Surface Losses in a Superconducting Microwave Resonator in High Magnetic Fields
Authors:
T. Braine,
G. Rybka,
A. A. Baker,
J. Brodsky,
G. Carosi,
N. Du,
N. Woollett,
S. Knirck,
M. Jones
Abstract:
This paper reports on a surface impedance measurement of a niobium titanium superconducting radio frequency (SRF) cavity in a magnetic field (up to $10\,{\rm T}$). A novel method is employed to decompose the surface resistance contributions of the cylindrical cavity end caps and walls using measurements from multiple $TM$ cavity modes. The results confirm that quality factor degradation of a NbTi…
▽ More
This paper reports on a surface impedance measurement of a niobium titanium superconducting radio frequency (SRF) cavity in a magnetic field (up to $10\,{\rm T}$). A novel method is employed to decompose the surface resistance contributions of the cylindrical cavity end caps and walls using measurements from multiple $TM$ cavity modes. The results confirm that quality factor degradation of a NbTi SRF cavity in a high magnetic field is primarily from surfaces perpendicular to the field (the cavity end caps), while parallel surface resistances (the walls) remain relatively constant. This result is encouraging for applications needing high Q cavities in strong magnetic fields, such as the Axion Dark Matter eXperiment (ADMX), because it opens the possibility of hybrid SRF cavity construction to replace conventional copper cavities.
△ Less
Submitted 24 August, 2022;
originally announced August 2022.
-
Explainable Anomaly Detection for Industrial Control System Cybersecurity
Authors:
Do Thu Ha,
Nguyen Xuan Hoang,
Nguyen Viet Hoang,
Nguyen Huu Du,
Truong Thu Huong,
Kim Phuc Tran
Abstract:
Industrial Control Systems (ICSs) are becoming more and more important in managing the operation of many important systems in smart manufacturing, such as power stations, water supply systems, and manufacturing sites. While massive digital data can be a driving force for system performance, data security has raised serious concerns. Anomaly detection, therefore, is essential for preventing network…
▽ More
Industrial Control Systems (ICSs) are becoming more and more important in managing the operation of many important systems in smart manufacturing, such as power stations, water supply systems, and manufacturing sites. While massive digital data can be a driving force for system performance, data security has raised serious concerns. Anomaly detection, therefore, is essential for preventing network security intrusions and system attacks. Many AI-based anomaly detection methods have been proposed and achieved high detection performance, however, are still a "black box" that is hard to be interpreted. In this study, we suggest using Explainable Artificial Intelligence to enhance the perspective and reliable results of an LSTM-based Autoencoder-OCSVM learning model for anomaly detection in ICS. We demonstrate the performance of our proposed method based on a well-known SCADA dataset.
△ Less
Submitted 4 May, 2022;
originally announced May 2022.
-
PaLM: Scaling Language Modeling with Pathways
Authors:
Aakanksha Chowdhery,
Sharan Narang,
Jacob Devlin,
Maarten Bosma,
Gaurav Mishra,
Adam Roberts,
Paul Barham,
Hyung Won Chung,
Charles Sutton,
Sebastian Gehrmann,
Parker Schuh,
Kensen Shi,
Sasha Tsvyashchenko,
Joshua Maynez,
Abhishek Rao,
Parker Barnes,
Yi Tay,
Noam Shazeer,
Vinodkumar Prabhakaran,
Emily Reif,
Nan Du,
Ben Hutchinson,
Reiner Pope,
James Bradbury,
Jacob Austin
, et al. (42 additional authors not shown)
Abstract:
Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Tran…
▽ More
Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies.
△ Less
Submitted 5 October, 2022; v1 submitted 5 April, 2022;
originally announced April 2022.
-
Axion Dark Matter
Authors:
C. B. Adams,
N. Aggarwal,
A. Agrawal,
R. Balafendiev,
C. Bartram,
M. Baryakhtar,
H. Bekker,
P. Belov,
K. K. Berggren,
A. Berlin,
C. Boutan,
D. Bowring,
D. Budker,
A. Caldwell,
P. Carenza,
G. Carosi,
R. Cervantes,
S. S. Chakrabarty,
S. Chaudhuri,
T. Y. Chen,
S. Cheong,
A. Chou,
R. T. Co,
J. Conrad,
D. Croon
, et al. (130 additional authors not shown)
Abstract:
Axions are well-motivated dark matter candidates with simple cosmological production mechanisms. They were originally introduced to solve the strong CP problem, but also arise in a wide range of extensions to the Standard Model. This Snowmass white paper summarizes axion phenomenology and outlines next-generation laboratory experiments proposed to detect axion dark matter. There are vibrant synerg…
▽ More
Axions are well-motivated dark matter candidates with simple cosmological production mechanisms. They were originally introduced to solve the strong CP problem, but also arise in a wide range of extensions to the Standard Model. This Snowmass white paper summarizes axion phenomenology and outlines next-generation laboratory experiments proposed to detect axion dark matter. There are vibrant synergies with astrophysical searches and advances in instrumentation including quantum-enabled readout, high-Q resonators and cavities and large high-field magnets. This white paper outlines a clear roadmap to discovery, and shows that the US is well-positioned to be at the forefront of the search for axion dark matter in the coming decade.
△ Less
Submitted 29 March, 2023; v1 submitted 28 March, 2022;
originally announced March 2022.
-
New Horizons: Scalar and Vector Ultralight Dark Matter
Authors:
D. Antypas,
A. Banerjee,
C. Bartram,
M. Baryakhtar,
J. Betz,
J. J. Bollinger,
C. Boutan,
D. Bowring,
D. Budker,
D. Carney,
G. Carosi,
S. Chaudhuri,
S. Cheong,
A. Chou,
M. D. Chowdhury,
R. T. Co,
J. R. Crespo López-Urrutia,
M. Demarteau,
N. DePorzio,
A. V. Derbin,
T. Deshpande,
M. D. Chowdhury,
L. Di Luzio,
A. Diaz-Morcillo,
J. M. Doyle
, et al. (104 additional authors not shown)
Abstract:
The last decade has seen unprecedented effort in dark matter model building at all mass scales coupled with the design of numerous new detection strategies. Transformative advances in quantum technologies have led to a plethora of new high-precision quantum sensors and dark matter detection strategies for ultralight ($<10\,$eV) bosonic dark matter that can be described by an oscillating classical,…
▽ More
The last decade has seen unprecedented effort in dark matter model building at all mass scales coupled with the design of numerous new detection strategies. Transformative advances in quantum technologies have led to a plethora of new high-precision quantum sensors and dark matter detection strategies for ultralight ($<10\,$eV) bosonic dark matter that can be described by an oscillating classical, largely coherent field. This white paper focuses on searches for wavelike scalar and vector dark matter candidates.
△ Less
Submitted 28 March, 2022;
originally announced March 2022.
-
Multi-task unscented Kalman inversion (MUKI): a derivative-free joint inversion framework and its application to joint inversion of geophysical data
Authors:
Longlong Wang,
Yun Chen,
Youshan Liu,
Nanqiao Du,
Wei Li,
Junliu Suwen
Abstract:
In the geophysical joint inversion, the gradient and Bayesian Markov Chain Monte Carlo (MCMC) sampling-based methods are widely used owing to their fast convergences or global optimality. However, these methods either require the computation of gradients and easily fall into local optimal solutions, or cost much time to carry out the millions of forward calculations in a huge sampling space. Diffe…
▽ More
In the geophysical joint inversion, the gradient and Bayesian Markov Chain Monte Carlo (MCMC) sampling-based methods are widely used owing to their fast convergences or global optimality. However, these methods either require the computation of gradients and easily fall into local optimal solutions, or cost much time to carry out the millions of forward calculations in a huge sampling space. Different from these two methods, taking advantage of the recently developed unscented Kalman method in computational mathematics, we extend an iterative gradient-free Bayesian joint inversion framework, i.e., Multi-task unscented Kalman inversion (MUKI). In this new framework, information from various observations is incorporated, the model is iteratively updated in a derivative-free way, and a Gaussian approximation to the posterior distribution of the model parameters is obtained. We apply the MUKI to the joint inversion of receiver functions and surface wave dispersion, which is well-established and widely used to construct the crustal and upper mantle structure of the earth. Based on synthesized and real data, the tests demonstrate that MUKI can recover the model more efficiently than the gradient-based method and the Markov Chain Monte Carlo method, and it would be a promising approach to resolve the geophysical joint inversion problems.
△ Less
Submitted 3 August, 2022; v1 submitted 19 February, 2022;
originally announced February 2022.
-
Mixture-of-Experts with Expert Choice Routing
Authors:
Yanqi Zhou,
Tao Lei,
Hanxiao Liu,
Nan Du,
Yanping Huang,
Vincent Zhao,
Andrew Dai,
Zhifeng Chen,
Quoc Le,
James Laudon
Abstract:
Sparsely-activated Mixture-of-experts (MoE) models allow the number of parameters to greatly increase while keeping the amount of computation for a given token or a given sample unchanged. However, a poor expert routing strategy (e.g. one resulting in load imbalance) can cause certain experts to be under-trained, leading to an expert being under or over-specialized. Prior work allocates a fixed nu…
▽ More
Sparsely-activated Mixture-of-experts (MoE) models allow the number of parameters to greatly increase while keeping the amount of computation for a given token or a given sample unchanged. However, a poor expert routing strategy (e.g. one resulting in load imbalance) can cause certain experts to be under-trained, leading to an expert being under or over-specialized. Prior work allocates a fixed number of experts to each token using a top-k function regardless of the relative importance of different tokens. To address this, we propose a heterogeneous mixture-of-experts employing an expert choice method. Instead of letting tokens select the top-k experts, we have experts selecting the top-k tokens. As a result, each token can be routed to a variable number of experts and each expert can have a fixed bucket size. We systematically study pre-training speedups using the same computational resources of the Switch Transformer top-1 and GShard top-2 gating of prior work and find that our method improves training convergence time by more than 2x. For the same computational cost, our method demonstrates higher performance in fine-tuning 11 selected tasks in the GLUE and SuperGLUE benchmarks. For a smaller activation cost, our method outperforms the T5 dense model in 7 out of the 11 tasks.
△ Less
Submitted 13 October, 2022; v1 submitted 18 February, 2022;
originally announced February 2022.
-
ST-MoE: Designing Stable and Transferable Sparse Expert Models
Authors:
Barret Zoph,
Irwan Bello,
Sameer Kumar,
Nan Du,
Yanping Huang,
Jeff Dean,
Noam Shazeer,
William Fedus
Abstract:
Scale has opened new frontiers in natural language processing -- but at a high cost. In response, Mixture-of-Experts (MoE) and Switch Transformers have been proposed as an energy efficient path to even larger and more capable language models. But advancing the state-of-the-art across a broad set of natural language tasks has been hindered by training instabilities and uncertain quality during fine…
▽ More
Scale has opened new frontiers in natural language processing -- but at a high cost. In response, Mixture-of-Experts (MoE) and Switch Transformers have been proposed as an energy efficient path to even larger and more capable language models. But advancing the state-of-the-art across a broad set of natural language tasks has been hindered by training instabilities and uncertain quality during fine-tuning. Our work focuses on these issues and acts as a design guide. We conclude by scaling a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer (Stable and Transferable Mixture-of-Experts or ST-MoE-32B). For the first time, a sparse model achieves state-of-the-art performance in transfer learning, across a diverse set of tasks including reasoning (SuperGLUE, ARC Easy, ARC Challenge), summarization (XSum, CNN-DM), closed book question answering (WebQA, Natural Questions), and adversarially constructed tasks (Winogrande, ANLI R3).
△ Less
Submitted 29 April, 2022; v1 submitted 17 February, 2022;
originally announced February 2022.
-
Stochastic epidemic SIR models with hidden states
Authors:
Nguyen Du,
Alexandru Hening,
Nhu Nguyen,
George Yin
Abstract:
This paper focuses on and analyzes realistic SIR models that take stochasticity into account. The proposed systems are applicable to most incidence rates that are used in the literature including the bilinear incidence rate, the Beddington-DeAngelis incidence rate, and a Holling type II functional response. Given that many diseases can lead to asymptomatic infections, we look at a system of stocha…
▽ More
This paper focuses on and analyzes realistic SIR models that take stochasticity into account. The proposed systems are applicable to most incidence rates that are used in the literature including the bilinear incidence rate, the Beddington-DeAngelis incidence rate, and a Holling type II functional response. Given that many diseases can lead to asymptomatic infections, we look at a system of stochastic differential equations that also includes a class of hidden state individuals, for which the infection status is unknown. We assume that the direct observation of the percentage of hidden state individuals that are infected, $α(t)$, is not given and only a noise-corrupted observation process is available. Using the nonlinear filtering techniques in conjunction with an invasion type analysis (or analysis using Lyapunov exponents from the dynamical system point of view), this paper proves that the long-term behavior of the disease is governed by a threshold $λ\in \mathbb{R}$ that depends on the model parameters. It turns out that if $λ<0$ the number $I(t)$ of infected individuals converges to zero exponentially fast, or the extinction happens. In contrast, if $λ>0$, the infection is endemic and the system is permanent. We showcase our results by applying them in specific illuminating examples. Numerical simulations are also given to illustrate our results.
△ Less
Submitted 17 January, 2022;
originally announced January 2022.
-
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
Authors:
Nan Du,
Yanping Huang,
Andrew M. Dai,
Simon Tong,
Dmitry Lepikhin,
Yuanzhong Xu,
Maxim Krikun,
Yanqi Zhou,
Adams Wei Yu,
Orhan Firat,
Barret Zoph,
Liam Fedus,
Maarten Bosma,
Zongwei Zhou,
Tao Wang,
Yu Emma Wang,
Kellie Webster,
Marie Pellat,
Kevin Robinson,
Kathleen Meier-Hellstern,
Toju Duke,
Lucas Dixon,
Kun Zhang,
Quoc V Le,
Yonghui Wu
, et al. (2 additional authors not shown)
Abstract:
Scaling language models with more data, compute and parameters has driven significant progress in natural language processing. For example, thanks to scaling, GPT-3 was able to achieve strong results on in-context learning tasks. However, training these large dense models requires significant amounts of computing resources. In this paper, we propose and develop a family of language models named GL…
▽ More
Scaling language models with more data, compute and parameters has driven significant progress in natural language processing. For example, thanks to scaling, GPT-3 was able to achieve strong results on in-context learning tasks. However, training these large dense models requires significant amounts of computing resources. In this paper, we propose and develop a family of language models named GLaM (Generalist Language Model), which uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants. The largest GLaM has 1.2 trillion parameters, which is approximately 7x larger than GPT-3. It consumes only 1/3 of the energy used to train GPT-3 and requires half of the computation flops for inference, while still achieving better overall zero-shot and one-shot performance across 29 NLP tasks.
△ Less
Submitted 1 August, 2022; v1 submitted 13 December, 2021;
originally announced December 2021.