-
Can Go AIs be adversarially robust?
Authors:
Tom Tseng,
Euan McLean,
Kellin Pelrine,
Tony T. Wang,
Adam Gleave
Abstract:
Prior work found that superhuman Go AIs like KataGo can be defeated by simple adversarial strategies. In this paper, we study if simple defenses can improve KataGo's worst-case performance. We test three natural defenses: adversarial training on hand-constructed positions, iterated adversarial training, and changing the network architecture. We find that some of these defenses are able to protect…
▽ More
Prior work found that superhuman Go AIs like KataGo can be defeated by simple adversarial strategies. In this paper, we study if simple defenses can improve KataGo's worst-case performance. We test three natural defenses: adversarial training on hand-constructed positions, iterated adversarial training, and changing the network architecture. We find that some of these defenses are able to protect against previously discovered attacks. Unfortunately, we also find that none of these defenses are able to withstand adaptive attacks. In particular, we are able to train new adversaries that reliably defeat our defended agents by causing them to blunder in ways humans would not. Our results suggest that building robust AI systems is challenging even in narrow domains such as Go. For interactive examples of attacks and a link to our codebase, see https://goattack.far.ai.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
Exploiting Novel GPT-4 APIs
Authors:
Kellin Pelrine,
Mohammad Taufeeque,
Michał Zając,
Euan McLean,
Adam Gleave
Abstract:
Language model attacks typically assume one of two extreme threat models: full white-box access to model weights, or black-box access limited to a text generation API. However, real-world APIs are often more flexible than just text generation: these APIs expose "gray-box" access leading to new threat vectors. To explore this, we red-team three new functionalities exposed in the GPT-4 APIs: fine-tu…
▽ More
Language model attacks typically assume one of two extreme threat models: full white-box access to model weights, or black-box access limited to a text generation API. However, real-world APIs are often more flexible than just text generation: these APIs expose "gray-box" access leading to new threat vectors. To explore this, we red-team three new functionalities exposed in the GPT-4 APIs: fine-tuning, function calling and knowledge retrieval. We find that fine-tuning a model on as few as 15 harmful examples or 100 benign examples can remove core safeguards from GPT-4, enabling a range of harmful outputs. Furthermore, we find that GPT-4 Assistants readily divulge the function call schema and can be made to execute arbitrary function calls. Finally, we find that knowledge retrieval can be hijacked by injecting instructions into retrieval documents. These vulnerabilities highlight that any additions to the functionality exposed by an API can create new vulnerabilities.
△ Less
Submitted 4 August, 2024; v1 submitted 21 December, 2023;
originally announced December 2023.
-
Inverse Scaling: When Bigger Isn't Better
Authors:
Ian R. McKenzie,
Alexander Lyzhov,
Michael Pieler,
Alicia Parrish,
Aaron Mueller,
Ameya Prabhu,
Euan McLean,
Aaron Kirtland,
Alexis Ross,
Alisa Liu,
Andrew Gritsevskiy,
Daniel Wurgaft,
Derik Kauffman,
Gabriel Recchia,
Jiacheng Liu,
Joe Cavanagh,
Max Weiss,
Sicong Huang,
The Floating Droid,
Tom Tseng,
Tomasz Korbak,
Xudong Shen,
Yuhui Zhang,
Zhengping Zhou,
Najoung Kim
, et al. (2 additional authors not shown)
Abstract:
Work on scaling laws has found that large language models (LMs) show predictable improvements to overall loss with increased scale (model size, training data, and compute). Here, we present evidence for the claim that LMs may show inverse scaling, or worse task performance with increased scale, e.g., due to flaws in the training objective and data. We present empirical evidence of inverse scaling…
▽ More
Work on scaling laws has found that large language models (LMs) show predictable improvements to overall loss with increased scale (model size, training data, and compute). Here, we present evidence for the claim that LMs may show inverse scaling, or worse task performance with increased scale, e.g., due to flaws in the training objective and data. We present empirical evidence of inverse scaling on 11 datasets collected by running a public contest, the Inverse Scaling Prize, with a substantial prize pool. Through analysis of the datasets, along with other examples found in the literature, we identify four potential causes of inverse scaling: (i) preference to repeat memorized sequences over following in-context instructions, (ii) imitation of undesirable patterns in the training data, (iii) tasks containing an easy distractor task which LMs could focus on, rather than the harder real task, and (iv) correct but misleading few-shot demonstrations of the task. We release the winning datasets at https://inversescaling.com/data to allow for further investigation of inverse scaling. Our tasks have helped drive the discovery of U-shaped and inverted-U scaling trends, where an initial trend reverses, suggesting that scaling trends are less reliable at predicting the behavior of larger-scale models than previously understood. Overall, our results suggest that there are tasks for which increased model scale alone may not lead to progress, and that more careful thought needs to go into the data and objectives for training language models.
△ Less
Submitted 12 May, 2024; v1 submitted 15 June, 2023;
originally announced June 2023.
-
Language models are better than humans at next-token prediction
Authors:
Buck Shlegeris,
Fabien Roger,
Lawrence Chan,
Euan McLean
Abstract:
Current language models are considered to have sub-human capabilities at natural language tasks like question-answering or writing code. However, language models are not trained to perform well at these tasks, they are trained to accurately predict the next token given previous tokes in tokenized text. It is not clear whether language models are better or worse than humans at next token prediction…
▽ More
Current language models are considered to have sub-human capabilities at natural language tasks like question-answering or writing code. However, language models are not trained to perform well at these tasks, they are trained to accurately predict the next token given previous tokes in tokenized text. It is not clear whether language models are better or worse than humans at next token prediction. To try to answer this question, we performed two distinct experiments to directly compare humans and language models on this front: one measuring top-1 accuracy and the other measuring perplexity. In both experiments, we find humans to be consistently \emph{worse} than even relatively small language models like GPT3-Ada at next-token prediction.
△ Less
Submitted 15 July, 2024; v1 submitted 21 December, 2022;
originally announced December 2022.
-
$B_s\to D_s \ellν$ Form Factors for the full $q^2$ range from Lattice QCD with non-perturbatively normalized currents
Authors:
E. McLean,
C. T. H. Davies,
J. Koponen,
A. T. Lytle
Abstract:
We present a lattice QCD determination of the $B_s \to D_s \ellν$ scalar and vector form factors over the full physical range of momentum transfer. The result is derived from correlation functions computed using the Highly Improved Staggered Quark (HISQ) formalism, on the second generation MILC gluon ensembles accounting for up, down, strange and charm contributions from the sea. We calculate corr…
▽ More
We present a lattice QCD determination of the $B_s \to D_s \ellν$ scalar and vector form factors over the full physical range of momentum transfer. The result is derived from correlation functions computed using the Highly Improved Staggered Quark (HISQ) formalism, on the second generation MILC gluon ensembles accounting for up, down, strange and charm contributions from the sea. We calculate correlation functions for three lattice spacing values and an array of unphysically light $b$-quark masses, and extrapolate to the physical value. Using the HISQ formalism for all quarks means that the lattice current coupling to the $W$ can be renormalized non-perturbatively, giving a result free from perturbative matching errors for the first time. Our results are in agreement with, and more accurate than, previous determinations of these form factors. From the form factors we also determine the ratio of branching fractions that is sensitive to violation of lepton universality: $R(D_s) = \mathcal{B}(B_s\to D_s τν_τ)/\mathcal{B}(B_s\to D_s \ell ν_{l})$, where $\ell$ is an electron or a muon. We find $R(D_s) = 0.2987(46)$, which is also more accurate than previous lattice QCD results. Combined with a future measurement of $R(D_s)$, this could supply a new test of the Standard Model. We also compare the dependence on heavy quark mass of our form factors to expectations from Heavy Quark Effective Theory.
△ Less
Submitted 14 June, 2019; v1 submitted 3 June, 2019;
originally announced June 2019.
-
Lattice QCD form factor for $B_s\to D_s^* lν$ at zero recoil with non-perturbative current renormalisation
Authors:
E. McLean,
C. T. H. Davies,
A. T. Lytle,
J. Koponen
Abstract:
We present details of a lattice QCD calculation of the $B_s\to D_s^*$ axial form factor at zero recoil using the Highly Improved Staggered Quark (HISQ) formalism on the second generation MILC gluon ensembles that include up, down, strange and charm quarks in the sea. Using the HISQ action for all valence quarks means that the lattice axial vector current that couples to the $W$ can be renormalized…
▽ More
We present details of a lattice QCD calculation of the $B_s\to D_s^*$ axial form factor at zero recoil using the Highly Improved Staggered Quark (HISQ) formalism on the second generation MILC gluon ensembles that include up, down, strange and charm quarks in the sea. Using the HISQ action for all valence quarks means that the lattice axial vector current that couples to the $W$ can be renormalized fully non-perturbatively, giving a result free of the perturbative matching errors that previous lattice QCD calculations have had. We calculate correlation functions at three values of the lattice spacing, and multiple `$b$'-quark masses, for physical $c$ and $s$. The functional dependence on the $b$-quark mass can be determined and compared to Heavy Quark Effective Theory expectations, and a result for the form factor obtained at the physical value of the $b$-quark mass. We find $\mathcal{F}^{B_s\to D_s^*}(1) = h^s_{A_1}(1) = 0.9020(96)_{\text{stat}}(90)_{\text{sys}}$. This is in agreement with earlier lattice QCD results, which use NRQCD $b$ quarks, with a total uncertainty reduced by more than a factor of two. We discuss implications of this result for the $B\to D^*$ axial form factor at zero recoil and for determinations of $V_{cb}$.
△ Less
Submitted 31 May, 2019; v1 submitted 3 April, 2019;
originally announced April 2019.
-
$B_s\to D_s^{(*)}lν$ Form Factors with Heavy HISQ Quarks
Authors:
E. McLean,
C. T. H. Davies,
A. T. Lytle,
J. Koponen
Abstract:
We present progress on an ongoing calculation of the $B_s\to D_s^{(*)} l ν$ form factors calculated on the $n_f=2+1+1$ MILC ensembles and using the Highly Improved Staggered Quark action for all valence quarks. We perform the calculation at a range of $b$ quark masses (and lattice spacings) so that we can extrapolate to the physical $b$-quark mass.
We present progress on an ongoing calculation of the $B_s\to D_s^{(*)} l ν$ form factors calculated on the $n_f=2+1+1$ MILC ensembles and using the Highly Improved Staggered Quark action for all valence quarks. We perform the calculation at a range of $b$ quark masses (and lattice spacings) so that we can extrapolate to the physical $b$-quark mass.
△ Less
Submitted 15 January, 2019;
originally announced January 2019.
-
The $B_{(s)} \to D_{(s)}lν$ Decay with Highly Improved Staggered Quarks and NRQCD
Authors:
Euan McLean,
Christine T. H. Davies,
Brian Colquhoun,
Andrew Lytle
Abstract:
We report on progress of a lattice QCD calculation of the $B\to Dlν$ and $B_s\to D_s lν$ semileptonic form factors. We use a relativistic staggered action (HISQ) for light and charm quarks, and an improved non-relativistic (NRQCD) action for bottom, on the second generation MILC ensembles.
We report on progress of a lattice QCD calculation of the $B\to Dlν$ and $B_s\to D_s lν$ semileptonic form factors. We use a relativistic staggered action (HISQ) for light and charm quarks, and an improved non-relativistic (NRQCD) action for bottom, on the second generation MILC ensembles.
△ Less
Submitted 9 November, 2017;
originally announced November 2017.