-
LINGOLY: A Benchmark of Olympiad-Level Linguistic Reasoning Puzzles in Low-Resource and Extinct Languages
Authors:
Andrew M. Bean,
Simi Hellsten,
Harry Mayne,
Jabez Magomere,
Ethan A. Chi,
Ryan Chi,
Scott A. Hale,
Hannah Rose Kirk
Abstract:
In this paper, we present the LingOly benchmark, a novel benchmark for advanced reasoning abilities in large language models. Using challenging Linguistic Olympiad puzzles, we evaluate (i) capabilities for in-context identification and generalisation of linguistic patterns in very low-resource or extinct languages, and (ii) abilities to follow complex task instructions. The LingOly benchmark cover…
▽ More
In this paper, we present the LingOly benchmark, a novel benchmark for advanced reasoning abilities in large language models. Using challenging Linguistic Olympiad puzzles, we evaluate (i) capabilities for in-context identification and generalisation of linguistic patterns in very low-resource or extinct languages, and (ii) abilities to follow complex task instructions. The LingOly benchmark covers more than 90 mostly low-resource languages, minimising issues of data contamination, and contains 1,133 problems across 6 formats and 5 levels of human difficulty. We assess performance with both direct accuracy and comparison to a no-context baseline to penalise memorisation. Scores from 11 state-of-the-art LLMs demonstrate the benchmark to be challenging, and models perform poorly on the higher difficulty problems. On harder problems, even the top model only achieved 38.7% accuracy, 24.7% improvement over the no-context baseline. Large closed models typically outperform open models, and in general, the higher resource the language, the better the scores. These results indicate, in absence of memorisation, true multi-step out-of-domain reasoning remains a challenge for current language models.
△ Less
Submitted 11 June, 2024; v1 submitted 10 June, 2024;
originally announced June 2024.
-
The AI Community Building the Future? A Quantitative Analysis of Development Activity on Hugging Face Hub
Authors:
Cailean Osborne,
Jennifer Ding,
Hannah Rose Kirk
Abstract:
Open model developers have emerged as key actors in the political economy of artificial intelligence (AI), but we still have a limited understanding of collaborative practices in the open AI ecosystem. This paper responds to this gap with a three-part quantitative analysis of development activity on the Hugging Face (HF) Hub, a popular platform for building, sharing, and demonstrating models. Firs…
▽ More
Open model developers have emerged as key actors in the political economy of artificial intelligence (AI), but we still have a limited understanding of collaborative practices in the open AI ecosystem. This paper responds to this gap with a three-part quantitative analysis of development activity on the Hugging Face (HF) Hub, a popular platform for building, sharing, and demonstrating models. First, various types of activity across 348,181 model, 65,761 dataset, and 156,642 space repositories exhibit right-skewed distributions. Activity is extremely imbalanced between repositories; for example, over 70% of models have 0 downloads, while 1% account for 99% of downloads. Furthermore, licenses matter: there are statistically significant differences in collaboration patterns in model repositories with permissive, restrictive, and no licenses. Second, we analyse a snapshot of the social network structure of collaboration in model repositories, finding that the community has a core-periphery structure, with a core of prolific developers and a majority of isolate developers (89%). Upon removing the isolate developers from the network, collaboration is characterised by high reciprocity regardless of developers' network positions. Third, we examine model adoption through the lens of model usage in spaces, finding that a minority of models, developed by a handful of companies, are widely used on the HF Hub. Overall, activity on the HF Hub is characterised by Pareto distributions, congruent with OSS development patterns on platforms like GitHub. We conclude with recommendations for researchers, companies, and policymakers to advance our understanding of open AI development.
△ Less
Submitted 5 June, 2024; v1 submitted 20 May, 2024;
originally announced May 2024.
-
The PRISM Alignment Project: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models
Authors:
Hannah Rose Kirk,
Alexander Whitefield,
Paul Röttger,
Andrew Bean,
Katerina Margatina,
Juan Ciro,
Rafael Mosquera,
Max Bartolo,
Adina Williams,
He He,
Bertie Vidgen,
Scott A. Hale
Abstract:
Human feedback plays a central role in the alignment of Large Language Models (LLMs). However, open questions remain about the methods (how), domains (where), people (who) and objectives (to what end) of human feedback collection. To navigate these questions, we introduce PRISM, a new dataset which maps the sociodemographics and stated preferences of 1,500 diverse participants from 75 countries, t…
▽ More
Human feedback plays a central role in the alignment of Large Language Models (LLMs). However, open questions remain about the methods (how), domains (where), people (who) and objectives (to what end) of human feedback collection. To navigate these questions, we introduce PRISM, a new dataset which maps the sociodemographics and stated preferences of 1,500 diverse participants from 75 countries, to their contextual preferences and fine-grained feedback in 8,011 live conversations with 21 LLMs. PRISM contributes (i) wide geographic and demographic participation in human feedback data; (ii) two census-representative samples for understanding collective welfare (UK and US); and (iii) individualised feedback where every rating is linked to a detailed participant profile, thus permitting exploration of personalisation and attribution of sample artefacts. We focus on collecting conversations that centre subjective and multicultural perspectives on value-laden and controversial topics, where we expect the most interpersonal and cross-cultural disagreement. We demonstrate the usefulness of PRISM via three case studies of dialogue diversity, preference diversity, and welfare outcomes, showing that it matters which humans set alignment norms. As well as offering a rich community resource, we advocate for broader participation in AI development and a more inclusive approach to technology design.
△ Less
Submitted 24 April, 2024;
originally announced April 2024.
-
An ALMA search for substructure and fragmentation in starless cores in Orion B North
Authors:
Samuel Fielder,
Helen Kirk,
Michael Dunham,
Stella Offner
Abstract:
We present Atacama Large Millimeter/submillimeter Array (ALMA) Cycle 3 observations of 73 starless and protostellar cores in the Orion B North molecular cloud. We detect a total of 34 continuum sources at 106 GHz, and after comparisons with other data, 4 of these sources appear to be starless. Three of the four sources are located near groupings of protostellar sources, while one source is an isol…
▽ More
We present Atacama Large Millimeter/submillimeter Array (ALMA) Cycle 3 observations of 73 starless and protostellar cores in the Orion B North molecular cloud. We detect a total of 34 continuum sources at 106 GHz, and after comparisons with other data, 4 of these sources appear to be starless. Three of the four sources are located near groupings of protostellar sources, while one source is an isolated detection. We use synthetic observations of a simulation modeling a collapsing turbulent, magnetized core to compute the expected number of starless cores that should be detectable with our ALMA observations and find at least two (1.52) starless core should be detectable, consistent with our data. We run a simple virial analysis of the cores to put the Orion B North observations into context with similar previous ALMA surveys of cores in Chamaeleon I and Ophiuchus. We conclude that the Chamaeleon I starless core population is characteristically less bounded than the other two populations, along with external pressure contributions dominating the binding energy of the cores. These differences may explain why the Chamaeleon I cores do not follow turbulent model predictions, while the Ophiuchus and Orion B North cores are consistent with the model.
△ Less
Submitted 21 April, 2024;
originally announced April 2024.
-
Introducing v0.5 of the AI Safety Benchmark from MLCommons
Authors:
Bertie Vidgen,
Adarsh Agrawal,
Ahmed M. Ahmed,
Victor Akinwande,
Namir Al-Nuaimi,
Najla Alfaraj,
Elie Alhajjar,
Lora Aroyo,
Trupti Bavalatti,
Max Bartolo,
Borhane Blili-Hamelin,
Kurt Bollacker,
Rishi Bomassani,
Marisa Ferrara Boston,
Siméon Campos,
Kal Chakra,
Canyu Chen,
Cody Coleman,
Zacharie Delpierre Coudert,
Leon Derczynski,
Debojyoti Dutta,
Ian Eisenberg,
James Ezick,
Heather Frase,
Brian Fuller
, et al. (75 additional authors not shown)
Abstract:
This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The AI Safety Benchmark has been designed to assess the safety risks of AI systems that use chat-tuned language models. We introduce a principled approach to specifying and constructing the benchmark, which for v0.5 covers only a single use case (an adult chatting to a general-pu…
▽ More
This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The AI Safety Benchmark has been designed to assess the safety risks of AI systems that use chat-tuned language models. We introduce a principled approach to specifying and constructing the benchmark, which for v0.5 covers only a single use case (an adult chatting to a general-purpose assistant in English), and a limited set of personas (i.e., typical users, malicious users, and vulnerable users). We created a new taxonomy of 13 hazard categories, of which 7 have tests in the v0.5 benchmark. We plan to release version 1.0 of the AI Safety Benchmark by the end of 2024. The v1.0 benchmark will provide meaningful insights into the safety of AI systems. However, the v0.5 benchmark should not be used to assess the safety of AI systems. We have sought to fully document the limitations, flaws, and challenges of v0.5. This release of v0.5 of the AI Safety Benchmark includes (1) a principled approach to specifying and constructing the benchmark, which comprises use cases, types of systems under test (SUTs), language and context, personas, tests, and test items; (2) a taxonomy of 13 hazard categories with definitions and subcategories; (3) tests for seven of the hazard categories, each comprising a unique set of test items, i.e., prompts. There are 43,090 test items in total, which we created with templates; (4) a grading system for AI systems against the benchmark; (5) an openly available platform, and downloadable tool, called ModelBench that can be used to evaluate the safety of AI systems on the benchmark; (6) an example evaluation report which benchmarks the performance of over a dozen openly available chat-tuned language models; (7) a test specification for the benchmark.
△ Less
Submitted 13 May, 2024; v1 submitted 18 April, 2024;
originally announced April 2024.
-
Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse Harms in Text-to-Image Generation
Authors:
Jessica Quaye,
Alicia Parrish,
Oana Inel,
Charvi Rastogi,
Hannah Rose Kirk,
Minsuk Kahng,
Erin van Liemt,
Max Bartolo,
Jess Tsang,
Justin White,
Nathan Clement,
Rafael Mosquera,
Juan Ciro,
Vijay Janapa Reddi,
Lora Aroyo
Abstract:
With the rise of text-to-image (T2I) generative AI models reaching wide audiences, it is critical to evaluate model robustness against non-obvious attacks to mitigate the generation of offensive images. By focusing on ``implicitly adversarial'' prompts (those that trigger T2I models to generate unsafe images for non-obvious reasons), we isolate a set of difficult safety issues that human creativit…
▽ More
With the rise of text-to-image (T2I) generative AI models reaching wide audiences, it is critical to evaluate model robustness against non-obvious attacks to mitigate the generation of offensive images. By focusing on ``implicitly adversarial'' prompts (those that trigger T2I models to generate unsafe images for non-obvious reasons), we isolate a set of difficult safety issues that human creativity is well-suited to uncover. To this end, we built the Adversarial Nibbler Challenge, a red-teaming methodology for crowdsourcing a diverse set of implicitly adversarial prompts. We have assembled a suite of state-of-the-art T2I models, employed a simple user interface to identify and annotate harms, and engaged diverse populations to capture long-tail safety issues that may be overlooked in standard testing. The challenge is run in consecutive rounds to enable a sustained discovery and analysis of safety pitfalls in T2I models.
In this paper, we present an in-depth account of our methodology, a systematic study of novel attack strategies and discussion of safety failures revealed by challenge participants. We also release a companion visualization tool for easy exploration and derivation of insights from the dataset. The first challenge round resulted in over 10k prompt-image pairs with machine annotations for safety. A subset of 1.5k samples contains rich human annotations of harm types and attack styles. We find that 14% of images that humans consider harmful are mislabeled as ``safe'' by machines. We have identified new attack strategies that highlight the complexity of ensuring T2I model robustness. Our findings emphasize the necessity of continual auditing and adaptation as new vulnerabilities emerge. We are confident that this work will enable proactive, iterative safety assessments and promote responsible development of T2I models.
△ Less
Submitted 13 May, 2024; v1 submitted 14 February, 2024;
originally announced March 2024.
-
Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models
Authors:
Paul Röttger,
Valentin Hofmann,
Valentina Pyatkin,
Musashi Hinck,
Hannah Rose Kirk,
Hinrich Schütze,
Dirk Hovy
Abstract:
Much recent work seeks to evaluate values and opinions in large language models (LLMs) using multiple-choice surveys and questionnaires. Most of this work is motivated by concerns around real-world LLM applications. For example, politically-biased LLMs may subtly influence society when they are used by millions of people. Such real-world concerns, however, stand in stark contrast to the artificial…
▽ More
Much recent work seeks to evaluate values and opinions in large language models (LLMs) using multiple-choice surveys and questionnaires. Most of this work is motivated by concerns around real-world LLM applications. For example, politically-biased LLMs may subtly influence society when they are used by millions of people. Such real-world concerns, however, stand in stark contrast to the artificiality of current evaluations: real users do not typically ask LLMs survey questions. Motivated by this discrepancy, we challenge the prevailing constrained evaluation paradigm for values and opinions in LLMs and explore more realistic unconstrained evaluations. As a case study, we focus on the popular Political Compass Test (PCT). In a systematic review, we find that most prior work using the PCT forces models to comply with the PCT's multiple-choice format. We show that models give substantively different answers when not forced; that answers change depending on how models are forced; and that answers lack paraphrase robustness. Then, we demonstrate that models give different answers yet again in a more realistic open-ended answer setting. We distill these findings into recommendations and open challenges in evaluating values and opinions in LLMs.
△ Less
Submitted 5 June, 2024; v1 submitted 26 February, 2024;
originally announced February 2024.
-
Cheap Learning: Maximising Performance of Language Models for Social Data Science Using Minimal Data
Authors:
Leonardo Castro-Gonzalez,
Yi-Ling Chung,
Hannak Rose Kirk,
John Francis,
Angus R. Williams,
Pica Johansson,
Jonathan Bright
Abstract:
The field of machine learning has recently made significant progress in reducing the requirements for labelled training data when building new models. These `cheaper' learning techniques hold significant potential for the social sciences, where development of large labelled training datasets is often a significant practical impediment to the use of machine learning for analytical tasks. In this ar…
▽ More
The field of machine learning has recently made significant progress in reducing the requirements for labelled training data when building new models. These `cheaper' learning techniques hold significant potential for the social sciences, where development of large labelled training datasets is often a significant practical impediment to the use of machine learning for analytical tasks. In this article we review three `cheap' techniques that have developed in recent years: weak supervision, transfer learning and prompt engineering. For the latter, we also review the particular case of zero-shot prompting of large language models. For each technique we provide a guide of how it works and demonstrate its application across six different realistic social science applications (two different tasks paired with three different dataset makeups). We show good performance for all techniques, and in particular we demonstrate how prompting of large language models can achieve high accuracy at very low cost. Our results are accompanied by a code repository to make it easy for others to duplicate our work and use it in their own research. Overall, our article is intended to stimulate further uptake of these techniques in the social sciences.
△ Less
Submitted 22 January, 2024;
originally announced January 2024.
-
SimpleSafetyTests: a Test Suite for Identifying Critical Safety Risks in Large Language Models
Authors:
Bertie Vidgen,
Nino Scherrer,
Hannah Rose Kirk,
Rebecca Qian,
Anand Kannappan,
Scott A. Hale,
Paul Röttger
Abstract:
The past year has seen rapid acceleration in the development of large language models (LLMs). However, without proper steering and safeguards, LLMs will readily follow malicious instructions, provide unsafe advice, and generate toxic content. We introduce SimpleSafetyTests (SST) as a new test suite for rapidly and systematically identifying such critical safety risks. The test suite comprises 100…
▽ More
The past year has seen rapid acceleration in the development of large language models (LLMs). However, without proper steering and safeguards, LLMs will readily follow malicious instructions, provide unsafe advice, and generate toxic content. We introduce SimpleSafetyTests (SST) as a new test suite for rapidly and systematically identifying such critical safety risks. The test suite comprises 100 test prompts across five harm areas that LLMs, for the vast majority of applications, should refuse to comply with. We test 11 open-access and open-source LLMs and four closed-source LLMs, and find critical safety weaknesses. While some of the models do not give a single unsafe response, most give unsafe responses to more than 20% of the prompts, with over 50% unsafe responses in the extreme. Prepending a safety-emphasising system prompt substantially reduces the occurrence of unsafe responses, but does not completely stop them from happening. Trained annotators labelled every model response to SST (n = 3,000). We use these annotations to evaluate five AI safety filters (which assess whether a models' response is unsafe given a prompt) as a way of automatically evaluating models' performance on SST. The filters' performance varies considerably. There are also differences across the five harm areas, and on the unsafe versus safe responses. The widely-used Perspective API has 72% accuracy and a newly-created zero-shot prompt to OpenAI's GPT-4 performs best with 89% accuracy. Content Warning: This paper contains prompts and responses that relate to child abuse, suicide, self-harm and eating disorders, scams and fraud, illegal items, and physical harm.
△ Less
Submitted 16 February, 2024; v1 submitted 14 November, 2023;
originally announced November 2023.
-
Mode Selection and Target Classification in Cognitive Radar Networks
Authors:
William W. Howard,
Samuel R. Shebert,
Benjamin H. Kirk,
R. Michael Buehrer
Abstract:
Cognitive Radar Networks were proposed by Simon Haykin in 2006 to address problems with large legacy radar implementations - primarily, single-point vulnerabilities and lack of adaptability. This work proposes to leverage the adaptability of cognitive radar networks to trade between active radar observation, which uses high power and risks interception, and passive signal parameter estimation, whi…
▽ More
Cognitive Radar Networks were proposed by Simon Haykin in 2006 to address problems with large legacy radar implementations - primarily, single-point vulnerabilities and lack of adaptability. This work proposes to leverage the adaptability of cognitive radar networks to trade between active radar observation, which uses high power and risks interception, and passive signal parameter estimation, which uses target emissions to gain side information and lower the power necessary to accurately track multiple targets. The goal of the network is to learn over many target tracks both the characteristics of the targets as well as the optimal action choices for each type of target. In order to select between the available actions, we utilize a multi-armed bandit model, using current class information as prior information. When the active radar action is selected, the node estimates the physical behavior of targets through the radar emissions. When the passive action is selected, the node estimates the radio behavior of targets through passive sensing. Over many target tracks, the network collects the observed behavior of targets and forms clusters of similarly-behaved targets. In this way, the network meta-learns the target class distributions while learning the optimal mode selections for each target class.
△ Less
Submitted 25 October, 2023;
originally announced October 2023.
-
The Past, Present and Better Future of Feedback Learning in Large Language Models for Subjective Human Preferences and Values
Authors:
Hannah Rose Kirk,
Andrew M. Bean,
Bertie Vidgen,
Paul Röttger,
Scott A. Hale
Abstract:
Human feedback is increasingly used to steer the behaviours of Large Language Models (LLMs). However, it is unclear how to collect and incorporate feedback in a way that is efficient, effective and unbiased, especially for highly subjective human preferences and values. In this paper, we survey existing approaches for learning from human feedback, drawing on 95 papers primarily from the ACL and ar…
▽ More
Human feedback is increasingly used to steer the behaviours of Large Language Models (LLMs). However, it is unclear how to collect and incorporate feedback in a way that is efficient, effective and unbiased, especially for highly subjective human preferences and values. In this paper, we survey existing approaches for learning from human feedback, drawing on 95 papers primarily from the ACL and arXiv repositories.First, we summarise the past, pre-LLM trends for integrating human feedback into language models. Second, we give an overview of present techniques and practices, as well as the motivations for using feedback; conceptual frameworks for defining values and preferences; and how feedback is collected and from whom. Finally, we encourage a better future of feedback learning in LLMs by raising five unresolved conceptual and practical challenges.
△ Less
Submitted 11 October, 2023;
originally announced October 2023.
-
The Empty Signifier Problem: Towards Clearer Paradigms for Operationalising "Alignment" in Large Language Models
Authors:
Hannah Rose Kirk,
Bertie Vidgen,
Paul Röttger,
Scott A. Hale
Abstract:
In this paper, we address the concept of "alignment" in large language models (LLMs) through the lens of post-structuralist socio-political theory, specifically examining its parallels to empty signifiers. To establish a shared vocabulary around how abstract concepts of alignment are operationalised in empirical datasets, we propose a framework that demarcates: 1) which dimensions of model behavio…
▽ More
In this paper, we address the concept of "alignment" in large language models (LLMs) through the lens of post-structuralist socio-political theory, specifically examining its parallels to empty signifiers. To establish a shared vocabulary around how abstract concepts of alignment are operationalised in empirical datasets, we propose a framework that demarcates: 1) which dimensions of model behaviour are considered important, then 2) how meanings and definitions are ascribed to these dimensions, and by whom. We situate existing empirical literature and provide guidance on deciding which paradigm to follow. Through this framework, we aim to foster a culture of transparency and critical evaluation, aiding the community in navigating the complexities of aligning LLMs with human populations.
△ Less
Submitted 15 November, 2023; v1 submitted 3 October, 2023;
originally announced October 2023.
-
Casteist but Not Racist? Quantifying Disparities in Large Language Model Bias between India and the West
Authors:
Khyati Khandelwal,
Manuel Tonneau,
Andrew M. Bean,
Hannah Rose Kirk,
Scott A. Hale
Abstract:
Large Language Models (LLMs), now used daily by millions of users, can encode societal biases, exposing their users to representational harms. A large body of scholarship on LLM bias exists but it predominantly adopts a Western-centric frame and attends comparatively less to bias levels and potential harms in the Global South. In this paper, we quantify stereotypical bias in popular LLMs according…
▽ More
Large Language Models (LLMs), now used daily by millions of users, can encode societal biases, exposing their users to representational harms. A large body of scholarship on LLM bias exists but it predominantly adopts a Western-centric frame and attends comparatively less to bias levels and potential harms in the Global South. In this paper, we quantify stereotypical bias in popular LLMs according to an Indian-centric frame and compare bias levels between the Indian and Western contexts. To do this, we develop a novel dataset which we call Indian-BhED (Indian Bias Evaluation Dataset), containing stereotypical and anti-stereotypical examples for caste and religion contexts. We find that the majority of LLMs tested are strongly biased towards stereotypes in the Indian context, especially as compared to the Western context. We finally investigate Instruction Prompting as a simple intervention to mitigate such bias and find that it significantly reduces both stereotypical and anti-stereotypical biases in the majority of cases for GPT-3.5. The findings of this work highlight the need for including more diverse voices when evaluating LLMs.
△ Less
Submitted 15 September, 2023;
originally announced September 2023.
-
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
Authors:
Paul Röttger,
Hannah Rose Kirk,
Bertie Vidgen,
Giuseppe Attanasio,
Federico Bianchi,
Dirk Hovy
Abstract:
Without proper safeguards, large language models will readily follow malicious instructions and generate toxic content. This risk motivates safety efforts such as red-teaming and large-scale feedback learning, which aim to make models both helpful and harmless. However, there is a tension between these two objectives, since harmlessness requires models to refuse to comply with unsafe prompts, and…
▽ More
Without proper safeguards, large language models will readily follow malicious instructions and generate toxic content. This risk motivates safety efforts such as red-teaming and large-scale feedback learning, which aim to make models both helpful and harmless. However, there is a tension between these two objectives, since harmlessness requires models to refuse to comply with unsafe prompts, and thus not be helpful. Recent anecdotal evidence suggests that some models may have struck a poor balance, so that even clearly safe prompts are refused if they use similar language to unsafe prompts or mention sensitive topics. In this paper, we introduce a new test suite called XSTest to identify such eXaggerated Safety behaviours in a systematic way. XSTest comprises 250 safe prompts across ten prompt types that well-calibrated models should not refuse to comply with, and 200 unsafe prompts as contrasts that models, for most applications, should refuse. We describe XSTest's creation and composition, and then use the test suite to highlight systematic failure modes in state-of-the-art language models as well as more general challenges in building safer language models.
△ Less
Submitted 1 April, 2024; v1 submitted 2 August, 2023;
originally announced August 2023.
-
DoDo Learning: DOmain-DemOgraphic Transfer in Language Models for Detecting Abuse Targeted at Public Figures
Authors:
Angus R. Williams,
Hannah Rose Kirk,
Liam Burke,
Yi-Ling Chung,
Ivan Debono,
Pica Johansson,
Francesca Stevens,
Jonathan Bright,
Scott A. Hale
Abstract:
Public figures receive a disproportionate amount of abuse on social media, impacting their active participation in public life. Automated systems can identify abuse at scale but labelling training data is expensive, complex and potentially harmful. So, it is desirable that systems are efficient and generalisable, handling both shared and specific aspects of online abuse. We explore the dynamics of…
▽ More
Public figures receive a disproportionate amount of abuse on social media, impacting their active participation in public life. Automated systems can identify abuse at scale but labelling training data is expensive, complex and potentially harmful. So, it is desirable that systems are efficient and generalisable, handling both shared and specific aspects of online abuse. We explore the dynamics of cross-group text classification in order to understand how well classifiers trained on one domain or demographic can transfer to others, with a view to building more generalisable abuse classifiers. We fine-tune language models to classify tweets targeted at public figures across DOmains (sport and politics) and DemOgraphics (women and men) using our novel DODO dataset, containing 28,000 labelled entries, split equally across four domain-demographic pairs. We find that (i) small amounts of diverse data are hugely beneficial to generalisation and model adaptation; (ii) models transfer more easily across demographics but models trained on cross-domain data are more generalisable; (iii) some groups contribute more to generalisability than others; and (iv) dataset similarity is a signal of transferability.
△ Less
Submitted 25 April, 2024; v1 submitted 31 July, 2023;
originally announced July 2023.
-
Alignment of dense molecular core morphology and velocity gradients with ambient magnetic fields
Authors:
A. Pandhi,
R. K. Friesen,
L. Fissel,
J. E. Pineda,
P. Caselli,
M. C-Y. Chen,
J. Di Francesco,
A. Ginsburg,
H. Kirk,
P. C. Myers,
S. S. R. Offner,
A. Punanova,
F. Quan,
E. Redaelli,
E. Rosolowsky,
S. Scibelli,
Y. M. Seo,
Y. Shirley
Abstract:
Studies of dense core morphologies and their orientations with respect to gas flows and the local magnetic field have been limited to only a small sample of cores with spectroscopic data. Leveraging the Green Bank Ammonia Survey alongside existing sub-millimeter continuum observations and Planck dust polarization, we produce a cross-matched catalogue of 399 dense cores with estimates of core morph…
▽ More
Studies of dense core morphologies and their orientations with respect to gas flows and the local magnetic field have been limited to only a small sample of cores with spectroscopic data. Leveraging the Green Bank Ammonia Survey alongside existing sub-millimeter continuum observations and Planck dust polarization, we produce a cross-matched catalogue of 399 dense cores with estimates of core morphology, size, mass, specific angular momentum, and magnetic field orientation. Of the 399 cores, 329 exhibit 2D $\mathrm{v}_\mathrm{LSR}$ maps that are well fit with a linear gradient, consistent with rotation projected on the sky. We find a best-fit specific angular momentum and core size relationship of $J/M \propto R^{1.82 \pm 0.10}$, suggesting that core velocity gradients originate from a combination of solid body rotation and turbulent motions. Most cores have no preferred orientation between the axis of core elongation, velocity gradient direction, and the ambient magnetic field orientation, favouring a triaxial and weakly magnetized origin. We find, however, strong evidence for a preferred anti-alignment between the core elongation axis and magnetic field for protostellar cores, revealing a change in orientation from starless and prestellar populations that may result from gravitational contraction in a magnetically-regulated (but not dominant) environment. We also find marginal evidence for anti-alignment between the core velocity gradient and magnetic field orientation in the L1228 and L1251 regions of Cepheus, suggesting a preferred orientation with respect to magnetic fields may be more prevalent in regions with locally ordered fields.
△ Less
Submitted 24 July, 2023;
originally announced July 2023.
-
VisoGender: A dataset for benchmarking gender bias in image-text pronoun resolution
Authors:
Siobhan Mackenzie Hall,
Fernanda Gonçalves Abrantes,
Hanwen Zhu,
Grace Sodunke,
Aleksandar Shtedritski,
Hannah Rose Kirk
Abstract:
We introduce VisoGender, a novel dataset for benchmarking gender bias in vision-language models. We focus on occupation-related biases within a hegemonic system of binary gender, inspired by Winograd and Winogender schemas, where each image is associated with a caption containing a pronoun relationship of subjects and objects in the scene. VisoGender is balanced by gender representation in profess…
▽ More
We introduce VisoGender, a novel dataset for benchmarking gender bias in vision-language models. We focus on occupation-related biases within a hegemonic system of binary gender, inspired by Winograd and Winogender schemas, where each image is associated with a caption containing a pronoun relationship of subjects and objects in the scene. VisoGender is balanced by gender representation in professional roles, supporting bias evaluation in two ways: i) resolution bias, where we evaluate the difference between pronoun resolution accuracies for image subjects with gender presentations perceived as masculine versus feminine by human annotators and ii) retrieval bias, where we compare ratios of professionals perceived to have masculine and feminine gender presentations retrieved for a gender-neutral search query. We benchmark several state-of-the-art vision-language models and find that they demonstrate bias in resolving binary gender in complex scenes. While the direction and magnitude of gender bias depends on the task and the model being evaluated, captioning models are generally less biased than Vision-Language Encoders. Dataset and code are available at https://github.com/oxai/visogender
△ Less
Submitted 12 December, 2023; v1 submitted 21 June, 2023;
originally announced June 2023.
-
Balancing the Picture: Debiasing Vision-Language Datasets with Synthetic Contrast Sets
Authors:
Brandon Smith,
Miguel Farinha,
Siobhan Mackenzie Hall,
Hannah Rose Kirk,
Aleksandar Shtedritski,
Max Bain
Abstract:
Vision-language models are growing in popularity and public visibility to generate, edit, and caption images at scale; but their outputs can perpetuate and amplify societal biases learned during pre-training on uncurated image-text pairs from the internet. Although debiasing methods have been proposed, we argue that these measurements of model bias lack validity due to dataset bias. We demonstrate…
▽ More
Vision-language models are growing in popularity and public visibility to generate, edit, and caption images at scale; but their outputs can perpetuate and amplify societal biases learned during pre-training on uncurated image-text pairs from the internet. Although debiasing methods have been proposed, we argue that these measurements of model bias lack validity due to dataset bias. We demonstrate there are spurious correlations in COCO Captions, the most commonly used dataset for evaluating bias, between background context and the gender of people in-situ. This is problematic because commonly-used bias metrics (such as Bias@K) rely on per-gender base rates. To address this issue, we propose a novel dataset debiasing pipeline to augment the COCO dataset with synthetic, gender-balanced contrast sets, where only the gender of the subject is edited and the background is fixed. However, existing image editing methods have limitations and sometimes produce low-quality images; so, we introduce a method to automatically filter the generated images based on their similarity to real images. Using our balanced synthetic contrast sets, we benchmark bias in multiple CLIP-based models, demonstrating how metrics are skewed by imbalance in the original COCO images. Our results indicate that the proposed approach improves the validity of the evaluation, ultimately contributing to more realistic understanding of bias in vision-language models.
△ Less
Submitted 24 May, 2023;
originally announced May 2023.
-
Adversarial Nibbler: A Data-Centric Challenge for Improving the Safety of Text-to-Image Models
Authors:
Alicia Parrish,
Hannah Rose Kirk,
Jessica Quaye,
Charvi Rastogi,
Max Bartolo,
Oana Inel,
Juan Ciro,
Rafael Mosquera,
Addison Howard,
Will Cukierski,
D. Sculley,
Vijay Janapa Reddi,
Lora Aroyo
Abstract:
The generative AI revolution in recent years has been spurred by an expansion in compute power and data quantity, which together enable extensive pre-training of powerful text-to-image (T2I) models. With their greater capabilities to generate realistic and creative content, these T2I models like DALL-E, MidJourney, Imagen or Stable Diffusion are reaching ever wider audiences. Any unsafe behaviors…
▽ More
The generative AI revolution in recent years has been spurred by an expansion in compute power and data quantity, which together enable extensive pre-training of powerful text-to-image (T2I) models. With their greater capabilities to generate realistic and creative content, these T2I models like DALL-E, MidJourney, Imagen or Stable Diffusion are reaching ever wider audiences. Any unsafe behaviors inherited from pretraining on uncurated internet-scraped datasets thus have the potential to cause wide-reaching harm, for example, through generated images which are violent, sexually explicit, or contain biased and derogatory stereotypes. Despite this risk of harm, we lack systematic and structured evaluation datasets to scrutinize model behavior, especially adversarial attacks that bypass existing safety filters. A typical bottleneck in safety evaluation is achieving a wide coverage of different types of challenging examples in the evaluation set, i.e., identifying 'unknown unknowns' or long-tail problems. To address this need, we introduce the Adversarial Nibbler challenge. The goal of this challenge is to crowdsource a diverse set of failure modes and reward challenge participants for successfully finding safety vulnerabilities in current state-of-the-art T2I models. Ultimately, we aim to provide greater awareness of these issues and assist developers in improving the future safety and reliability of generative AI models. Adversarial Nibbler is a data-centric challenge, part of the DataPerf challenge suite, organized and supported by Kaggle and MLCommons.
△ Less
Submitted 22 May, 2023;
originally announced May 2023.
-
Assessing Language Model Deployment with Risk Cards
Authors:
Leon Derczynski,
Hannah Rose Kirk,
Vidhisha Balachandran,
Sachin Kumar,
Yulia Tsvetkov,
M. R. Leiser,
Saif Mohammad
Abstract:
This paper introduces RiskCards, a framework for structured assessment and documentation of risks associated with an application of language models. As with all language, text generated by language models can be harmful, or used to bring about harm. Automating language generation adds both an element of scale and also more subtle or emergent undesirable tendencies to the generated text. Prior work…
▽ More
This paper introduces RiskCards, a framework for structured assessment and documentation of risks associated with an application of language models. As with all language, text generated by language models can be harmful, or used to bring about harm. Automating language generation adds both an element of scale and also more subtle or emergent undesirable tendencies to the generated text. Prior work establishes a wide variety of language model harms to many different actors: existing taxonomies identify categories of harms posed by language models; benchmarks establish automated tests of these harms; and documentation standards for models, tasks and datasets encourage transparent reporting. However, there is no risk-centric framework for documenting the complexity of a landscape in which some risks are shared across models and contexts, while others are specific, and where certain conditions may be required for risks to manifest as harms. RiskCards address this methodological gap by providing a generic framework for assessing the use of a given language model in a given scenario. Each RiskCard makes clear the routes for the risk to manifest harm, their placement in harm taxonomies, and example prompt-output pairs. While RiskCards are designed to be open-source, dynamic and participatory, we present a "starter set" of RiskCards taken from a broad literature survey, each of which details a concrete risk presentation. Language model RiskCards initiate a community knowledge base which permits the mapping of risks and harms to a specific model or its application scenario, ultimately contributing to a better, safer and shared understanding of the risk landscape.
△ Less
Submitted 31 March, 2023;
originally announced March 2023.
-
Personalisation within bounds: A risk taxonomy and policy framework for the alignment of large language models with personalised feedback
Authors:
Hannah Rose Kirk,
Bertie Vidgen,
Paul Röttger,
Scott A. Hale
Abstract:
Large language models (LLMs) are used to generate content for a wide range of tasks, and are set to reach a growing audience in coming years due to integration in product interfaces like ChatGPT or search engines like Bing. This intensifies the need to ensure that models are aligned with human preferences and do not produce unsafe, inaccurate or toxic outputs. While alignment techniques like reinf…
▽ More
Large language models (LLMs) are used to generate content for a wide range of tasks, and are set to reach a growing audience in coming years due to integration in product interfaces like ChatGPT or search engines like Bing. This intensifies the need to ensure that models are aligned with human preferences and do not produce unsafe, inaccurate or toxic outputs. While alignment techniques like reinforcement learning with human feedback (RLHF) and red-teaming can mitigate some safety concerns and improve model capabilities, it is unlikely that an aggregate fine-tuning process can adequately represent the full range of users' preferences and values. Different people may legitimately disagree on their preferences for language and conversational norms, as well as on values or ideologies which guide their communication. Personalising LLMs through micro-level preference learning processes may result in models that are better aligned with each user. However, there are several normative challenges in defining the bounds of a societally-acceptable and safe degree of personalisation. In this paper, we ask how, and in what ways, LLMs should be personalised. First, we review literature on current paradigms for aligning LLMs with human feedback, and identify issues including (i) a lack of clarity regarding what alignment means; (ii) a tendency of technology providers to prescribe definitions of inherently subjective preferences and values; and (iii) a 'tyranny of the crowdworker', exacerbated by a lack of documentation in who we are really aligning to. Second, we present a taxonomy of benefits and risks associated with personalised LLMs, for individuals and society at large. Finally, we propose a three-tiered policy framework that allows users to experience the benefits of personalised alignment, while restraining unsafe and undesirable LLM-behaviours within (supra-)national and organisational bounds.
△ Less
Submitted 9 March, 2023;
originally announced March 2023.
-
SemEval-2023 Task 10: Explainable Detection of Online Sexism
Authors:
Hannah Rose Kirk,
Wenjie Yin,
Bertie Vidgen,
Paul Röttger
Abstract:
Online sexism is a widespread and harmful phenomenon. Automated tools can assist the detection of sexism at scale. Binary detection, however, disregards the diversity of sexist content, and fails to provide clear explanations for why something is sexist. To address this issue, we introduce SemEval Task 10 on the Explainable Detection of Online Sexism (EDOS). We make three main contributions: i) a…
▽ More
Online sexism is a widespread and harmful phenomenon. Automated tools can assist the detection of sexism at scale. Binary detection, however, disregards the diversity of sexist content, and fails to provide clear explanations for why something is sexist. To address this issue, we introduce SemEval Task 10 on the Explainable Detection of Online Sexism (EDOS). We make three main contributions: i) a novel hierarchical taxonomy of sexist content, which includes granular vectors of sexism to aid explainability; ii) a new dataset of 20,000 social media comments with fine-grained labels, along with larger unlabelled datasets for model adaptation; and iii) baseline models as well as an analysis of the methods, results and errors for participant submissions to our task.
△ Less
Submitted 8 May, 2023; v1 submitted 7 March, 2023;
originally announced March 2023.
-
Auditing large language models: a three-layered approach
Authors:
Jakob Mökander,
Jonas Schuett,
Hannah Rose Kirk,
Luciano Floridi
Abstract:
Large language models (LLMs) represent a major advance in artificial intelligence (AI) research. However, the widespread use of LLMs is also coupled with significant ethical and social challenges. Previous research has pointed towards auditing as a promising governance mechanism to help ensure that AI systems are designed and deployed in ways that are ethical, legal, and technically robust. Howeve…
▽ More
Large language models (LLMs) represent a major advance in artificial intelligence (AI) research. However, the widespread use of LLMs is also coupled with significant ethical and social challenges. Previous research has pointed towards auditing as a promising governance mechanism to help ensure that AI systems are designed and deployed in ways that are ethical, legal, and technically robust. However, existing auditing procedures fail to address the governance challenges posed by LLMs, which display emergent capabilities and are adaptable to a wide range of downstream tasks. In this article, we address that gap by outlining a novel blueprint for how to audit LLMs. Specifically, we propose a three-layered approach, whereby governance audits (of technology providers that design and disseminate LLMs), model audits (of LLMs after pre-training but prior to their release), and application audits (of applications based on LLMs) complement and inform each other. We show how audits, when conducted in a structured and coordinated manner on all three levels, can be a feasible and effective mechanism for identifying and managing some of the ethical and social risks posed by LLMs. However, it is important to remain realistic about what auditing can reasonably be expected to achieve. Therefore, we discuss the limitations not only of our three-layered approach but also of the prospect of auditing LLMs at all. Ultimately, this article seeks to expand the methodological toolkit available to technology providers and policymakers who wish to analyse and evaluate LLMs from technical, ethical, and legal perspectives.
△ Less
Submitted 27 June, 2023; v1 submitted 16 February, 2023;
originally announced February 2023.
-
Open Set Wireless Signal Classification: Augmenting Deep Learning with Expert Feature Classifiers
Authors:
Samuel R. Shebert,
Benjamin H. Kirk,
R. Michael Buehrer
Abstract:
In shared spectrum with multiple radio access technologies, wireless standard classification is vital for applications such as dynamic spectrum access (DSA) and wideband spectrum monitoring. However, interfering signals and the presence of unknown classes of signals can diminish classification accuracy. To reduce interference, signals can be isolated in time, frequency, and space, but the isolatio…
▽ More
In shared spectrum with multiple radio access technologies, wireless standard classification is vital for applications such as dynamic spectrum access (DSA) and wideband spectrum monitoring. However, interfering signals and the presence of unknown classes of signals can diminish classification accuracy. To reduce interference, signals can be isolated in time, frequency, and space, but the isolation process adds distortion that reduces the accuracy of deep learning classifiers. We find that the distortion can be partially mitigated by augmenting the classifier training data with the signal isolation steps. To address unknown signals, we propose an open set hybrid classifier, which combines deep learning and expert feature classifiers to leverage the reliability and explainability of expert feature classifiers and the lower computational complexity of deep learning classifiers. The hybrid classifier reduces the computational complexity by 2 to 7 times on average compared to the expert feature classifiers, while achieving an accuracy of 95% at 15 dB SNR for known signal classes. The hybrid classifier manages to detect unknown classes at nearly 100% accuracy, due to the robustness of the expert feature classifiers.
△ Less
Submitted 7 February, 2023;
originally announced February 2023.
-
Velocity-Coherent Substructure in TMC-1: Inflow and Fragmentation
Authors:
Simon E. T. Smith,
Rachel Friesen,
Antoine Marchal,
Jaime E. Pineda,
Paola Caselli,
Michael Chun-Yuan Chen,
Spandan Choudhury,
James Di Francesco,
Adam Ginsburg,
Helen Kirk,
Chris Matzner,
Anna Punanova,
Samantha Scibelli,
Yancy Shirley
Abstract:
Filamentary structures have been found nearly ubiquitously in molecular clouds and yet their formation and evolution is still poorly understood. We examine a segment of Taurus Molecular Cloud 1 (TMC-1) that appears as a single, narrow filament in continuum emission from dust. We use the Regularized Optimization for Hyper-Spectral Analysis (ROHSA), a Gaussian decomposition algorithm which enforces…
▽ More
Filamentary structures have been found nearly ubiquitously in molecular clouds and yet their formation and evolution is still poorly understood. We examine a segment of Taurus Molecular Cloud 1 (TMC-1) that appears as a single, narrow filament in continuum emission from dust. We use the Regularized Optimization for Hyper-Spectral Analysis (ROHSA), a Gaussian decomposition algorithm which enforces spatial coherence when fitting multiple velocity components simultaneously over a data cube. We analyze HC$_5$N (9-8) line emission as part of the Green Bank Ammonia Survey (GAS) and identify three velocity-coherent components with ROHSA. The two brightest components extend the length of the filament, while the third component is fainter and clumpier. The brightest component has a prominent transverse velocity gradient of $2.7 \pm 0.1$ km s$^{-1}$ pc$^{-1}$ that we show to be indicative of gravitationally induced inflow. In the second component, we identify regularly spaced emission peaks along its length. We show that the local minima between pairs of adjacent HC$_5$N peaks line up closely with submillimetre continuum emission peaks, which we argue is evidence for fragmentation along the spine of TMC-1. While coherent velocity components have been described as separate physical structures in other star-forming filaments, we argue that the two bright components identified in HC$_5$N emission in TMC-1 are tracing two layers in one filament: a lower density outer layer whose material is flowing under gravity towards the higher density inner layer of the filament.
△ Less
Submitted 6 February, 2023; v1 submitted 18 November, 2022;
originally announced November 2022.
-
Is More Data Better? Re-thinking the Importance of Efficiency in Abusive Language Detection with Transformers-Based Active Learning
Authors:
Hannah Rose Kirk,
Bertie Vidgen,
Scott A. Hale
Abstract:
Annotating abusive language is expensive, logistically complex and creates a risk of psychological harm. However, most machine learning research has prioritized maximizing effectiveness (i.e., F1 or accuracy score) rather than data efficiency (i.e., minimizing the amount of data that is annotated). In this paper, we use simulated experiments over two datasets at varying percentages of abuse to dem…
▽ More
Annotating abusive language is expensive, logistically complex and creates a risk of psychological harm. However, most machine learning research has prioritized maximizing effectiveness (i.e., F1 or accuracy score) rather than data efficiency (i.e., minimizing the amount of data that is annotated). In this paper, we use simulated experiments over two datasets at varying percentages of abuse to demonstrate that transformers-based active learning is a promising approach to substantially raise efficiency whilst still maintaining high effectiveness, especially when abusive content is a smaller percentage of the dataset. This approach requires a fraction of labeled data to reach performance equivalent to training over the full dataset.
△ Less
Submitted 21 September, 2022;
originally announced September 2022.
-
DataPerf: Benchmarks for Data-Centric AI Development
Authors:
Mark Mazumder,
Colby Banbury,
Xiaozhe Yao,
Bojan Karlaš,
William Gaviria Rojas,
Sudnya Diamos,
Greg Diamos,
Lynn He,
Alicia Parrish,
Hannah Rose Kirk,
Jessica Quaye,
Charvi Rastogi,
Douwe Kiela,
David Jurado,
David Kanter,
Rafael Mosquera,
Juan Ciro,
Lora Aroyo,
Bilge Acun,
Lingjiao Chen,
Mehul Smriti Raje,
Max Bartolo,
Sabri Eyuboglu,
Amirata Ghorbani,
Emmett Goodman
, et al. (20 additional authors not shown)
Abstract:
Machine learning research has long focused on models rather than datasets, and prominent datasets are used for common ML tasks without regard to the breadth, difficulty, and faithfulness of the underlying problems. Neglecting the fundamental importance of data has given rise to inaccuracy, bias, and fragility in real-world applications, and research is hindered by saturation across existing datase…
▽ More
Machine learning research has long focused on models rather than datasets, and prominent datasets are used for common ML tasks without regard to the breadth, difficulty, and faithfulness of the underlying problems. Neglecting the fundamental importance of data has given rise to inaccuracy, bias, and fragility in real-world applications, and research is hindered by saturation across existing dataset benchmarks. In response, we present DataPerf, a community-led benchmark suite for evaluating ML datasets and data-centric algorithms. We aim to foster innovation in data-centric AI through competition, comparability, and reproducibility. We enable the ML community to iterate on datasets, instead of just architectures, and we provide an open, online platform with multiple rounds of challenges to support this iterative development. The first iteration of DataPerf contains five benchmarks covering a wide spectrum of data-centric techniques, tasks, and modalities in vision, speech, acquisition, debugging, and diffusion prompting, and we support hosting new contributed benchmarks from the community. The benchmarks, online evaluation platform, and baseline implementations are open source, and the MLCommons Association will maintain DataPerf to ensure long-term benefits to academia and industry.
△ Less
Submitted 13 October, 2023; v1 submitted 20 July, 2022;
originally announced July 2022.
-
Looking for a Handsome Carpenter! Debiasing GPT-3 Job Advertisements
Authors:
Conrad Borchers,
Dalia Sara Gala,
Benjamin Gilburt,
Eduard Oravkin,
Wilfried Bounsi,
Yuki M. Asano,
Hannah Rose Kirk
Abstract:
The growing capability and availability of generative language models has enabled a wide range of new downstream tasks. Academic research has identified, quantified and mitigated biases present in language models but is rarely tailored to downstream tasks where wider impact on individuals and society can be felt. In this work, we leverage one popular generative language model, GPT-3, with the goal…
▽ More
The growing capability and availability of generative language models has enabled a wide range of new downstream tasks. Academic research has identified, quantified and mitigated biases present in language models but is rarely tailored to downstream tasks where wider impact on individuals and society can be felt. In this work, we leverage one popular generative language model, GPT-3, with the goal of writing unbiased and realistic job advertisements. We first assess the bias and realism of zero-shot generated advertisements and compare them to real-world advertisements. We then evaluate prompt-engineering and fine-tuning as debiasing methods. We find that prompt-engineering with diversity-encouraging prompts gives no significant improvement to bias, nor realism. Conversely, fine-tuning, especially on unbiased real advertisements, can improve realism and reduce bias.
△ Less
Submitted 23 May, 2022;
originally announced May 2022.
-
Handling and Presenting Harmful Text in NLP Research
Authors:
Hannah Rose Kirk,
Abeba Birhane,
Bertie Vidgen,
Leon Derczynski
Abstract:
Text data can pose a risk of harm. However, the risks are not fully understood, and how to handle, present, and discuss harmful text in a safe way remains an unresolved issue in the NLP community. We provide an analytical framework categorising harms on three axes: (1) the harm type (e.g., misinformation, hate speech or racial stereotypes); (2) whether a harm is \textit{sought} as a feature of the…
▽ More
Text data can pose a risk of harm. However, the risks are not fully understood, and how to handle, present, and discuss harmful text in a safe way remains an unresolved issue in the NLP community. We provide an analytical framework categorising harms on three axes: (1) the harm type (e.g., misinformation, hate speech or racial stereotypes); (2) whether a harm is \textit{sought} as a feature of the research design if explicitly studying harmful content (e.g., training a hate speech classifier), versus \textit{unsought} if harmful content is encountered when working on unrelated problems (e.g., language generation or part-of-speech tagging); and (3) who it affects, from people (mis)represented in the data to those handling the data and those publishing on the data. We provide advice for practitioners, with concrete steps for mitigating harm in research and in publication. To assist implementation we introduce \textsc{HarmCheck} -- a documentation standard for handling and presenting harmful text in research.
△ Less
Submitted 24 February, 2023; v1 submitted 29 April, 2022;
originally announced April 2022.
-
A Prompt Array Keeps the Bias Away: Debiasing Vision-Language Models with Adversarial Learning
Authors:
Hugo Berg,
Siobhan Mackenzie Hall,
Yash Bhalgat,
Wonsuk Yang,
Hannah Rose Kirk,
Aleksandar Shtedritski,
Max Bain
Abstract:
Vision-language models can encode societal biases and stereotypes, but there are challenges to measuring and mitigating these multimodal harms due to lacking measurement robustness and feature degradation. To address these challenges, we investigate bias measures and apply ranking metrics for image-text representations. We then investigate debiasing methods and show that prepending learned embeddi…
▽ More
Vision-language models can encode societal biases and stereotypes, but there are challenges to measuring and mitigating these multimodal harms due to lacking measurement robustness and feature degradation. To address these challenges, we investigate bias measures and apply ranking metrics for image-text representations. We then investigate debiasing methods and show that prepending learned embeddings to text queries that are jointly trained with adversarial debiasing and a contrastive loss reduces various bias measures with minimal degradation to the image-text representation.
△ Less
Submitted 25 October, 2022; v1 submitted 22 March, 2022;
originally announced March 2022.
-
Hatemoji: A Test Suite and Adversarially-Generated Dataset for Benchmarking and Detecting Emoji-based Hate
Authors:
Hannah Rose Kirk,
Bertram Vidgen,
Paul Röttger,
Tristan Thrush,
Scott A. Hale
Abstract:
Detecting online hate is a complex task, and low-performing models have harmful consequences when used for sensitive applications such as content moderation. Emoji-based hate is an emerging challenge for automated detection. We present HatemojiCheck, a test suite of 3,930 short-form statements that allows us to evaluate performance on hateful language expressed with emoji. Using the test suite, we…
▽ More
Detecting online hate is a complex task, and low-performing models have harmful consequences when used for sensitive applications such as content moderation. Emoji-based hate is an emerging challenge for automated detection. We present HatemojiCheck, a test suite of 3,930 short-form statements that allows us to evaluate performance on hateful language expressed with emoji. Using the test suite, we expose weaknesses in existing hate detection models. To address these weaknesses, we create the HatemojiBuild dataset using a human-and-model-in-the-loop approach. Models built with these 5,912 adversarial examples perform substantially better at detecting emoji-based hate, while retaining strong performance on text-only hate. Both HatemojiCheck and HatemojiBuild are made publicly available. See our Github Repository (https://github.com/HannahKirk/Hatemoji). HatemojiCheck, HatemojiBuild, and the final Hatemoji Model are also available on HuggingFace (https://huggingface.co/datasets/HannahRoseKirk/).
△ Less
Submitted 6 May, 2022; v1 submitted 12 August, 2021;
originally announced August 2021.
-
Are massive dense clumps truly sub-virial? A new analysis using Gould Belt ammonia data
Authors:
Ayushi Singh,
Christopher D. Matzner,
Rachel K. Friesen,
Peter G. Martin,
Jaime E. Pineda,
Erik W. Rosolowsky,
Felipe Alves,
Ana Chacón-Tanarro,
Hope How-Huan Chen,
Michael Chun-Yuan Chen,
Spandan Choudhury,
James Di Francesco,
Jared Keown,
Helen Kirk,
Anna Punanova,
Youngmin Seo,
Yancy Shirley,
Adam Ginsburg,
Stella S. R. Offner,
Héctor G. Arce,
Paola Caselli,
Alyssa A. Goodman,
Philip C. Myers,
Elena Redaelli
Abstract:
Dynamical studies of dense structures within molecular clouds often conclude that the most massive clumps contain too little kinetic energy for virial equilibrium, unless they are magnetized to an unexpected degree. This raises questions about how such a state might arise, and how it might persist long enough to represent the population of massive clumps. In an effort to re-examine the origins of…
▽ More
Dynamical studies of dense structures within molecular clouds often conclude that the most massive clumps contain too little kinetic energy for virial equilibrium, unless they are magnetized to an unexpected degree. This raises questions about how such a state might arise, and how it might persist long enough to represent the population of massive clumps. In an effort to re-examine the origins of this conclusion, we use ammonia line data from the Green Bank Ammonia Survey and Planck-calibrated dust emission data from Herschel to estimate the masses and kinetic and gravitational energies for dense clumps in the Gould Belt clouds. We show that several types of systematic error can enhance the appearance of low kinetic-to-gravitational energy ratios: insufficient removal of foreground and background material; ignoring the kinetic energy associated with velocity differences across a resolved cloud; and over-correcting for stratification when evaluating the gravitational energy. Using an analysis designed to avoid these errors, we find that the most massive Gould Belt clumps harbor virial motions, rather than sub-virial ones. As a byproduct, we present a catalog of masses, energies, and virial energy ratios for 85 Gould Belt clumps.
△ Less
Submitted 11 August, 2021;
originally announced August 2021.
-
The JCMT Transient Survey: Four Year Summary of Monitoring the Submillimeter Variability of Protostars
Authors:
Yong-Hee Lee,
Doug Johnstone,
Jeong-Eun Lee,
Gregory Herczeg,
Steve Mairs,
Carlos Contreras-Peña,
Jennifer Hatchell,
Tim Naylor,
Graham S. Bell,
Tyler L. Bourke,
Colton Broughton,
Logan Francis,
Aashish Gupta,
Daniel Harsono,
Sheng-Yuan Liu,
Geumsook Park,
Spencer Plovie,
Gerald H. Moriarty-Schieven,
Aleks Scholz,
Tanvi Sharma,
Paula Stella Teixeira,
Yao-Te Wang,
Yuri Aikawa,
Geoffrey C. Bower,
Huei-Ru Vivien Chen
, et al. (27 additional authors not shown)
Abstract:
We present the four-year survey results of monthly submillimeter monitoring of eight nearby ($< 500 $pc) star-forming regions by the JCMT Transient Survey. We apply the Lomb-Scargle Periodogram technique to search for and characterize variability on 295 submillimeter peaks brighter than 0.14 Jy beam$^{-1}$, including 22 disk sources (Class II), 83 protostars (Class 0/I), and 190 starless sources.…
▽ More
We present the four-year survey results of monthly submillimeter monitoring of eight nearby ($< 500 $pc) star-forming regions by the JCMT Transient Survey. We apply the Lomb-Scargle Periodogram technique to search for and characterize variability on 295 submillimeter peaks brighter than 0.14 Jy beam$^{-1}$, including 22 disk sources (Class II), 83 protostars (Class 0/I), and 190 starless sources. We uncover 18 secular variables, all of them protostars. No single-epoch burst or drop events and no inherently stochastic sources are observed. We classify the secular variables by their timescales into three groups: Periodic, Curved, and Linear. For the Curved and Periodic cases, the detectable fractional amplitude, with respect to mean peak brightness, is $\sim4$ % for sources brighter than $\sim$ 0.5 Jy beam$^{-1}$. Limiting our sample to only these bright sources, the observed variable fraction is 37 % (16 out of 43). Considering source evolution, we find a similar fraction of bright variables for both Class 0 and Class I. Using an empirically motivated conversion from submillimeter variability to variation in mass accretion rate, six sources (7 % of our full sample) are predicted to have years-long accretion events during which the excess mass accreted reaches more than 40 % above the total quiescently accreted mass: two previously known eruptive Class I sources, V1647 Ori and EC 53 (V371 Ser), and four Class 0 sources, HOPS 356, HOPS 373, HOPS 383, and West 40. Considering the full protostellar ensemble, the importance of episodic accretion on few years timescale is negligible, only a few percent of the assembled mass. However, given that this accretion is dominated by events of order the observing time-window, it remains uncertain as to whether the importance of episodic events will continue to rise with decades-long monitoring.
△ Less
Submitted 22 July, 2021;
originally announced July 2021.
-
Memes in the Wild: Assessing the Generalizability of the Hateful Memes Challenge Dataset
Authors:
Hannah Rose Kirk,
Yennie Jun,
Paulius Rauba,
Gal Wachtel,
Ruining Li,
Xingjian Bai,
Noah Broestl,
Martin Doff-Sotta,
Aleksandar Shtedritski,
Yuki M. Asano
Abstract:
Hateful memes pose a unique challenge for current machine learning systems because their message is derived from both text- and visual-modalities. To this effect, Facebook released the Hateful Memes Challenge, a dataset of memes with pre-extracted text captions, but it is unclear whether these synthetic examples generalize to `memes in the wild'. In this paper, we collect hateful and non-hateful m…
▽ More
Hateful memes pose a unique challenge for current machine learning systems because their message is derived from both text- and visual-modalities. To this effect, Facebook released the Hateful Memes Challenge, a dataset of memes with pre-extracted text captions, but it is unclear whether these synthetic examples generalize to `memes in the wild'. In this paper, we collect hateful and non-hateful memes from Pinterest to evaluate out-of-sample performance on models pre-trained on the Facebook dataset. We find that memes in the wild differ in two key aspects: 1) Captions must be extracted via OCR, injecting noise and diminishing performance of multimodal models, and 2) Memes are more diverse than `traditional memes', including screenshots of conversations or text on a plain background. This paper thus serves as a reality check for the current benchmark of hateful meme detection and its applicability for detecting real world hate.
△ Less
Submitted 9 July, 2021;
originally announced July 2021.
-
The JCMT Gould Belt Survey: radiative heating by OB stars
Authors:
Damian Rumble,
Jennifer Hatchell,
Helen Kirk,
Kate Pattle
Abstract:
Radiative feedback can influence subsequent star formation. We quantify the heating from OB stars in the local star-forming regions in the JCMT Gould Belt survey. Dust temperatures are calculated from 450/850 micron flux ratios from SCUBA-2 observations at the JCMT assuming a fixed dust opacity spectral index $β=1.8$. Mean dust temperatures are calculated for each submillimetre clump along with pr…
▽ More
Radiative feedback can influence subsequent star formation. We quantify the heating from OB stars in the local star-forming regions in the JCMT Gould Belt survey. Dust temperatures are calculated from 450/850 micron flux ratios from SCUBA-2 observations at the JCMT assuming a fixed dust opacity spectral index $β=1.8$. Mean dust temperatures are calculated for each submillimetre clump along with projected distances from the main OB star in the region. Temperature vs. distance is fit with a simple model of dust heating by the OB star radiation plus the interstellar radiation field and dust cooling through optically thin radiation. Classifying the heating sources by spectral type, O-type stars produce the greatest clump average temperature rises and largest heating extent, with temperatures over 40 K and significant heating out to at least 2.4 pc. Early-type B stars (B4 and above) produce temperatures of over 20 K and significant heating over 0.4 pc. Late-type B stars show a marginal heating effect within 0.2 pc. For a given projected distance, there is a significant scatter in clump temperatures that is due to local heating by other luminous stars in the region, projection effects, or shadowing effects. Even in these local, `low-mass' star-forming regions, radiative feedback is having an effect on parsec scales, with 24% of the clumps heated to at least 3 K above the 15 K base temperature expected from heating by only the interstellar radiation field, and a mean dust temperature for heated clumps of 24 K.
△ Less
Submitted 7 May, 2021;
originally announced May 2021.
-
Transition from Coherent Cores to Surrounding Cloud in L1688
Authors:
Spandan Choudhury,
Jaime E. Pineda,
Paola Caselli,
Stella S. R. Offner,
Erik Rosolowsky,
Rachel K. Friesen,
Elena Redaelli,
Ana Chacón-Tanarro,
Yancy Shirley,
Anna Punanova,
Helen Kirk
Abstract:
Stars form in cold dense cores showing subsonic velocity dispersions. The parental molecular clouds display higher temperatures and supersonic velocity dispersions. The transition from core to cloud has been observed in velocity dispersion, but temperature and abundance variations are unknown. We aim to study the transition from cores to ambient cloud in temperature and velocity dispersion using a…
▽ More
Stars form in cold dense cores showing subsonic velocity dispersions. The parental molecular clouds display higher temperatures and supersonic velocity dispersions. The transition from core to cloud has been observed in velocity dispersion, but temperature and abundance variations are unknown. We aim to study the transition from cores to ambient cloud in temperature and velocity dispersion using a single tracer.
We use NH3 (1,1) and (2,2) maps in L1688 from the Green Bank Ammonia Survey, smoothed to 1', and determine the physical properties from fits. We identify the coherent cores and study the changes in temperature and velocity dispersion from cores to the surrounding cloud. We obtain a kinetic temperature map tracing the extended cloud, improving from previous maps tracing mostly the cores. The cloud is 4-6 K warmer than the cores, and shows a larger velocity dispersion (diff. = 0.15-0.25 km/s). Comparing to Herschel-based measurements, we find that cores show kinetic temperature $\approx$1.8 K lower than the dust temperature; while the gas temperature is higher than the dust temperature in the cloud. We find an average p-NH3 fractional abundance (with respect to H2) of $(4.2\pm0.2) \times 10^{-9}$ towards the coherent cores, and $(1.4\pm0.1) \times 10^{-9}$ outside the core boundaries. Using stacked spectra, we detect two components, one narrow and one broad, towards cores and their neighbourhoods. We find the turbulence in the narrow component to be correlated to the size of the structure (Pearson-r=0.54). With these unresolved regional measurements, we obtain a turbulence-size relation of $σ_{v,NT}\propto r^{0.5}$, similar to previous findings using multiple tracers.
We discover that the subsonic component extends up to 0.15 pc beyond the typical coherent boundaries, unveiling larger extents of the coherent cores and showing gradual transition to coherence over ~0.2 pc.
△ Less
Submitted 12 February, 2021;
originally announced February 2021.
-
Bias Out-of-the-Box: An Empirical Analysis of Intersectional Occupational Biases in Popular Generative Language Models
Authors:
Hannah Kirk,
Yennie Jun,
Haider Iqbal,
Elias Benussi,
Filippo Volpin,
Frederic A. Dreyer,
Aleksandar Shtedritski,
Yuki M. Asano
Abstract:
The capabilities of natural language models trained on large-scale data have increased immensely over the past few years. Open source libraries such as HuggingFace have made these models easily available and accessible. While prior research has identified biases in large language models, this paper considers biases contained in the most popular versions of these models when applied `out-of-the-box…
▽ More
The capabilities of natural language models trained on large-scale data have increased immensely over the past few years. Open source libraries such as HuggingFace have made these models easily available and accessible. While prior research has identified biases in large language models, this paper considers biases contained in the most popular versions of these models when applied `out-of-the-box' for downstream tasks. We focus on generative language models as they are well-suited for extracting biases inherited from training data. Specifically, we conduct an in-depth analysis of GPT-2, which is the most downloaded text generation model on HuggingFace, with over half a million downloads per month. We assess biases related to occupational associations for different protected categories by intersecting gender with religion, sexuality, ethnicity, political affiliation, and continental name origin. Using a template-based data collection pipeline, we collect 396K sentence completions made by GPT-2 and find: (i) The machine-predicted jobs are less diverse and more stereotypical for women than for men, especially for intersections; (ii) Intersectional interactions are highly relevant for occupational associations, which we quantify by fitting 262 logistic models; (iii) For most occupations, GPT-2 reflects the skewed gender and ethnicity distribution found in US Labor Bureau data, and even pulls the societally-skewed distribution towards gender parity in cases where its predictions deviate from real labor market observations. This raises the normative question of what language models should learn - whether they should reflect or correct for existing inequalities.
△ Less
Submitted 27 October, 2021; v1 submitted 8 February, 2021;
originally announced February 2021.
-
Ubiquitous $\rm NH_3$ supersonic component in L1688 coherent cores
Authors:
Spandan Choudhury,
Jaime E. Pineda,
Paola Caselli,
Adam Ginsburg,
Stella S. R. Offner,
Erik Rosolowsky,
Rachel K. Friesen,
Felipe O. Alves,
Ana Chacón-Tanarro,
Anna Punanova,
Elena Redaelli,
Helen Kirk,
Philip C. Myers,
Peter G. Martin,
Yancy Shirley,
Michael Chun-Yuan Chen,
Alyssa A. Goodman,
James Di Francesco
Abstract:
Context : Star formation takes place in cold dense cores in molecular clouds. Earlier observations have found that dense cores exhibit subsonic non-thermal velocity dispersions. In contrast, CO observations show that the ambient large-scale cloud is warmer and has supersonic velocity dispersions. Aims : We aim to study the ammonia ($\rm NH_3$) molecular line profiles with exquisite sensitivity tow…
▽ More
Context : Star formation takes place in cold dense cores in molecular clouds. Earlier observations have found that dense cores exhibit subsonic non-thermal velocity dispersions. In contrast, CO observations show that the ambient large-scale cloud is warmer and has supersonic velocity dispersions. Aims : We aim to study the ammonia ($\rm NH_3$) molecular line profiles with exquisite sensitivity towards the coherent cores in L1688 in order to study their kinematical properties in unprecedented detail. Methods : We used $\rm NH_3$ (1,1) and (2,2) data from the first data release (DR1) in the Green Bank Ammonia Survey (GAS). We first smoothed the data to a larger beam of 1' to obtain substantially more extended maps of velocity dispersion and kinetic temperature, compared to the DR1 maps. We then identified the coherent cores in the cloud and analysed the averaged line profiles towards the cores. Results : For the first time, we detected a faint (mean $\rm NH_3$(1,1) peak brightness $<$0.25 K in $T_{MB}$), supersonic component towards all the coherent cores in L1688. We fitted two components, one broad and one narrow, and derived the kinetic temperature and velocity dispersion of each component. The broad components towards all cores have supersonic linewidths ($\mathcal{M}_S \ge 1$). This component biases the estimate of the narrow dense core component's velocity dispersion by $\approx$28% and the kinetic temperature by $\approx$10%, on average, as compared to the results from single-component fits. Conclusions : Neglecting this ubiquitous presence of a broad component towards all coherent cores causes the typical single-component fit to overestimate the temperature and velocity dispersion. This affects the derived detailed physical structure and stability of the cores estimated from $\rm NH_3$ observations.
△ Less
Submitted 20 July, 2020; v1 submitted 14 July, 2020;
originally announced July 2020.
-
Relative Alignment between Dense Molecular Cores and Ambient Magnetic Field: The Synergy of Numerical Models and Observations
Authors:
Che-Yu Chen,
Erica A. Behrens,
Jasmin E. Washington,
Laura M. Fissel,
Rachel K. Friesen,
Zhi-Yun Li,
Jaime E. Pineda,
Adam Ginsburg,
Helen Kirk,
Samantha Scibelli,
Felipe Alves,
Elena Redaelli,
Paola Caselli,
Anna Punanova,
James Di Francesco,
Erik Rosolowsky,
Stella S. R. Offner,
Peter G. Martin,
Ana Chacón-Tanarro,
Hope H. -H. Chen,
Michael C. -Y. Chen,
Jared Keown,
Youngmin Seo,
Yancy Shirley,
Hector G. Arce
, et al. (4 additional authors not shown)
Abstract:
The role played by magnetic field during star formation is an important topic in astrophysics. We investigate the correlation between the orientation of star-forming cores (as defined by the core major axes) and ambient magnetic field directions in 1) a 3D MHD simulation, 2) synthetic observations generated from the simulation at different viewing angles, and 3) observations of nearby molecular cl…
▽ More
The role played by magnetic field during star formation is an important topic in astrophysics. We investigate the correlation between the orientation of star-forming cores (as defined by the core major axes) and ambient magnetic field directions in 1) a 3D MHD simulation, 2) synthetic observations generated from the simulation at different viewing angles, and 3) observations of nearby molecular clouds. We find that the results on relative alignment between cores and background magnetic field in synthetic observations slightly disagree with those measured in fully 3D simulation data, which is partly because cores identified in projected 2D maps tend to coexist within filamentary structures, while 3D cores are generally more rounded. In addition, we examine the progression of magnetic field from pc- to core-scale in the simulation, which is consistent with the anisotropic core formation model that gas preferably flow along the magnetic field toward dense cores. When comparing the observed cores identified from the GBT Ammonia Survey (GAS) and Planck polarization-inferred magnetic field orientations, we find that the relative core-field alignment has a regional dependence among different clouds. More specifically, we find that dense cores in the Taurus molecular cloud tend to align perpendicular to the background magnetic field, while those in Perseus and Ophiuchus tend to have random (Perseus) or slightly parallel (Ophiuchus) orientations with respect to the field. We argue that this feature of relative core-field orientation could be used to probe the relative significance of the magnetic field within the cloud.
△ Less
Submitted 24 March, 2020;
originally announced March 2020.
-
Opportunities and Outcomes for Postdocs in Canada
Authors:
Henry Ngo,
Helen Kirk,
Toby Brown,
Tyrone E. Woods,
Gwendolyn Eadie,
Samantha Lawler,
Locke Spencer
Abstract:
Currently, postdoctoral fellow (PDF) researchers in Canada face challenges due to the precarious nature of their employment and their overall low compensation and benefits coverage. This report presents three themes, written as statements of need, to support an inclusive and thriving PDF community. These themes are the need for better terms of employment and conditions, the need for access to gran…
▽ More
Currently, postdoctoral fellow (PDF) researchers in Canada face challenges due to the precarious nature of their employment and their overall low compensation and benefits coverage. This report presents three themes, written as statements of need, to support an inclusive and thriving PDF community. These themes are the need for better terms of employment and conditions, the need for access to grants by non-permanent research staff, and the need for a sustainable PDF hiring model that considers the outcomes for the PDFs.
We make six recommendations:
R1. PDFs should be hired and compensated as skilled experts in their areas, not as trainees.
R2. Standard PDF hiring practices should be revised to be more inclusive of different life circumstances.
- R2.1 Allow PDFs the option of part-time employment.
- R2.2 Remove years-since-PhD time limits from PDF jobs.
- R2.3 Financially support PDF hires for relocation and visa expenses.
R3. CASCA should form a committee to advocate for and provide support to astronomy PDFs in Canada.
R4. CASCA should encourage universities to create offices dedicated to their PDFs.
R5. PDFs and other PhD-holding term researchers with a host institution should be able to compete for and win grants to self-fund their own research.
R6. Astronomy in Canada should hire general-purpose continuing support scientist positions instead of term PDFs to fill project or mission-specific requirements.
In short, we ask for prioritization of people over production of papers.
△ Less
Submitted 23 November, 2019;
originally announced November 2019.
-
A Human Action Descriptor Based on Motion Coordination
Authors:
Pietro Falco,
Matteo Saveriano,
Eka Gibran Hasany,
Nicholas H. Kirk,
Dongheui Lee
Abstract:
In this paper, we present a descriptor for human whole-body actions based on motion coordination. We exploit the principle, well known in neuromechanics, that humans move their joints in a coordinated fashion. Our coordination-based descriptor (CODE) is computed by two main steps. The first step is to identify the most informative joints which characterize the motion. The second step enriches the…
▽ More
In this paper, we present a descriptor for human whole-body actions based on motion coordination. We exploit the principle, well known in neuromechanics, that humans move their joints in a coordinated fashion. Our coordination-based descriptor (CODE) is computed by two main steps. The first step is to identify the most informative joints which characterize the motion. The second step enriches the descriptor considering minimum and maximum joint velocities and the correlations between the most informative joints. In order to compute the distances between action descriptors, we propose a novel correlation-based similarity measure. The performance of CODE is tested on two public datasets, namely HDM05 and Berkeley MHAD, and compared with state-of-the-art approaches, showing recognition results.
△ Less
Submitted 20 November, 2019;
originally announced November 2019.
-
The Formation of Stars -- From Filaments to Cores to Protostars and Protoplanetry Disks
Authors:
James Di Francesco,
Helen Kirk,
Doug Johnstone,
Ralph Pudritz,
Shantanu Basu,
Sarah Sadavoy,
Laura Fissel,
Lewis Knee,
Mehrnoosh Tahani,
Rachel Friesen,
Simon Coudé,
Erik Rosolowsky,
Nienke van der Marel,
Michel Fich,
Christine Wilson,
Chris Matzner,
Ruobing Dong,
Brenda Matthews,
Gerald Schieven
Abstract:
Star formation involves the flow of gas and dust within molecular clouds into protostars and young stellar objects (YSOs) due to gravity. Along the way, these flows are shaped significantly by many other mechanisms, including pressure, turbulent motions, magnetic fields, stellar feedback, jets, and angular momentum. How all these mechanisms interact nonlinearly with each other on various length sc…
▽ More
Star formation involves the flow of gas and dust within molecular clouds into protostars and young stellar objects (YSOs) due to gravity. Along the way, these flows are shaped significantly by many other mechanisms, including pressure, turbulent motions, magnetic fields, stellar feedback, jets, and angular momentum. How all these mechanisms interact nonlinearly with each other on various length scales leads to the formation and evolution of substructures within clouds, including filaments, clumps, cores, disks, outflows, the protostars/YSOs themselves, and planets. In this white paper, prepared for the 2020 Long Range Plan panel which will recommend Canada's future directions for astronomy, we describe the observational and theoretical leadership in the star formation field that Canada's vibrant community has demonstrated over the past decade. Drawing from this extensive background, we identify five key questions that must be addressed for further progress to be made in understanding star formation in the next decade. Addressing these questions will improve our understanding of the dynamics of the dense gas and the role of the magnetic field in star formation, the optical properties of the dust used to trace mass and magnetic fields, the sources of variability in star-forming objects on short timescales, and the physical processes that specifically promote the clustering of stars. We further highlight key facilities in which Canada should become involved to continue making progress in this field. Single-dish facilities we recommend include LSST, trans-atmospheric far-infrared telescopes like BLAST-TNG and SPICA, and ground-based telescopes like JCMT, GBT, and CCAT-p. Interferometric facilities we recommend include ALMA, ngVLA, and SKA1.
△ Less
Submitted 9 November, 2019; v1 submitted 5 November, 2019;
originally announced November 2019.
-
The Next Generation Very Large Array
Authors:
James Di Francesco,
Dean Chalmers,
Nolan Denman,
Laura Fissel,
Rachel Friesen,
Bryan Gaensler,
Julie Hlavacek-Larrondo,
Helen Kirk,
Brenda Matthews,
Christopher O'Dea,
Tim Robishaw,
Erik Rosolowsky,
Michael Rupen,
Sarah Sadavoy,
Samar Safi-Harb,
Greg Sivakoff,
Mehrnoosh Tahani,
Nienke van der Marel,
Jacob White,
Christine Wilson
Abstract:
The next generation Very Large Array (ngVLA) is a transformational radio observatory being designed by the U.S. National Radio Astronomy Observatory (NRAO). It will provide order of magnitude improvements in sensitivity, resolution, and uv coverage over the current Jansky Very Large Array (VLA) at ~1.2-50 GHz and extend the frequency range up to 70-115 GHz. This document is a white paper written b…
▽ More
The next generation Very Large Array (ngVLA) is a transformational radio observatory being designed by the U.S. National Radio Astronomy Observatory (NRAO). It will provide order of magnitude improvements in sensitivity, resolution, and uv coverage over the current Jansky Very Large Array (VLA) at ~1.2-50 GHz and extend the frequency range up to 70-115 GHz. This document is a white paper written by members of the Canadian community for the 2020 Long Range Plan panel, which will be making recommendations on Canada's future directions in astronomy. Since Canadians have been historically major users of the VLA and have been valued partners with NRAO for ALMA, Canada's participation in ngVLA is welcome. Canadians have been actually involved in ngVLA discussions for the past five years, and have played leadership roles in the ngVLA Science and Technical Advisory Councils. Canadian technologies are also very attractive for the ngVLA, in particular our designs for radio antennas, receivers, correlates, and data archives, and our industrial capacities to realize them. Indeed, the Canadian designs for the ngVLA antennas and correlator/beamformer are presently the baseline models for the project. Given the size of Canada's radio community and earlier use of the VLA (and ALMA), we recommend Canadian participation in the ngVLA at the 7% level. Such participation would be significant enough to allow Canadian leadership in gVLA's construction and usage. Canada's participation in ngVLA should not preclude its participation in SKA; access to both facilities is necessary to meet Canada's radio astronomy needs. Indeed, ngVLA will fill the gap between those radio frequencies observable with the SKA and ALMA at high sensitivities and resolutions. Canada's partnership in ngVLA will give it access to cutting-edge facilities together covering approximately three orders of magnitude in frequency.
△ Less
Submitted 4 November, 2019;
originally announced November 2019.
-
Development Plans for the Atacama Large Millimeter/submillimeter Array (ALMA)
Authors:
Christine Wilson,
Scott Chapman,
Ruobing Dong,
James di Francesco,
Laura Fissel,
Doug Johnstone,
Helen Kirk,
Brenda Matthews,
Brian McNamara,
Erik Rosolowsky,
Michael Rupen,
Sarah Sadavoy,
Douglas Scott,
Nienke van der Marel
Abstract:
(abridged) The Atacama Large Millimeter/submillimeter Array (ALMA) was the top-ranked priority for a new ground-based facility in the 2000 Canadian Long Range Plan. Ten years later, at the time of LRP2010, ALMA construction was well underway, with first science observations anticipated for 2011. In the past 8 years, ALMA has proved itself to be a high-impact, high-demand observatory, with record n…
▽ More
(abridged) The Atacama Large Millimeter/submillimeter Array (ALMA) was the top-ranked priority for a new ground-based facility in the 2000 Canadian Long Range Plan. Ten years later, at the time of LRP2010, ALMA construction was well underway, with first science observations anticipated for 2011. In the past 8 years, ALMA has proved itself to be a high-impact, high-demand observatory, with record numbers of proposals submitted to the annual calls and large numbers of highly cited scientific papers across fields from protoplanetary disks to high-redshift galaxies and quasars.
The LRP2010 ALMA white paper laid out 8 specific metrics that could be used to judge the success of Canada's participation in ALMA. Among these metrics were publications (number; impact), collaborations (international; multi-wavelength), and student training. To call out one particular metric, Canadians are making excellent use of ALMA in training graduate students and postdocs: as of June 2018, 12 of 23 Canadian first-author papers were led by a graduate student, and a further 4 papers were led by postdocs. All 8 metrics argue for Canada's involvement in ALMA over the past decade to be judged a success. The successful achievement of these wide-ranging goals argues strongly for Canada's continuing participation in ALMA over the next decade and beyond.
Looking forward, our community needs to: (1) maintain Canadian access to ALMA and our competitiveness in using ALMA; (2) preserve full Canadian funding for our share of ALMA operations; (3) identify components of ALMA development in which Canada can play a significant role, including stimulating expertise in submillimetre instrumentation to capitalize on future opportunities; and (4) keep Canadians fully trained and engaged in ALMA, as new capabilities become available, reaching the widest possible community of potential users.
△ Less
Submitted 9 October, 2019;
originally announced October 2019.
-
KFPA Examinations of Young STellar Object Natal Environments (KEYSTONE): Hierarchical Ammonia Structures in Galactic Giant Molecular Clouds
Authors:
Jared Keown,
James Di Francesco,
Erik Rosolowsky,
Ayushi Singh,
Charles Figura,
Helen Kirk,
L. D. Anderson,
Michael Chun-Yuan Chen,
Davide Elia,
Rachel Friesen,
Adam Ginsburg,
A. Marston,
Stefano Pezzuto,
Eugenio Schisano,
Sylvain Bontemps,
Paola Caselli,
Hong-Li Liu,
Steven Longmore,
Frederique Motte,
Philip C. Myers,
Stella S. R. Offner,
Patricio Sanhueza,
Nicola Schneider,
Ian Stephens,
James Urquhart
, et al. (1 additional authors not shown)
Abstract:
We present initial results from the K-band focal plane array Examinations of Young STellar Object Natal Environments (KEYSTONE) survey, a large project on the 100-m Green Bank Telescope mapping ammonia emission across eleven giant molecular clouds at distances of $0.9-3.0$ kpc (Cygnus X North, Cygnus X South, M16, M17, MonR1, MonR2, NGC2264, NGC7538, Rosette, W3, and W48). This data release includ…
▽ More
We present initial results from the K-band focal plane array Examinations of Young STellar Object Natal Environments (KEYSTONE) survey, a large project on the 100-m Green Bank Telescope mapping ammonia emission across eleven giant molecular clouds at distances of $0.9-3.0$ kpc (Cygnus X North, Cygnus X South, M16, M17, MonR1, MonR2, NGC2264, NGC7538, Rosette, W3, and W48). This data release includes the NH$_3$ (1,1) and (2,2) maps for each cloud, which are modeled to produce maps of kinetic temperature, centroid velocity, velocity dispersion, and ammonia column density. Median cloud kinetic temperatures range from $11.4\pm2.2$ K in the coldest cloud (MonR1) to $23.0\pm6.5$ K in the warmest cloud (M17). Using dendrograms on the NH$_3$ (1,1) integrated intensity maps, we identify 856 dense gas clumps across the eleven clouds. Depending on the cloud observed, $40-100\%$ of the clumps are aligned spatially with filaments identified in H$_2$ column density maps derived from SED-fitting of dust continuum emission. A virial analysis reveals that 523 of the 835 clumps ($\sim63\%$) with mass estimates are bound by gravity alone. We find no significant difference between the virial parameter distributions for clumps aligned with the dust-continuum filaments and those unaligned with filaments. In some clouds, however, hubs or ridges of dense gas with unusually high mass and low virial parameters are located within a single filament or at the intersection of multiple filaments. These hubs and ridges tend to host water maser emission, multiple 70$μ$m-detected protostars, and have masses and radii above an empirical threshold for forming massive stars.
△ Less
Submitted 29 August, 2019; v1 submitted 27 August, 2019;
originally announced August 2019.
-
The Green Bank Ammonia Survey: A Virial Analysis of Gould Belt Clouds in Data Release 1
Authors:
Ronan Kerr,
Helen Kirk,
James Di Francesco,
Jared Keown,
Mike Chen,
Erik Rosolowsky,
Stella S. R. Offner,
Rachel Friesen,
Jaime E. Pineda,
Yancy Shirley,
Elena Redaelli,
Paola Caselli,
Anna Punanova,
Youngmin Seo,
Felipe Alves,
Ana Chacón-Tanarro,
Hope How-Huan Chen
Abstract:
We perform a virial analysis of starless dense cores in three nearby star-forming regions : L1688 in Ophiuchus, NGC 1333 in Perseus, and B18 in Taurus. Our analysis takes advantage of comprehensive kinematic information for the dense gas in all of these regions made publicly available through the Green Bank Ammonia Survey Data Release 1, which used to estimate internal support against collapse. We…
▽ More
We perform a virial analysis of starless dense cores in three nearby star-forming regions : L1688 in Ophiuchus, NGC 1333 in Perseus, and B18 in Taurus. Our analysis takes advantage of comprehensive kinematic information for the dense gas in all of these regions made publicly available through the Green Bank Ammonia Survey Data Release 1, which used to estimate internal support against collapse. We combine this information with ancillary data used to estimate other important properties of the cores, including continuum data from the James Clerk Maxwell Telescope Gould Belt Survey for core identification, core masses, and core sizes. Additionally, we used \textit{Planck} and \textit{Herschel}-based column density maps for external cloud weight pressure, and Five College Radio Astronomy Observatory $^{13}$CO observations for external turbulent pressure. Our self-consistent analysis suggests that many dense cores in all three star-forming regions are not bound by gravity alone, but rather require additional pressure confinement to remain bound. Unlike a recent, similar study in Orion~A, we find that turbulent pressure represents a significant portion of the external pressure budget. Our broad conclusion emphasizing the importance of pressure confinement in dense core evolution, however, agrees with earlier work.
△ Less
Submitted 8 March, 2019;
originally announced March 2019.
-
Catalogue of High Protostellar Surface Density Regions in Nearby Embedded Clusters
Authors:
Juan Li,
Philip C. Myers,
Helen Kirk,
Robert A. Gutermuth,
Michael M. Dunham,
Riwaj Pokhrel
Abstract:
We analyze high-quality stellar catalogs for 24 young and nearby (within 1 kpc) embedded clusters and present a catalogue of 32 groups which have a high concentration of protostars. The median effective radius of these groups is 0.17 pc. The median protostellar and pre-main sequence star surface densities are 46 M_{\odot} pc^{-2} and 11 M_{\odot} pc^{-2}, respectively. We estimate the age of these…
▽ More
We analyze high-quality stellar catalogs for 24 young and nearby (within 1 kpc) embedded clusters and present a catalogue of 32 groups which have a high concentration of protostars. The median effective radius of these groups is 0.17 pc. The median protostellar and pre-main sequence star surface densities are 46 M_{\odot} pc^{-2} and 11 M_{\odot} pc^{-2}, respectively. We estimate the age of these groups using a model of constant birthrate and random accretion stopping and find a median value of 0.25 Myr. Some groups in Aquila and Serpens, Corona Australia and Ophichus L1688 show high protostellar surface density and high molecular gas surface density, which seem to be undergoing vigorous star formation. These groups provide an excellent opportunity to study initial conditions of clustered star formation. Comparison of protostellar and pre-main-sequence stellar surface densities reveal continuous low-mass star formation of these groups over several Myr in some clouds. For groups with typical protostellar separations of less than 0.4 pc, we find that these separations agree well with the thermal Jeans fragmentation scale. On the other hand, for groups with typical protostellar separations larger than 0.4 pc, these separations are always larger than the associated Jeans length.
△ Less
Submitted 3 December, 2018;
originally announced December 2018.
-
Droplets I: Pressure-Dominated Sub-0.1 pc Coherent Structures in L1688 and B18
Authors:
Hope How-Huan Chen,
Jaime E. Pineda,
Alyssa A. Goodman,
Andreas Burkert,
Stella S. R. Offner,
Rachel K. Friesen,
Philip C. Myers,
Felipe Alves,
Hector G. Arce,
Paola Caselli,
Ana Chacon-Tanarro,
Michael Chun-Yuan Chen,
James Di Francesco,
Adam Ginsburg,
Jared Keown,
Helen Kirk,
Peter G. Martin,
Christopher Matzner,
Anna Punanova,
Elena Redaelli,
Erik Rosolowsky,
Samantha Scibelli,
Young Min Seo,
Yancy Shirley,
Ayushi Singh
Abstract:
We present the observation and analysis of newly discovered coherent structures in the L1688 region of Ophiuchus and the B18 region of Taurus. Using data from the Green Bank Ammonia Survey (GAS), we identify regions of high density and near-constant, almost-thermal, velocity dispersion. Eighteen coherent structures are revealed, twelve in L1688 and six in B18, each of which shows a sharp "transiti…
▽ More
We present the observation and analysis of newly discovered coherent structures in the L1688 region of Ophiuchus and the B18 region of Taurus. Using data from the Green Bank Ammonia Survey (GAS), we identify regions of high density and near-constant, almost-thermal, velocity dispersion. Eighteen coherent structures are revealed, twelve in L1688 and six in B18, each of which shows a sharp "transition to coherence" in velocity dispersion around its periphery. The identification of these structures provides a chance to study the coherent structures in molecular clouds statistically. The identified coherent structures have a typical radius of 0.04 pc and a typical mass of 0.4 Msun, generally smaller than previously known coherent cores identified by Goodman et al. (1998), Caselli et al. (2002), and Pineda et al. (2010). We call these structures "droplets." We find that unlike previously known coherent cores, these structures are not virially bound by self-gravity and are instead predominantly confined by ambient pressure. The droplets have density profiles shallower than a critical Bonnor-Ebert sphere, and they have a velocity (VLSR) distribution consistent with the dense gas motions traced by NH3 emission. These results point to a potential formation mechanism through pressure compression and turbulent processes in the dense gas. We present a comparison with a magnetohydrodynamic simulation of a star-forming region, and we speculate on the relationship of droplets with larger, gravitationally bound coherent cores, as well as on the role that droplets and other coherent structures play in the star formation process.
△ Less
Submitted 15 May, 2019; v1 submitted 26 September, 2018;
originally announced September 2018.
-
The JCMT Gould Belt Survey: SCUBA-2 Data-Reduction Methods and Gaussian Source Recovery Analysis
Authors:
Helen Kirk,
Jennifer Hatchell,
Doug Johnstone,
David Berry,
Tim Jenness,
Jane Buckle,
Steve Mairs,
Erik Rosolowsky,
James Di Francesco,
Sarah Sadavoy,
Malcolm Currie,
Hannah Broekhoven-Fiene,
Joseph C. Mottram,
Kate Pattle,
Brenda Matthews,
Lewis B. G. Knee,
Gerald Moriarty-Schieven,
Ana Duarte-Cabral,
Sam Tisi,
Derek Ward-Thompson
Abstract:
The JCMT Gould Belt Survey was one of the first Legacy Surveys with the James Clerk Maxwell Telescope in Hawaii, mapping 47 square degrees of nearby (< 500 pc) molecular clouds in both dust continuum emission at 850 $μ$m and 450 $μ$m, as well as a more-limited area in lines of various CO isotopologues. While molecular clouds and the material that forms stars have structures on many size scales, th…
▽ More
The JCMT Gould Belt Survey was one of the first Legacy Surveys with the James Clerk Maxwell Telescope in Hawaii, mapping 47 square degrees of nearby (< 500 pc) molecular clouds in both dust continuum emission at 850 $μ$m and 450 $μ$m, as well as a more-limited area in lines of various CO isotopologues. While molecular clouds and the material that forms stars have structures on many size scales, their larger-scale structures are difficult to observe reliably in the submillimetre regime using ground-based facilities. In this paper, we quantify the extent to which three subsequent data-reduction methods employed by the JCMT GBS accurately recover emission structures of various size scales, in particular, dense cores which are the focus of many GBS science goals. With our current best data-reduction procedure, we expect to recover 100% of structures with Gaussian sigma sizes of $\le$30" and intensity peaks of at least five times the local noise for isolated peaks of emission. The measured sizes and peak fluxes of these compact structures are reliable (within 15% of the input values), but source recovery and reliability both decrease significantly for larger emission structures and for fainter peaks. Additional factors such as source crowding have not been tested in our analysis. The most recent JCMT GBS data release includes pointing corrections, and we demonstrate that these tend to decrease the sizes and increase the peak intensities of compact sources in our dataset, mostly at a low level (several percent), but occasionally with notable improvement.
△ Less
Submitted 23 August, 2018;
originally announced August 2018.
-
Dense gas kinematics and a narrow filament in the Orion A OMC1 region using NH3
Authors:
Kristina Monsch,
Jaime E. Pineda,
Hauyu Baobab Liu,
Catherine Zucker,
Hope How-Huan Chen,
Kate Pattle,
Stella S. R. Offner,
James Di Francesco,
Adam Ginsburg,
Barbara Ercolano,
Héctor G. Arce,
Rachel Friesen,
Helen Kirk,
Paola Caselli,
Alyssa A. Goodman
Abstract:
We present combined observations of the NH3 (J,K) = (1,1) and (2,2) inversion transitions towards OMC1 in Orion A obtained by the Karl G. Jansky Very Large Array (VLA) and the 100 m Robert C. Byrd Green Bank Telescope (GBT). With an angular resolution of 6" (0.01 pc), these observations reveal with unprecedented detail the complex filamentary structure extending north of the active Orion BN/KL reg…
▽ More
We present combined observations of the NH3 (J,K) = (1,1) and (2,2) inversion transitions towards OMC1 in Orion A obtained by the Karl G. Jansky Very Large Array (VLA) and the 100 m Robert C. Byrd Green Bank Telescope (GBT). With an angular resolution of 6" (0.01 pc), these observations reveal with unprecedented detail the complex filamentary structure extending north of the active Orion BN/KL region in a field covering 6' x 7'. We find a 0.012 pc wide filament within OMC1, with an aspect ratio of ~37:1, that was missed in previous studies. Its orientation is directly compared to the relative orientation of the magnetic field from the James Clerk Maxwell Telescope BISTRO survey in Orion A. We find a small deviation of ~11 deg between the mean orientation of the filament and the magnetic field, suggesting that they are almost parallel to one another. The filament's column density is estimated to be 2-3 orders of magnitude larger than the filaments studied with Herschel and is possibly self-gravitating given the low values of turbulence found. We further produce maps of the gas kinematics by forward modeling the hyperfine structure of the NH3 (J,K) = (1,1) and (2,2) lines. The resulting distribution of velocity dispersions peaks at ~0.5 km/s, close to the subsonic regime of the gas. This value is about 0.2 km/s smaller than previously measured in single-dish observations of the same region, suggesting that higher angular and spectral resolution observations will identify even lower velocity dispersions that might reach the subsonic turbulence regime in dense gas filaments.
△ Less
Submitted 18 December, 2018; v1 submitted 5 June, 2018;
originally announced June 2018.