-
LLM-jp: A Cross-organizational Project for the Research and Development of Fully Open Japanese LLMs
Authors:
LLM-jp,
:,
Akiko Aizawa,
Eiji Aramaki,
Bowen Chen,
Fei Cheng,
Hiroyuki Deguchi,
Rintaro Enomoto,
Kazuki Fujii,
Kensuke Fukumoto,
Takuya Fukushima,
Namgi Han,
Yuto Harada,
Chikara Hashimoto,
Tatsuya Hiraoka,
Shohei Hisada,
Sosuke Hosokawa,
Lu Jie,
Keisuke Kamata,
Teruhito Kanazawa,
Hiroki Kanezashi,
Hiroshi Kataoka,
Satoru Katsumata,
Daisuke Kawahara,
Seiya Kawano
, et al. (57 additional authors not shown)
Abstract:
This paper introduces LLM-jp, a cross-organizational project for the research and development of Japanese large language models (LLMs). LLM-jp aims to develop open-source and strong Japanese LLMs, and as of this writing, more than 1,500 participants from academia and industry are working together for this purpose. This paper presents the background of the establishment of LLM-jp, summaries of its…
▽ More
This paper introduces LLM-jp, a cross-organizational project for the research and development of Japanese large language models (LLMs). LLM-jp aims to develop open-source and strong Japanese LLMs, and as of this writing, more than 1,500 participants from academia and industry are working together for this purpose. This paper presents the background of the establishment of LLM-jp, summaries of its activities, and technical reports on the LLMs developed by LLM-jp. For the latest activities, visit https://llm-jp.nii.ac.jp/en/.
△ Less
Submitted 4 July, 2024;
originally announced July 2024.
-
MoreHopQA: More Than Multi-hop Reasoning
Authors:
Julian Schnitzler,
Xanh Ho,
Jiahao Huang,
Florian Boudin,
Saku Sugawara,
Akiko Aizawa
Abstract:
Most existing multi-hop datasets are extractive answer datasets, where the answers to the questions can be extracted directly from the provided context. This often leads models to use heuristics or shortcuts instead of performing true multi-hop reasoning. In this paper, we propose a new multi-hop dataset, MoreHopQA, which shifts from extractive to generative answers. Our dataset is created by util…
▽ More
Most existing multi-hop datasets are extractive answer datasets, where the answers to the questions can be extracted directly from the provided context. This often leads models to use heuristics or shortcuts instead of performing true multi-hop reasoning. In this paper, we propose a new multi-hop dataset, MoreHopQA, which shifts from extractive to generative answers. Our dataset is created by utilizing three existing multi-hop datasets: HotpotQA, 2WikiMultihopQA, and MuSiQue. Instead of relying solely on factual reasoning, we enhance the existing multi-hop questions by adding another layer of questioning that involves one, two, or all three of the following types of reasoning: commonsense, arithmetic, and symbolic. Our dataset is created through a semi-automated process, resulting in a dataset with 1,118 samples that have undergone human verification. We then use our dataset to evaluate five different large language models: Mistral 7B, Gemma 7B, Llama 3 (8B and 70B), and GPT-4. We also design various cases to analyze the reasoning steps in the question-answering process. Our results show that models perform well on initial multi-hop questions but struggle with our extended questions, indicating that our dataset is more challenging than previous ones. Our analysis of question decomposition reveals that although models can correctly answer questions, only a portion - 38.7% for GPT-4 and 33.4% for Llama3-70B - achieve perfect reasoning, where all corresponding sub-questions are answered correctly. Evaluation code and data are available at https://github.com/Alab-NII/morehopqa
△ Less
Submitted 19 June, 2024;
originally announced June 2024.
-
What Makes Language Models Good-enough?
Authors:
Daiki Asami,
Saku Sugawara
Abstract:
Psycholinguistic research suggests that humans may build a representation of linguistic input that is 'good-enough' for the task at hand. This study examines what architectural features make language models learn human-like good-enough language processing. We focus on the number of layers and self-attention heads in Transformers. We create a good-enough language processing (GELP) evaluation datase…
▽ More
Psycholinguistic research suggests that humans may build a representation of linguistic input that is 'good-enough' for the task at hand. This study examines what architectural features make language models learn human-like good-enough language processing. We focus on the number of layers and self-attention heads in Transformers. We create a good-enough language processing (GELP) evaluation dataset (7,680 examples), which is designed to test the effects of two plausibility types, eight construction types, and three degrees of memory cost on language processing. To annotate GELP, we first conduct a crowdsourcing experiment whose design follows prior psycholinguistic studies. Our model evaluation against the annotated GELP then reveals that the full model as well as models with fewer layers and/or self-attention heads exhibit a good-enough performance. This result suggests that models with shallower depth and fewer heads can learn good-enough language processing.
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
First homology groups of the Milnor fiber boundary for generic hyperplane arrangements in $\mathbb{C}^{3}$
Authors:
Sakumi Sugawara
Abstract:
We study the Milnor fiber boundary for hyperplane arrangements in $\mathbb{C}^3$. This is one of the examples of non-isolated surface singularities, which are studied by Némethi--Szilárd. In this paper, we compute the first homology group of the Milnor fiber boundary for a generic arrangement and prove it is combinatorially determined. In particular, this gives the affirmative answer to the conjec…
▽ More
We study the Milnor fiber boundary for hyperplane arrangements in $\mathbb{C}^3$. This is one of the examples of non-isolated surface singularities, which are studied by Némethi--Szilárd. In this paper, we compute the first homology group of the Milnor fiber boundary for a generic arrangement and prove it is combinatorially determined. In particular, this gives the affirmative answer to the conjecture of Suciu.
△ Less
Submitted 1 April, 2024;
originally announced April 2024.
-
Homogeneous quandles with commutative inner automorphism groups
Authors:
Takuya Saito,
Sakumi Sugawara
Abstract:
In this paper, we give a characterization for homogeneous quandles with commutative inner automorphism groups. In particular, it is shown that such a quandle is expressed as an abelian extension of a trivial quandle. Our construction is a generalization of the recent work by Furuki and Tamaru, which gives the construction of disconnected flat quandles.
In this paper, we give a characterization for homogeneous quandles with commutative inner automorphism groups. In particular, it is shown that such a quandle is expressed as an abelian extension of a trivial quandle. Our construction is a generalization of the recent work by Furuki and Tamaru, which gives the construction of disconnected flat quandles.
△ Less
Submitted 12 March, 2024;
originally announced March 2024.
-
PUAD: Frustratingly Simple Method for Robust Anomaly Detection
Authors:
Shota Sugawara,
Ryuji Imamura
Abstract:
Developing an accurate and fast anomaly detection model is an important task in real-time computer vision applications. There has been much research to develop a single model that detects either structural or logical anomalies, which are inherently distinct. The majority of the existing approaches implicitly assume that the anomaly can be represented by identifying the anomalous location. However,…
▽ More
Developing an accurate and fast anomaly detection model is an important task in real-time computer vision applications. There has been much research to develop a single model that detects either structural or logical anomalies, which are inherently distinct. The majority of the existing approaches implicitly assume that the anomaly can be represented by identifying the anomalous location. However, we argue that logical anomalies, such as the wrong number of objects, can not be well-represented by the spatial feature maps and require an alternative approach. In addition, we focused on the possibility of detecting logical anomalies by using an out-of-distribution detection approach on the feature space, which aggregates the spatial information of the feature map. As a demonstration, we propose a method that incorporates a simple out-of-distribution detection method on the feature space against state-of-the-art reconstruction-based approaches. Despite the simplicity of our proposal, our method PUAD (Picturable and Unpicturable Anomaly Detection) achieves state-of-the-art performance on the MVTec LOCO AD dataset.
△ Less
Submitted 23 February, 2024;
originally announced February 2024.
-
Demonstration of nuclear gamma-ray polarimetry based on a multi-layer CdTe Compton Camera
Authors:
S. Go,
Y. Tsuzuki,
H. Yoneda,
Y. Ichikawa,
T. Ikeda,
N. Imai,
K. Imamura,
M. Niikura,
D. Nishimura,
R. Mizuno,
S. Takeda,
H. Ueno,
S. Watanabe,
T. Y. Saito,
S. Shimoura,
S. Sugawara,
A. Takamine,
T. Takahashi
Abstract:
To detect and track structural changes in atomic nuclei, the systematic study of nuclear levels with firm spin-parity assignments is important. While linear polarization measurements have been applied to determine the electromagnetic character of gamma-ray transitions, the applicable range is strongly limited due to the low efficiency of the detection system. The multi-layer Cadmium-Telluride (CdT…
▽ More
To detect and track structural changes in atomic nuclei, the systematic study of nuclear levels with firm spin-parity assignments is important. While linear polarization measurements have been applied to determine the electromagnetic character of gamma-ray transitions, the applicable range is strongly limited due to the low efficiency of the detection system. The multi-layer Cadmium-Telluride (CdTe) Compton camera can be a state-of-the-art gamma-ray polarimeter for nuclear spectroscopy with the high position sensitivity and the detection efficiency. We demonstrated the capability to operate this detector as a reliable gamma-ray polarimeter by using polarized 847-keV gamma rays produced by the $^{56}\rm{Fe}({\it p},{\it p'}γ)$ reaction. By combining the experimental data and simulated calculations, the modulation curve for the gamma ray was successfully obtained. A remarkably high polarization sensitivity was achieved, compatible with a reasonable detection efficiency. Based on the obtained results, a possible future gamma-ray polarimetery is discussed.
△ Less
Submitted 14 February, 2024;
originally announced February 2024.
-
PROPRES: Investigating the Projectivity of Presupposition with Various Triggers and Environments
Authors:
Daiki Asami,
Saku Sugawara
Abstract:
What makes a presupposition of an utterance -- information taken for granted by its speaker -- different from other pragmatic inferences such as an entailment is projectivity (e.g., the negative sentence the boy did not stop shedding tears presupposes the boy had shed tears before). The projectivity may vary depending on the combination of presupposition triggers and environments. However, prior n…
▽ More
What makes a presupposition of an utterance -- information taken for granted by its speaker -- different from other pragmatic inferences such as an entailment is projectivity (e.g., the negative sentence the boy did not stop shedding tears presupposes the boy had shed tears before). The projectivity may vary depending on the combination of presupposition triggers and environments. However, prior natural language understanding studies fail to take it into account as they either use no human baseline or include only negation as an entailment-canceling environment to evaluate models' performance. The current study attempts to reconcile these issues. We introduce a new dataset, projectivity of presupposition (PROPRES, which includes 12k premise-hypothesis pairs crossing six triggers involving some lexical variety with five environments. Our human evaluation reveals that humans exhibit variable projectivity in some cases. However, the model evaluation shows that the best-performed model, DeBERTa, does not fully capture it. Our findings suggest that probing studies on pragmatic inferences should take extra care of the human judgment variability and the combination of linguistic items.
△ Less
Submitted 14 December, 2023;
originally announced December 2023.
-
Divides with cusps and symmetric links
Authors:
Sakumi Sugawara
Abstract:
A Divide with cusps is the image of a proper generic immersion from finite intervals and circles into a $2$-disk which allows to have cusps. A divide with cusps is the generalization of the notion of the divide which is introduced by A'Campo. From a divide with cusps, we can define the associated link in $S^3$. In this paper, we give the characterization of the link in $S^3$ which can be described…
▽ More
A Divide with cusps is the image of a proper generic immersion from finite intervals and circles into a $2$-disk which allows to have cusps. A divide with cusps is the generalization of the notion of the divide which is introduced by A'Campo. From a divide with cusps, we can define the associated link in $S^3$. In this paper, we give the characterization of the link in $S^3$ which can be described as the associated link of a divide with cusps. In particular, we prove that every strongly invertible link and $2$-periodic link can be described as the link of a divide with cusps.
△ Less
Submitted 4 January, 2024; v1 submitted 1 December, 2023;
originally announced December 2023.
-
Evaluating the Rationale Understanding of Critical Reasoning in Logical Reading Comprehension
Authors:
Akira Kawabata,
Saku Sugawara
Abstract:
To precisely evaluate a language model's capability for logical reading comprehension, we present a dataset for testing the understanding of the rationale behind critical reasoning. For questions taken from an existing multiplechoice logical reading comprehension dataset, we crowdsource rationale texts that explain why we should select or eliminate answer options, resulting in 3,003 multiple-choic…
▽ More
To precisely evaluate a language model's capability for logical reading comprehension, we present a dataset for testing the understanding of the rationale behind critical reasoning. For questions taken from an existing multiplechoice logical reading comprehension dataset, we crowdsource rationale texts that explain why we should select or eliminate answer options, resulting in 3,003 multiple-choice subquestions that are associated with 943 main questions. Experiments on our dataset show that recent large language models (e.g., InstructGPT) struggle to answer the subquestions even if they are able to answer the main questions correctly. We find that the models perform particularly poorly in answering subquestions written for the incorrect options of the main questions, implying that the models have a limited capability for explaining why incorrect alternatives should be eliminated. These results suggest that our dataset encourages further investigation into the critical reasoning ability of language models while focusing on the elimination process of relevant alternatives.
△ Less
Submitted 30 November, 2023;
originally announced November 2023.
-
Handle decompositions and Kirby diagrams for the complement of plane algebraic curves
Authors:
Sakumi Sugawara
Abstract:
The complement of plane algebraic curves are well studied from topological and algebro-geometric viewpoints. In this paper, we will describe the explicit handle decompositions and the Kirby diagrams for the complement of plane algebraic curves. The method is based on the notion of braid monodromy. We refined this technique to obtain handle decompositions and Kirby diagrams.
The complement of plane algebraic curves are well studied from topological and algebro-geometric viewpoints. In this paper, we will describe the explicit handle decompositions and the Kirby diagrams for the complement of plane algebraic curves. The method is based on the notion of braid monodromy. We refined this technique to obtain handle decompositions and Kirby diagrams.
△ Less
Submitted 23 October, 2023; v1 submitted 18 June, 2023;
originally announced June 2023.
-
Probing Physical Reasoning with Counter-Commonsense Context
Authors:
Kazushi Kondo,
Saku Sugawara,
Akiko Aizawa
Abstract:
In this study, we create a CConS (Counter-commonsense Contextual Size comparison) dataset to investigate how physical commonsense affects the contextualized size comparison task; the proposed dataset consists of both contexts that fit physical commonsense and those that do not. This dataset tests the ability of language models to predict the size relationship between objects under various contexts…
▽ More
In this study, we create a CConS (Counter-commonsense Contextual Size comparison) dataset to investigate how physical commonsense affects the contextualized size comparison task; the proposed dataset consists of both contexts that fit physical commonsense and those that do not. This dataset tests the ability of language models to predict the size relationship between objects under various contexts generated from our curated noun list and templates. We measure the ability of several masked language models and generative models. The results show that while large language models can use prepositions such as ``in'' and ``into'' in the provided context to infer size relationships, they fail to use verbs and thus make incorrect judgments led by their prior physical commonsense.
△ Less
Submitted 4 June, 2023;
originally announced June 2023.
-
On Degrees of Freedom in Defining and Testing Natural Language Understanding
Authors:
Saku Sugawara,
Shun Tsugita
Abstract:
Natural language understanding (NLU) studies often exaggerate or underestimate the capabilities of systems, thereby limiting the reproducibility of their findings. These erroneous evaluations can be attributed to the difficulty of defining and testing NLU adequately. In this position paper, we reconsider this challenge by identifying two types of researcher degrees of freedom. We revisit Turing's…
▽ More
Natural language understanding (NLU) studies often exaggerate or underestimate the capabilities of systems, thereby limiting the reproducibility of their findings. These erroneous evaluations can be attributed to the difficulty of defining and testing NLU adequately. In this position paper, we reconsider this challenge by identifying two types of researcher degrees of freedom. We revisit Turing's original interpretation of the Turing test and indicate that an NLU test does not provide an operational definition; it merely provides inductive evidence that the test subject understands the language sufficiently well to meet stakeholder objectives. In other words, stakeholders are free to arbitrarily define NLU through their objectives. To use the test results as inductive evidence, stakeholders must carefully assess if the interpretation of test scores is valid or not. However, designing and using NLU tests involve other degrees of freedom, such as specifying target skills and defining evaluation metrics. As a result, achieving consensus among stakeholders becomes difficult. To resolve this issue, we propose a validity argument, which is a framework comprising a series of validation criteria across test components. By demonstrating that current practices in NLU studies can be associated with those criteria and organizing them into a comprehensive checklist, we prove that the validity argument can serve as a coherent guideline for designing credible test sets and facilitating scientific communication.
△ Less
Submitted 24 May, 2023;
originally announced May 2023.
-
Analyzing the Effectiveness of the Underlying Reasoning Tasks in Multi-hop Question Answering
Authors:
Xanh Ho,
Anh-Khoa Duong Nguyen,
Saku Sugawara,
Akiko Aizawa
Abstract:
To explain the predicted answers and evaluate the reasoning abilities of models, several studies have utilized underlying reasoning (UR) tasks in multi-hop question answering (QA) datasets. However, it remains an open question as to how effective UR tasks are for the QA task when training models on both tasks in an end-to-end manner. In this study, we address this question by analyzing the effecti…
▽ More
To explain the predicted answers and evaluate the reasoning abilities of models, several studies have utilized underlying reasoning (UR) tasks in multi-hop question answering (QA) datasets. However, it remains an open question as to how effective UR tasks are for the QA task when training models on both tasks in an end-to-end manner. In this study, we address this question by analyzing the effectiveness of UR tasks (including both sentence-level and entity-level tasks) in three aspects: (1) QA performance, (2) reasoning shortcuts, and (3) robustness. While the previous models have not been explicitly trained on an entity-level reasoning prediction task, we build a multi-task model that performs three tasks together: sentence-level supporting facts prediction, entity-level reasoning prediction, and answer prediction. Experimental results on 2WikiMultiHopQA and HotpotQA-small datasets reveal that (1) UR tasks can improve QA performance. Using four debiased datasets that are newly created, we demonstrate that (2) UR tasks are helpful in preventing reasoning shortcuts in the multi-hop QA task. However, we find that (3) UR tasks do not contribute to improving the robustness of the model on adversarial questions, such as sub-questions and inverted questions. We encourage future studies to investigate the effectiveness of entity-level reasoning in the form of natural language questions (e.g., sub-question forms).
△ Less
Submitted 12 February, 2023;
originally announced February 2023.
-
Cross-Modal Similarity-Based Curriculum Learning for Image Captioning
Authors:
Hongkuan Zhang,
Saku Sugawara,
Akiko Aizawa,
Lei Zhou,
Ryohei Sasano,
Koichi Takeda
Abstract:
Image captioning models require the high-level generalization ability to describe the contents of various images in words. Most existing approaches treat the image-caption pairs equally in their training without considering the differences in their learning difficulties. Several image captioning approaches introduce curriculum learning methods that present training data with increasing levels of d…
▽ More
Image captioning models require the high-level generalization ability to describe the contents of various images in words. Most existing approaches treat the image-caption pairs equally in their training without considering the differences in their learning difficulties. Several image captioning approaches introduce curriculum learning methods that present training data with increasing levels of difficulty. However, their difficulty measurements are either based on domain-specific features or prior model training. In this paper, we propose a simple yet efficient difficulty measurement for image captioning using cross-modal similarity calculated by a pretrained vision-language model. Experiments on the COCO and Flickr30k datasets show that our proposed approach achieves superior performance and competitive convergence speed to baselines without requiring heuristics or incurring additional training costs. Moreover, the higher model performance on difficult examples and unseen data also demonstrates the generalization ability.
△ Less
Submitted 14 December, 2022;
originally announced December 2022.
-
Which Shortcut Solution Do Question Answering Models Prefer to Learn?
Authors:
Kazutoshi Shinoda,
Saku Sugawara,
Akiko Aizawa
Abstract:
Question answering (QA) models for reading comprehension tend to learn shortcut solutions rather than the solutions intended by QA datasets. QA models that have learned shortcut solutions can achieve human-level performance in shortcut examples where shortcuts are valid, but these same behaviors degrade generalization potential on anti-shortcut examples where shortcuts are invalid. Various methods…
▽ More
Question answering (QA) models for reading comprehension tend to learn shortcut solutions rather than the solutions intended by QA datasets. QA models that have learned shortcut solutions can achieve human-level performance in shortcut examples where shortcuts are valid, but these same behaviors degrade generalization potential on anti-shortcut examples where shortcuts are invalid. Various methods have been proposed to mitigate this problem, but they do not fully take the characteristics of shortcuts themselves into account. We assume that the learnability of shortcuts, i.e., how easy it is to learn a shortcut, is useful to mitigate the problem. Thus, we first examine the learnability of the representative shortcuts on extractive and multiple-choice QA datasets. Behavioral tests using biased training sets reveal that shortcuts that exploit answer positions and word-label correlations are preferentially learned for extractive and multiple-choice QA, respectively. We find that the more learnable a shortcut is, the flatter and deeper the loss landscape is around the shortcut solution in the parameter space. We also find that the availability of the preferred shortcuts tends to make the task easier to perform from an information-theoretic viewpoint. Lastly, we experimentally show that the learnability of shortcuts can be utilized to construct an effective QA training set; the more learnable a shortcut is, the smaller the proportion of anti-shortcut examples required to achieve comparable performance on shortcut and anti-shortcut examples. We claim that the learnability of shortcuts should be considered when designing mitigation methods.
△ Less
Submitted 29 November, 2022;
originally announced November 2022.
-
Penalizing Confident Predictions on Largely Perturbed Inputs Does Not Improve Out-of-Distribution Generalization in Question Answering
Authors:
Kazutoshi Shinoda,
Saku Sugawara,
Akiko Aizawa
Abstract:
Question answering (QA) models are shown to be insensitive to large perturbations to inputs; that is, they make correct and confident predictions even when given largely perturbed inputs from which humans can not correctly derive answers. In addition, QA models fail to generalize to other domains and adversarial test sets, while humans maintain high accuracy. Based on these observations, we assume…
▽ More
Question answering (QA) models are shown to be insensitive to large perturbations to inputs; that is, they make correct and confident predictions even when given largely perturbed inputs from which humans can not correctly derive answers. In addition, QA models fail to generalize to other domains and adversarial test sets, while humans maintain high accuracy. Based on these observations, we assume that QA models do not use intended features necessary for human reading but rely on spurious features, causing the lack of generalization ability. Therefore, we attempt to answer the question: If the overconfident predictions of QA models for various types of perturbations are penalized, will the out-of-distribution (OOD) generalization be improved? To prevent models from making confident predictions on perturbed inputs, we first follow existing studies and maximize the entropy of the output probability for perturbed inputs. However, we find that QA models trained to be sensitive to a certain perturbation type are often insensitive to unseen types of perturbations. Thus, we simultaneously maximize the entropy for the four perturbation types (i.e., word- and sentence-level shuffling and deletion) to further close the gap between models and humans. Contrary to our expectations, although models become sensitive to the four types of perturbations, we find that the OOD generalization is not improved. Moreover, the OOD generalization is sometimes degraded after entropy maximization. Making unconfident predictions on largely perturbed inputs per se may be beneficial to gaining human trust. However, our negative results suggest that researchers should pay attention to the side effect of entropy maximization.
△ Less
Submitted 29 November, 2022;
originally announced November 2022.
-
Debiasing Masks: A New Framework for Shortcut Mitigation in NLU
Authors:
Johannes Mario Meissner,
Saku Sugawara,
Akiko Aizawa
Abstract:
Debiasing language models from unwanted behaviors in Natural Language Understanding tasks is a topic with rapidly increasing interest in the NLP community. Spurious statistical correlations in the data allow models to perform shortcuts and avoid uncovering more advanced and desirable linguistic features. A multitude of effective debiasing approaches has been proposed, but flexibility remains a maj…
▽ More
Debiasing language models from unwanted behaviors in Natural Language Understanding tasks is a topic with rapidly increasing interest in the NLP community. Spurious statistical correlations in the data allow models to perform shortcuts and avoid uncovering more advanced and desirable linguistic features. A multitude of effective debiasing approaches has been proposed, but flexibility remains a major issue. For the most part, models must be retrained to find a new set of weights with debiased behavior. We propose a new debiasing method in which we identify debiased pruning masks that can be applied to a finetuned model. This enables the selective and conditional application of debiasing behaviors. We assume that bias is caused by a certain subset of weights in the network; our method is, in essence, a mask search to identify and remove biased weights. Our masks show equivalent or superior performance to the standard counterparts, while offering important benefits. Pruning masks can be stored with high efficiency in memory, and it becomes possible to switch among several debiasing behaviors (or revert back to the original biased model) at inference time. Finally, it opens the doors to further research on how biases are acquired by studying the generated masks. For example, we observed that the early layers and attention heads were pruned more aggressively, possibly hinting towards the location in which biases may be encoded.
△ Less
Submitted 28 October, 2022;
originally announced October 2022.
-
Look to the Right: Mitigating Relative Position Bias in Extractive Question Answering
Authors:
Kazutoshi Shinoda,
Saku Sugawara,
Akiko Aizawa
Abstract:
Extractive question answering (QA) models tend to exploit spurious correlations to make predictions when a training set has unintended biases. This tendency results in models not being generalizable to examples where the correlations do not hold. Determining the spurious correlations QA models can exploit is crucial in building generalizable QA models in real-world applications; moreover, a method…
▽ More
Extractive question answering (QA) models tend to exploit spurious correlations to make predictions when a training set has unintended biases. This tendency results in models not being generalizable to examples where the correlations do not hold. Determining the spurious correlations QA models can exploit is crucial in building generalizable QA models in real-world applications; moreover, a method needs to be developed that prevents these models from learning the spurious correlations even when a training set is biased. In this study, we discovered that the relative position of an answer, which is defined as the relative distance from an answer span to the closest question-context overlap word, can be exploited by QA models as superficial cues for making predictions. Specifically, we find that when the relative positions in a training set are biased, the performance on examples with relative positions unseen during training is significantly degraded. To mitigate the performance degradation for unseen relative positions, we propose an ensemble-based debiasing method that does not require prior knowledge about the distribution of relative positions. We demonstrate that the proposed method mitigates the models' reliance on relative positions using the biased and full SQuAD dataset. We hope that this study can help enhance the generalization ability of QA models in real-world applications.
△ Less
Submitted 26 October, 2022;
originally announced October 2022.
-
How Well Do Multi-hop Reading Comprehension Models Understand Date Information?
Authors:
Xanh Ho,
Saku Sugawara,
Akiko Aizawa
Abstract:
Several multi-hop reading comprehension datasets have been proposed to resolve the issue of reasoning shortcuts by which questions can be answered without performing multi-hop reasoning. However, the ability of multi-hop models to perform step-by-step reasoning when finding an answer to a comparison question remains unclear. It is also unclear how questions about the internal reasoning process are…
▽ More
Several multi-hop reading comprehension datasets have been proposed to resolve the issue of reasoning shortcuts by which questions can be answered without performing multi-hop reasoning. However, the ability of multi-hop models to perform step-by-step reasoning when finding an answer to a comparison question remains unclear. It is also unclear how questions about the internal reasoning process are useful for training and evaluating question-answering (QA) systems. To evaluate the model precisely in a hierarchical manner, we first propose a dataset, \textit{HieraDate}, with three probing tasks in addition to the main question: extraction, reasoning, and robustness. Our dataset is created by enhancing two previous multi-hop datasets, HotpotQA and 2WikiMultiHopQA, focusing on multi-hop questions on date information that involve both comparison and numerical reasoning. We then evaluate the ability of existing models to understand date information. Our experimental results reveal that the multi-hop models do not have the ability to subtract two dates even when they perform well in date comparison and number subtraction tasks. Other results reveal that our probing questions can help to improve the performance of the models (e.g., by +10.3 F1) on the main QA task and our dataset can be used for data augmentation to improve the robustness of the models.
△ Less
Submitted 11 October, 2022;
originally announced October 2022.
-
Possible Stories: Evaluating Situated Commonsense Reasoning under Multiple Possible Scenarios
Authors:
Mana Ashida,
Saku Sugawara
Abstract:
The possible consequences for the same context may vary depending on the situation we refer to. However, current studies in natural language processing do not focus on situated commonsense reasoning under multiple possible scenarios. This study frames this task by asking multiple questions with the same set of possible endings as candidate answers, given a short story text. Our resulting dataset,…
▽ More
The possible consequences for the same context may vary depending on the situation we refer to. However, current studies in natural language processing do not focus on situated commonsense reasoning under multiple possible scenarios. This study frames this task by asking multiple questions with the same set of possible endings as candidate answers, given a short story text. Our resulting dataset, Possible Stories, consists of more than 4.5K questions over 1.3K story texts in English. We discover that even current strong pretrained language models struggle to answer the questions consistently, highlighting that the highest accuracy in an unsupervised setting (60.2%) is far behind human accuracy (92.5%). Through a comparison with existing datasets, we observe that the questions in our dataset contain minimal annotation artifacts in the answer options. In addition, our dataset includes examples that require counterfactual reasoning, as well as those requiring readers' reactions and fictional information, suggesting that our dataset can serve as a challenging testbed for future studies on situated commonsense reasoning.
△ Less
Submitted 16 September, 2022;
originally announced September 2022.
-
$\mathbb{Z}$-local system cohomology of hyperplane arrangements and a Cohen-Dimca-Orlik type theorem
Authors:
Sakumi Sugawara
Abstract:
Local system cohomology groups of the complements of hyperplane arrangements have played an important role in the theory of hypergeometric integrals, the topology of Milnor fibers and covering spaces. One of the important theorems is the vanishing theorem for generic $\mathbb{C}$-local systems which goes back to Aomoto's work. Later, Cohen, Dimca, and Orlik proved a stronger version of the vanishi…
▽ More
Local system cohomology groups of the complements of hyperplane arrangements have played an important role in the theory of hypergeometric integrals, the topology of Milnor fibers and covering spaces. One of the important theorems is the vanishing theorem for generic $\mathbb{C}$-local systems which goes back to Aomoto's work. Later, Cohen, Dimca, and Orlik proved a stronger version of the vanishing theorem. In this paper, we prove a Cohen-Dimca-Orlik type theorem for $\mathbb{Z}$-local systems.
△ Less
Submitted 21 April, 2023; v1 submitted 6 September, 2022;
originally announced September 2022.
-
Betti numbers and torsions in homology groups of double coverings
Authors:
Suguru Ishibashi,
Sakumi Sugawara,
Masahiko Yoshinaga
Abstract:
Papadima and Suciu proved an inequality between the ranks of the cohomology groups of the Aomoto complex with finite field coefficients and the twisted cohomology groups, and conjectured that they are actually equal for certain cases associated with the Milnor fiber of the arrangement. Recently, an arrangement (the icosidodecahedral arrangement) with the following two peculiar properties was found…
▽ More
Papadima and Suciu proved an inequality between the ranks of the cohomology groups of the Aomoto complex with finite field coefficients and the twisted cohomology groups, and conjectured that they are actually equal for certain cases associated with the Milnor fiber of the arrangement. Recently, an arrangement (the icosidodecahedral arrangement) with the following two peculiar properties was found: (i) the strict version of Papadima-Suciu's inequality holds, and (ii) the first integral homology of the Milnor fiber has a non-trivial $2$-torsion. In this paper, we investigate the relationship between these two properties for double covering spaces. We prove that (i) and (ii) are actually equivalent.
△ Less
Submitted 20 May, 2024; v1 submitted 6 September, 2022;
originally announced September 2022.
-
A Survey on Measuring and Mitigating Reasoning Shortcuts in Machine Reading Comprehension
Authors:
Xanh Ho,
Johannes Mario Meissner,
Saku Sugawara,
Akiko Aizawa
Abstract:
The issue of shortcut learning is widely known in NLP and has been an important research focus in recent years. Unintended correlations in the data enable models to easily solve tasks that were meant to exhibit advanced language understanding and reasoning capabilities. In this survey paper, we focus on the field of machine reading comprehension (MRC), an important task for showcasing high-level l…
▽ More
The issue of shortcut learning is widely known in NLP and has been an important research focus in recent years. Unintended correlations in the data enable models to easily solve tasks that were meant to exhibit advanced language understanding and reasoning capabilities. In this survey paper, we focus on the field of machine reading comprehension (MRC), an important task for showcasing high-level language understanding that also suffers from a range of shortcuts. We summarize the available techniques for measuring and mitigating shortcuts and conclude with suggestions for further progress in shortcut research. Importantly, we highlight two concerns for shortcut mitigation in MRC: (1) the lack of public challenge sets, a necessary component for effective and reusable evaluation, and (2) the lack of certain mitigation techniques that are prominent in other areas.
△ Less
Submitted 6 September, 2023; v1 submitted 5 September, 2022;
originally announced September 2022.
-
What Makes Reading Comprehension Questions Difficult?
Authors:
Saku Sugawara,
Nikita Nangia,
Alex Warstadt,
Samuel R. Bowman
Abstract:
For a natural language understanding benchmark to be useful in research, it has to consist of examples that are diverse and difficult enough to discriminate among current and near-future state-of-the-art systems. However, we do not yet know how best to select text sources to collect a variety of challenging examples. In this study, we crowdsource multiple-choice reading comprehension questions for…
▽ More
For a natural language understanding benchmark to be useful in research, it has to consist of examples that are diverse and difficult enough to discriminate among current and near-future state-of-the-art systems. However, we do not yet know how best to select text sources to collect a variety of challenging examples. In this study, we crowdsource multiple-choice reading comprehension questions for passages taken from seven qualitatively distinct sources, analyzing what attributes of passages contribute to the difficulty and question types of the collected examples. To our surprise, we find that passage source, length, and readability measures do not significantly affect question difficulty. Through our manual annotation of seven reasoning types, we observe several trends between passage sources and reasoning types, e.g., logical reasoning is more often required in questions written for technical passages. These results suggest that when creating a new benchmark dataset, selecting a diverse set of passages can help ensure a diverse range of question types, but that passage difficulty need not be a priority.
△ Less
Submitted 11 March, 2022;
originally announced March 2022.
-
Hydrothermal activities on C-complex asteroids induced by radioactivity
Authors:
Wataru Fujiya,
Hisato Higashi,
Yuki Hibiya,
Shingo Sugawara,
Akira Yamaguchi,
Makoto Kimura,
Ko Hashizume
Abstract:
C-complex asteroids, rich in carbonaceous materials, are potential sources of Earth's volatile inventories. They are spectrally dark resembling primitive carbonaceous meteorites, and thus, C-complex asteroids are thought to be potential parent bodies of carbonaceous meteorites. However, the substantial number of C-complex asteroids exhibits surface spectra with weaker hydroxyl absorption than wate…
▽ More
C-complex asteroids, rich in carbonaceous materials, are potential sources of Earth's volatile inventories. They are spectrally dark resembling primitive carbonaceous meteorites, and thus, C-complex asteroids are thought to be potential parent bodies of carbonaceous meteorites. However, the substantial number of C-complex asteroids exhibits surface spectra with weaker hydroxyl absorption than water-rich carbonaceous meteorites. Rather, they best correspond to meteorites showing evidence for dehydration, commonly attributed to impact heating. Here, we report an old radiometric age of 4564.7 million years ago for Ca-carbonates from the Jbilet Winselwan meteorite analogous to dehydrated C-complex asteroids. The carbonates are enclosed by a high-temperature polymorph of Ca-sulfates, suggesting thermal metamorphism at >300°C subsequently after aqueous alteration. This old age indicates the early onset of aqueous alteration and subsequent thermal metamorphism driven by the decay of short-lived radionuclides rather than impact heating. The breakup of original asteroids internally heated by radioactivity should result in asteroid families predominantly consisting of thermally metamorphosed materials. This explains the common occurrence of dehydrated C-complex asteroids.
△ Less
Submitted 13 January, 2022; v1 submitted 4 January, 2022;
originally announced January 2022.
-
Can Question Generation Debias Question Answering Models? A Case Study on Question-Context Lexical Overlap
Authors:
Kazutoshi Shinoda,
Saku Sugawara,
Akiko Aizawa
Abstract:
Question answering (QA) models for reading comprehension have been demonstrated to exploit unintended dataset biases such as question-context lexical overlap. This hinders QA models from generalizing to under-represented samples such as questions with low lexical overlap. Question generation (QG), a method for augmenting QA datasets, can be a solution for such performance degradation if QG can pro…
▽ More
Question answering (QA) models for reading comprehension have been demonstrated to exploit unintended dataset biases such as question-context lexical overlap. This hinders QA models from generalizing to under-represented samples such as questions with low lexical overlap. Question generation (QG), a method for augmenting QA datasets, can be a solution for such performance degradation if QG can properly debias QA datasets. However, we discover that recent neural QG models are biased towards generating questions with high lexical overlap, which can amplify the dataset bias. Moreover, our analysis reveals that data augmentation with these QG models frequently impairs the performance on questions with low lexical overlap, while improving that on questions with high lexical overlap. To address this problem, we use a synonym replacement-based approach to augment questions with low lexical overlap. We demonstrate that the proposed data augmentation approach is simple yet effective to mitigate the degradation problem with only 70k synthetic examples. Our data is publicly available at https://github.com/KazutoshiShinoda/Synonym-Replacement.
△ Less
Submitted 23 September, 2021;
originally announced September 2021.
-
Embracing Ambiguity: Shifting the Training Target of NLI Models
Authors:
Johannes Mario Meissner,
Napat Thumwanit,
Saku Sugawara,
Akiko Aizawa
Abstract:
Natural Language Inference (NLI) datasets contain examples with highly ambiguous labels. While many research works do not pay much attention to this fact, several recent efforts have been made to acknowledge and embrace the existence of ambiguity, such as UNLI and ChaosNLI. In this paper, we explore the option of training directly on the estimated label distribution of the annotators in the NLI ta…
▽ More
Natural Language Inference (NLI) datasets contain examples with highly ambiguous labels. While many research works do not pay much attention to this fact, several recent efforts have been made to acknowledge and embrace the existence of ambiguity, such as UNLI and ChaosNLI. In this paper, we explore the option of training directly on the estimated label distribution of the annotators in the NLI task, using a learning loss based on this ambiguity distribution instead of the gold-labels. We prepare AmbiNLI, a trial dataset obtained from readily available sources, and show it is possible to reduce ChaosNLI divergence scores when finetuning on this data, a promising first step towards learning how to capture linguistic ambiguity. Additionally, we show that training on the same amount of data but targeting the ambiguity distribution instead of gold-labels can result in models that achieve higher performance and learn better representations for downstream tasks.
△ Less
Submitted 5 June, 2021;
originally announced June 2021.
-
What Ingredients Make for an Effective Crowdsourcing Protocol for Difficult NLU Data Collection Tasks?
Authors:
Nikita Nangia,
Saku Sugawara,
Harsh Trivedi,
Alex Warstadt,
Clara Vania,
Samuel R. Bowman
Abstract:
Crowdsourcing is widely used to create data for common natural language understanding tasks. Despite the importance of these datasets for measuring and refining model understanding of language, there has been little focus on the crowdsourcing methods used for collecting the datasets. In this paper, we compare the efficacy of interventions that have been proposed in prior work as ways of improving…
▽ More
Crowdsourcing is widely used to create data for common natural language understanding tasks. Despite the importance of these datasets for measuring and refining model understanding of language, there has been little focus on the crowdsourcing methods used for collecting the datasets. In this paper, we compare the efficacy of interventions that have been proposed in prior work as ways of improving data quality. We use multiple-choice question answering as a testbed and run a randomized trial by assigning crowdworkers to write questions under one of four different data collection protocols. We find that asking workers to write explanations for their examples is an ineffective stand-alone strategy for boosting NLU example difficulty. However, we find that training crowdworkers, and then using an iterative process of collecting data, sending feedback, and qualifying workers based on expert judgments is an effective means of collecting challenging data. But using crowdsourced, instead of expert judgments, to qualify workers and send feedback does not prove to be effective. We observe that the data from the iterative protocol with expert assessments is more challenging by several measures. Notably, the human--model gap on the unanimous agreement portion of this data is, on average, twice as large as the gap for the baseline protocol data.
△ Less
Submitted 1 June, 2021;
originally announced June 2021.
-
Divides with cusps and Kirby diagrams for line arrangements
Authors:
Sakumi Sugawara,
Masahiko Yoshinaga
Abstract:
The complement of a complexified real line arrangement is an affine surface. It is classically known that such a space has a handle decomposition up to $2$-handles. We will describe the handle decomposition induced from Lefschetz hyperplane section theorem for such a space. To describe the Kirby diagram, we introduce the notion of the divide with cusps which is a generalization of the divide intro…
▽ More
The complement of a complexified real line arrangement is an affine surface. It is classically known that such a space has a handle decomposition up to $2$-handles. We will describe the handle decomposition induced from Lefschetz hyperplane section theorem for such a space. To describe the Kirby diagram, we introduce the notion of the divide with cusps which is a generalization of the divide introduced by A'Campo.
△ Less
Submitted 3 July, 2021; v1 submitted 28 March, 2021;
originally announced March 2021.
-
Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps
Authors:
Xanh Ho,
Anh-Khoa Duong Nguyen,
Saku Sugawara,
Akiko Aizawa
Abstract:
A multi-hop question answering (QA) dataset aims to test reasoning and inference skills by requiring a model to read multiple paragraphs to answer a given question. However, current datasets do not provide a complete explanation for the reasoning process from the question to the answer. Further, previous studies revealed that many examples in existing multi-hop datasets do not require multi-hop re…
▽ More
A multi-hop question answering (QA) dataset aims to test reasoning and inference skills by requiring a model to read multiple paragraphs to answer a given question. However, current datasets do not provide a complete explanation for the reasoning process from the question to the answer. Further, previous studies revealed that many examples in existing multi-hop datasets do not require multi-hop reasoning to answer a question. In this study, we present a new multi-hop QA dataset, called 2WikiMultiHopQA, which uses structured and unstructured data. In our dataset, we introduce the evidence information containing a reasoning path for multi-hop questions. The evidence information has two benefits: (i) providing a comprehensive explanation for predictions and (ii) evaluating the reasoning skills of a model. We carefully design a pipeline and a set of templates when generating a question-answer pair that guarantees the multi-hop steps and the quality of the questions. We also exploit the structured format in Wikidata and use logical rules to create questions that are natural but still require multi-hop reasoning. Through experiments, we demonstrate that our dataset is challenging for multi-hop models and it ensures that multi-hop reasoning is required.
△ Less
Submitted 12 November, 2020; v1 submitted 2 November, 2020;
originally announced November 2020.
-
Improving the Robustness of QA Models to Challenge Sets with Variational Question-Answer Pair Generation
Authors:
Kazutoshi Shinoda,
Saku Sugawara,
Akiko Aizawa
Abstract:
Question answering (QA) models for reading comprehension have achieved human-level accuracy on in-distribution test sets. However, they have been demonstrated to lack robustness to challenge sets, whose distribution is different from that of training sets. Existing data augmentation methods mitigate this problem by simply augmenting training sets with synthetic examples sampled from the same distr…
▽ More
Question answering (QA) models for reading comprehension have achieved human-level accuracy on in-distribution test sets. However, they have been demonstrated to lack robustness to challenge sets, whose distribution is different from that of training sets. Existing data augmentation methods mitigate this problem by simply augmenting training sets with synthetic examples sampled from the same distribution as the challenge sets. However, these methods assume that the distribution of a challenge set is known a priori, making them less applicable to unseen challenge sets. In this study, we focus on question-answer pair generation (QAG) to mitigate this problem. While most existing QAG methods aim to improve the quality of synthetic examples, we conjecture that diversity-promoting QAG can mitigate the sparsity of training sets and lead to better robustness. We present a variational QAG model that generates multiple diverse QA pairs from a paragraph. Our experiments show that our method can improve the accuracy of 12 challenge sets, as well as the in-distribution accuracy. Our code and data are available at https://github.com/KazutoshiShinoda/VQAG.
△ Less
Submitted 3 June, 2021; v1 submitted 7 April, 2020;
originally announced April 2020.
-
Benchmarking Machine Reading Comprehension: A Psychological Perspective
Authors:
Saku Sugawara,
Pontus Stenetorp,
Akiko Aizawa
Abstract:
Machine reading comprehension (MRC) has received considerable attention as a benchmark for natural language understanding. However, the conventional task design of MRC lacks explainability beyond the model interpretation, i.e., reading comprehension by a model cannot be explained in human terms. To this end, this position paper provides a theoretical basis for the design of MRC datasets based on p…
▽ More
Machine reading comprehension (MRC) has received considerable attention as a benchmark for natural language understanding. However, the conventional task design of MRC lacks explainability beyond the model interpretation, i.e., reading comprehension by a model cannot be explained in human terms. To this end, this position paper provides a theoretical basis for the design of MRC datasets based on psychology as well as psychometrics, and summarizes it in terms of the prerequisites for benchmarking MRC. We conclude that future datasets should (i) evaluate the capability of the model for constructing a coherent and grounded representation to understand context-dependent situations and (ii) ensure substantive validity by shortcut-proof questions and explanation as a part of the task design.
△ Less
Submitted 26 January, 2021; v1 submitted 4 April, 2020;
originally announced April 2020.
-
Assessing the Benchmarking Capacity of Machine Reading Comprehension Datasets
Authors:
Saku Sugawara,
Pontus Stenetorp,
Kentaro Inui,
Akiko Aizawa
Abstract:
Existing analysis work in machine reading comprehension (MRC) is largely concerned with evaluating the capabilities of systems. However, the capabilities of datasets are not assessed for benchmarking language understanding precisely. We propose a semi-automated, ablation-based methodology for this challenge; By checking whether questions can be solved even after removing features associated with a…
▽ More
Existing analysis work in machine reading comprehension (MRC) is largely concerned with evaluating the capabilities of systems. However, the capabilities of datasets are not assessed for benchmarking language understanding precisely. We propose a semi-automated, ablation-based methodology for this challenge; By checking whether questions can be solved even after removing features associated with a skill requisite for language understanding, we evaluate to what degree the questions do not require the skill. Experiments on 10 datasets (e.g., CoQA, SQuAD v2.0, and RACE) with a strong baseline model show that, for example, the relative scores of a baseline model provided with content words only and with shuffled sentence words in the context are on average 89.2% and 78.5% of the original score, respectively. These results suggest that most of the questions already answered correctly by the model do not necessarily require grammatical and complex reasoning. For precise benchmarking, MRC datasets will need to take extra care in their design to ensure that questions can correctly evaluate the intended skills.
△ Less
Submitted 20 November, 2019;
originally announced November 2019.
-
Reverberation Measurements of the Inner Radii of the Dust Tori in Quasars
Authors:
Takeo Minezaki,
Yuzuru Yoshii,
Yukiyasu Kobayashi,
Shota Sugawara,
Yu Sakata,
Keigo Enya,
Shintaro Koshida,
Hiroyuki Tomita,
Masahiro Suganuma,
Tsutomu Aoki,
Bruce A. Peterson
Abstract:
We present the results of a dust-reverberation survey of quasars at redshifts z<0.6. We found a delayed response of the K-band flux variation after the optical flux variation in 25 out of 31 targets, and obtained the lag time between them for 22 targets. Combined with the results for nearby Seyfert galaxies, we provide the largest homogeneous collection of K-band dust-reverberation data for 36 typ…
▽ More
We present the results of a dust-reverberation survey of quasars at redshifts z<0.6. We found a delayed response of the K-band flux variation after the optical flux variation in 25 out of 31 targets, and obtained the lag time between them for 22 targets. Combined with the results for nearby Seyfert galaxies, we provide the largest homogeneous collection of K-band dust-reverberation data for 36 type 1 active galactic nuclei (AGNs). This doubles the sample and includes the most distant AGN and the largest lag so far measured. We estimated the optical luminosity of the AGN component of each target using three different methods: spectral decomposition, the flux-variation-gradient method, and image decomposition. We found a strong correlation between the reverberation radius for the innermost dust torus and the optical luminosity over a range of approximately four orders of magnitude in luminosity, as is already known for Seyfert galaxies. We estimated the luminosity distances of the AGNs based on their dust-reverberation lags, and found that the data in the redshift-distance diagram are consistent with the current standard estimates of the cosmological parameters. We also present the radius-luminosity relations for isotropic luminosity indicators such as the hard X-ray (14--195 keV), [OIV] 25.89 um, and mid-infrared (12 um) continuum luminosities, which are applicable to obscured AGNs.
△ Less
Submitted 8 November, 2019; v1 submitted 19 October, 2019;
originally announced October 2019.
-
What Makes Reading Comprehension Questions Easier?
Authors:
Saku Sugawara,
Kentaro Inui,
Satoshi Sekine,
Akiko Aizawa
Abstract:
A challenge in creating a dataset for machine reading comprehension (MRC) is to collect questions that require a sophisticated understanding of language to answer beyond using superficial cues. In this work, we investigate what makes questions easier across recent 12 MRC datasets with three question styles (answer extraction, description, and multiple choice). We propose to employ simple heuristic…
▽ More
A challenge in creating a dataset for machine reading comprehension (MRC) is to collect questions that require a sophisticated understanding of language to answer beyond using superficial cues. In this work, we investigate what makes questions easier across recent 12 MRC datasets with three question styles (answer extraction, description, and multiple choice). We propose to employ simple heuristics to split each dataset into easy and hard subsets and examine the performance of two baseline models for each of the subsets. We then manually annotate questions sampled from each subset with both validity and requisite reasoning skills to investigate which skills explain the difference between easy and hard questions. From this study, we observed that (i) the baseline performances for the hard subsets remarkably degrade compared to those of entire datasets, (ii) hard questions require knowledge inference and multiple-sentence reasoning in comparison with easy questions, and (iii) multiple-choice questions tend to require a broader range of reasoning skills than answer extraction and description questions. These results suggest that one might overestimate recent advances in MRC.
△ Less
Submitted 28 August, 2018;
originally announced August 2018.
-
Reverberation Measurements of the Inner Radius of the Dust Torus in 17 Seyfert Galaxies
Authors:
S. Koshida,
T. Minezaki,
Y. Yoshii,
Y. Kobayashi,
Y. Sakata,
S. Sugawara,
K. Enya,
M. Suganuma,
H. Tomita,
T. Aoki,
B. A. Peterson
Abstract:
We present the results of a dust reverberation survey for 17 nearby Seyfert 1 galaxies, which provides the largest homogeneous data collection for the radius of the innermost dust torus. A delayed response of the K-band light curve after the V-band light curve was found for all targets, and 49 measurements of lag times between the flux variation of the dust emission in the K band and that of the o…
▽ More
We present the results of a dust reverberation survey for 17 nearby Seyfert 1 galaxies, which provides the largest homogeneous data collection for the radius of the innermost dust torus. A delayed response of the K-band light curve after the V-band light curve was found for all targets, and 49 measurements of lag times between the flux variation of the dust emission in the K band and that of the optical continuum emission in the V band were obtained. The lag times strongly correlated with the optical luminosity in the luminosity range of M_V=-16 to -22 mag, and the regression analysis was performed to obtain the correlation log $Δt$ (days) = -2.11 -0.2 M_V assuming $Δt \propto L^{0.5}$, which was theoretically expected. We discuss the possible origins of the intrinsic scatter of the dust lag-luminosity correlation, which was estimated to be about 0.13 dex, and we find that the difference of internal extinction and delayed response of changes in lag times to the flux variations could have partly contributed to intrinsic scatter. However, we could not detect any systematic change of the correlation with the subclass of the Seyfert type or the Eddington ratio. Finally, we compare the dust reverberation radius with the near-infrared interferometric radius of the dust torus and the reverberation radius of broad Balmer emission lines. The interferometric radius in the K band was found to be systematically larger than the dust reverberation radius in the same band by about a factor of two, which could be interpreted by the difference between the flux-weighted radius and the response-weighted radius of the innermost dust torus. The reverberation radius of the broad Balmer emission lines was found to be systematically smaller than the dust reverberation radius by about a factor of 4-5, which strongly supports the unified scheme of the Seyfert type of active galactic nuclei. (Abridged)
△ Less
Submitted 9 June, 2014;
originally announced June 2014.
-
Inter-Band Effects of Magnetic Field on Hall Conductivity in Multi Layered Massless Dirac Fermion System $α$-(BEDT-TTF)$_2$I$_3$
Authors:
N. Tajima,
R. Kato,
S. Sugawara,
Y. Nishio,
K. Kajita
Abstract:
We have discovered two-dimensional zero-gap material with a layered structure in the organic conductor $α$-(BEDT-TTF)$_2$I$_3$ under high hydrostatic pressure. In contrast to graphene, the electron-hole symmetry is not good except at the vicinity of the Dirac points. Thus, temperature dependence of the chemical potential, $μ$, plays an important role in the transport in this system. The experiment…
▽ More
We have discovered two-dimensional zero-gap material with a layered structure in the organic conductor $α$-(BEDT-TTF)$_2$I$_3$ under high hydrostatic pressure. In contrast to graphene, the electron-hole symmetry is not good except at the vicinity of the Dirac points. Thus, temperature dependence of the chemical potential, $μ$, plays an important role in the transport in this system. The experimental formula of $μ$ is revealed. We succeeded in detecting the inter-band effects of a magnetic field on the Hall conductivity when $μ$ passes the Dirac point.
△ Less
Submitted 12 January, 2012;
originally announced January 2012.
-
Spin and Valley Splittings in Multilayered Massless Dirac Fermion System
Authors:
N. Tajima,
M. Sato,
S. Sugawara,
R. Kato,
Y. Nishio,
K. Kajita
Abstract:
The inter-layer magnetoresistance in a multilayered massless Dirac fermion system, $α$-(BEDT-TTF)$_2$I$_3$, under hydrostatic pressure was investigated. We succeeded in detecting the zero-mode (n=0) Landau level and its spin splitting in the magnetic field normal to the 2D plane. We demonstrated that the effective Coulomb interaction in the magnetic field intensifies the spin splitting of zero-mod…
▽ More
The inter-layer magnetoresistance in a multilayered massless Dirac fermion system, $α$-(BEDT-TTF)$_2$I$_3$, under hydrostatic pressure was investigated. We succeeded in detecting the zero-mode (n=0) Landau level and its spin splitting in the magnetic field normal to the 2D plane. We demonstrated that the effective Coulomb interaction in the magnetic field intensifies the spin splitting of zero-mode Landau carriers. At temperatures below 2K, magnetic fields above several Tesla break the twofold valley degeneracy.
△ Less
Submitted 28 September, 2010; v1 submitted 20 July, 2010;
originally announced July 2010.
-
Long-Term Optical Continuum Color Variability of Nearby Active Galactic Nuclei
Authors:
Yu Sakata,
Takeo Minezaki,
Yuzuru Yoshii,
Yukiyasu Kobayashi,
Shintaro Koshida,
Tsutomu Aoki,
Keigo Enya,
Hiroyuki Tomita,
Masahiro Suganuma,
Yuka Katsuno Uchimoto,
Shota Sugawara
Abstract:
We examine whether the spectral energy distribution of optical continuum emission of active galactic nuclei (AGNs) changes during flux variation, based on accurate and frequent monitoring observations of 11 nearby Seyfert galaxies and QSOs carried out in the B, V, and I bands for seven years by the MAGNUM telescope. The multi-epoch flux data in any two different bands obtained on the same night…
▽ More
We examine whether the spectral energy distribution of optical continuum emission of active galactic nuclei (AGNs) changes during flux variation, based on accurate and frequent monitoring observations of 11 nearby Seyfert galaxies and QSOs carried out in the B, V, and I bands for seven years by the MAGNUM telescope. The multi-epoch flux data in any two different bands obtained on the same night show a very tight linear flux to flux relationship for all target AGNs. The flux of the host galaxy within the photometric aperture is carefully estimated by surface brightness fitting to available high-resolution HST images and MAGNUM images. The flux of narrow emission lines in the photometric bands is also estimated from available spectroscopic data. We find that the non-variable component of the host galaxy plus narrow emission lines for all target AGNs is located on the fainter extension of the linear regression line of multi-epoch flux data in the flux to flux diagram. This result strongly indicates that the spectral shape of AGN continuum emission in the optical region does not systematically change during flux variation. The trend of spectral hardening that optical continuum emission becomes bluer as it becomes brighter, which has been reported by many studies, is therefore interpreted as the domination of the variable component of the nearly constant spectral shape of an AGN as it brightens over the non-variable component of the host galaxy plus narrow lines, which is usually redder than AGN continuum emission.
△ Less
Submitted 28 January, 2010;
originally announced January 2010.
-
Variation of Inner Radius of Dust Torus in NGC4151
Authors:
Shintaro Koshida,
Yuzuru Yoshii,
Yukiyasu Kobayashi,
Takeo Minezaki,
Yu Sakata,
Shota Sugawara,
Keigo Enya,
Masahiro Suganuma,
Hiroyuki Tomita,
Tsutomu Aoki,
Bruce A. Peterson
Abstract:
The long-term optical and near infrared monitoring observations for a type 1 act ive galactic nucleus NGC 4151 were carried out for six years from 2001 to 2006 b y using the MAGNUM telescope, and delayed response of flux variations in the $K(2.2μm)$ band to those in the $V(0.55μm)$ band was clearly detected. Based on cross correlation analysis, we precisely measured a lag time $Δt$ for eight sep…
▽ More
The long-term optical and near infrared monitoring observations for a type 1 act ive galactic nucleus NGC 4151 were carried out for six years from 2001 to 2006 b y using the MAGNUM telescope, and delayed response of flux variations in the $K(2.2μm)$ band to those in the $V(0.55μm)$ band was clearly detected. Based on cross correlation analysis, we precisely measured a lag time $Δt$ for eight separate periods, and we found that $Δt$ is not constant changing be tween 30 and 70 days during the monitoring period. Since $Δt$ is the ligh t travel time from the central energy source out to the surrounding dust torus, this is the first convincing evidence that the inner radius of dust torus did ch ange in an individual AGN. In order to relate such a change of $Δt$ with a change of AGN luminosity $L$, we presented a method of taking an average of th e observed $V$-band fluxes that corresponds to the measured value of $Δt$, and we found that the time-changing track of NGC 4151 in the $Δt$ versus $L$ diagram during the monitoring period deviates from the relation of $Δt \propto L^{0.5}$ expected from dust reverberation. This result, combined with t he elapsed time from period to period for which $Δt$ was measured, indicat es that the timescale of dust formation is about one year, which should be taken into account as a new constraint in future studies of dust evolution in AGNs.
△ Less
Submitted 10 July, 2009; v1 submitted 3 July, 2009;
originally announced July 2009.
-
Effects of the Zero-Mode Landau Level on Inter-Layer Magnetoresistance in Multilayer Massless Dirac Fermion Systems
Authors:
N. Tajima,
S. Sugawara,
R. Kato,
Y. Nishio,
K. Kajita
Abstract:
We report on the experimental results of interlayer magnetoresistance in multilayer massless Dirac fermion system $α$-(BEDT-TTF)$_2$I$_3$ under hydrostatic pressure and its interpretation. We succeeded in detecting the zero-mode Landau level (n=0 Landau level) that is epected to appear at the contact points of Dirac cones in the magnetic field normal to the two-dimensional plane. The characteris…
▽ More
We report on the experimental results of interlayer magnetoresistance in multilayer massless Dirac fermion system $α$-(BEDT-TTF)$_2$I$_3$ under hydrostatic pressure and its interpretation. We succeeded in detecting the zero-mode Landau level (n=0 Landau level) that is epected to appear at the contact points of Dirac cones in the magnetic field normal to the two-dimensional plane. The characteristic feature of zero-mode Landau carriers including the Zeeman effect is clearly seen in the interlayer magnetoresistance.
△ Less
Submitted 30 April, 2009; v1 submitted 4 December, 2008;
originally announced December 2008.