Animate, or Inanimate, That is the Question for Large Language Models

Leonardo Ranaldi, Giulia Pucci and Fabio Massimo Zanzotto
School of Informatics, University of Edinburgh, UK.
Department of Computing Science, University of Aberdeen, UK
Human-Centric ART Group, University of Rome Tor Vergata, Italy
[first_name].[last_name]@uniroma2.it

Abstract

The cognitive essence of humans is deeply intertwined with the concept of animacy, which plays an essential role in shaping their memory, vision, and multi-layered language understanding. Although animacy appears in language via nuanced constraints on verbs and adjectives, it is also learned and refined through extralinguistic information. Similarly, we assume that the LLMs’ limited abilities to understand natural language when processing animacy are motivated by the fact that these models are trained exclusively on text.

Hence, the question this paper aims to answer arises: can LLMs, in their digital wisdom, process animacy in a similar way to what humans would do? We then propose a systematic analysis via prompting approaches. In particular, we probe different LLMs by prompting them using animate, inanimate, usual, and stranger contexts. Results reveal that, although LLMs have been trained predominantly on textual data, they exhibit human-like behavior when faced with typical animate and inanimate entities in alignment with earlier studies. Hence, LLMs can adapt to understand unconventional situations by recognizing oddities as animated without needing to interface with unspoken cognitive triggers humans rely on to break down animations.

Leonardo Ranaldi, Giulia Pucci and Fabio Massimo Zanzotto School of Informatics, University of Edinburgh, UK. Department of Computing Science, University of Aberdeen, UK Human-Centric ART Group, University of Rome Tor Vergata, Italy [first_name].[last_name]@uniroma2.it,

1 Introduction

The mnemonic abilities underlying cognitive processing seem to enable animate entities and concepts to be more easily memorized, which highlights the role of the animacy effect in human cognition New et al. (2007); Nairne et al. (2013).

Animacy is manifested via language through the faculty that humans have in using certain verbs or adjectives with animate and inanimate entities and accordingly inferring and reasoning about the mental states, intentions, and reactions of others. This allows them to navigate and understand social interactions. For this reason, using NLP models in increasingly complex social contexts necessitates the same ability to capture these socio-cognitive dynamics.

Current Large Language Models (LLMs) Chowdhery et al. (2022); Touvron et al. (2023); OpenAI (2023), such as the GPTs OpenAI (2023), PaLM Chowdhery et al. (2022), and Llama Touvron et al. (2023), are trained merely on textual data and cannot access non-verbal information, unlike humans. If faced with discerning animacy, they must infer it from its downstream linguistic implications, diverging from humans who also benefit from visual and physical stimuli. Thus, a fundamental question arises: Do LLMs perceive and respond to animacy hooks in language in a way as close to human comprehension as possible?

Refer to caption — Figure 1: Large Language Models seem to be as surprised as humans at the thought of experiencing inanimate entities in animate contexts.

This work investigates whether LLMs “behave” as humans when dealing with animacy. We conduct extensive investigations using LLMs as subjects in psycholinguistic experiments developed for humans. Hence, we explore the behaviours of LLMs in answering to infractions of selective constraints associated with animacy in typical and atypical settings. Complementing the foundation work of Warstadt et al. (2020); Spiliopoulou et al. (2022); Hanna et al. (2023), we study the animacy effect by operating via prompting approaches. We aim to elicit different LLMs to understand scenarios and situations that demand intricate reasoning passage. We discover that, like humans, LLMs generally prefer sentences adhering to animacy-related constraints, greatly preferring these constructions. These similarities are not strictly constrained to typical animacy; in fact, the behaviour of LLMs encountering atypical animate entities still seems to align with that of humans, both in terms of surprise at a first impression and downstream of adaptation by exhibiting significantly less surprise.

Our findings can be summarized as follows:

•

By proposing a systematic analysis based on the prompting of LLMs, we evaluate the animacy effect, extending the results obtained from previous contributions.
•

In particular, using psychological tests designed for humans, we observe that the LLMs not only prefer sentences that adhere to animacy constraints, as shown in Warstadt et al. (2020), but are able to adapt awareness in atypical scenarios just as humans do Nieuwland and Van Berkum (2006).
•

Finally, we demonstrate that LLMs not only demonstrate robust behaviour comparable with humans but also deliver answers that best approximate the placed expectations.

2 Models & Methods

To investigate whether Large Language Models (LLMs) are able to understand and generate language in a way that reflects human expectations, we need to understand whether they are able to best approximate human knowledge of words and their cognitive passages. Hence, using LLMs as subjects (§ 2.1), we study if behaviours can be manifested in human-designed experiments conceived for studying animacy (§ 2.2). We propose a systematic prompt-based approach for LLMs through which we discuss the results in § 2.3. Finally, we outline a general discussion of the findings in § 2.4.

2.1 The Subjects

The animacy effects behind state-of-the-art Large Language Models are analyzed via systematic prompting in three groups of models:

•

two subjects from the OpenAI family OpenAI (2023): GPT-3.5 and GPT-4;
•

three subjects form the Meta family Touvron et al. (2023): Llama2-chat-7b, -13b, -70b;
•

two subjects form the Mistral family Jiang et al. (2023, 2024): Mixtral8x7b, Mistral-7b ;

We use both open-source models¹¹1To simplify the discussion, we omit ”chat”, ”b”. The resulting names are Llama2-7, -13, -70 Mixtral, and Mistral-7 to make our work more reproducible and closed-source models because they demonstrate outstanding performance in many NLP tasks.

Finally, as we describe in each experiment, we evaluate the accuracy scores computed via string matching between the final and the target answers (detailed information in Appendices B, C).

2.2 Selected Experimental Settings

To adapt our ‘subjects’ to the experimental settings proposed on humans, we discern between two different types of experiments: (1) typical animacy (§ 2.2.1); (2) atypical animacy (§ 2.2.2). The two different kinds of experiments are needed as typical animacy is more a lexical task, and atypical animacy is more a contextual task from the point of view of LLMs.

2.2.1 Typical Animacy

In typical animacy experiments, subjects are prompted to determine which word in a pair is animate and which is not (e.g., if "frogs" are animated and "mountains" are not). Hence, we use two different settings:

•

the Benchmark of Linguistic Minimal Pairs (BLiMP) Warstadt et al. (2020) in § 2.3.1;
•

the Benchmark of Sentence Plausibility (BSP) Vega-Mendoza et al. (2021) in § 2.3.2;

Acceptable	Example
Sub-task: Passive
Yes	The glove was noticed by some woman.
No	The glove was noticed by some mouse.
Yes	Galileo is concealed by the woman.
No	Galileo is concealed by the horse.
Sub-task: Transitive
Yes	Beth scares Roger.
No	A carriage scares Roger.
Yes	Tanya admires Melanie.
No	Music admires Melanie.

Table 1: Two examples from the Transitive and Passive datasets. Each is a minimal pair of sentences: one Acceptable (Yes) and one not (No).

In BLiMP, we select two sub-tasks: transitive-animate and passive-animate. Each sub-task has 1,000 pairs of synthetic English sentences that are similar but differ by only one/two words (Table 1).

Meanwhile, in BSP, we use sentences containing plausible and implausible words with different nuances. The resource contains 1,500 synthetic sentences in English. Each sentence has a fixed initial part and an interchangeable final part between animated plausible, animated implausible (inherent and non-related), and inanimate implausible (inherent and non-related) words (Table 2). From the point of view of our subjects, that is, LLMs, this psychological experiment is translated into a lexical task.

Sentence:
At the club the cocktails are served by the _
Plausible
\hdashlineControl	barmaid
Implausible
\hdashlineAnimate-Related	drunkard
Animate-Unrelated	queen
Inanimate-Related	tonic
Inanimate-Unrelated	dirt

Table 2: Example from Benchmark of Sentence Plausibility. Each sentence has a plausible and four non-plausible words. As proposed by Vega-Mendoza et al. (2021) we use the options as different tasks.

2.2.2 Atypical Animacy

In contrast to § 2.2.1, for investigating if the subjects are able to detect animacy without relying on lexical information of the target word, we employ repetition and contextual study where the inanimate entities are treated as animated entities Nieuwland and Van Berkum (2006). This shifts the focus from the lexical knowledge of the target word to the contextual knowledge. The human experiments are based on N400, a brain response measured by EEG that rises when processing semantically anomalous input.

The repetition study measured participants’ N400 responses while reading cartoon-like stories in which a typically inanimate entity behaved as animate (Table 4). Nieuwland and Van Berkum (2006) found that although initially surprised by the atypically animated entity, participants quickly adapted, producing increasingly lower N400 responses. In contrast, the contextual study performs the measures only behind a contextualization part since the repetition experiment shows similarities with the work of Caramazza and Shelton (1998). These contexts are given as in Table 3 where people are asked to read the story with one of the targets alternatively.

A girl told a sandwich that an attack was imminent.

The sandwich wailed that his family was in danger.

The girl told the sandwich that public places were

the most dangerous. The sandwich immediately

started calling everyone he knew. The sandwich

was [ targets ] and wanted to make sure none

of his loved ones were in danger

targets: delicious, worried

Table 3: Example from translated context story of N400 Nieuwland and Van Berkum (2006).

Hence, these experiments are useful for investigating the following questions: Can LLMs adapt to animated entities at the token level despite being typically inanimate? Or is animate processing limited to a simple type-level understanding? We replicate these with LLMs to answer this question, using their surprise to model N400 responses.

We conduct two different experiments: in the first experiment presented in § 2.3.3, we reproduce the repetition and context as in Nieuwland and Van Berkum (2006); in a second experiment (§ 2.3.4), we analyze the impact of context adaptation as proposed in Boudewyn et al. (2019). For clarity, we introduce the original study and the methods we used for the context adaptation of LLMs. Finally, we report our empirical results and compare them with those of the original study.

(

T_{1}

) A granny met the (confectioner-apple pie) at the market with whom she started a pleasant conversation

about recipes. (

T_{2}

) The (confectioner-apple pie) confided a secret recipe to the granny. (

T_{3}

) But the granny

deceived the (confectioner-apple pie) by making off with the recipe herself. (

T_{4}

) The (confectioner-apple pie)

discovered the deception and wanted to reprimand the granny. But the granny pleased the

(confectioner-apple pie) with an (confectioner-apple pie) with an even better version of the recipe.

(

T_{5}

) The (confectioner-apple pie) understood that this was the ultimate recipe and apologized for the

misplaced distrust.

Table 4: Example from translated version of N400 Nieuwland and Van Berkum (2006). The first tokens indicate an acceptable example, and the numbers indicate the sentences given as context.

2.3 Experimenting with LLM subjects

2.3.1 Experiment 1: Typical Animacy on BLiMP

Prompt definition

By constructing a series of prompts over the datasets presented in §2.2.1, we test the models’ answers to animacy in situations where the animacy of an instance aligns with its more general type.

Sentence pairs in BLiMP Warstadt et al. (2020) is built as follows: one sentence respects the animacy constraints, and the other violates them. Hence, there is a straightforward way to evaluate the LLM’s ability to surpass the animacy test. We ask them to answer the following prompt:

Choose which example is acceptable between A and B. A) Galileo is concealed by the woman. B) Galileo is concealed by the horse. Answer:

A model gets a correct example if it chooses a sentence that respects the animacy constraint.

Following this approach, we evaluate the accuracies by performing a string matching- between the generated answers and the target values on both sub-tasks.

Results

The results of this first experiment set are intriguing. The OpenAI family behave on par with respect to humans, and the Meta and Mistral families are catching up. Figure 2 shows the results of each model (vertical bars) and the results obtained by humans (horizontal dashed lines), as presented in Warstadt et al. (2020). The accuracy metric has to be intended as the percentage of examples in which human or artificial subjects preferred the acceptable sentence of the given pair.

Human Transitive Accuracy is reached by GPT-3.5 and topped by GPT-4. This seems to suggest that these models can handle the lexicon to determine typical animacy. Llama lags behind, but it is reaching the Human level.

Similarly, in the passive scenario, GPT-4 performs very close to humans. Concerning GPT-3.5, Llama2-70 and Mixtral have comparable and slightly lower performance in the transitive scenario and significantly lower performance in the passive scenario, respectively. Finally, the smaller models, i.e., with fewer parameters, underperform humans with average gaps of 20 points.

This difference between the transitive and passive case may be due more to differences in setting than to different animacy processing in the two scenarios. However, emerges that the composition of the choices is strongly class-related. Indeed, in the passive case, the target word, i.e., the most influential one, is always in the last position. In contrast, the target word is not the last token in the transitory case. Thus, heuristics related to the sensitivity of the choices in the input prompts’ structure may be present.

2.3.2 Experiment 2: Typical Animacy on BSP

Prompt definition

In the second Experiment, we structured the prompting phase similarly. Hence, by constructing a series of prompts over the BSP benchmark (§ 2.2.1), we test the model’s responses in situations where plausible and implausible sentences with animated and non-animated components were provided.

Following Vega-Mendoza et al. (2021), we analyze the model’s answers to the plausibility question on five inputs constructed from a sentence and completed with different types of answers. Hence, there is a straightforward way to evaluate our subject’s ability to surpass the plausibility test. Hence, given the following prompt:

Is the following sentence plausible? Answer by choosing (Yes) or (No). Sentence: In ancient Egypt the people were governed by the pyramid. Answer:

A model gets an answer if it answers the question with Yes/ No, respecting plausibility. The accuracy of the LLMs is computed in this way: for the plausible control word, the accuracy counts the percentage of Yes, and for all the implausible words the percentage of No.

Results

In this second experiment, LLMs of the OpenAI family behave similarly to humans as in the previous. Figure 3 shows the accuracy results of each model and the results obtained by humans, as presented in Vega-Mendoza et al. (2021). Dealing with animated words (see Figure 3), humans and LLMs behave similarly. Indeed, humans perform on animate-unrelated similarly to how they perform on control words. Instead, they are less able to recognize animate-related as making target sentences implausible. The same happens for all the LLM subjects. GPT-4 performs better than humans, and it keeps the difference in recognizing the implausibility of sentences built with animated-unrelated and animated-related words. Moreover, when dealing with unanimated words (see the right plot in Figure 3), humans and LLMs behave similarly. Humans recognize the implausibility of unanimated-unrelated words but have a slight decrease in recognizing unanimated-related ones. The same trend happens for all the LLM subjects and, consistently in other experiments, the OpenAI family performs better than humans.

In humans, these differences in the plausibility of animate and non-animate cases are given by a combination of cognitive factors, as explained by Vega-Mendoza et al. (2021). Consequently, as in the experiments in § 2.3.1, GPTs perform comparably to humans and sometimes outperform. However, even in this task, there is a robust structural component related to fearfulness. The target words, i.e., those that provide the final decision, are always in the last position. Therefore, there may be a heuristic related to the sensitivity of the structure of choices in the input prompts.

2.3.3 Experiment 3: Atypical Animacy - Repetition

Human experiment and its results

The repetition experiment on Atypical Animacy Nieuwland and Van Berkum (2006) measure the N400 responses of a series of participants who listened to Dutch stories containing a typical animate or an inanimate entities behaving as if it were a human being. The N400 values are measured in three stages: the first ( $T_{1}$ ), the third ( $T_{3}$ ), and the fifth ( $T_{5}$ ) mention of the entity (see, for example, Table 4 with confectioner and apple pie). Nieuwland and Van Berkum (2006) discovered that:

•

in the case of animated entities, participants have a moderate N400 response to the first mention ( $T_{1}$ ) and a low response to subsequent mentions ( $T_{2}$ and $T_{3}$ );
•

in the case of inanimated entities, participants initially ( $T_{1}$ ) have a high N400 response to the atypically animated entity, and, as the mentions progress ( $T_{2}$ and $T_{3}$ ), their N400 responses are very close to the responses from the mentions of the animated entity.

Thus, while the humans are initially surprised by the atypically animated entity, they quickly adapt to the situation and no longer find it surprising. Moreover, they show that responses do not derive from lexical repetition but from context. In fact, in the contextual experiment, they provide a context. Only at the end did they estimate N400 responses of the participants obtaining low results for inanimate entities in atypical inanimate contexts.

Prompt definition

To estimate a surprise value analogous to the N400, state-of-the-art studies examine token probability values. However, some of the models used in our study do not provide access to probability values, prompting us to define a series of prompts to query the model about its level of surprise systematically. In particular, for each of the 60 examples, we estimate the surprise of the animate and inanimate entity given by the context at each time-step, denoted as $T_{n}$ , (which, in our case, refers to an input-prompt). For instance, to model the inanimate N400 response at $T_{1}$ in the example from Table 4, we construct the following prompt:

Choose a surprising value from 0 to 30 on the following story: A granny met the apple pie at the market with whom she started a pleasant conversation about recipes. Answer:[num]

Following the time-steps, we introduce additional prompts by contextualizing the preceding story, whether animate or inanimate. For example, for the inanimate scenario:

Given the following story: A granny met the apple pie at the market with whom she started a pleasant conversation about recipes. The apple pie confided to............. Choose a surprising value from 0 to 30 on the following story: The apple pie that this was the ultimate recipe and apologized for the misplaced distrust. Answer:[num]

Hence, we compute the average surprise value of examples containing animate and inanimate entities separately at each time-step.

Results

The LLMs follow the general trends of human N400 responses (Figure 4). Indeed, as reported in § 2.3.3, human N400 responses for animate and inanimate critical words diverge at $T_{1}$ and come closer at $T_{3}$ and $T_{5}$ . LLM subjects behave similarly. In fact, at $T_{1}$ , models are surprised by the inanimate entity and trimmed by the animate one. At later steps ( $T_{3}$ and $T_{5})$ , the surprises of inanimate entities decrease until they reach levels similar to animate entities. LLMs seem to adapt, just as humans do. However, the raw results do not show that the models adapt to the same extent as humans.

We use the Wilcoxon signed-rank test for making the experiments robust as Nieuwland and Van Berkum (2006). We observe distinct surprise values at each time step. As with humans, LLMs have a statistically significant difference between the surprises of animate and inanimate entities, for example, $T_{1}$ of Figure 5. However, while there is no difference between humans at $T_{3}$ , there are differences (p < 0.01) in most models; only the largest do not have any. At $T_{5}$ , differences disappear only in the large models. Although the models can generally approximate human N400 responses to atypical animacy trends, only the most significant and most potent fully replicate human adaptation.

[Uncaptioned image] — Table 5: Statistical significance of the difference between animate and inanimate surprisal, by LLMs response values and time-step

2.3.4 Experiment 4: Atypical Animacy - Context Experiment

Human experiments & results

Nieuwland and Van Berkum (2006) discover that contextual appropriateness seems to neutralize animacy violations, that is, non-appropriate adjectives (such as “worried”, for example, in Table 3) are not generating much surprise if the context suggests them. Moreover, context can even make an animacy-violating predicate more preferred than an animacy-obeying canonical predicate if the context justifies this.

Prompt definition

By using examples for the context experiment (as in Table 3), we ask for a surprising value for animate and inanimate adjectives for each of the 60 stories proposed by Nieuwland and Van Berkum (2006). In particular, we use input-prompt structures closer to the previous:

Given the following context: +Context A girl told a sandwich that an attack was imminent. The sandwich wailed that his family was in danger. The girl told the sandwich that public places were the most dangerous. The sandwich immediately started calling everyone he knew. Baseline Choose a surprising value from 0 to 30 on the following story: The sandwich was delicious and wanted to make sure none of his loved ones were in danger. Answer:[num]

To estimate absolute values, we also ask for baseline surprises, that is, those of the inanimate adjective without the context of the whole story.

Results

Even in this experiment in Figure 5, LLM subjects behave similarly to human subjects. The animate baseline is larger for all subjects than the inanimate baseline. Even in the baseline, there is conflicting information as to the presence or absence of animacy of the selected subject. Moreover, the surprise drop with the context is more significant with inanimated than animated adjectives. This is in line with the human experiments.

2.4 General discussion

The experiments deliver a coherent message: despite their lack of embodiment and senses, LLMs behave as humans in animacy understanding. For the two typical animacy tasks (Experiments 1 and 2), LLMs can be surprised by both animate and inanimate examples like humans. The intriguing part is that LLMs of the OpenAI family behave very similarly to humans. Moreover, the atypical animacy tasks (Experiments 3 and 4) clear the ground from the fact that LLMs possibly use only lexical information correlated to words. Indeed, in the atypical animacy tasks, subjects are investigated on their level of surprise over an inanimated object performing actions as if they were animated. These two experiments aim to track how surprise lowers when subjects are exposed to more context. Hence, this is not a lexical task. Nevertheless, similar to the results of typical animacy, LLM subjects behave as human subjects: their level of surprise lowers more for inanimated objects than animated ones. Finally, we complete our analysis by providing evidence of the stability of the assessments on the generations in Appendix A.

3 Related Work

The relationship between animacy and language has long been a subject of interest in cognitive linguistics. Visualized as a spectrum, animacy influences linguistic structures and shapes cognitive interpretations of entities based on their perceived liveliness. There is a wide path to studying animacy in Natural Language (§ 3.1). At the same time, these explorations began to spread within the NLP community with attractive evidence of the abilities of Language Models in handling linguistic properties (§ 3.2). However, the advent of Large Language Models (LLMs) has revolutionized many previously adopted analytical methods. Therefore, we ask whether ongoing models can handle linguistic properties in the same way as their predecessors. Even more so whether these models are able to generate answers in line with the expectations placed on them by humans. In an analytical scenario, we adopt a psycholinguistic lens, treating these models as subjects to assess their understanding and processing of animacy.

3.1 The Animacy in Natural Language

Animacy in cognitive processes is described as going along a continuum García et al. (2018). This is shown through a hierarchy of humans, animals, and objects in language. Entities are distinguished based on their position in this ranking through sentence structure or word form Gass (1984). It can be traced at the broad category and the specific instance levels. Similarly, linguistic animacy is not solely grounded in biological factors but also hinges on the speaker’s emotional connection and empathy towards a specific entity Vihman and Nelson (2019). The impact of animacy in language is not uniform across different languages; it can range from explicit markers of animacy to more subtle influences. Such subtleties encompass strict constraints based on animacy Caplan et al. (1994); Buckle et al. (2017) and nuanced grammatical impacts Rosenbach (2008). For instance, sentences more frequently begin with animate entities, even if this results in less conventional structures Ferreira (1994); Fairclough (2008). We analyzed the distinction between humans and inanimate objects and the constraints based on animacy. Such a pronounced differentiation is anticipated to yield more discernible effects in LLMs.

3.2 Large Language Models as Tests Subjects

Previous works present investigations on Language Models²²2non-large and non-instruction-tuned or further refined linguistic capabilities on structural properties and operating through the analysis of assigned probabilities of given sentences. Such techniques have previously been used to deepen LMs’ understanding of constructs such as negation, structural agreement, and in-context priming Sinclair et al. (2022); Jumelet et al. (2024). Recent studies have compared the performance of LLMs with human cognition by utilizing surprisal, the negative log probability of a sequence, as an estimator for cognitive exertion Michaelov et al. (2023); Hanna et al. (2023). LLM has displayed notable versatility, demonstrating significant correlations with human values Aurnhammer and Frank (2018); Goodkind and Bicknell (2018); Truong et al. (2023).

Large Language Models

Compared with the smaller Language Models, ongoing Large LMs (GPTs OpenAI (2023), Llamas Touvron et al. (2023), et alia) have been demonstrating capabilities in challenging complex tasks by delivering multi-step reasoned answers Wei et al. (2022, 2023). The refinement techniques employed to lead the models to best approximate human expectations in their responses are increasingly operated (e.g., instruction-tuning Ouyang et al. (2022), reinforcement learning from human preferences Christiano et al. (2023)). We analyzed whether these methods that give rise to emergent capabilities also arise in psycholinguistic tasks applied to humans in detecting animacy. By complementing the foundation works Michaelov et al. (2023); Truong et al. (2023); Spiliopoulou et al. (2022); Buijtelaar and Pezzelle (2023); Hanna et al. (2023), we extended the analyses to include further tests and revisited prompt-based evaluation by eliciting the models to generate responses (§ 3.4). We evaluate different LLMs by employing them as subjects within a psycholinguistic framework, an increasingly adopted methodology in the field. This approach psychoanalyses the LLMs by evaluating their delivered responses to several questions, just as would be done to humans.

3.3 Animacy in Large Language Models

Previous works have investigated the LMs’ abilities in processing animacy. Warstadt et al. (2020) explored the phenomenon within the BLiMP framework. Kauf et al. (2023) investigate within LMs’ overall event knowledge, concluding that models are adept at discerning animacy concerning selective constraints. Our study goes forward exploring different animacy. We introduce a surprise score following Nieuwland and Van Berkum (2006) and Michaelov et al. (2023); Hanna et al. (2023). Then, we focused on transferring the experimental setting into generative models where accessing the internal weights (closed-source LLMs) is often impossible. These efforts aim to study scenarios in which the models efficiently capture the trends manifested by analyzing open LMs. Furthermore, we aim to investigate the correlation between the robustness and predictive accuracy of an LM by evaluating a broad spectrum of LLMs.

3.4 Our Contribution

Completing the earlier foundational work (see §3.2) comparatively, our work goes beyond by:
(i) We propose a systematic prompting pattern and analyzing natural language responses as humans would. Specifically, we establish a promting pipeline for estimating the LLMs’ understanding of the acceptability and plausibility of concepts related to animate and inanimate entities. Moreover, we extend to generative-based models an approach based on a series of progressive in-context prompts to simulate the estimation of the N400 neurological response. (ii) Hence, by placing LLMs in atypical contexts with animated entities, we have shown similarities to the results of tests performed on humans and the results from previous contributions. (iii) Finally, we show that the prompting approaches are affected by minor bias that allows fair analogies between results obtained by LLMs and prior findings (Appendix A).

4 Conclusion

Large Language Models (LLMs) reveal capacities to solve repetitive cognitive tasks by exploiting them better and faster than humans. We treat LLMs as subjects in psychological experiments, exploring whether they behave as humans when dealing with the concept of animacy. We show that LLMs behave as humans even when lexical information does not entirely lead the decision. This is astonishing, as humans’ cognitive underpinnings are intricately linked with the concept of animacy. Although the LLMs subjects of our analysis are trained primarily on textual data without the support of extralinguistic information, they show human-like behaviour when exposed to typical entities and adapted to unconventional narratives. The ability to adapt, while remarkable, is sometimes aligned with human fluidity. Hence, it becomes imperative to provide robust analyses of LLMs’ behaviour in intricate social scenarios by studying the ability to emulate human-like processes. While this analysis shows significant results, current models are still grappling with assessing the understanding of social interactions. To truly build models aligned with human minds, they must merge their vast textual knowledge with a deep understanding of human social dynamics.

Limitations & Future Works

In this study, we used a series of behavioral experiments. Due to the ease of interaction with Large Language Models on benchmarks, they are suitable for comparing the models with human data. Although they have shown exciting features, analyzing the causal mechanisms by which these models processed the animated sentences is impossible. To analyze them, it would be appropriate to look at the integer weights of the models, which are not always open-source, as in the GPTs thing OpenAI (2023).

In future developments, we plan to extend the analysis to more languages to assess whether the models respond in the same way in similar scenarios and contexts involving languages beyond English. In addition, it is of interest to us to assess the impact of the in-context prompt, in particular, the degree to which the composition of the prompt may influence causal generations as done for the sycophantic behavioural study in Ranaldi and Pucci (2024). Last but not least, it will be of interest for us to analyze the internal dynamics that support the models’ decisions in order to better understand the neuronal patterns that motivate the generations Mohebbi et al. (2024).

Ethics Statemet

In our work, ethical topics were not addressed. The data comes from open-source benchmarks, and statistics on language differences in commonly used pre-training data were obtained from official sources without touching on gender, sex, or race differences.

References

Aurnhammer and Frank (2018) Christoph Aurnhammer and S. Frank. 2018. Comparing gated and simple recurrent neural network architectures as models of human sentence processing. In Annual Meeting of the Cognitive Science Society.
Boudewyn et al. (2019) Megan A. Boudewyn, Adam R. Blalock, Debra L. Long, and Tamara Y. Swaab. 2019. Adaptation to animacy violations during listening comprehension. Cognitive, Affective, Behavioral Neuroscience, 19(5):1247–1258.
Buckle et al. (2017) Leone Buckle, Elena Lieven, and Anna L. Theakston. 2017. The effects of animacy and syntax on priming: A developmental study. Frontiers in Psychology, 8.
Buijtelaar and Pezzelle (2023) Lars Buijtelaar and Sandro Pezzelle. 2023. A psycholinguistic analysis of BERT’s representations of compounds. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2230–2241, Dubrovnik, Croatia. Association for Computational Linguistics.
Caplan et al. (1994) David Caplan, Nancy Hildebrandt, and Gloria S. Waters. 1994. Interaction of verb selectional restrictions, noun animacy and syntactic form in sentence processing. Language and Cognitive Processes, 9(4):549–585.
Caramazza and Shelton (1998) Alfonso Caramazza and Jennifer R. Shelton. 1998. Domain-specific knowledge systems in the brain: The animate-inanimate distinction. J. Cognitive Neuroscience, 10(1):1–34.
Chowdhery et al. (2022) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. 2022. Palm: Scaling language modeling with pathways.
Christiano et al. (2023) Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2023. Deep reinforcement learning from human preferences.
Fairclough (2008) Norman Fairclough. 2008. The language of critical discourse analysis: reply to michael billig. Discourse Society, 19(6):811–819.
Ferreira (1994) F. Ferreira. 1994. Choice of passive voice is affected by verb type and animacy. Journal of Memory and Language, 33(6):715–736.
García et al. (2018) Marco García García, Beatrice Primus, and Nikolaus P. Himmelmann. 2018. Shifting from animacy to agentivity. Theoretical Linguistics, 44:25 – 39.
Gass (1984) Susan M. Gass. 1984. A review of interlanguage syntax: Language transfer and language universals. Language Learning, 34:115–132.
Goodkind and Bicknell (2018) Adam Goodkind and Klinton Bicknell. 2018. Predictive power of word surprisal for reading times is a linear function of language model quality. In Proceedings of the 8th Workshop on Cognitive Modeling and Computational Linguistics (CMCL 2018), pages 10–18, Salt Lake City, Utah. Association for Computational Linguistics.
Hanna et al. (2023) Michael Hanna, Yonatan Belinkov, and Sandro Pezzelle. 2023. When language models fall in love: Animacy processing in transformer language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12120–12135, Singapore. Association for Computational Linguistics.
Jiang et al. (2023) Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7b.
Jiang et al. (2024) Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2024. Mixtral of experts.
Jumelet et al. (2024) Jaap Jumelet, Willem Zuidema, and Arabella Sinclair. 2024. Do language models exhibit human-like structural priming effects?
Kauf et al. (2023) Carina Kauf, Anna A. Ivanova, Giulia Rambelli, Emmanuele Chersoni, Jingyuan Selena She, Zawad Chowdhury, Evelina Fedorenko, and Alessandro Lenci. 2023. Event knowledge in large language models: the gap between the impossible and the unlikely.
Michaelov et al. (2023) James A. Michaelov, Seana Coulson, and Benjamin K. Bergen. 2023. Can peanuts fall in love with distributional semantics?
Mohebbi et al. (2024) Hosein Mohebbi, Jaap Jumelet, Michael Hanna, Afra Alishahi, and Willem Zuidema. 2024. Transformer-specific interpretability. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Tutorial Abstracts, pages 21–26, St. Julian’s, Malta. Association for Computational Linguistics.
Nairne et al. (2013) James S. Nairne, Joshua E. VanArsdall, Josefa N. S. Pandeirada, Mindi Cogdill, and James M. LeBreton. 2013. Adaptive memory: The mnemonic value of animacy. Psychological Science, 24(10):2099–2105. PMID: 23921770.
New et al. (2007) Joshua New, Leda Cosmides, and John Tooby. 2007. Category-specific attention for animals reflects ancestral priorities, not expertise. Proceedings of the National Academy of Sciences, 104(42):16598–16603.
Nieuwland and Van Berkum (2006) Mante S. Nieuwland and Jos J. A. Van Berkum. 2006. When peanuts fall in love: N400 evidence for the power of discourse. J. Cognitive Neuroscience, 18(7):1098–1111.
OpenAI (2023) OpenAI. 2023. Gpt-4 technical report.
Ouyang et al. (2022) Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback.
Ranaldi and Pucci (2024) Leonardo Ranaldi and Giulia Pucci. 2024. When large language models contradict humans? large language models’ sycophantic behaviour.
Rosenbach (2008) Anette Rosenbach. 2008. Animacy and grammatical variation—findings from english genitive variation. Lingua, 118(2):151–171.
Sinclair et al. (2022) Arabella Sinclair, Jaap Jumelet, Willem Zuidema, and Raquel Fernández. 2022. Structural persistence in language models: Priming as a window into abstract language representations. Transactions of the Association for Computational Linguistics, 10:1031–1050.
Spiliopoulou et al. (2022) Evangelia Spiliopoulou, Artidoro Pagnoni, Yonatan Bisk, and Eduard Hovy. 2022. EvEntS ReaLM: Event reasoning of entity states via language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1982–1997, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open foundation and fine-tuned chat models.
Truong et al. (2023) Thinh Hung Truong, Timothy Baldwin, Karin Verspoor, and Trevor Cohn. 2023. Language models are not naysayers: an analysis of language models on negation benchmarks. In Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023), pages 101–114, Toronto, Canada. Association for Computational Linguistics.
Vega-Mendoza et al. (2021) Mariana Vega-Mendoza, Martin J. Pickering, and Mante S. Nieuwland. 2021. Concurrent use of animacy and event-knowledge during comprehension: Evidence from event-related potentials. Neuropsychologia, 152:107724.
Vihman and Nelson (2019) Virve-Anneli Vihman and Diane Nelson. 2019. Effects of animacy in grammar and cognition: Introduction to special issue. Open Linguistics, 5(1):260–267.
Warstadt et al. (2020) Alex Warstadt, Alicia Parrish, Haokun Liu, Anhad Mohananey, Wei Peng, Sheng-Fu Wang, and Samuel R. Bowman. 2020. BLiMP: The benchmark of linguistic minimal pairs for English. Transactions of the Association for Computational Linguistics, 8:377–392.
Wei et al. (2022) Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. 2022. Emergent abilities of large language models.
Wei et al. (2023) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2023. Chain-of-thought prompting elicits reasoning in large language models.
Zheng et al. (2024) Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. 2024. Large language models are not robust multiple choice selectors.

Appendix A Error Analysis

Although we have observed human-like behaviors as extensively discussed in the previous sections, the results of our experiments are the result of generations of the patients introduced in § 2. To stabilize our analysis, we have reported the standard deviations of the results computed over the generations. In the following paragraphs, we exemplify the evaluation process used for each experiment by discussing the error analysis to provide a clear overview of the robustness of the results obtained.

Multiple Choices Question

Experiment 1 and Experiment 2, presented in § 2.2.1, are based on a robust pipeline. In the first case, the generation is closely related to a multiple-choice question task. Consequently, the evaluation used heuristics based on string matching between the target choice and the given answer as proposed in Wei et al. (2023); Zheng et al. (2024). Similar to Experiment 2, where the question is a strict answer (Yes) or (No). Hence, in this second case, we also used a heuristic based on string matching between the target values and the answer is given. Hence, the LLMs were stimulated to generate well-formed defined responses. In Appendix D, we show that the total percentage of responses that did not reflect the defined string-matching heuristics is not sensible and confirms the robustness of the results obtained. In particular, we estimated a maximum misleading response rate of about 0.5% and 0.6% (see Table 12) and 2.5-3% (see Table 12), which does not affect the final results. Examples of generation can be seen in Appendix F and Appendix G.

Number Generation

Prompts based on multiple-choice questions or strict answers such as Yes or No are easier to control and analyze. However, in Experiment 3 and Experiment 4, numbers are involved. To manage and control the sensitivity of the prompts, as proposed in Experiments 3 and 4, we added the keyword "[num]" (see § 2.3.3 and § 2.3.4). In a similar way, in order to produce a complete and robust analysis, we estimated the final values by profoundly analyzing the numerical outputs or not. We used the Python library word2number to convert the generated literal number into integer values. As displayed in Appendix E, the answers containing literal numbers are significantly minor and do not affect the final evaluations. Finally, the [num] keyword seems to have directed the generation correctly, as reported in the examples shown in Table 14 and Table 14.

Appendix B Models Vesions

Model	Version
Llama-2-7	meta-llama/Llama-2-7b
Llama-2-13	meta-llama/Llama-2-13b
Llama-2-70	meta-llama/Llama-2-70b
Mistral-7	mistralai/Mistral-7B-Instruct-v0.2
Mixtral8x7	TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ
GPT-3.5-turbo	OpenAI API (gpt-3.5-turbo-0125)
GPT-4	OpenAI API (gpt-4-1106-preview)

Table 6: List the versions of the models proposed in this work, which can be found on huggingface.co. We used the configurations described in Appendix C in the repositories for each model *(access to the following models was verified on 1-8-2024).

Appendix C Model and Hyperparameters

In our experimental setting, we propose different LLMs: (i) models from the GPT family OpenAI (2023): GPT-3.5 (gpt-3.5-turbo-0125) and GPT-4 (gpt-4); (ii) three models from the Llama-2 family Touvron et al. (2023): Llama2-7b, Llama2-13b, Llama2-70b, (iii) two models of the MistralAI family: Mistral-7b and Mixtral Jiang et al. (2024).

In particular, GPTs models are used via API, while for the others, we used versions of the quantized to 4-bit models that use GPTQ (see detailed versions in Table B)

As discussed in the limitations, our choices are related to reproducibility and the cost associated with non-open-source models. We use closed-source API and the 4-bit GPTQ quantized version of the model on four 48GB NVIDIA RTXA600 GPUs for all experiments performed only in inference.

Finally, the generation temperature used varies from $\tau=0$ τたう = 0 of GPT models to $\tau=0.5$ τたう = 0.5 of Llama2s. We choose these temperatures for (mostly) deterministic outputs, with a maximum token length of 256. The other parameters are left unchanged as recommended by the official resources. We will release the code and the dataset upon acceptance of the paper.

Evaluation

Finally, as we described in each experiment, we evaluate the accuracy scores. We compute the string matching between the final answers and the target values. The top-p parameter is set to 1 in all processes and the prompting temperature [0, 1] by repeating the experiments three times.

Appendix D Appendix Error Analysis Strict Answers

Type	GPT-4	GPT-3.5	Llama2-70	Llama2-13	Llama2-7	Mixtral	Mistral-7
Transitive	0.1%	0.1%	0.2%	0.3%	0.4%	0.3%	0.4%
Passive	0.1%	0.1%	0.2%	0.3%	0.5%	0.2%	0.5%

Table 7: Percentage over 1,000 instances for the Transitive and Passive sub-task (Section 2.3.1) of generations that do not contain one of the prompted choices. Table 12 shows two examples of outputs.

Type	GPT-4	GPT-3.5	Llama2-70	Llama2-13	Llama2-7	Mixtral	Mistral-7
Animated	0.5%	1%	2.5%	2.5%	3%	3%	3.5%
Unanimated	0.8%	1.5%	2%	2%	2.5%	3%	3%

Table 8: Percentage over 4,500 instances for the Animated and Unanimated sub-task of generations that do not contain (Yes) or (No) as explained in Section 2.3.2. Consequently, it is difficult to assess the answer automatically.

Appendix E Appendix Error Analysis Numeric Answers

		GPT-4	GPT-3.5	Llama2-70	Llama2-13	Llama2-7	Mixtral	Mistral-7
$T_{1}$	Animate	2(0)	1(0)	2(0)	7(1)	8(1)	3(0)	4(0)
$T_{1}$	Inanimate	2(0)	4(0)	5(1)	7(1)	8(1)	4(0)	5(1)
$T_{2}$	Animate	0(0)	1(0)	2(0)	3(0)	3(0)	1(0)	3(0)
$T_{2}$	Inanimate	1(0)	4(0)	3(1)	2(1)	8(1)	4(0)	5(1)
$T_{3}$	Animate	0(0)	0(0)	1(0)	1(0)	1(0)	1(0)	1(0)
$T_{3}$	Inanimate	0(0)	0(0)	0(0)	1(0)	2(0)	2(0)	0(0)

Table 9: Number of generations that do not contain numerical values and in brackets that do not contain words meaning numbers. The total instances (sentences) are 60 for each time-step, as introduced in Section 2.3.3.

	GPT-4	GPT-3.5	Llama2-70	Llama2-13	Llama2-7	Mixtral	Mistral-7
Animate	2(0)	3(0)	7(0)	8(1)	8(2)	4(1)	5(1)
Inanimate	1(0)	3(0)	6(1)	8(1)	9(0)	2(0)	5(1)

Table 10: Number of generations that do not contain numerical values and in brackets that do not contain words meaning numbers. The total number of instances analyzed (sentences) is 60, as introduced in Section 2.3.4.

Appendix F Answers for Experiment 1

Choose which example is acceptable between A and B. A) Hospitals aren’t biked to by the man. B) Hospitals aren’t biked to by the mouth. Answer: The acceptable example is A) Hospitals aren’t biked to by the man. In English, "biking" typically refers to the activity of riding a bicycle. It is logical to say that a man could ride a bicycle to a hospital. However, it doesn’t make sense to say that "the mouth" could bike to a hospital, as the mouth is not capable of such physical activity. Therefore, option A is the appropriate choice because it aligns with the expected action and capability of the subject ("the man"). (answer generated by GPT-3.5-turbo) Choose which example is acceptable between A and B. A) Hospitals aren’t biked to by the man. B) Hospitals aren’t biked to by the mouth. Answer: The correct answer is B) Hospitals aren’t biked to by the mouth. Explanation: In this sentence, "mouth" is a noun that refers to the opening through which food and air pass when we breathe. It doesn’t make sense to say that hospitals are biked to by the mouth because hospitals are buildings where medical treatment is provided, and they cannot be reached or accessed through the mouth. (answer generated by Llama-2-13-chat)

Appendix G Answers for Experiment 2

Is the following sentence plausible? Answer by choosing (Yes) or (No). Sentence: At the club the cocktails are served by the tonic Answer: No, the sentence is not plausible. Cocktails are typically served by bartenders or servers, not by tonic, which is a type of mixer. Tonic water is a beverage ingredient used in cocktails, but it does not serve cocktails itself. Therefore, the sentence does not make sense in the context of cocktail service. (answer generated by GPT-3.5-turbo) Is the following sentence plausible? Answer by choosing (Yes) or (No). Sentence: At the club the cocktails are served by the tonic Answer: Yes, the sentence is plausible. The word "tonic" could refer to a type of alcoholic drink, such as a gin and tonic, which is commonly served at clubs. In this context, the sentence makes sense and is a common way to describe how cocktails are served at a bar or nightclub. (answer generated by Llama-2-7-chat)

Table 11: Examples of Answers generated by GPT-3.5 and Llama-2-13-chat. As introduced in Section 2.3.1 and shown in Table 8, the answers contain the choices prompted.

Table 12: Examples of Answers generated from GPT-3.5 and Llama-7-13-chat. Also in this case, as described in Section 2.3.2, the prompts are well formed as the output generated with low misleading rates (Table 8).

Appendix H Answers for Experiment 3

Choose a surprising value from 0 to 30 on the following story: A museum visitor asked the map where the small collection of Van Gogh paintings was located. Answer: 27 (GPT-4) Answer: 22 (Mixtral8x7) Given the following story: A museum visitor asked the map where the small collection of Van Gogh paintings was located. The map was willing to take her there, as there was nothing to do. The museum visitor envied the map because he was allowed to see so much beauty every day. But the map laughingly explained that he always took a Game Boy with him when he was bored. The museum visitor teased the map, joking that he had to be an art barbarian to ever get bored in this museum. Choose a surprising value from 0 to 30 on the following story: The map joked that he had seen enough art for the next ten years. Answer:[num] 6. (answer generated by GPT-4) 7. Considering the previous answer, the map’s response is less surprising but still slightly unexpected. A surprising value of 7 out of 30 reflects the diminished unexpectedness of the map’s attitude, while still acknowledging the earlier context. (answer generated by Mixtral8x7)

Table 13: This example shows the answers provided by GPT-4 and Mixtral8x7 using the prompt defined in Section 2.3.3. The structure of the input and the keyword ([num]) elicit the prompted LLMs to generates a number in digits as output.

Appendix I Answers for Experiment 4

Given the following context: A girl told a sandwich that an attack was imminent. The sandwich wailed that his family was in danger. The girl told the sandwich that public places were the most dangerous. The sandwich immediately started calling everyone he knew. Choose a surprising value from 0 to 30 on the following story: The sandwich was delicious and wanted to make sure none of his loved ones were in danger. Answer:[num] 20. (answer generated by GPT-3.5) 18. (answer generated by Mixtral8x7) 15. (answer generated by Llama-2-70-chat) 24. (answer generated by Llama-2-7-chat and Llama-2-13-chat)

Table 14: This example shows the answers provided by different probed LLMs using the prompt defined in Section 2.3.4. Also in this experiment the structure of the input and the keyword ([num]) elicit the prompted LLMs to generates a number in digits as output.