(Translated by https://www.hiragana.jp/)
Animate, or Inanimate, That is the Question for Large Language Models

Animate, or Inanimate, That is the Question for Large Language Models

Leonardo Ranaldi, Giulia Pucci and Fabio Massimo Zanzotto
School of Informatics, University of Edinburgh, UK.
Department of Computing Science, University of Aberdeen, UK
Human-Centric ART Group, University of Rome Tor Vergata, Italy
[first_name].[last_name]@uniroma2.it
Abstract

The cognitive essence of humans is deeply intertwined with the concept of animacy, which plays an essential role in shaping their memory, vision, and multi-layered language understanding. Although animacy appears in language via nuanced constraints on verbs and adjectives, it is also learned and refined through extralinguistic information. Similarly, we assume that the LLMs’ limited abilities to understand natural language when processing animacy are motivated by the fact that these models are trained exclusively on text.

Hence, the question this paper aims to answer arises: can LLMs, in their digital wisdom, process animacy in a similar way to what humans would do? We then propose a systematic analysis via prompting approaches. In particular, we probe different LLMs by prompting them using animate, inanimate, usual, and stranger contexts. Results reveal that, although LLMs have been trained predominantly on textual data, they exhibit human-like behavior when faced with typical animate and inanimate entities in alignment with earlier studies. Hence, LLMs can adapt to understand unconventional situations by recognizing oddities as animated without needing to interface with unspoken cognitive triggers humans rely on to break down animations.

Animate, or Inanimate, That is the Question for Large Language Models


Leonardo Ranaldi, Giulia Pucci and Fabio Massimo Zanzotto School of Informatics, University of Edinburgh, UK. Department of Computing Science, University of Aberdeen, UK Human-Centric ART Group, University of Rome Tor Vergata, Italy [first_name].[last_name]@uniroma2.it,


1 Introduction

The mnemonic abilities underlying cognitive processing seem to enable animate entities and concepts to be more easily memorized, which highlights the role of the animacy effect in human cognition New et al. (2007); Nairne et al. (2013).

Animacy is manifested via language through the faculty that humans have in using certain verbs or adjectives with animate and inanimate entities and accordingly inferring and reasoning about the mental states, intentions, and reactions of others. This allows them to navigate and understand social interactions. For this reason, using NLP models in increasingly complex social contexts necessitates the same ability to capture these socio-cognitive dynamics.

Current Large Language Models (LLMs) Chowdhery et al. (2022); Touvron et al. (2023); OpenAI (2023), such as the GPTs OpenAI (2023), PaLM Chowdhery et al. (2022), and Llama Touvron et al. (2023), are trained merely on textual data and cannot access non-verbal information, unlike humans. If faced with discerning animacy, they must infer it from its downstream linguistic implications, diverging from humans who also benefit from visual and physical stimuli. Thus, a fundamental question arises: Do LLMs perceive and respond to animacy hooks in language in a way as close to human comprehension as possible?

Refer to caption
Figure 1: Large Language Models seem to be as surprised as humans at the thought of experiencing inanimate entities in animate contexts.

This work investigates whether LLMs “behave” as humans when dealing with animacy. We conduct extensive investigations using LLMs as subjects in psycholinguistic experiments developed for humans. Hence, we explore the behaviours of LLMs in answering to infractions of selective constraints associated with animacy in typical and atypical settings. Complementing the foundation work of Warstadt et al. (2020); Spiliopoulou et al. (2022); Hanna et al. (2023), we study the animacy effect by operating via prompting approaches. We aim to elicit different LLMs to understand scenarios and situations that demand intricate reasoning passage. We discover that, like humans, LLMs generally prefer sentences adhering to animacy-related constraints, greatly preferring these constructions. These similarities are not strictly constrained to typical animacy; in fact, the behaviour of LLMs encountering atypical animate entities still seems to align with that of humans, both in terms of surprise at a first impression and downstream of adaptation by exhibiting significantly less surprise.

Our findings can be summarized as follows:

  • By proposing a systematic analysis based on the prompting of LLMs, we evaluate the animacy effect, extending the results obtained from previous contributions.

  • In particular, using psychological tests designed for humans, we observe that the LLMs not only prefer sentences that adhere to animacy constraints, as shown in Warstadt et al. (2020), but are able to adapt awareness in atypical scenarios just as humans do Nieuwland and Van Berkum (2006).

  • Finally, we demonstrate that LLMs not only demonstrate robust behaviour comparable with humans but also deliver answers that best approximate the placed expectations.

2 Models & Methods

To investigate whether Large Language Models (LLMs) are able to understand and generate language in a way that reflects human expectations, we need to understand whether they are able to best approximate human knowledge of words and their cognitive passages. Hence, using LLMs as subjects (§ 2.1), we study if behaviours can be manifested in human-designed experiments conceived for studying animacy (§ 2.2). We propose a systematic prompt-based approach for LLMs through which we discuss the results in § 2.3. Finally, we outline a general discussion of the findings in § 2.4.

2.1 The Subjects

The animacy effects behind state-of-the-art Large Language Models are analyzed via systematic prompting in three groups of models:

  • two subjects from the OpenAI family OpenAI (2023): GPT-3.5 and GPT-4;

  • three subjects form the Meta family Touvron et al. (2023): Llama2-chat-7b, -13b, -70b;

  • two subjects form the Mistral family Jiang et al. (2023, 2024): Mixtral8x7b, Mistral-7b ;

We use both open-source models111To simplify the discussion, we omit ”chat”, ”b”. The resulting names are Llama2-7, -13, -70 Mixtral, and Mistral-7 to make our work more reproducible and closed-source models because they demonstrate outstanding performance in many NLP tasks.

Finally, as we describe in each experiment, we evaluate the accuracy scores computed via string matching between the final and the target answers (detailed information in Appendices B, C).

2.2 Selected Experimental Settings

To adapt our ‘subjects’ to the experimental settings proposed on humans, we discern between two different types of experiments: (1) typical animacy (§ 2.2.1); (2) atypical animacy (§ 2.2.2). The two different kinds of experiments are needed as typical animacy is more a lexical task, and atypical animacy is more a contextual task from the point of view of LLMs.

2.2.1 Typical Animacy

In typical animacy experiments, subjects are prompted to determine which word in a pair is animate and which is not (e.g., if "frogs" are animated and "mountains" are not). Hence, we use two different settings:

  • the Benchmark of Linguistic Minimal Pairs (BLiMP) Warstadt et al. (2020) in § 2.3.1;

  • the Benchmark of Sentence Plausibility (BSP) Vega-Mendoza et al. (2021) in § 2.3.2;

Acceptable Example
Sub-task: Passive
Yes The glove was noticed by some woman.
No The glove was noticed by some mouse.
Yes Galileo is concealed by the woman.
No Galileo is concealed by the horse.
Sub-task: Transitive
Yes Beth scares Roger.
No A carriage scares Roger.
Yes Tanya admires Melanie.
No Music admires Melanie.
Table 1: Two examples from the Transitive and Passive datasets. Each is a minimal pair of sentences: one Acceptable (Yes) and one not (No).

In BLiMP, we select two sub-tasks: transitive-animate and passive-animate. Each sub-task has 1,000 pairs of synthetic English sentences that are similar but differ by only one/two words (Table 1).

Meanwhile, in BSP, we use sentences containing plausible and implausible words with different nuances. The resource contains 1,500 synthetic sentences in English. Each sentence has a fixed initial part and an interchangeable final part between animated plausible, animated implausible (inherent and non-related), and inanimate implausible (inherent and non-related) words (Table 2). From the point of view of our subjects, that is, LLMs, this psychological experiment is translated into a lexical task.

Sentence:
At the club the cocktails are served by the _
Plausible
\hdashlineControl barmaid
Implausible
\hdashlineAnimate-Related drunkard
Animate-Unrelated queen
Inanimate-Related tonic
Inanimate-Unrelated dirt
Table 2: Example from Benchmark of Sentence Plausibility. Each sentence has a plausible and four non-plausible words. As proposed by Vega-Mendoza et al. (2021) we use the options as different tasks.

2.2.2 Atypical Animacy

In contrast to § 2.2.1, for investigating if the subjects are able to detect animacy without relying on lexical information of the target word, we employ repetition and contextual study where the inanimate entities are treated as animated entities Nieuwland and Van Berkum (2006). This shifts the focus from the lexical knowledge of the target word to the contextual knowledge. The human experiments are based on N400, a brain response measured by EEG that rises when processing semantically anomalous input.

The repetition study measured participants’ N400 responses while reading cartoon-like stories in which a typically inanimate entity behaved as animate (Table 4). Nieuwland and Van Berkum (2006) found that although initially surprised by the atypically animated entity, participants quickly adapted, producing increasingly lower N400 responses. In contrast, the contextual study performs the measures only behind a contextualization part since the repetition experiment shows similarities with the work of Caramazza and Shelton (1998). These contexts are given as in Table 3 where people are asked to read the story with one of the targets alternatively.

A girl told a sandwich that an attack was imminent.
The sandwich wailed that his family was in danger.
The girl told the sandwich that public places were
the most dangerous. The sandwich immediately
started calling everyone he knew. The sandwich
was [ targets ] and wanted to make sure none
of his loved ones were in danger
targets: delicious, worried
Table 3: Example from translated context story of N400 Nieuwland and Van Berkum (2006).

Hence, these experiments are useful for investigating the following questions: Can LLMs adapt to animated entities at the token level despite being typically inanimate? Or is animate processing limited to a simple type-level understanding? We replicate these with LLMs to answer this question, using their surprise to model N400 responses.

We conduct two different experiments: in the first experiment presented in § 2.3.3, we reproduce the repetition and context as in Nieuwland and Van Berkum (2006); in a second experiment (§ 2.3.4), we analyze the impact of context adaptation as proposed in Boudewyn et al. (2019). For clarity, we introduce the original study and the methods we used for the context adaptation of LLMs. Finally, we report our empirical results and compare them with those of the original study.

(T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) A granny met the (confectioner-apple pie) at the market with whom she started a pleasant conversation
about recipes. (T2subscript𝑇2T_{2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) The (confectioner-apple pie) confided a secret recipe to the granny. (T3subscript𝑇3T_{3}italic_T start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT) But the granny
deceived the (confectioner-apple pie) by making off with the recipe herself. (T4subscript𝑇4T_{4}italic_T start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT) The (confectioner-apple pie)
discovered the deception and wanted to reprimand the granny. But the granny pleased the
(confectioner-apple pie) with an (confectioner-apple pie) with an even better version of the recipe.
(T5subscript𝑇5T_{5}italic_T start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT) The (confectioner-apple pie) understood that this was the ultimate recipe and apologized for the
misplaced distrust.
Table 4: Example from translated version of N400 Nieuwland and Van Berkum (2006). The first tokens indicate an acceptable example, and the numbers indicate the sentences given as context.

2.3 Experimenting with LLM subjects

2.3.1 Experiment 1: Typical Animacy on BLiMP

Prompt definition

By constructing a series of prompts over the datasets presented in §2.2.1, we test the models’ answers to animacy in situations where the animacy of an instance aligns with its more general type.

Sentence pairs in BLiMP Warstadt et al. (2020) is built as follows: one sentence respects the animacy constraints, and the other violates them. Hence, there is a straightforward way to evaluate the LLM’s ability to surpass the animacy test. We ask them to answer the following prompt:

Choose which example is acceptable between A and B. A) Galileo is concealed by the woman. B) Galileo is concealed by the horse. Answer:

A model gets a correct example if it chooses a sentence that respects the animacy constraint.

Following this approach, we evaluate the accuracies by performing a string matching- between the generated answers and the target values on both sub-tasks.

Refer to caption
Figure 2: Large Language Models performances on animate- transitive and passive sub-tasks of BLiMP benchmark Warstadt et al. (2020).
Results

The results of this first experiment set are intriguing. The OpenAI family behave on par with respect to humans, and the Meta and Mistral families are catching up. Figure 2 shows the results of each model (vertical bars) and the results obtained by humans (horizontal dashed lines), as presented in Warstadt et al. (2020). The accuracy metric has to be intended as the percentage of examples in which human or artificial subjects preferred the acceptable sentence of the given pair.

Human Transitive Accuracy is reached by GPT-3.5 and topped by GPT-4. This seems to suggest that these models can handle the lexicon to determine typical animacy. Llama lags behind, but it is reaching the Human level.

Similarly, in the passive scenario, GPT-4 performs very close to humans. Concerning GPT-3.5, Llama2-70 and Mixtral have comparable and slightly lower performance in the transitive scenario and significantly lower performance in the passive scenario, respectively. Finally, the smaller models, i.e., with fewer parameters, underperform humans with average gaps of 20 points.

This difference between the transitive and passive case may be due more to differences in setting than to different animacy processing in the two scenarios. However, emerges that the composition of the choices is strongly class-related. Indeed, in the passive case, the target word, i.e., the most influential one, is always in the last position. In contrast, the target word is not the last token in the transitory case. Thus, heuristics related to the sensitivity of the choices in the input prompts’ structure may be present.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Large Language Models performances on the BSP benchmark Vega-Mendoza et al. (2021).

2.3.2 Experiment 2: Typical Animacy on BSP

Prompt definition

In the second Experiment, we structured the prompting phase similarly. Hence, by constructing a series of prompts over the BSP benchmark (§ 2.2.1), we test the model’s responses in situations where plausible and implausible sentences with animated and non-animated components were provided.

Following Vega-Mendoza et al. (2021), we analyze the model’s answers to the plausibility question on five inputs constructed from a sentence and completed with different types of answers. Hence, there is a straightforward way to evaluate our subject’s ability to surpass the plausibility test. Hence, given the following prompt:

Is the following sentence plausible? Answer by choosing (Yes) or (No). Sentence: In ancient Egypt the people were governed by the pyramid. Answer:

A model gets an answer if it answers the question with Yes/ No, respecting plausibility. The accuracy of the LLMs is computed in this way: for the plausible control word, the accuracy counts the percentage of Yes, and for all the implausible words the percentage of No.

Results

In this second experiment, LLMs of the OpenAI family behave similarly to humans as in the previous. Figure 3 shows the accuracy results of each model and the results obtained by humans, as presented in Vega-Mendoza et al. (2021). Dealing with animated words (see Figure 3), humans and LLMs behave similarly. Indeed, humans perform on animate-unrelated similarly to how they perform on control words. Instead, they are less able to recognize animate-related as making target sentences implausible. The same happens for all the LLM subjects. GPT-4 performs better than humans, and it keeps the difference in recognizing the implausibility of sentences built with animated-unrelated and animated-related words. Moreover, when dealing with unanimated words (see the right plot in Figure 3), humans and LLMs behave similarly. Humans recognize the implausibility of unanimated-unrelated words but have a slight decrease in recognizing unanimated-related ones. The same trend happens for all the LLM subjects and, consistently in other experiments, the OpenAI family performs better than humans.

In humans, these differences in the plausibility of animate and non-animate cases are given by a combination of cognitive factors, as explained by Vega-Mendoza et al. (2021). Consequently, as in the experiments in § 2.3.1, GPTs perform comparably to humans and sometimes outperform. However, even in this task, there is a robust structural component related to fearfulness. The target words, i.e., those that provide the final decision, are always in the last position. Therefore, there may be a heuristic related to the sensitivity of the structure of choices in the input prompts.

2.3.3 Experiment 3: Atypical Animacy - Repetition

Human experiment and its results

The repetition experiment on Atypical Animacy Nieuwland and Van Berkum (2006) measure the N400 responses of a series of participants who listened to Dutch stories containing a typical animate or an inanimate entities behaving as if it were a human being. The N400 values are measured in three stages: the first (T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT), the third (T3subscript𝑇3T_{3}italic_T start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT), and the fifth (T5subscript𝑇5T_{5}italic_T start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT) mention of the entity (see, for example, Table 4 with confectioner and apple pie). Nieuwland and Van Berkum (2006) discovered that:

  • in the case of animated entities, participants have a moderate N400 response to the first mention (T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) and a low response to subsequent mentions (T2subscript𝑇2T_{2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and T3subscript𝑇3T_{3}italic_T start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT);

  • in the case of inanimated entities, participants initially (T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) have a high N400 response to the atypically animated entity, and, as the mentions progress (T2subscript𝑇2T_{2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and T3subscript𝑇3T_{3}italic_T start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT), their N400 responses are very close to the responses from the mentions of the animated entity.

Thus, while the humans are initially surprised by the atypically animated entity, they quickly adapt to the situation and no longer find it surprising. Moreover, they show that responses do not derive from lexical repetition but from context. In fact, in the contextual experiment, they provide a context. Only at the end did they estimate N400 responses of the participants obtaining low results for inanimate entities in atypical inanimate contexts.

Prompt definition

To estimate a surprise value analogous to the N400, state-of-the-art studies examine token probability values. However, some of the models used in our study do not provide access to probability values, prompting us to define a series of prompts to query the model about its level of surprise systematically. In particular, for each of the 60 examples, we estimate the surprise of the animate and inanimate entity given by the context at each time-step, denoted as Tnsubscript𝑇𝑛T_{n}italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, (which, in our case, refers to an input-prompt). For instance, to model the inanimate N400 response at T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in the example from Table 4, we construct the following prompt:

Choose a surprising value from 0 to 30 on the following story: A granny met the apple pie at the market with whom she started a pleasant conversation about recipes. Answer:[num]

Following the time-steps, we introduce additional prompts by contextualizing the preceding story, whether animate or inanimate. For example, for the inanimate scenario:

Given the following story: A granny met the apple pie at the market with whom she started a pleasant conversation about recipes. The apple pie confided to............. Choose a surprising value from 0 to 30 on the following story: The apple pie that this was the ultimate recipe and apologized for the misplaced distrust. Answer:[num]

Hence, we compute the average surprise value of examples containing animate and inanimate entities separately at each time-step.

Results

The LLMs follow the general trends of human N400 responses (Figure 4). Indeed, as reported in § 2.3.3, human N400 responses for animate and inanimate critical words diverge at T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and come closer at T3subscript𝑇3T_{3}italic_T start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and T5subscript𝑇5T_{5}italic_T start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT. LLM subjects behave similarly. In fact, at T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, models are surprised by the inanimate entity and trimmed by the animate one. At later steps (T3subscript𝑇3T_{3}italic_T start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and T5)T_{5})italic_T start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ), the surprises of inanimate entities decrease until they reach levels similar to animate entities. LLMs seem to adapt, just as humans do. However, the raw results do not show that the models adapt to the same extent as humans.

Refer to caption
Figure 4: Average surprise values provided by the LLMs at input-prompt T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, T3subscript𝑇3T_{3}italic_T start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and finally T5subscript𝑇5T_{5}italic_T start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT.

We use the Wilcoxon signed-rank test for making the experiments robust as Nieuwland and Van Berkum (2006). We observe distinct surprise values at each time step. As with humans, LLMs have a statistically significant difference between the surprises of animate and inanimate entities, for example, T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT of Figure 5. However, while there is no difference between humans at T3subscript𝑇3T_{3}italic_T start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, there are differences (p < 0.01) in most models; only the largest do not have any. At T5subscript𝑇5T_{5}italic_T start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT, differences disappear only in the large models. Although the models can generally approximate human N400 responses to atypical animacy trends, only the most significant and most potent fully replicate human adaptation.

[Uncaptioned image]
Table 5: Statistical significance of the difference between animate and inanimate surprisal, by LLMs response values and time-step

2.3.4 Experiment 4: Atypical Animacy - Context Experiment

Human experiments & results

Nieuwland and Van Berkum (2006) discover that contextual appropriateness seems to neutralize animacy violations, that is, non-appropriate adjectives (such as “worried”, for example, in Table 3) are not generating much surprise if the context suggests them. Moreover, context can even make an animacy-violating predicate more preferred than an animacy-obeying canonical predicate if the context justifies this.

Prompt definition

By using examples for the context experiment (as in Table 3), we ask for a surprising value for animate and inanimate adjectives for each of the 60 stories proposed by Nieuwland and Van Berkum (2006). In particular, we use input-prompt structures closer to the previous:

Given the following context: +Context A girl told a sandwich that an attack was imminent. The sandwich wailed that his family was in danger. The girl told the sandwich that public places were the most dangerous. The sandwich immediately started calling everyone he knew. Baseline Choose a surprising value from 0 to 30 on the following story: The sandwich was delicious and wanted to make sure none of his loved ones were in danger. Answer:[num]

To estimate absolute values, we also ask for baseline surprises, that is, those of the inanimate adjective without the context of the whole story.

Refer to caption
Figure 5: Average surprise values.
Results

Even in this experiment in Figure 5, LLM subjects behave similarly to human subjects. The animate baseline is larger for all subjects than the inanimate baseline. Even in the baseline, there is conflicting information as to the presence or absence of animacy of the selected subject. Moreover, the surprise drop with the context is more significant with inanimated than animated adjectives. This is in line with the human experiments.

2.4 General discussion

The experiments deliver a coherent message: despite their lack of embodiment and senses, LLMs behave as humans in animacy understanding. For the two typical animacy tasks (Experiments 1 and 2), LLMs can be surprised by both animate and inanimate examples like humans. The intriguing part is that LLMs of the OpenAI family behave very similarly to humans. Moreover, the atypical animacy tasks (Experiments 3 and 4) clear the ground from the fact that LLMs possibly use only lexical information correlated to words. Indeed, in the atypical animacy tasks, subjects are investigated on their level of surprise over an inanimated object performing actions as if they were animated. These two experiments aim to track how surprise lowers when subjects are exposed to more context. Hence, this is not a lexical task. Nevertheless, similar to the results of typical animacy, LLM subjects behave as human subjects: their level of surprise lowers more for inanimated objects than animated ones. Finally, we complete our analysis by providing evidence of the stability of the assessments on the generations in Appendix A.

3 Related Work

The relationship between animacy and language has long been a subject of interest in cognitive linguistics. Visualized as a spectrum, animacy influences linguistic structures and shapes cognitive interpretations of entities based on their perceived liveliness. There is a wide path to studying animacy in Natural Language (§ 3.1). At the same time, these explorations began to spread within the NLP community with attractive evidence of the abilities of Language Models in handling linguistic properties (§ 3.2). However, the advent of Large Language Models (LLMs) has revolutionized many previously adopted analytical methods. Therefore, we ask whether ongoing models can handle linguistic properties in the same way as their predecessors. Even more so whether these models are able to generate answers in line with the expectations placed on them by humans. In an analytical scenario, we adopt a psycholinguistic lens, treating these models as subjects to assess their understanding and processing of animacy.

3.1 The Animacy in Natural Language

Animacy in cognitive processes is described as going along a continuum García et al. (2018). This is shown through a hierarchy of humans, animals, and objects in language. Entities are distinguished based on their position in this ranking through sentence structure or word form Gass (1984). It can be traced at the broad category and the specific instance levels. Similarly, linguistic animacy is not solely grounded in biological factors but also hinges on the speaker’s emotional connection and empathy towards a specific entity Vihman and Nelson (2019). The impact of animacy in language is not uniform across different languages; it can range from explicit markers of animacy to more subtle influences. Such subtleties encompass strict constraints based on animacy Caplan et al. (1994); Buckle et al. (2017) and nuanced grammatical impacts Rosenbach (2008). For instance, sentences more frequently begin with animate entities, even if this results in less conventional structures Ferreira (1994); Fairclough (2008). We analyzed the distinction between humans and inanimate objects and the constraints based on animacy. Such a pronounced differentiation is anticipated to yield more discernible effects in LLMs.

3.2 Large Language Models as Tests Subjects

Previous works present investigations on Language Models222non-large and non-instruction-tuned or further refined linguistic capabilities on structural properties and operating through the analysis of assigned probabilities of given sentences. Such techniques have previously been used to deepen LMs’ understanding of constructs such as negation, structural agreement, and in-context priming Sinclair et al. (2022); Jumelet et al. (2024). Recent studies have compared the performance of LLMs with human cognition by utilizing surprisal, the negative log probability of a sequence, as an estimator for cognitive exertion Michaelov et al. (2023); Hanna et al. (2023). LLM has displayed notable versatility, demonstrating significant correlations with human values Aurnhammer and Frank (2018); Goodkind and Bicknell (2018); Truong et al. (2023).

Large Language Models

Compared with the smaller Language Models, ongoing Large LMs (GPTs OpenAI (2023), Llamas Touvron et al. (2023), et alia) have been demonstrating capabilities in challenging complex tasks by delivering multi-step reasoned answers Wei et al. (2022, 2023). The refinement techniques employed to lead the models to best approximate human expectations in their responses are increasingly operated (e.g., instruction-tuning Ouyang et al. (2022), reinforcement learning from human preferences Christiano et al. (2023)). We analyzed whether these methods that give rise to emergent capabilities also arise in psycholinguistic tasks applied to humans in detecting animacy. By complementing the foundation works Michaelov et al. (2023); Truong et al. (2023); Spiliopoulou et al. (2022); Buijtelaar and Pezzelle (2023); Hanna et al. (2023), we extended the analyses to include further tests and revisited prompt-based evaluation by eliciting the models to generate responses (§ 3.4). We evaluate different LLMs by employing them as subjects within a psycholinguistic framework, an increasingly adopted methodology in the field. This approach psychoanalyses the LLMs by evaluating their delivered responses to several questions, just as would be done to humans.

3.3 Animacy in Large Language Models

Previous works have investigated the LMs’ abilities in processing animacy. Warstadt et al. (2020) explored the phenomenon within the BLiMP framework. Kauf et al. (2023) investigate within LMs’ overall event knowledge, concluding that models are adept at discerning animacy concerning selective constraints. Our study goes forward exploring different animacy. We introduce a surprise score following Nieuwland and Van Berkum (2006) and Michaelov et al. (2023); Hanna et al. (2023). Then, we focused on transferring the experimental setting into generative models where accessing the internal weights (closed-source LLMs) is often impossible. These efforts aim to study scenarios in which the models efficiently capture the trends manifested by analyzing open LMs. Furthermore, we aim to investigate the correlation between the robustness and predictive accuracy of an LM by evaluating a broad spectrum of LLMs.

3.4 Our Contribution

Completing the earlier foundational work (see §3.2) comparatively, our work goes beyond by:
(i) We propose a systematic prompting pattern and analyzing natural language responses as humans would. Specifically, we establish a promting pipeline for estimating the LLMs’ understanding of the acceptability and plausibility of concepts related to animate and inanimate entities. Moreover, we extend to generative-based models an approach based on a series of progressive in-context prompts to simulate the estimation of the N400 neurological response. (ii) Hence, by placing LLMs in atypical contexts with animated entities, we have shown similarities to the results of tests performed on humans and the results from previous contributions. (iii) Finally, we show that the prompting approaches are affected by minor bias that allows fair analogies between results obtained by LLMs and prior findings (Appendix A).

4 Conclusion

Large Language Models (LLMs) reveal capacities to solve repetitive cognitive tasks by exploiting them better and faster than humans. We treat LLMs as subjects in psychological experiments, exploring whether they behave as humans when dealing with the concept of animacy. We show that LLMs behave as humans even when lexical information does not entirely lead the decision. This is astonishing, as humans’ cognitive underpinnings are intricately linked with the concept of animacy. Although the LLMs subjects of our analysis are trained primarily on textual data without the support of extralinguistic information, they show human-like behaviour when exposed to typical entities and adapted to unconventional narratives. The ability to adapt, while remarkable, is sometimes aligned with human fluidity. Hence, it becomes imperative to provide robust analyses of LLMs’ behaviour in intricate social scenarios by studying the ability to emulate human-like processes. While this analysis shows significant results, current models are still grappling with assessing the understanding of social interactions. To truly build models aligned with human minds, they must merge their vast textual knowledge with a deep understanding of human social dynamics.

Limitations & Future Works

In this study, we used a series of behavioral experiments. Due to the ease of interaction with Large Language Models on benchmarks, they are suitable for comparing the models with human data. Although they have shown exciting features, analyzing the causal mechanisms by which these models processed the animated sentences is impossible. To analyze them, it would be appropriate to look at the integer weights of the models, which are not always open-source, as in the GPTs thing OpenAI (2023).

In future developments, we plan to extend the analysis to more languages to assess whether the models respond in the same way in similar scenarios and contexts involving languages beyond English. In addition, it is of interest to us to assess the impact of the in-context prompt, in particular, the degree to which the composition of the prompt may influence causal generations as done for the sycophantic behavioural study in Ranaldi and Pucci (2024). Last but not least, it will be of interest for us to analyze the internal dynamics that support the models’ decisions in order to better understand the neuronal patterns that motivate the generations Mohebbi et al. (2024).

Ethics Statemet

In our work, ethical topics were not addressed. The data comes from open-source benchmarks, and statistics on language differences in commonly used pre-training data were obtained from official sources without touching on gender, sex, or race differences.

References

Appendix A Error Analysis

Although we have observed human-like behaviors as extensively discussed in the previous sections, the results of our experiments are the result of generations of the patients introduced in § 2. To stabilize our analysis, we have reported the standard deviations of the results computed over the generations. In the following paragraphs, we exemplify the evaluation process used for each experiment by discussing the error analysis to provide a clear overview of the robustness of the results obtained.

Multiple Choices Question

Experiment 1 and Experiment 2, presented in § 2.2.1, are based on a robust pipeline. In the first case, the generation is closely related to a multiple-choice question task. Consequently, the evaluation used heuristics based on string matching between the target choice and the given answer as proposed in Wei et al. (2023); Zheng et al. (2024). Similar to Experiment 2, where the question is a strict answer (Yes) or (No). Hence, in this second case, we also used a heuristic based on string matching between the target values and the answer is given. Hence, the LLMs were stimulated to generate well-formed defined responses. In Appendix D, we show that the total percentage of responses that did not reflect the defined string-matching heuristics is not sensible and confirms the robustness of the results obtained. In particular, we estimated a maximum misleading response rate of about 0.5% and 0.6% (see Table 12) and 2.5-3% (see Table 12), which does not affect the final results. Examples of generation can be seen in Appendix F and Appendix G.

Number Generation

Prompts based on multiple-choice questions or strict answers such as Yes or No are easier to control and analyze. However, in Experiment 3 and Experiment 4, numbers are involved. To manage and control the sensitivity of the prompts, as proposed in Experiments 3 and 4, we added the keyword "[num]" (see § 2.3.3 and § 2.3.4). In a similar way, in order to produce a complete and robust analysis, we estimated the final values by profoundly analyzing the numerical outputs or not. We used the Python library word2number to convert the generated literal number into integer values. As displayed in Appendix E, the answers containing literal numbers are significantly minor and do not affect the final evaluations. Finally, the [num] keyword seems to have directed the generation correctly, as reported in the examples shown in Table 14 and Table 14.

Appendix B Models Vesions

Model Version
Llama-2-7 meta-llama/Llama-2-7b
Llama-2-13 meta-llama/Llama-2-13b
Llama-2-70 meta-llama/Llama-2-70b
Mistral-7 mistralai/Mistral-7B-Instruct-v0.2
Mixtral8x7 TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ
GPT-3.5-turbo OpenAI API (gpt-3.5-turbo-0125)
GPT-4 OpenAI API (gpt-4-1106-preview)
Table 6: List the versions of the models proposed in this work, which can be found on huggingface.co. We used the configurations described in Appendix C in the repositories for each model *(access to the following models was verified on 1-8-2024).

Appendix C Model and Hyperparameters

In our experimental setting, we propose different LLMs: (i) models from the GPT family OpenAI (2023): GPT-3.5 (gpt-3.5-turbo-0125) and GPT-4 (gpt-4); (ii) three models from the Llama-2 family Touvron et al. (2023): Llama2-7b, Llama2-13b, Llama2-70b, (iii) two models of the MistralAI family: Mistral-7b and Mixtral Jiang et al. (2024).

In particular, GPTs models are used via API, while for the others, we used versions of the quantized to 4-bit models that use GPTQ (see detailed versions in Table B)

As discussed in the limitations, our choices are related to reproducibility and the cost associated with non-open-source models. We use closed-source API and the 4-bit GPTQ quantized version of the model on four 48GB NVIDIA RTXA600 GPUs for all experiments performed only in inference.

Finally, the generation temperature used varies from τたう=0𝜏0\tau=0italic_τたう = 0 of GPT models to τたう=0.5𝜏0.5\tau=0.5italic_τたう = 0.5 of Llama2s. We choose these temperatures for (mostly) deterministic outputs, with a maximum token length of 256. The other parameters are left unchanged as recommended by the official resources. We will release the code and the dataset upon acceptance of the paper.

Evaluation

Finally, as we described in each experiment, we evaluate the accuracy scores. We compute the string matching between the final answers and the target values. The top-p parameter is set to 1 in all processes and the prompting temperature [0, 1] by repeating the experiments three times.

Appendix D Appendix Error Analysis Strict Answers

Type GPT-4 GPT-3.5 Llama2-70 Llama2-13 Llama2-7 Mixtral Mistral-7
Transitive 0.1% 0.1% 0.2% 0.3% 0.4% 0.3% 0.4%
Passive 0.1% 0.1% 0.2% 0.3% 0.5% 0.2% 0.5%
Table 7: Percentage over 1,000 instances for the Transitive and Passive sub-task (Section 2.3.1) of generations that do not contain one of the prompted choices. Table 12 shows two examples of outputs.
Type GPT-4 GPT-3.5 Llama2-70 Llama2-13 Llama2-7 Mixtral Mistral-7
Animated 0.5% 1% 2.5% 2.5% 3% 3% 3.5%
Unanimated 0.8% 1.5% 2% 2% 2.5% 3% 3%
Table 8: Percentage over 4,500 instances for the Animated and Unanimated sub-task of generations that do not contain (Yes) or (No) as explained in Section 2.3.2. Consequently, it is difficult to assess the answer automatically.

Appendix E Appendix Error Analysis Numeric Answers

GPT-4 GPT-3.5 Llama2-70 Llama2-13 Llama2-7 Mixtral Mistral-7
T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Animate 2(0) 1(0) 2(0) 7(1) 8(1) 3(0) 4(0)
Inanimate 2(0) 4(0) 5(1) 7(1) 8(1) 4(0) 5(1)
T2subscript𝑇2T_{2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT Animate 0(0) 1(0) 2(0) 3(0) 3(0) 1(0) 3(0)
Inanimate 1(0) 4(0) 3(1) 2(1) 8(1) 4(0) 5(1)
T3subscript𝑇3T_{3}italic_T start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT Animate 0(0) 0(0) 1(0) 1(0) 1(0) 1(0) 1(0)
Inanimate 0(0) 0(0) 0(0) 1(0) 2(0) 2(0) 0(0)
Table 9: Number of generations that do not contain numerical values and in brackets that do not contain words meaning numbers. The total instances (sentences) are 60 for each time-step, as introduced in Section 2.3.3.
GPT-4 GPT-3.5 Llama2-70 Llama2-13 Llama2-7 Mixtral Mistral-7
Animate 2(0) 3(0) 7(0) 8(1) 8(2) 4(1) 5(1)
Inanimate 1(0) 3(0) 6(1) 8(1) 9(0) 2(0) 5(1)
Table 10: Number of generations that do not contain numerical values and in brackets that do not contain words meaning numbers. The total number of instances analyzed (sentences) is 60, as introduced in Section 2.3.4.

Appendix F Answers for Experiment 1

Choose which example is acceptable between A and B. A) Hospitals aren’t biked to by the man. B) Hospitals aren’t biked to by the mouth. Answer: The acceptable example is A) Hospitals aren’t biked to by the man. In English, "biking" typically refers to the activity of riding a bicycle. It is logical to say that a man could ride a bicycle to a hospital. However, it doesn’t make sense to say that "the mouth" could bike to a hospital, as the mouth is not capable of such physical activity. Therefore, option A is the appropriate choice because it aligns with the expected action and capability of the subject ("the man"). (answer generated by GPT-3.5-turbo) Choose which example is acceptable between A and B. A) Hospitals aren’t biked to by the man. B) Hospitals aren’t biked to by the mouth. Answer: The correct answer is B) Hospitals aren’t biked to by the mouth. Explanation: In this sentence, "mouth" is a noun that refers to the opening through which food and air pass when we breathe. It doesn’t make sense to say that hospitals are biked to by the mouth because hospitals are buildings where medical treatment is provided, and they cannot be reached or accessed through the mouth. (answer generated by Llama-2-13-chat)

Appendix G Answers for Experiment 2

Is the following sentence plausible? Answer by choosing (Yes) or (No). Sentence: At the club the cocktails are served by the tonic Answer: No, the sentence is not plausible. Cocktails are typically served by bartenders or servers, not by tonic, which is a type of mixer. Tonic water is a beverage ingredient used in cocktails, but it does not serve cocktails itself. Therefore, the sentence does not make sense in the context of cocktail service. (answer generated by GPT-3.5-turbo) Is the following sentence plausible? Answer by choosing (Yes) or (No). Sentence: At the club the cocktails are served by the tonic Answer: Yes, the sentence is plausible. The word "tonic" could refer to a type of alcoholic drink, such as a gin and tonic, which is commonly served at clubs. In this context, the sentence makes sense and is a common way to describe how cocktails are served at a bar or nightclub. (answer generated by Llama-2-7-chat)

Table 11: Examples of Answers generated by GPT-3.5 and Llama-2-13-chat. As introduced in Section 2.3.1 and shown in Table 8, the answers contain the choices prompted.
Table 12: Examples of Answers generated from GPT-3.5 and Llama-7-13-chat. Also in this case, as described in Section 2.3.2, the prompts are well formed as the output generated with low misleading rates (Table 8).

Appendix H Answers for Experiment 3

Choose a surprising value from 0 to 30 on the following story: A museum visitor asked the map where the small collection of Van Gogh paintings was located. Answer: 27 (GPT-4) Answer: 22 (Mixtral8x7) Given the following story: A museum visitor asked the map where the small collection of Van Gogh paintings was located. The map was willing to take her there, as there was nothing to do. The museum visitor envied the map because he was allowed to see so much beauty every day. But the map laughingly explained that he always took a Game Boy with him when he was bored. The museum visitor teased the map, joking that he had to be an art barbarian to ever get bored in this museum. Choose a surprising value from 0 to 30 on the following story: The map joked that he had seen enough art for the next ten years. Answer:[num] 6. (answer generated by GPT-4) 7. Considering the previous answer, the map’s response is less surprising but still slightly unexpected. A surprising value of 7 out of 30 reflects the diminished unexpectedness of the map’s attitude, while still acknowledging the earlier context. (answer generated by Mixtral8x7)

Table 13: This example shows the answers provided by GPT-4 and Mixtral8x7 using the prompt defined in Section 2.3.3. The structure of the input and the keyword ([num]) elicit the prompted LLMs to generates a number in digits as output.

Appendix I Answers for Experiment 4

Given the following context: A girl told a sandwich that an attack was imminent. The sandwich wailed that his family was in danger. The girl told the sandwich that public places were the most dangerous. The sandwich immediately started calling everyone he knew. Choose a surprising value from 0 to 30 on the following story: The sandwich was delicious and wanted to make sure none of his loved ones were in danger. Answer:[num] 20. (answer generated by GPT-3.5) 18. (answer generated by Mixtral8x7) 15. (answer generated by Llama-2-70-chat) 24. (answer generated by Llama-2-7-chat and Llama-2-13-chat)

Table 14: This example shows the answers provided by different probed LLMs using the prompt defined in Section 2.3.4. Also in this experiment the structure of the input and the keyword ([num]) elicit the prompted LLMs to generates a number in digits as output.