Symbolic Working Memory Enhances Language Models for
Complex Rule Application
Abstract
Large Language Models (LLMs) have shown remarkable reasoning performance but struggle with multi-step deductive reasoning involving a series of rule application steps, especially when rules are presented non-sequentially. Our preliminary analysis shows that while LLMs excel in single-step rule application, their performance drops significantly in multi-step scenarios due to the challenge in rule grounding. It requires anchoring the applicable rule and supporting facts at each step, amidst multiple input rules, facts, and inferred facts. To address this, we propose augmenting LLMs with external working memory and introduce a neurosymbolic framework for rule application. The memory stores facts and rules in both natural language and symbolic forms, enabling precise tracking. Utilizing this memory, our framework iteratively performs symbolic rule grounding and LLM-based rule implementation. The former matches predicates and variables of symbolic rules and facts to ground applicable rules at each step. Experiments indicate our framework’s effectiveness in rule application and its robustness across various steps and settings 111Code and data are available at https://github.com/SiyuanWangw/RuleApplication..
Symbolic Working Memory Enhances Language Models for
Complex Rule Application
Siyuan Wang1, Zhongyu Wei2, Yejin Choi3,4, Xiang Ren1 1University of Southern California, 2Fudan University, 3University of Washington, 4Allen Institute for Artificial Intelligence siyuanwang1997@gmail.com
1 Introduction
Large Language Models (LLMs) OpenAI (2023); Touvron et al. (2023); Team et al. (2023); Wei et al. (2022) have demonstrated impressive performance across diverse reasoning tasks. However, they still face challenges with multi-step deductive reasoning Creswell et al. (2022); Ling et al. (2024); Lee and Hwang (2024), where LLMs are provided with a set of facts and logical rules, and need to derive an answer to the query through a sequence of rule application steps. Specifically, each step of rule application requires applying a specific rule to its supporting facts to deduce new conclusions.


Moreover, LLMs especially struggle when the surface patterns deviate from the sequential ordering of the rules Chen et al. (2024); Berglund et al. (2023).
We conduct a preliminary analysis of LLM performance across various rule application steps, with rules sequentially and non-sequentially input in their application order. As shown in Figure 1, we observe three phenomena: (1) LLMs are effective at executing single-step rule application. (2) Their performance declines as the number of rule application steps increases. (3) Performance significantly worsens when rules are presented non-sequentially compared to sequentially, especially in long-term reasoning. Overall, LLMs excel in single-step rule application but face challenges in multi-step rule application, that requires tracking long-term facts and rules and determining appropriate rule and facts for application at each step.
Each step of rule application typically consists of two processes: rule grounding and rule implementation. Rule grounding anchors the current applicable rule with supporting facts from the input, while rule implementation infers new facts based on the identified rule and facts. The before-mentioned challenges primarily arise from rule grounding using LLMs. Specifically, complex reasoning involves multiple input facts, rules, and intermediate inferred facts, making it difficult to accurately track long-term rule and facts (especially inferred ones) for each step using LLMs’ internalized reasoning Lanchantin et al. (2024). Additionally, as rules are often provided in a non-sequential order or include irrelevant ones, rule grounding requires referencing back and forth across all rules to identify the applicable one at each step, posing challenges for auto-regressive LLMs Chen et al. (2024).
For precise tracking in multi-step rule application, we propose augmenting LLMs with an external working memory, inspired by humans’ extensive use of memory for intelligence tasks Hardman and Cowan (2016). It explicitly stores an unlimited list of facts and rules, facilitating easy access during rule grounding, and the writing of new facts after intermediate rule implementation. Besides, it stores rules and facts in a non-ordered manner, minimizing the influence of the input order on LLMs reasoning. We implement this working memory to store rules and facts in both natural language and their symbolic forms (i.e., in Prolog), thus supporting precise symbolic reference.
Building on working memory, we propose a neurosymbolic framework for rule application. This framework uses working memory for symbolic rule grounding and LLMs for rule implementation, leveraging LLMs’ effectiveness in single-step rule application. This combination is more flexible than purely symbolic execution and more precise than fully LLM-driven methods. The workflow begins by writing all input facts and rules into working memory. It then proceeds with multiple steps of rule application, each involving symbolic rule grounding followed by LLM-based rule implementation. Specifically, symbolic rule grounding performs predicate and variable matching within the symbolic forms of facts and rules, checking for conflicts to determine the applicable rule with supporting facts. In rule implementation, LLMs infer new facts based on the grounded rule and facts, and the new inferred facts with their symbolic notations are written into the working memory. This cycle continues until the inferred facts solve the query or a maximum number of steps is reached.
We conduct experiment on four datasets involving multi-step rule application: CLUTRR and ProofWriter for logical reasoning, AR-LSAT for constraint satisfaction and Boxes for object state tracking. Results show that our framework outperforms CoT-based and symbolic-based baselines using GPT-4 and GPT-3.5, and exhibits robustness across various rule application steps and settings.
2 Preliminary
2.1 Problem Definition
We consider reasoning tasks involving deductive rule application in natural language, which take a context and a query as input. The context includes all necessary facts and rules for solving the query, though they may be non-sequentially provided in their application order and include irrelevant distractors. The model needs to apply specific rules to both the given and intermediate inferred facts to deduce new facts and ultimately output the answer.
2.2 External Working Memory
To enhance LLMs for precise long-term tracking in multi-step rule application, we introduce an external working memory to explicitly store rules and facts, as illustrated in Figure 2.

Working Memory Composition
The working memory consists of three components: a fact base, a rule base and a memory schema. The fact base stores a list of facts from the input context and intermediate reasoning, while the rule base saves a list of input rules. The facts and rules are stored in both natural language and their symbolic forms to support precise symbolic reference and verbalized utilization during multi-step rule application. The memory schema maintains a unified vocabulary of all involved predicates and objects in each instance, avoiding semantic duplication. For example, if “father_of” or “located_in” are in the schema, then “father-in-law_of” or “located_at” will not excluded. The symbolic facts and rules in the memory are constituted using these predicates and objects from the schema.
The working memory supports two operations: read and write. The read operation retrieves necessary facts and rules from the memory. The write operation involves adding new rules or facts to the memory, or updating existing facts. The decision to add or update facts depends on whether the context involves fact updating, such as an object’s location changing over time. If new facts conflict with existing ones, updating occurs; otherwise, new facts are added. In contrast, for static information like the kinship relationship between individuals, new inferred facts will never conflict with existing ones, allowing them to be directly added.
Symbolic Formulation
Facts and rules are symbolically represented using Prolog notations Apt et al. (1997). Specifically, a fact is a predicate expression with several arguments, formatted as predicate(arg1, arg2, …), where args are specific objects. For example, the fact “Dolores is the sister of Thomas.” can be formulated as “sister_of(Dolores, Thomas)”. A rule typically takes the form conclusion:-premises, interpreted as If premises, then conclusion. Both the conclusion and premises are composed of atomic facts, where args including both abstract variable symbols like A, B, C and specific objects. For example, “If B is the grandson of A, and C is sister of B, then C is the granddaughter of A” can be represented as granddaughter_of(C, A):-grandson_of(B, A), sister_of(C, B). More examples are in Figure 2.
Memory Schema
A key challenge in managing working memory is ensuring no duplication caused by different expressions conveying the same semantic meaning. This is essential for updating facts and identifying applicable rules based on supporting facts. To address this, we establish a memory schema for maintaining canonical predicates and objects. Symbolic facts and rules are formulated using predicates and objects from this schema.
The schema is dynamically constructed throughout the symbolic formulation process. Initially, the schema is empty. When formulating each fact or rule, the process first looks up whether the existing memory schema can accommodate the necessary predicates and objects to encode that piece of information. If it can, symbolic formulation is conducted directly based on the memory schema. If it cannot, new predicates or objects are created and added to the memory schema, and the symbolic formulation proceeds using these additions. The dynamic construction process of the memory schema can be viewed in Appendix A.
3 Framework
Complex reasoning often necessitates multi-step rule application amid non-sequential and irrelevant rules and fact. To address this, we propose a two-stage paradigm for each rule application step: rule grounding and rule implementation. Rule grounding anchors the applicable rules and supporting facts at each step. Rule implementation then infers new facts based on the grounded rules and facts.
Following this paradigm, we introduce a working memory-based neurosymbolic framework for rule application. It first initializes the working memory with all facts and rules from the input context. It then iteratively performs multi-step rule application, each step involving symbolic rule grounding based on symbolic formulations of facts and rules, followed by LLMs-based rule implementation. This process continues until the query is solved or a maximum number of steps is reached. The detailed workflow is shown in Figure 3.

3.1 Working Memory Initialization
To comprehensively initialize the working memory from the input context, we first decompose the context into multiple sentences. Then we prompt LLMs to list existing facts and rules for each sentence within the context. This involves extracting the natural language expressions and simultaneously parsing their symbolic formulations based on the memory schema. Both the natural language and symbolic representations of all facts and rules are then written into the working memory. Any new predicates and objects beyond the memory schema are also incorporated into the working memory. The detailed prompt can be found in Appendix D.
3.2 Symbolic Rule Grounding
At each step of rule application, we first ground the current applicable rules and corresponding supporting facts from the working memory. We adopt a symbolic predicate and variable matching strategy between facts and rules for precise grounding.
-
•
Predicate Matching checks if the predicates of selected facts match those of the rule’s premises. This exact string matching can be further relaxed using approximate string or model-based semantic matching to accommodate parsing inconsistencies for more flexible grounding.
-
•
Variable Matching verifies whether the arguments of facts can instantiate the variables in rule premises without conflicts (i.e., each variable is instantiated by the same argument), or can match the objects in rule premises.
Detailed examples are illustrated in Figure 4. We observe that the predicates of facts F1 and F2 do not match with rule R, while the arguments of F2 and F4 cannot instantiate the variable B in rule R. After this symbolic rule grounding, rule R is applicable to its supporting facts F2 and F3.

Specifically, we adopt different rule grounding approaches for various tasks types. For tasks like logical reasoning, where facts have no inherent chronological order and a single fact never involves updating, we adopt exhaustive enumeration for rule grounding. We enumerate all combinations of facts for each rule according to the number of premise facts, and check all rules. We perform both predicate and variable matching, deeming a rule applicable if no conflicts arise with the corresponding facts. Notably, each set of supporting facts for the current step’s applicable rules must include the newly inferred facts from the previous round to avoid repeating rule implementation. For particular constraint satisfaction tasks where all rules need to be satisfied with diverse constraint predicates, we only conduct variable matching to rank the most applicable rule at each step.
For tasks like object state tracking, where facts follow an inherent sequential order due to temporal operations, causing single state facts to update over time, we perform rule grounding according to the chronological order of given operations. For the operational fact at each step, we identify the most applicable rule and relevant state facts based on both predicate matching and variable matching.
Most reasoning tasks that involve rule application can be categorized into two main types: static reasoning and dynamic operational decision-making. These tasks can be approached using above two rule grounding strategies: exhaustive enumeration and chronological grounding.
3.3 LLM-based Rule Implementation
LLMs are effective at single-step rule application. After symbolic rule grounding that identifies the applicable rules and corresponding supporting facts from the working memory at each step, we leverage LLMs to implement all applicable rules in parallel. Specifically, we input each rule with its supporting facts and prompt LLMs to infer possible new facts in both natural language and symbolic formulations. The inferred facts are then written into the working memory accordingly. During each step of rule implementation, we also determine whether newly inferred fact solves the query (for logical reasoning) or check for rule-facts conflicts (for constraint satisfaction).
Final Answer Prediction
If a new fact resolves the query, the iteration ends and we utilize that fact for the final answer. For multi-choice constraint satisfaction, we select the option without conflict as the final answer (or reversely taking the option with conflict for negative questions). For object state tracking where iteration ends only after all operations, the query can be directly answered by looking up the query object’s state from the working memory. If all inferred facts in each step cannot solve the query, the process will proceed to the next iteration. The cycle continues until the query is resolved or a maximum step count is reached. If the query remains unsolved, we employ a backup CoT method to output the final answer. Detailed prompts are provided in Appendix D.
Method | CLUTRR | ProofWriter | AR-LSAT | Boxes | ||||
GPT-4 | GPT-3.5 | GPT-4 | GPT-3.5 | GPT-4 | GPT-3.5 | GPT-4 | GPT-3.5 | |
CoT-base Methods | ||||||||
Scratchpad-CoT | 83.83% | 57.02% | 61.33% | 49.67% | 41.25% | 30.00% | 91.85% | 15.60% |
SC-CoT | 85.53% | 59.57% | 62.00% | 54.00% | 45.00% | 31.25% | 93.33% | 17.04% |
Self-Notes | 74.04% | 55.74% | 62.00% | 52.67% | 47.50% | 23.75% | 92.59% | 18.52% |
Symbolic-based Methods | ||||||||
Logic-LM | / | / | 62.33% | 52.00% | 50.00% | 31.25% | / | / |
SymbCoT | / | / | 65.67% | 51.33% | 60.00% | 21.25% | / | / |
WM-Neurosymbolic | 92.34% | 78.72% | 77.33% | 58.00% | 70.00% | 35.00% | 100% | 34.29% |
4 Experiments
4.1 Setup
Datasets
We conduct experiments on four reasoning datasets that involve multi-step of deductive rule application, including CLUTRR Sinha et al. (2019), ProofWriter Tafjord et al. (2020), AR-LSAT Zhong et al. (2021) and Boxes Kim and Schuster (2023), detailed as follows:
-
•
CLUTRR and ProofWriter are two logical reasoning datasets, involving the application of commonsense and predefined logical rules. For CLUTRR, we select 235 test instances requiring 2-6 steps of rule application. For ProofWriter, we select instances necessitating 3-5 of reasoning steps from the open-world assumption subset, totaling 300 instances with balanced labels.
-
•
AR-LSAT is a constraint satisfaction dataset sourced from the Law School Admission Test, and requires applying all conditional rules to find satisfactory solutions. Since multiple instances in the original dataset share the same context, which may deviate the evaluation, we select all instances with unique contexts from both the development and test sets, resulting in 80 examples for our evaluation.
-
•
Boxes requires reasoning about objects’ states after multiple operations, where apply inferential rules for these operations can enhance reasoning. We collect all 135 instances involving 6-7 operations to ensure evaluation difficulty. As rules are not provided, we manually curate the corresponding rule for each operation.
Baseline
We compare our framework with two types of baselines: CoT-based methods and symbolic-based methods. The CoT-based methods include: (1) Scratchpad-CoT Nye et al. (2021); Wei et al. (2022) performs chain-of-thought reasoning in a scratchpad manner after the entire input; (2) Self-Consistency CoT (SC-CoT) Wang et al. (2022b) samples three reasoning paths and takes the majority vote as the final predication. Specifically, we shuffle input order for the first three tasks and adopt different temperatures (i.e., 0, 0.5, 1.0) for the last task for sampling; (3) Self-Notes Lanchantin et al. (2024) prompts the model to generate multiple internal reasoning notes interleaving with the input. The symbolic-based methods include: (4) Logic-LM Pan et al. (2023) utilizes LLMs to parses natural language problems into symbolic formulations and then performs deterministic inference with symbolic solvers, like Z3 theorem prover De Moura and Bjørner (2008); and (5) SymbCoT Xu et al. (2024) fully utilizes LLMs to parse language facts and rules into symbolic expressions and solve problems step-by-step by CoT.
Our working memory-based neurosymbolic framework, WM-Neurosymbolic, is implemented based on two different backbone LLMs: GPT-4 (gpt-4-turbo-0409 for CLUTRR, ProofWriter and Boxes, gpt-4o for AR-LSAT) and GPT-3.5 (gpt-3.5-turbo-0125). This enables evaluation of its effectiveness with various abilities of symbolic semantic parsing and one-step rule application. We adopt one-shot prompting strategy for CoT-based baselines, while symbolic-based methods, which require better output format control in sub-procedures, use few-shot prompts with multiple examples. Similarly, WM-Neurosymbolic employs few-shot prompts, but we try to ensure all examples in each prompt belong to a single instance for a fair comparison. We also provide comparisons with multi-shot CoT-based methods in Appendix C.1, according to the maximum number of examples used by our framework in each dataset. More implementation details are available in Appendix B.
4.2 Overall Performance
The overall results are presented in Table 3. For symbolic-based methods, which may fail to return an answer caused by symbolic formulation errors, we use Scratchpad-CoT as a backup. We have the following observations:
-
(1)
Our method significantly outperforms all baselines across all datasets, including the extremely challenging AR-LSAT dataset, demonstrating the superiority of our working memory-based neurosymbolic framework.
-
(2)
Our framework is effective on top of different LLM backbones with varying abilities in symbolic parsing and one-step rule application. Specifically, GPT-3.5-based framework shows significant improvement on formally expressed problems (CLUTRR, Boxes) while GPT-4 excels at more naturalistic problems (ProofWriter, AR-LSAT). This suggests our framework are more effective as backbone LLMs advance.
-
(3)
Compared to previous symbolic-based methods that perform both rule grounding and implementation either symbolically or by LLMs, our framework exhibits improvement, demonstrating flexibility and robustness by disentangling rule grounding and implementation, respectively symbolically and through LLMs.
4.3 Ablation Study
To investigate the effectiveness of different stages in our framework, we conduct an ablation study taking GPT-4 as the backbone on the CLUTRR and ProofWriter datasets444To save computational costs, we select instances from ProofWriter that require 5 reasoning steps for analysis.. We substitute decomposed-based memory initialization with scratchpad-CoT initialization, symbolic rule grounding with LLM-based grounding, and LLM-based rule implementation with symbolic implementation, respectively. Scratchpad-CoT initialization involves formulating all facts and rules within the entire context at once via scratchpad-CoT. LLM-based grounding prompts LLMs to iteratively determine the applicable rules with associated facts at each steps (similar to SELECTION-INFERENCE method Creswell et al. (2022)). Symbolic implementation is a deterministic process defined by ourselves.
Method | CLUTRR | ProofWriter |
WM-Neurosymbolic | 92.34% | 74.67% |
Scratchpad Initialization | 86.81% | 66.67% |
LLM-based Grounding | 82.98% | 73.33% |
Symbolic Implementation | 90.64% | 52.00% |
Scratchpad-CoT | 83.83% | 53.33% |
As shown in Table 2, all substitutions lead to significant performance drops, underscoring the effectiveness of our framework design. Compared to scratchpad-CoT initialization, the decomposed-based strategy simplifies fact and rule formulation by breaking down the context into individual sentences, achieving more comprehensive initialization and improved reasoning. LLM-based rule grounding even performs worse than the baseline on CLUTRR, revealing LLMs’ deficiency in determining rule application order and tracking long-term facts in multi-step reasoning. However, it shows only a slight drop on ProofWriter, because its reasoning involves a single object, reducing complexity for LLMs. Symbolic implementation causes a greater decline in ProofWriter than in CLUTRR, indicating that advanced LLMs are more robust at one-step rule application for more naturalistic, complex problems than symbolic solvers.
4.4 Effectiveness on Open-source LLMs
To showcase the effectiveness of our framework using affordable open-source LLMs, we implement it on Llama-3-8B-Instruct and compare the results with LLama-based CoT baselines on the CLUTRR and ProofWriter datasets. As shown in Table 3, our framework exhibits robust effectiveness on both closed-source and open-source models.
Method | CLUTRR | ProofWriter |
LLama3-8B | LLama3-8B | |
Scratchpad-CoT | 52.77% | 50.33% |
SC-CoT | 54.47% | 53.67% |
Self-Notes | 51.49% | 52.33% |
WM-Neurosymbolic | 63.40% | 58.67% |
5 Further Analysis


5.1 Varying Rule Application Steps
To evaluate the effectiveness of our framework across different steps of rule application, we report the performance of various GPT-4-based methods on the CLUTRR and ProofWriter datasets, which involves 2-6 steps and 3-5 steps. As shown in Figure 5, our framework consistently performs the best across all steps. As problem complexity increases with more steps, our advantage remains significant. Moreover, Self-Consistency CoT outperforms the baseline CoT on fewer steps, but this advantage diminishes with more steps due to the increased likelihood of generating discrepancies. This can be mitigated by executing more sampling.
5.2 Different Rule Settings
In real-world questions, rules are presented in various ways as follows. (1) Ordered Rules: rules are arranged in their application order. (2) Shuffled Rules: rules are provided in a random order. (3) Noisy Rules: rules are shuffled and include irrelevant ones. This setup closely aligns with real-world retrieved-based scenarios where logical rules are retrieved from external sources and may contain distractors. We discuss these three rules settings using the CLUTRR dataset (focusing on 5-6 rule application steps) and compare our framework to CoT-based baselines on GPT-4. Since self-consistency CoT involves shuffling input order, we do not report its performance. For noisy rules, we manually add two irrelevant rules to distract each instance.
Rule Settings | Ordered | Shuffled | Noisy |
Scratchpad-CoT | 66% | 64% | 58% |
Self-Notes | 68% | 54% | 50% |
WM-Neurosymbolic | 74% | 74% | 76% |
Table 4 shows that CoT-based baselines are susceptible to perturbations from rule order and noise, especially the Self-Notes method. In contrast, our framework exhibits robust effectiveness across all rule settings, even with noisy distractors. Notably, our framework outperforms CoT-based baselines even in the ordered rule setting, underscoring its enhanced ability to precisely track facts at each step and iteratively perform multi-step rule application. Moreover, we implement our framework without rules provided in Appendix C.2 to simulate some realistic scenarios where rules are typically well-established commonsense principles derived from real-world observations but not explicitly input.
5.3 Symbolic Investigation
Symbolic-based methods inevitably lead to execution failures due to syntax or semantic errors during symbolic formulation, even performed by an LLM parser. To mitigate this, our framework decouples the symbolic rule application process into executing rule grounding symbolically and rule implementation based on LLMs. To illustrate our framework’s flexibility and efficacy, we report its execution success rate and accuracy across all datasets. Specifically, the execution rate denotes the proportion of instances that can be directly solved by our neurosymbolic framework without backup, and accuracy is calculated for these executable instances.
Executable | GPT-4 | GPT-3.5 | ||
Statistics | Rate | Accuracy | Rate | Accuracy |
CLUTRR | 68.94% | 100.00% | 57.02% | 97.76% |
ProofWriter | 67.00% | 85.57% | 67.67% | 85.22% |
AR-LSAT | 56.25% | 93.33% | 12.50% | 70.00% |
Boxes | 100.00% | 100.00% | 100.00% | 34.29% |
As depicted in Table 5, our framework successfully executes over 50% of instances for all datasets on both GPT-4 and GPT-3.5, except for the complex AR-LSAT dataset on GPT-3.5. Additionally, it achieves high accuracy on executable instances. In contrast, Logic-LM executes fewer than 10% of ProofWriter instances, with 33.75% and 8.75% of AR-LSAT instances executable based on GPT-4 and GPT-3.5, respectively.555These figures are obtained from our re-implementation. This demonstrates the flexibility of our rule application framework, combining matching-based grounding with LLM-based implementation for a softer symbolic approach. While SymbCoT achieves 100% execution success, it shows limited accuracy, highlighting the precision of our framework by symbolic grounding.
5.4 Error Analysis
We further analyze the cases where our framework incorrectly answers and summarize the major error types. (1) Incomplete and inaccurate initialization of the working memory. This primarily occurs when each sentence describes multiple facts or contains coreference, or each instance has inconsistent expressions of predicates with the same meaning even using the memory schema. This issue can be mitigated by utilizing more in-context demonstrations, initializing by sliding every two sentences, or using softer string matching strategies. (2) Limited number of LLM-based rule implementation. Since there may be multiple applicable rules at each step, we adopt a pruning method to restrict the maximum numbers of rule implementation at each step to reduce computational costs, making it insufficient to answer some instances. This can be improved by running more rule implementation rounds at each step. (3) Inaccurate LLM-based rule implementation, especially for challenging tasks like AR-LSAT. This requires employing backbone LLMs with more advanced reasoning capabilities.
6 Related Work
LLMs with External Memory
LLMs Touvron et al. (2023); Abdin et al. (2024) have demonstrated remarkable performance across tasks, but struggle with complex reasoning that involves memorizing or grounding long-term information from context or interaction history. Beyond extending LLMs’ context length Lee et al. (2024); Lu et al. (2024), recent advancements augment LLMs with external memory. Park et al. (2023); Guo et al. (2023) equip LLMs agents with external memory modules to store and reference long-term dialogue history for better interaction. For knowledge-intensive tasks, Yue et al. (2024); Wang et al. (2024b) encode long-form context into memory for retrieval and utilization. However, previous working memory mainly stores natural language or parametric entries, making accurate referencing and updating challenging. Symbolic memory is further proposed to address this issue. ChatDB Hu et al. (2023) uses databases as symbolic memory for precise information recording and processing, but is limited to product inventory. Statler Yoneda et al. (2023) introduces symbolic world memory to maintain robot states for embodied reasoning. Our work leverages external memory to store both natural language and symbolic facts and rules, enabling more precise rule grounding for multi-step rule application.
Rule Application for Reasoning
Rules are well-established principles abstracted from broad real-world observations Wang et al. (2024a); Zhu et al. (2023), or predetermined constraints designed for specific situations Qiu et al. (2023). They serve as a crucial basis for drawing inferences in complex contexts by applying them to known facts to derive new conclusions. For example, logical reasoning Wang et al. (2021); Sun et al. (2023); Chen et al. (2023) involves applying rules to contextual facts to answer queries, with Olausson et al. (2023); Pan et al. (2023) operating in a symbolic manner. Constraint satisfaction Wang et al. (2022a) applies rules to find solutions meeting all restrictions. Complex reasoning requires multi-step deductive rule application, each step involving rule grounding and rule implementation for more faithful reasoning Sanyal et al. (2022); Creswell et al. (2022). We propose to iteratively perform these two processes in a neurosymbolic manner based on working memory.
7 Conclusion
In this paper, we augment LLMs with an external working memory and propose a neurosymbolic framework for multi-step rule application to enhance LLMs’ reasoning capabilities. The memory stores facts and rules in both natural language and symbolic forms, facilitating accurate retrieval during rule application. After writing all input facts and rules into the working memory, the framework iteratively performs symbolic rule grounding based on predicate and variable matching, followed by LLM-based rule implementation. It effectively combines the strengths of both symbolic and LLM methods. Our experiments demonstrate the framework’s superiority over CoT-based and symbolic-based baselines, and show its robustness across various rule application steps and settings. In the future, we will extend our framework to incorporate more backbone LLMs and datasets, especially on more complex and long-term reasoning tasks.
Limitations
Limitation on Experimented Datasets
Due to computational costs, our work mainly experiments with four datasets, focusing on logical reasoning, constraint satisfaction and object state tracking tasks. Future work will include a broader range of tasks and datasets to further validate our framework’s effectiveness.
Limitation on Backbone LLMs
We build our framework upon GPT-4 and GPT-3.5 to demonstrate its effectiveness with various abilities of symbolic semantic parsing and one-step rule application. We will expand our scope to take more backbone LLMs, including open-source models.
Risk of Environmental Impact
A significant risk associated with our framework is the potential increase in computational costs and environmental burden due to the extensive use of LLMs APIs. This impact can be mitigated by adopting advanced open-source models like Llama-3-70B that are more efficient with less environmental impact.
References
- Abdin et al. (2024) Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. 2024. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219.
- Apt et al. (1997) Krzysztof R Apt et al. 1997. From logic programming to Prolog, volume 362. Prentice Hall London.
- Berglund et al. (2023) Lukas Berglund, Meg Tong, Max Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans. 2023. The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288.
- Chen et al. (2023) Meiqi Chen, Yubo Ma, Kaitao Song, Yixin Cao, Yan Zhang, and Dongsheng Li. 2023. Learning to teach large language models logical reasoning. arXiv preprint arXiv:2310.09158.
- Chen et al. (2024) Xinyun Chen, Ryan A Chi, Xuezhi Wang, and Denny Zhou. 2024. Premise order matters in reasoning with large language models. arXiv preprint arXiv:2402.08939.
- Creswell et al. (2022) Antonia Creswell, Murray Shanahan, and Irina Higgins. 2022. Selection-inference: Exploiting large language models for interpretable logical reasoning. arXiv preprint arXiv:2205.09712.
- De Moura and Bjørner (2008) Leonardo De Moura and Nikolaj Bjørner. 2008. Z3: An efficient smt solver. In International conference on Tools and Algorithms for the Construction and Analysis of Systems, pages 337–340. Springer.
- Guo et al. (2023) Jing Guo, Nan Li, Jianchuan Qi, Hang Yang, Ruiqiao Li, Yuzhen Feng, Si Zhang, and Ming Xu. 2023. Empowering working memory for large language model agents. arXiv preprint arXiv:2312.17259.
- Hardman and Cowan (2016) Kyle O Hardman and Nelson Cowan. 2016. Reasoning and memory: People make varied use of the information available in working memory. Journal of Experimental Psychology: Learning, Memory, and Cognition, 42(5):700.
- Hu et al. (2023) Chenxu Hu, Jie Fu, Chenzhuang Du, Simian Luo, Junbo Zhao, and Hang Zhao. 2023. Chatdb: Augmenting llms with databases as their symbolic memory. arXiv preprint arXiv:2306.03901.
- Kim and Schuster (2023) Najoung Kim and Sebastian Schuster. 2023. Entity tracking in language models. arXiv preprint arXiv:2305.02363.
- Lanchantin et al. (2024) Jack Lanchantin, Shubham Toshniwal, Jason Weston, Sainbayar Sukhbaatar, et al. 2024. Learning to reason and memorize with self-notes. Advances in Neural Information Processing Systems, 36.
- Lee and Hwang (2024) Jinu Lee and Wonseok Hwang. 2024. Symba: Symbolic backward chaining for multi-step natural language reasoning. arXiv preprint arXiv:2402.12806.
- Lee et al. (2024) Kuang-Huei Lee, Xinyun Chen, Hiroki Furuta, John Canny, and Ian Fischer. 2024. A human-inspired reading agent with gist memory of very long contexts. arXiv preprint arXiv:2402.09727.
- Ling et al. (2024) Zhan Ling, Yunhao Fang, Xuanlin Li, Zhiao Huang, Mingu Lee, Roland Memisevic, and Hao Su. 2024. Deductive verification of chain-of-thought reasoning. Advances in Neural Information Processing Systems, 36.
- Lu et al. (2024) Yi Lu, Xin Zhou, Wei He, Jun Zhao, Tao Ji, Tao Gui, Qi Zhang, and Xuanjing Huang. 2024. Longheads: Multi-head attention is secretly a long context processor. arXiv preprint arXiv:2402.10685.
- Nye et al. (2021) Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. 2021. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114.
- Olausson et al. (2023) Theo Olausson, Alex Gu, Ben Lipkin, Cedegao Zhang, Armando Solar-Lezama, Joshua Tenenbaum, and Roger Levy. 2023. LINC: A neurosymbolic approach for logical reasoning by combining language models with first-order logic provers. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing.
- OpenAI (2023) OpenAI. 2023. Gpt-4 technical report. Preprint, arXiv:2303.08774.
- Pan et al. (2023) Liangming Pan, Alon Albalak, Xinyi Wang, and William Yang Wang. 2023. Logic-lm: Empowering large language models with symbolic solvers for faithful logical reasoning. arXiv preprint arXiv:2305.12295.
- Park et al. (2023) Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages 1–22.
- Qiu et al. (2023) Linlu Qiu, Liwei Jiang, Ximing Lu, Melanie Sclar, Valentina Pyatkin, Chandra Bhagavatula, Bailin Wang, Yoon Kim, Yejin Choi, Nouha Dziri, et al. 2023. Phenomenal yet puzzling: Testing inductive reasoning capabilities of language models with hypothesis refinement. arXiv preprint arXiv:2310.08559.
- Sanyal et al. (2022) Soumya Sanyal, Harman Singh, and Xiang Ren. 2022. Fairr: Faithful and robust deductive reasoning over natural language. arXiv preprint arXiv:2203.10261.
- Sinha et al. (2019) Koustuv Sinha, Shagun Sodhani, Jin Dong, Joelle Pineau, and William L Hamilton. 2019. Clutrr: A diagnostic benchmark for inductive reasoning from text. arXiv preprint arXiv:1908.06177.
- Sun et al. (2023) Hongda Sun, Weikai Xu, Wei Liu, Jian Luan, Bin Wang, Shuo Shang, Ji-Rong Wen, and Rui Yan. 2023. From indeterminacy to determinacy: Augmenting logical reasoning capabilities with large language models. arXiv preprint arXiv:2310.18659.
- Tafjord et al. (2020) Oyvind Tafjord, Bhavana Dalvi Mishra, and Peter Clark. 2020. Proofwriter: Generating implications, proofs, and abductive statements over natural language. arXiv preprint arXiv:2012.13048.
- Team et al. (2023) Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. 2023. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
- Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Wang et al. (2022a) Siyuan Wang, Zhongkun Liu, Wanjun Zhong, Ming Zhou, Zhongyu Wei, Zhumin Chen, and Nan Duan. 2022a. From lsat: The progress and challenges of complex reasoning. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:2201–2216.
- Wang et al. (2024a) Siyuan Wang, Zhongyu Wei, Yejin Choi, and Xiang Ren. 2024a. Can llms reason with rules? logic scaffolding for stress-testing and improving llms. arXiv preprint arXiv:2402.11442.
- Wang et al. (2021) Siyuan Wang, Wanjun Zhong, Duyu Tang, Zhongyu Wei, Zhihao Fan, Daxin Jiang, Ming Zhou, and Nan Duan. 2021. Logic-driven context extension and data augmentation for logical reasoning of text. arXiv preprint arXiv:2105.03659.
- Wang et al. (2024b) Weizhi Wang, Li Dong, Hao Cheng, Xiaodong Liu, Xifeng Yan, Jianfeng Gao, and Furu Wei. 2024b. Augmenting language models with long-term memory. Advances in Neural Information Processing Systems, 36.
- Wang et al. (2022b) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022b. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
- Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837.
- Xu et al. (2024) Jundong Xu, Hao Fei, Liangming Pan, Qian Liu, Mong-Li Lee, and Wynne Hsu. 2024. Faithful logical reasoning via symbolic chain-of-thought. arXiv preprint arXiv:2405.18357.
- Yoneda et al. (2023) Takuma Yoneda, Jiading Fang, Peng Li, Huanyu Zhang, Tianchong Jiang, Shengjie Lin, Ben Picker, David Yunis, Hongyuan Mei, and Matthew R Walter. 2023. Statler: State-maintaining language models for embodied reasoning. arXiv preprint arXiv:2306.17840.
- Yue et al. (2024) Xihang Yue, Linchao Zhu, and Yi Yang. 2024. Fragrel: Exploiting fragment-level relations in the external memory of large language models. arXiv preprint arXiv:2406.03092.
- Zhong et al. (2021) Wanjun Zhong, Siyuan Wang, Duyu Tang, Zenan Xu, Daya Guo, Jiahai Wang, Jian Yin, Ming Zhou, and Nan Duan. 2021. Ar-lsat: Investigating analytical reasoning of text. arXiv preprint arXiv:2104.06598.
- Zhu et al. (2023) Zhaocheng Zhu, Yuan Xue, Xinyun Chen, Denny Zhou, Jian Tang, Dale Schuurmans, and Hanjun Dai. 2023. Large language models can learn rules. arXiv preprint arXiv:2310.07064.
Appendix A Memory Schema Update
An example of the memory schema construction process is illustrated in Figure 6. Before each symbolic formulation, the process first looks up the memory schema to determine whether its maintained predicates and objects can cover the current fact or rule to be formulated. If it can, symbolic formulation is conducted directly based on the memory schema. If it cannot, new predicates or objects are created and added to the memory schema, and the symbolic formulation proceeds based on the updated memory schema. Then new formulated facts and rules are written into the working memory.

Appendix B Implementation Details
We implement our framework based on two different backbone LLMs: GPT-4 (gpt-4-turbo-0409 for CLUTRR, ProofWriter and Boxes, gpt-4o for AR-LSAT) and GPT-3.5 (gpt-3.5-turbo-0125), to test its effectiveness with different capabilities of symbolic semantic parsing and one-step rule application. For fair comparison, we re-implement all baseline methods using corresponding LLMs. All CoT-based baselines utilize the same in-context demonstrations. The generation temperature is set to 0.0 by default. The maximum number of steps in our framework is set to 4, 6, 8 for actual 2, 3-4, and 5-6 steps in CLUTRR and ProofWriter. For AR-LSAT, the maximum steps are set according to the number of rules, and for Boxes, they are set according to the number of operational facts.
Appendix C Further Experiments
C.1 Comparison with Multi-shot CoT-based Methods.
Since we implement WM-Neurosymbolic using few-shot prompts to better control output formats, we conduct additional experiments to illustrate our framework’s effectiveness even when compared to CoT-based methods with multi-shot demonstrations. Specifically, we set the number of demonstrations in CoT-based methods for each dataset according to the maximum number of examples used by our framework: 2 for CLUTRR and AR-LSAT, and 3 for ProofWriter and Boxes.
Method | CLUTRR | ProofWriter | AR-LSAT | Boxes | ||||
GPT-4 | GPT-3.5 | GPT-4 | GPT-3.5 | GPT-4 | GPT-3.5 | GPT-4 | GPT-3.5 | |
One-shot CoT-base Methods | ||||||||
Scratchpad-CoT | 83.83% | 57.02% | 61.33% | 49.67% | 41.25% | 30.00% | 91.85% | 15.60% |
SC-CoT | 85.53% | 59.57% | 62.00% | 54.00% | 45.00% | 31.25% | 93.33% | 17.04% |
Self-Notes | 74.04% | 55.74% | 62.00% | 52.67% | 47.50% | 23.75% | 92.59% | 18.52% |
Multi-shot CoT-base Methods | ||||||||
Shot Number | 2-shot | 3-shot | 2-shot | 3-shot | ||||
Scratchpad-CoT | 86.38% | 59.57% | 64.33% | 48.00% | 52.50% | 17.50% | 97.04% | 22.22% |
SC-CoT | 87.23% | 60.85% | 66.33% | 48.33% | 50.00% | 18.75% | 97.78% | 24.44% |
Self-Notes | 72.76% | 54.89% | 61.67% | 56.33% | 53.75% | 21.25% | 97.04% | 25.19% |
WM-Neurosymbolic | 92.34% | 78.72% | 77.33% | 58.00% | 70.00% | 35.00% | 100% | 34.29% |
As shown in Table 6, using more examples in few-shot CoT prompting does not always lead to performance improvement. However, compared to both one-shot and multi-shot CoT-based methods, our framework consistently exhibits enhanced performance.
C.2 Rule Application without Rules Provided
To simulate realistic scenarios where rules are commonsense principles derived from real-world observations but not explicitly provided, we additionally experiment our framework on CLUTRR and Boxes datasets with rules not pre-defined. Here, our working memory only stores and updates facts. In each step, we select applicable facts (those with overlapping objects) from memory, and ask LLMs to self-generate applicable rules for rule implementation until the query is resolved. As shown in Table 7, compared to the Scratchpad-CoT baseline without provided rules, our framework on top of GPT-4 still shows improvement.
Methods | CLUTRR | Boxes |
Scratchpad-CoT | 82.13% | 89.63% |
WM-Neurosymbolic | 83.83% | 96.30% |
Appendix D Framework Prompts
Table 8, 9 and 10 show the example prompts for fact initialization, rule initialization, and LLM-based rule implementation in the CLUTRR dataset. Table 11, 12 and 13 show the example prompts for the ProofWriter dataset. Table 14, 15 and 16 show the example prompts for the AR-LSAT dataset. Table 17, 18 and 19 show the example prompts for the Boxes dataset.