<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://royf.org/feed/publications.xml" rel="self" type="application/atom+xml" /><link href="https://royf.org/" rel="alternate" type="text/html" /><updated>2026-03-12T19:33:24+00:00</updated><id>https://royf.org/feed/publications.xml</id><title type="html">Roy Fox | Publications</title><author><name>Roy Fox</name><email>royf@uci.edu</email></author><entry><title type="html">Make the Pertinent Salient: Task-Relevant Reconstruction for Visual Control with Distractions</title><link href="https://royf.org/pub/Kim2025Segmentation/" rel="alternate" type="text/html" title="Make the Pertinent Salient: Task-Relevant Reconstruction for Visual Control with Distractions" /><published>2025-08-07T00:00:00+00:00</published><updated>2025-08-07T00:00:00+00:00</updated><id>https://royf.org/pub/Kim2025Segmentation</id><content type="html" xml:base="https://royf.org/pub/Kim2025Segmentation/"><![CDATA[<p>Model-Based Reinforcement Learning (MBRL) has shown promise in visual control tasks due to its data efficiency. However, training MBRL agents to develop generalizable perception remains challenging, especially amid visual distractions that introduce noise in representation learning. We introduce Segmentation Dreamer (SD), a framework that facilitates representation learning in MBRL by incorporating a novel auxiliary task. Assuming that task-relevant components in images can be easily identified with prior knowledge in a given task, SD uses segmentation masks on image observations to reconstruct only task-relevant regions, reducing representation complexity. SD can leverage either ground-truth masks available in simulation or potentially imperfect segmentation foundation models. The latter is further improved by selectively applying the image reconstruction loss to mitigate misleading learning signals from mask prediction errors. In modified DeepMind Control suite and Meta-World tasks with added visual distractions, SD achieves significantly better sample efficiency and greater final performance than prior work and is especially effective in sparse reward tasks that had been unsolvable by prior work. We also validate its effectiveness in a real-world robotic lane-following task when training with intentional distractions for zero-shot transfer.</p>]]></content><author><name>Roy Fox</name><email>royf@uci.edu</email></author><category term="Conferences" /><category term="Selected Publications" /><category term="Model-based" /><category term="Reinforcement learning" /><summary type="html"><![CDATA[Model-Based Reinforcement Learning (MBRL) has shown promise in visual control tasks due to its data efficiency. However, training MBRL agents to develop generalizable perception remains challenging, especially amid visual distractions that introduce noise in representation learning. We introduce Segmentation Dreamer (SD), a framework that facilitates representation learning in MBRL by incorporating a novel auxiliary task. Assuming that task-relevant components in images can be easily identified with prior knowledge in a given task, SD uses segmentation masks on image observations to reconstruct only task-relevant regions, reducing representation complexity. SD can leverage either ground-truth masks available in simulation or potentially imperfect segmentation foundation models. The latter is further improved by selectively applying the image reconstruction loss to mitigate misleading learning signals from mask prediction errors. In modified DeepMind Control suite and Meta-World tasks with added visual distractions, SD achieves significantly better sample efficiency and greater final performance than prior work and is especially effective in sparse reward tasks that had been unsolvable by prior work. We also validate its effectiveness in a real-world robotic lane-following task when training with intentional distractions for zero-shot transfer.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://royf.org/pub/img/Kim2025Segmentation.png" /><media:content medium="image" url="https://royf.org/pub/img/Kim2025Segmentation.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">A Variational Neural Network Model of Resource-Rational Reward Encoding in Human Planning</title><link href="https://royf.org/pub/Ying2025Variational/" rel="alternate" type="text/html" title="A Variational Neural Network Model of Resource-Rational Reward Encoding in Human Planning" /><published>2025-07-31T00:00:00+00:00</published><updated>2025-07-31T00:00:00+00:00</updated><id>https://royf.org/pub/Ying2025Variational</id><content type="html" xml:base="https://royf.org/pub/Ying2025Variational/"><![CDATA[<p>Working memory (WM) is essential for planning and decision-making, enabling us to temporarily store and manipulate information about potential future actions and their outcomes. Existing research on WM, however, has primarily considered contexts where stimuli are presented simultaneously and encoded independently. It thus remains unclear how WM dynamically manages information about reward and value during planning, when actions are evaluated sequentially in time and their cumulative values must be integrated to guide choice. To address this gap, we developed an information-theoretic model of WM allocation during planning, implemented using variational recurrent neural networks. In this model, an agent optimizes plan quality while maintaining reward information under WM constraints. To test our model, we designed a task in which participants sequentially observed the rewards available at different future states before executing a sequence of actions, attempting to maximize cumulative rewards. Our results suggest that humans preferentially maintain rewards that are most informative for plan selection, integrating both local and global factors. These findings bridge theories of WM limitations with models of human planning, revealing how cognitive constraints shape decision-making strategies.</p>]]></content><author><name>Roy Fox</name><email>royf@uci.edu</email></author><category term="Conferences" /><category term="Bounded RL" /><category term="Partial observability" /><summary type="html"><![CDATA[Working memory (WM) is essential for planning and decision-making, enabling us to temporarily store and manipulate information about potential future actions and their outcomes. Existing research on WM, however, has primarily considered contexts where stimuli are presented simultaneously and encoded independently. It thus remains unclear how WM dynamically manages information about reward and value during planning, when actions are evaluated sequentially in time and their cumulative values must be integrated to guide choice. To address this gap, we developed an information-theoretic model of WM allocation during planning, implemented using variational recurrent neural networks. In this model, an agent optimizes plan quality while maintaining reward information under WM constraints. To test our model, we designed a task in which participants sequentially observed the rewards available at different future states before executing a sequence of actions, attempting to maximize cumulative rewards. Our results suggest that humans preferentially maintain rewards that are most informative for plan selection, integrating both local and global factors. These findings bridge theories of WM limitations with models of human planning, revealing how cognitive constraints shape decision-making strategies.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://royf.org/pub/img/Ying2025Variational.png" /><media:content medium="image" url="https://royf.org/pub/img/Ying2025Variational.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Realizable Continuous-Space Shields for Safe Reinforcement Learning</title><link href="https://royf.org/pub/Kim2025Realizable/" rel="alternate" type="text/html" title="Realizable Continuous-Space Shields for Safe Reinforcement Learning" /><published>2025-06-06T00:00:00+00:00</published><updated>2025-06-06T00:00:00+00:00</updated><id>https://royf.org/pub/Kim2025Realizable</id><content type="html" xml:base="https://royf.org/pub/Kim2025Realizable/"><![CDATA[<p>While Deep Reinforcement Learning (DRL) has achieved remarkable success across various domains, it remains vulnerable to occasional catastrophic failures without additional safeguards. An effective solution to prevent these failures is to use a shield that validates and adjusts the agent’s actions to ensure compliance with a provided set of safety specifications. For real-world robotic domains, it is essential to define safety specifications over continuous state and action spaces to accurately account for system dynamics and compute new actions that minimally deviate from the agent’s original decision. In this paper, we present the first shielding approach specifically designed to ensure the satisfaction of safety requirements in continuous state and action spaces, making it suitable for practical robotic applications. Our method builds upon realizability, an essential property that confirms the shield will always be able to generate a safe action for any state in the environment. We formally prove that realizability can be verified for stateful shields, enabling the incorporation of non-Markovian safety requirements, such as loop avoidance. Finally, we demonstrate the effectiveness of our approach in ensuring safety without compromising the policy’s success rate by applying it to a navigation problem and a multi-agent particle environment.</p>]]></content><author><name>Roy Fox</name><email>royf@uci.edu</email></author><category term="Conferences" /><category term="Reinforcement learning" /><category term="Robotics" /><category term="Safety" /><summary type="html"><![CDATA[While Deep Reinforcement Learning (DRL) has achieved remarkable success across various domains, it remains vulnerable to occasional catastrophic failures without additional safeguards. An effective solution to prevent these failures is to use a shield that validates and adjusts the agent’s actions to ensure compliance with a provided set of safety specifications. For real-world robotic domains, it is essential to define safety specifications over continuous state and action spaces to accurately account for system dynamics and compute new actions that minimally deviate from the agent’s original decision. In this paper, we present the first shielding approach specifically designed to ensure the satisfaction of safety requirements in continuous state and action spaces, making it suitable for practical robotic applications. Our method builds upon realizability, an essential property that confirms the shield will always be able to generate a safe action for any state in the environment. We formally prove that realizability can be verified for stateful shields, enabling the incorporation of non-Markovian safety requirements, such as loop avoidance. Finally, we demonstrate the effectiveness of our approach in ensuring safety without compromising the policy’s success rate by applying it to a navigation problem and a multi-agent particle environment.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://royf.org/pub/img/Kim2025Realizable.png" /><media:content medium="image" url="https://royf.org/pub/img/Kim2025Realizable.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Verification-Guided Shielding for Deep Reinforcement Learning</title><link href="https://royf.org/pub/Corsi2024Verification/" rel="alternate" type="text/html" title="Verification-Guided Shielding for Deep Reinforcement Learning" /><published>2024-08-11T00:00:00+00:00</published><updated>2024-08-11T00:00:00+00:00</updated><id>https://royf.org/pub/Corsi2024Verification</id><content type="html" xml:base="https://royf.org/pub/Corsi2024Verification/"><![CDATA[<p>In recent years, Deep Reinforcement Learning (DRL) has emerged as an effective approach to solving real-world tasks. However, despite their successes, DRL-based policies suffer from poor reliability, which limits their deployment in safety-critical domains. Various methods have been put forth to address this issue by providing formal safety guarantees. Two main approaches include shielding and verification. While shielding ensures the safe behavior of the policy by employing an external online component (i.e., a shield) that overrides potentially dangerous actions, this approach has a significant computational cost as the shield must be invoked at runtime to validate every decision. On the other hand, verification is an offline process that can identify policies that are unsafe, prior to their deployment, yet, without providing alternative actions when such a policy is deemed unsafe. In this work, we present verification-guided shielding — a novel approach that bridges the DRL reliability gap by integrating these two methods. Our approach combines both formal and probabilistic verification tools to partition the input domain into safe and unsafe regions. In addition, we employ clustering and symbolic representation procedures that compress the unsafe regions into a compact representation. This, in turn, allows to temporarily activate the shield solely in (potentially) unsafe regions, in an efficient manner. Our novel approach allows to significantly reduce runtime overhead while still preserving formal safety guarantees. We extensively evaluate our approach on two benchmarks from the robotic navigation domain, as well as provide an in-depth analysis of its scalability and completeness.</p>]]></content><author><name>Roy Fox</name><email>royf@uci.edu</email></author><category term="Conferences" /><category term="Reinforcement learning" /><category term="Robotics" /><category term="Safety" /><summary type="html"><![CDATA[In recent years, Deep Reinforcement Learning (DRL) has emerged as an effective approach to solving real-world tasks. However, despite their successes, DRL-based policies suffer from poor reliability, which limits their deployment in safety-critical domains. Various methods have been put forth to address this issue by providing formal safety guarantees. Two main approaches include shielding and verification. While shielding ensures the safe behavior of the policy by employing an external online component (i.e., a shield) that overrides potentially dangerous actions, this approach has a significant computational cost as the shield must be invoked at runtime to validate every decision. On the other hand, verification is an offline process that can identify policies that are unsafe, prior to their deployment, yet, without providing alternative actions when such a policy is deemed unsafe. In this work, we present verification-guided shielding — a novel approach that bridges the DRL reliability gap by integrating these two methods. Our approach combines both formal and probabilistic verification tools to partition the input domain into safe and unsafe regions. In addition, we employ clustering and symbolic representation procedures that compress the unsafe regions into a compact representation. This, in turn, allows to temporarily activate the shield solely in (potentially) unsafe regions, in an efficient manner. Our novel approach allows to significantly reduce runtime overhead while still preserving formal safety guarantees. We extensively evaluate our approach on two benchmarks from the robotic navigation domain, as well as provide an in-depth analysis of its scalability and completeness.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://royf.org/pub/img/Corsi2024Verification.png" /><media:content medium="image" url="https://royf.org/pub/img/Corsi2024Verification.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Reinforcement Learning from Delayed Observations via World Models</title><link href="https://royf.org/pub/Karamzade2024Delayed/" rel="alternate" type="text/html" title="Reinforcement Learning from Delayed Observations via World Models" /><published>2024-08-10T00:00:00+00:00</published><updated>2024-08-10T00:00:00+00:00</updated><id>https://royf.org/pub/Karamzade2024Delayed</id><content type="html" xml:base="https://royf.org/pub/Karamzade2024Delayed/"><![CDATA[<p>In standard reinforcement learning settings, agents typically assume immediate feedback about the effects of their actions after taking them. However, in practice, this assumption may not hold true due to physical constraints and can significantly impact the performance of learning algorithms. In this paper, we address observation delays in partially observable environments. We propose leveraging world models, which have shown success in integrating past observations and learning dynamics, to handle observation delays. By reducing delayed POMDPs to delayed MDPs with world models, our methods can effectively handle partial observability, where existing approaches achieve sub-optimal performance or degrade quickly as observability decreases. Experiments suggest that one of our methods can outperform a naive model-based approach by up to 250%. Moreover, we evaluate our methods on visual delayed environments, for the first time showcasing delay-aware reinforcement learning continuous control with visual observations.</p>]]></content><author><name>Roy Fox</name><email>royf@uci.edu</email></author><category term="Conferences" /><category term="Model-based" /><category term="Partial observability" /><category term="Reinforcement learning" /><summary type="html"><![CDATA[In standard reinforcement learning settings, agents typically assume immediate feedback about the effects of their actions after taking them. However, in practice, this assumption may not hold true due to physical constraints and can significantly impact the performance of learning algorithms. In this paper, we address observation delays in partially observable environments. We propose leveraging world models, which have shown success in integrating past observations and learning dynamics, to handle observation delays. By reducing delayed POMDPs to delayed MDPs with world models, our methods can effectively handle partial observability, where existing approaches achieve sub-optimal performance or degrade quickly as observability decreases. Experiments suggest that one of our methods can outperform a naive model-based approach by up to 250%. Moreover, we evaluate our methods on visual delayed environments, for the first time showcasing delay-aware reinforcement learning continuous control with visual observations.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://royf.org/pub/img/Karamzade2024Delayed.png" /><media:content medium="image" url="https://royf.org/pub/img/Karamzade2024Delayed.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Make the Pertinent Salient: Task-Relevant Reconstruction for Visual Control with Distraction</title><link href="https://royf.org/pub/Kim2024Segmentation/" rel="alternate" type="text/html" title="Make the Pertinent Salient: Task-Relevant Reconstruction for Visual Control with Distraction" /><published>2024-08-09T00:00:00+00:00</published><updated>2024-08-09T00:00:00+00:00</updated><id>https://royf.org/pub/Kim2024Segmentation</id><content type="html" xml:base="https://royf.org/pub/Kim2024Segmentation/"><![CDATA[<p>Model-Based Reinforcement Learning (MBRL) has been a powerful tool for visual control tasks. Despite improved data efficiency, it remains challenging to use MBRL to train agents with generalizable perception. Training with visual distractions is particularly difficult due to the high variation they introduce to representation learning. Building on Dreamer, a popular MBRL method, we propose a simple yet effective auxiliary task — to reconstruct task-relevant components only. Our method, Segmentation Dreamer (SD), works either with ground-truth masks or by leveraging potentially error-prone segmentation foundation models. In DeepMind Control suite tasks with distraction, SD achieves significantly better sample efficiency and greater final performance than comparable methods. SD is especially helpful in a sparse reward task otherwise unsolvable by prior work, enabling the training of a visually robust agent without the need for extensive reward engineering.</p>]]></content><author><name>Roy Fox</name><email>royf@uci.edu</email></author><category term="Workshops" /><category term="Model-based" /><category term="Reinforcement learning" /><summary type="html"><![CDATA[Model-Based Reinforcement Learning (MBRL) has been a powerful tool for visual control tasks. Despite improved data efficiency, it remains challenging to use MBRL to train agents with generalizable perception. Training with visual distractions is particularly difficult due to the high variation they introduce to representation learning. Building on Dreamer, a popular MBRL method, we propose a simple yet effective auxiliary task — to reconstruct task-relevant components only. Our method, Segmentation Dreamer (SD), works either with ground-truth masks or by leveraging potentially error-prone segmentation foundation models. In DeepMind Control suite tasks with distraction, SD achieves significantly better sample efficiency and greater final performance than comparable methods. SD is especially helpful in a sparse reward task otherwise unsolvable by prior work, enabling the training of a visually robust agent without the need for extensive reward engineering.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://royf.org/pub/img/Kim2024Segmentation.png" /><media:content medium="image" url="https://royf.org/pub/img/Kim2024Segmentation.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Skill Set Optimization: Reinforcing Language Model Behavior via Transferable Skills</title><link href="https://royf.org/pub/Nottingham2024SSO/" rel="alternate" type="text/html" title="Skill Set Optimization: Reinforcing Language Model Behavior via Transferable Skills" /><published>2024-07-25T00:00:00+00:00</published><updated>2024-07-25T00:00:00+00:00</updated><id>https://royf.org/pub/Nottingham2024SSO</id><content type="html" xml:base="https://royf.org/pub/Nottingham2024SSO/"><![CDATA[<p>Large language models (LLMs) have recently been used for sequential decision making in interactive environments. However, leveraging environment reward signals for continual LLM actor improvement is not straightforward. We propose Skill Set Optimization (SSO) for improving LLM actor performance through constructing and refining sets of transferable skills. SSO constructs skills by extracting common subtrajectories with high rewards and generating subgoals and instructions to represent each skill. These skills are provided to the LLM actor in-context to reinforce behaviors with high rewards. Then, SSO further refines the skill set by pruning skills that do not continue to result in high rewards. We evaluate our method in the classic videogame NetHack and the text environment ScienceWorld to demonstrate SSO’s ability to optimize a set of skills and perform in-context policy improvement. SSO outperforms baselines by 40% in our custom NetHack task and outperforms the previous state-of-the-art in ScienceWorld by 35%.</p>]]></content><author><name>Roy Fox</name><email>royf@uci.edu</email></author><category term="Conferences" /><category term="Selected Publications" /><category term="Natural language" /><category term="Reinforcement learning" /><category term="Structured control" /><summary type="html"><![CDATA[Large language models (LLMs) have recently been used for sequential decision making in interactive environments. However, leveraging environment reward signals for continual LLM actor improvement is not straightforward. We propose Skill Set Optimization (SSO) for improving LLM actor performance through constructing and refining sets of transferable skills. SSO constructs skills by extracting common subtrajectories with high rewards and generating subgoals and instructions to represent each skill. These skills are provided to the LLM actor in-context to reinforce behaviors with high rewards. Then, SSO further refines the skill set by pruning skills that do not continue to result in high rewards. We evaluate our method in the classic videogame NetHack and the text environment ScienceWorld to demonstrate SSO’s ability to optimize a set of skills and perform in-context policy improvement. SSO outperforms baselines by 40% in our custom NetHack task and outperforms the previous state-of-the-art in ScienceWorld by 35%.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://royf.org/pub/img/Nottingham2024SSO.png" /><media:content medium="image" url="https://royf.org/pub/img/Nottingham2024SSO.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Selective Perception: Learning Concise State Descriptions for Language Model Actors</title><link href="https://royf.org/pub/Nottingham2024BLINDER/" rel="alternate" type="text/html" title="Selective Perception: Learning Concise State Descriptions for Language Model Actors" /><published>2024-06-17T00:00:00+00:00</published><updated>2024-06-17T00:00:00+00:00</updated><id>https://royf.org/pub/Nottingham2024BLINDER</id><content type="html" xml:base="https://royf.org/pub/Nottingham2024BLINDER/"><![CDATA[<p>It is increasingly common for large language models (LLMs) to be applied as actors in sequential decision making problems in embodied domains such as robotics and games, due to their general world knowledge and planning abilities. However, LLMs are not natively trained for embodied decision making problems, and expressing complex state spaces in text is non-trivial. Exhaustively describing high-dimensional states leads to prohibitive inference costs and impaired task performance due to distracting or irrelevant information. Previous LLM actors avoid the issue by relying on hand-engineered, task-specific protocols to determine which features to communicate about a state and which to leave out. In this work, we propose BLINDER (Brief Language INputs for DEcision-making Responses), a method for learning to select concise and helpful sets of state features for LLM actors. BLINDER learns a value function for task-conditioned state descriptions that approximates the likelihood that a state description will result in optimal actions. We evaluate BLINDER on the challenging video game NetHack and a real-world robotic manipulation task. We find that we are able to reduce the length of state descriptions by 87% and 99% on NetHack and robotic manipulation tasks respectively. BLINDER also improves task success rates by 158% and 54% on those same tasks and generalizes to LLM actors of various size and quality.</p>]]></content><author><name>Roy Fox</name><email>royf@uci.edu</email></author><category term="Conferences" /><category term="Selected Publications" /><category term="Natural language" /><category term="Reinforcement learning" /><category term="Robotics" /><summary type="html"><![CDATA[It is increasingly common for large language models (LLMs) to be applied as actors in sequential decision making problems in embodied domains such as robotics and games, due to their general world knowledge and planning abilities. However, LLMs are not natively trained for embodied decision making problems, and expressing complex state spaces in text is non-trivial. Exhaustively describing high-dimensional states leads to prohibitive inference costs and impaired task performance due to distracting or irrelevant information. Previous LLM actors avoid the issue by relying on hand-engineered, task-specific protocols to determine which features to communicate about a state and which to leave out. In this work, we propose BLINDER (Brief Language INputs for DEcision-making Responses), a method for learning to select concise and helpful sets of state features for LLM actors. BLINDER learns a value function for task-conditioned state descriptions that approximates the likelihood that a state description will result in optimal actions. We evaluate BLINDER on the challenging video game NetHack and a real-world robotic manipulation task. We find that we are able to reduce the length of state descriptions by 87% and 99% on NetHack and robotic manipulation tasks respectively. BLINDER also improves task success rates by 158% and 54% on those same tasks and generalizes to LLM actors of various size and quality.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://royf.org/pub/img/Nottingham2023BLINDER.png" /><media:content medium="image" url="https://royf.org/pub/img/Nottingham2023BLINDER.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Q* Search: Heuristic Search with Deep Q-Networks</title><link href="https://royf.org/pub/Agostinelli2024Qstar/" rel="alternate" type="text/html" title="Q* Search: Heuristic Search with Deep Q-Networks" /><published>2024-06-02T00:00:00+00:00</published><updated>2024-06-02T00:00:00+00:00</updated><id>https://royf.org/pub/Agostinelli2024Qstar</id><content type="html" xml:base="https://royf.org/pub/Agostinelli2024Qstar/"><![CDATA[<p>Efficiently solving problems with large action spaces using A* search has been of importance to the artificial intelligence community for decades. This is because the computation and memory requirements of A* search grow linearly with the size of the action space. This burden becomes even more apparent when A* search uses a heuristic function learned by computationally expensive function approximators, such as deep neural networks. To address this problem, we introduce Q* search, a search algorithm that uses deep Q-networks to guide search in order to take advantage of the fact that the sum of the transition costs and heuristic values of the children of a node can be computed with a single forward pass through a deep Q-network without explicitly generating those children. This significantly reduces computation time and requires only one node to be generated per iteration. We use Q* search on different domains and action spaces, showing that Q* suffers from only a small runtime overhead as the action size increases. In addition, our empirical results show Q* search is up to 129 times faster and generates up to 1288 times fewer nodes than A* search. Finally, although obtaining admissible heuristic functions from deep neural networks is an ongoing area of research, we prove that Q* search is guaranteed to find a shortest path given a heuristic function does not overestimate the sum of the transition cost and cost-to-go of the state.</p>]]></content><author><name>Roy Fox</name><email>royf@uci.edu</email></author><category term="Workshops" /><category term="Planning" /><category term="Reinforcement learning" /><summary type="html"><![CDATA[Efficiently solving problems with large action spaces using A* search has been of importance to the artificial intelligence community for decades. This is because the computation and memory requirements of A* search grow linearly with the size of the action space. This burden becomes even more apparent when A* search uses a heuristic function learned by computationally expensive function approximators, such as deep neural networks. To address this problem, we introduce Q* search, a search algorithm that uses deep Q-networks to guide search in order to take advantage of the fact that the sum of the transition costs and heuristic values of the children of a node can be computed with a single forward pass through a deep Q-network without explicitly generating those children. This significantly reduces computation time and requires only one node to be generated per iteration. We use Q* search on different domains and action spaces, showing that Q* suffers from only a small runtime overhead as the action size increases. In addition, our empirical results show Q* search is up to 129 times faster and generates up to 1288 times fewer nodes than A* search. Finally, although obtaining admissible heuristic functions from deep neural networks is an ongoing area of research, we prove that Q* search is guaranteed to find a shortest path given a heuristic function does not overestimate the sum of the transition cost and cost-to-go of the state.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://royf.org/pub/img/Agostinelli2024Qstar.png" /><media:content medium="image" url="https://royf.org/pub/img/Agostinelli2024Qstar.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry><title type="html">Toward Optimal Policy Population Growth in Two-Player Zero-Sum Games</title><link href="https://royf.org/pub/McAleer2024SPPSRO/" rel="alternate" type="text/html" title="Toward Optimal Policy Population Growth in Two-Player Zero-Sum Games" /><published>2024-05-10T00:00:00+00:00</published><updated>2024-05-10T00:00:00+00:00</updated><id>https://royf.org/pub/McAleer2024SPPSRO</id><content type="html" xml:base="https://royf.org/pub/McAleer2024SPPSRO/"><![CDATA[<p>In competitive two-agent environments, deep reinforcement learning (RL) methods like Policy Space Response Oracles (PSRO) often increase exploitability between iterations, which is problematic when training in large games. To address this issue, we introduce anytime double oracle (ADO), an algorithm that ensures exploitability does not increase between iterations, and its approximate extensive-form version, anytime PSRO (APSRO). ADO converges to a Nash equilibrium while iteratively reducing exploitability. However, convergence in these algorithms may require adding all of a game’s deterministic policies. To improve this, we propose Self-Play PSRO (SP-PSRO), which incorporates an approximately optimal stochastic policy into the population in each iteration. APSRO and SP-PSRO demonstrate lower exploitability and near-monotonic exploitability reduction in games like Leduc poker and Liar’s Dice. Empirically, SP-PSRO often converges much faster than APSRO and PSRO, requiring only a few iterations in many games.</p>]]></content><author><name>Roy Fox</name><email>royf@uci.edu</email></author><category term="Conferences" /><category term="Selected Publications" /><category term="Multi-agent" /><category term="Partial observability" /><category term="Reinforcement learning" /><summary type="html"><![CDATA[In competitive two-agent environments, deep reinforcement learning (RL) methods like Policy Space Response Oracles (PSRO) often increase exploitability between iterations, which is problematic when training in large games. To address this issue, we introduce anytime double oracle (ADO), an algorithm that ensures exploitability does not increase between iterations, and its approximate extensive-form version, anytime PSRO (APSRO). ADO converges to a Nash equilibrium while iteratively reducing exploitability. However, convergence in these algorithms may require adding all of a game’s deterministic policies. To improve this, we propose Self-Play PSRO (SP-PSRO), which incorporates an approximately optimal stochastic policy into the population in each iteration. APSRO and SP-PSRO demonstrate lower exploitability and near-monotonic exploitability reduction in games like Leduc poker and Liar’s Dice. Empirically, SP-PSRO often converges much faster than APSRO and PSRO, requiring only a few iterations in many games.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://royf.org/pub/img/McAleer2024SPPSRO.png" /><media:content medium="image" url="https://royf.org/pub/img/McAleer2024SPPSRO.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry></feed>