NeurIPS ’22 Main Conference Papers from Huang Lab @UMD

Main Conference Papers

I. Robust Reinforcement Learning (Sequential Decision-Making) against Adversarial Perturbations

Y. Liang, Y. Sun, R. Zheng, and F. Huang, “Efficient Adversarial Training without Attacking: Worst-Case-Aware Robust Reinforcement Learning”, Neural Information Processing System (NeurIPS), 2022. Paper Link, Code Link, BibTex Link & Presentation Link.

Recent studies reveal that a well-trained deep reinforcement learning (RL) policy can be particularly vulnerable to adversarial perturbations on input observations. Therefore, it is crucial to train RL agents that are robust against any attacks with a bounded budget. Existing robust training methods in deep RL either treat correlated steps separately, ignoring the robustness of long-term rewards, or train the agents and RL-based attacker together, doubling the computational burden and sample complexity of the training process. In this work, we propose a strong and efficient robust training framework for RL, named Worst-case-aware Robust RL (WocaR-RL) that directly estimates and optimizes the worst-case reward of a policy under bounded l_p attacks without requiring extra samples for learning an attacker. Experiments on multiple environments show that WocaR-RL achieves state-of-the-art performance under various strong attacks, and obtains significantly higher training efficiency than prior state-of-the-art robust training methods.

Are deep RL agents vulnerable?

We all know deep neural networks are vulnerable to adversarially drafted perturbations which can even be imperceptible. How about deep RL agents? The answer is deep RL agents are even more vulnerable to adversarial perturbations. Using very small amount of perturbations, we can achieve the lowest possible reward in Atari Games.

Why are deep RL agents so vulnerable?

The answer if the long-term vulnerability. Deep neural networks powered value/policy networks causes vulnerabilities, but, in addition, a seemingly harmless action that renders an OK (or even good) immediate reward may cause catastrophic failure (i.e., destructive cumulative reward/value).

A myopic attacker may mislead the agent to the blue directions, but a strong attacker with long-term vision may lead the victim agent to the red path, which, initially, has good rewards.

So there is an urgent need to develop RL agents that are robust to adversarial perturbations.

SOTA Adversarial Training in Reinforcement Learning:

[ZCXLBH,’20] enforcing consistent output under similar inputs

Pros: fast Cons: worst-case value not considered

[ZCBH,’21] alternately train agent and attacker

Pros: considering worst-case Cons: slow, double the required samples

Our Proposed Method: Worst-case-aware Robust RL (WocaR)

WocaR estimates/improves the worst-case value & the clean value, requiring no extra sample

Mechanism 1: Worst-case Value Estimation via a novel Worst-case Bellman Operator

Mechanism 2: Worst-case-aware Policy Optimization

Design implementation for PPO and DQN

Mechanism 3: Value-enhanced State Regularization

Characterize state importance (vulnerability)

Regularize policy network loss via state importance (vulnerability)

Advantage 1 of WocaR: SoTA Robustness across all attack strengths

WocaR-PPO (ours): evaluate and optimize worst-case value; SA-PPO: regularize output difference under similar inputs; ATLA-PPO/PA-ATLA-PPO: train with learned attackers; RADIAL-PPO: optimize policy network with adversarial loss.

Advantage 2 of WocaR: SoTA Natural Rewards & Robustness Tradeoff

Advantage 3 of WocaR: SoTA Efficiency

Advantage 4 of WocaR: SoTA Interpretability

Prior SoTA Defense “jumps with one leg”, may be overfitted to a specific attacker

Prior SoTA Defense: under no attack

Prior SoTA Defense: under strongest attack

Our WocaR Defense “lowers down the body”, which is an intuitively general defense.

Our WocaR: under no attack

Our WocaR: under strongest attack