Strategic planning & classification

Table of Contents

Definitions

Pareto-optimal

If a solution to a game is Pareto-optimal, then a beneficial change for one agent is detrimental to another, i.e. we've reached a solution for which any change results in "inequality".

Pareto-optimality is different from Nash equilibrium (NE) because in a NE it's not profitable for any of the agents to change their strategy, but in Pareto-optimiality it can be beneficial for agents to change their strategy but it results in an "inbalance".

Pareto-optimalty is therefore often used in the context where you want everyone to do equally well, e.g. economics.

Idea

  • Machine learning often relies on the assumption that unseen test instances of a classification problem follow the same distribution as observed data
  • This can often break down when used to make decisions about the welfare of strategic individuals
    • Knowing information about the classifier, such individuals may manipulate their attributes in order to obtain a better classification outcome
      • Often referred to as gaming
    • Result: performance of classifier may deteriorate sharply
    • In financial policy-making this problem is widely known as Goodhart's law: "When a measure becomes a target, it ceases to be a good measure."
  • Strategic classification is an attempt to address this:
    • Classification is considered as a sequential game between a player named Jury and a player named Contestant
    • Jury designs a classifier
    • Contestant may change input based on Jury's classifier
    • However, Contestant incurs a cost for these changes, according to some cost function
    • Jury's goal:
      • Achieve high classification accuracy wrt. Contestant's original input and some underlying target classification function $h$, assuming Contestant plays best response
    • Contestant's goal:
      • Achieve a favorable classification (i.e. 1 rather than -1) while taking into account the cost of achieving it

Motivation

Examples

  • Traffic predictions affect the traffic

Possible ideas

  • [ ] milli18_social_cost_strat_class suggests that there's a trade-off between the non-strategic equilibrium and the Stackelberg equilibrium
    • E.g. institution might only gain a very small increase in its utility while incurring a large social burden
    • Could imagine constructing a utility function for the institution ("social planner") which makes explicit this trade-off
  • Adding temporal dependence
    • Current approaches does not have any notion of temporal dependence encoded
    • Gaming vs. incentivized improvement could be combined by adding "short-term" cost function for gaming purpose and "long-term" cost for incentivized improvement
    • Requires analysis similar-looking to bandits I would imagine?
    • Could also try to design a classifier which "motivates" the agent to proritize the long-term rewards by incremental change
  • [ ] Generalize results from dong17_strat_class_from_reveal_prefer to work on convex problems with unbounded domain
    • Maybe possible to exploits some results from nesterov2017random, e.g.
      • Proof of Theorem 2
  • [ ]

General notes

  • There are mainly two approaches to strategic adaptation:
    1. "Individuals try to game our classifier; can be construct a classifier which acts robustly even under gamification?"
    2. "Can we design classifiers that incentivize individuals to improve a desired quality?"
  • Sequence of works
    • Agents are represented by feature vectors which can be manipulated at a cost
    • Problem: design optimal learning algo. in the presence of costly strategic manipulations of feature vector
    • Gaming view hardt15_strat_class
      • Manipulations do not change the underlying feature vector, only transforms the feature represented to the classifier → disadvantages the classifier
      • Initial work: hardt15_strat_class
        • Considers the problem as a two-player game, with one player being a binary classifier
      • Extensions:
        1. Assume different groups of agents have different costs, and study disparate impact on their outcomes dong17_strat_class_from_reveal_prefer
        2. Learner does not know the distribution of agents features or costs but must learn them through revealed preference hu18_dispar_effec_strat_manip,milli18_social_cost_strat_class
    • Improvement view

Notes for different papers

  • hardt15_strat_class
    • Follows approach (1)
    • Formulates the problem of strategic classification as a sequential two-player game between a Jury and a Contestant:
      • Jury's payoff:

        \begin{equation*}
\mathbb{P}_{x \sim \mathcal{D}} \big[ h(x) = f \big( \Delta(x) \big) \big]
\end{equation*}
      • Contestant's payoff:

        \begin{equation*}
\mathbb{E}_{x \sim \mathcal{D}} \big[ f \big( \Delta(x) \big) - c \big( x, \Delta(x) \big) \big] 
\end{equation*}
      • where
        • $h$ is true classifier
        • $f$ is Jury's classifier
        • $c$ denotes the cost for the Contestant
        • $\Delta(x)$ is the Contestant's "gaming-mechanism", in the sense that it tries to map $x$ to some positive example, i.e. $f \big( \Delta(x) \big) = 1$, while balancing off the cost $c$
    • When using empirical examples from $\mathcal{D}$ rather than full distribution $\mathcal{D}$, obtains result for separable cost functions, i.e. $c(x, y) = \max \left\{ 0, c_2(y) - c_1(x) \right\}$ for functions $c_1, c_2$ s.t. $c_1(X) \subseteq c_2(X)$
      • With prob $1 - \delta$, their algorithm (Algorithm 1) $\mathcal{A}$ produces a classifier $f: X \to \left\{ -1, 1 \right\}$ so that

        \begin{equation*}
\Prob \left\{ h(x) = f \big( \Delta(x) \big) \right\} \ge \mathrm{OPT}_h(\mathcal{D}, c) - \varepsilon
\end{equation*}

        with $\mathrm{OPT}_h(\mathcal{D}, c)$ denoting the true optimally robust $f$

    • Generalize to arbitrary cost functions
      • Is done by generalizing from separable cost functions to minimas for a set of separable cost functions
      • Then show that all cost functions can be represented this way
      • Drawback: bound scales with the size of the set of separable cost functions used, and thus quickly become not-so-useful
    • They show that the problem is NP-hard for the case of general cost functions
  • roth15_watch_learn
  • miller19_strat_class_is_causal_model_disguis
    • Any procedure for designing classifiers that incentivize improvement must inevitably solve a non-trivial causal inference problem
    • Draw the following distinctions:
      • Gaming: interventions that change the classification $\hat{Y}$, but do not change the true label $Y$
      • Incentiziving improvement: inducing agents to intervene on causal features that can change the label $Y$ rather than non-causal features
  • haghtalab2020maximizing
    • Thoughts:
      • Not super-excited about this one. The most significant part of this paper focuses on obtaining “robust” linear threshold mechanisms under a lot of different assumptions, e.g. knowledge of the true classifier. A bit too specific IMO. Though I like the idea of finding approximately optimal mechanisms, where the “approximate” part comes from comparison to the gain in utility.
    • Model is similar to hardt15_strat_class
      • Instead of only considering a binary classifier, generalizes slightly allowing an arbitrary "scoring mechanism" $g$, e.g.
        • $g: \mathbb{R}^n \to [0, 1]$ representing the probability of the input $x$ being accepted
        • $g: \mathbb{R}^n \to \left\{ 0, 1 \right\}$ representing a binary-classification
        • $g: \mathbb{R}^n \to \mathbb{R}$ representing the payoff an individual receives (larger being better)
      • Similarily to the "robust classifier" of hardt15_strat_class, they want to find some optimal scoring mechanism $g_{\mathrm{opt}} \in \mathcal{G}$
        • E.g. $f$ indicate overall health of an individual and $\mathcal{G}$ the set of governmental policies on how to set insurance premium based on observalbe features of individuals
          • Then $g_{\mathrm{opt}}$ corresponds to the policy that leads to the healthiest society on average
      • Relating to hardt15_strat_class, $f$ is the true classifier (denoted $h$) and $g$ is the Jury's classifier (denoted $f$)
    • Consider only a subset of input-features to be visible, i.e. the mechanism $g$ only sees some projection $P x$ of the true input $x$
    • Here they limit to the (known) true function $f$ being linear, i.e. $f(x) = w_f \cdot x - b_f$
      • For mechanisms $g$ they consider
        • Linear mechanisms $\mathcal{G}_{\mathrm{lin}}$, i.e. $g(x) = w_g \cdot x - b_g$ for $b_g \in \mathbb{R}$ and $w_g \in \Im(P)$ of length $\norm{w}_2 \le R$ for some $R^{ + }$
        • Linear threshold (halfspace) mehanisms $\mathcal{G}_{0 - 1}$, i.e. $g(x) = \mathrm{sign}(w_g \cdot x - b_g)$ for $w_g \in \Im(P)$ with $\norm{w_g}_2 = 1$ and $b_g \in \mathbb{R}$
  • dong17_strat_class_from_reveal_prefer
    • Questions:
      • [X] Why can we not have $c_t$ ($d_t$ in paper) be the $\ell_1 \text{-norm}?$
        • A: We can! We can have any norm, but we need

          \begin{equation*}
c(x, y) = \frac{1}{r} \norm{x - y}^r
\end{equation*}

          for some $r > 1$, i.e. any norm raised to some power!

        • Need response $\Delta(x, \beta)$ to be bounded, and only have restriction on $x_t$ to have bounded norm $R_1$, not the response $\Delta(x, \beta)$ ($\hat{x}(\beta)$ in paper)
          • [X] Why do we need $\Delta(x, \beta)$ to be bounded?
            • Thm. 4: sufficient condition for set of sub-gradients to be non-empty
            • Comment in paper: "boundedness of $\argmax_{\hat{x} \in \mathbb{R}^d}  u(\hat{x}, x, \beta)$ is not directly used in the proof of Thm 2, but it excludes the unrealistic situation in which th estrategic agent's best response can be arbitrarily far from the origin"
              • [X] Why is this "unrealistic"? This still needs to be weighted with the increase we see in $\left\langle x, \beta \right\rangle$
                • Claim 5 tells us that you end up with individual / agent always choosing either to not act at all or move to inifinity in the case of $f(z) = \norm{z}$, i.e. if we use $\ell_1\text{-norm}$.
    • Problem:
      • Considers decisions through time (in contrast to hardt15_strat_class)
      • Cost function $d_t(x, \hat{x}_t)$ is unknown to the learner
      • Restricts the learner to linear classifiers $\beta_t$
      • Agent's utility is then defined

        \begin{equation*}
u_t(\hat{x}, \beta) = \left\langle \beta, \hat{x} \right\rangle - d_t(x_t, \hat{x})
\end{equation*}
        • Similar to others:
          • hardt15_strat_class: uses $f (\hat{x})$ instead of $\left\langle \beta, \hat{x} \right\rangle$ and restricts the form of $d_t$ to be a separable
          • haghtalab2020maximizing: uses mechanism $g$ instead of $\left\langle \beta, \hat{x} \right\rangle$
            • Though also restricts the mechanism to be, say, linear, in which case it becomes equivalent to the above
    • Under sufficient conditions on the cost functions $c \big( \hat{x}_t(\beta_t), y_t, \beta_t \big)$ and agent cost $d_t(\hat{x}_t, x_t)$
      • Strategic agent's best response set is

        \begin{equation*}
\underset{\hat{x} \in \mathbb{R}^d}{\argmax}\ u(\hat{x}, x, \beta) = x + w^*(\beta) = x + \partial f^*(\beta)
\end{equation*}
      • Satisfied by examples:
        • Hinge loss:

          \begin{equation*}
c_h \big( \hat{x}_t, y_t, \beta_t \big) = \max \left\{ 0, 1 - y_t \cdot \left\langle \hat{x}_t, \beta_t \right\rangle \right\}
\end{equation*}
        • Logistic loss:

          \begin{equation*}
c_{\text{log}} \big( \hat{x}_t, y_t, \beta_t \big) = \log \Big( 1 + e^{- y_t \cdot \left\langle \hat{x}_t, \beta_t \right\rangle} \Big)
\end{equation*}
    • Focuses on obtaining conditions for when the cost function $c$ for the learner is convex
    • Drawbacks:
      • No high-probability guarantees, but instead we just get bounds on the expectation of the Stackelberg regret
  • perdomo20_perfor_predic
    • Describes conditions sufficient for repeated risk minimization (RRM) to converge
      • RRM refers to the notion of training a classifier, letting it "run" for a bit, then retraining it as we get new data, etc.
      • This paper describes the conditions under which such a strategy converges
      • RRM is described by

        \begin{equation*}
\theta_{t + 1} = G \big( \mathcal{D}(\textcolor{green}{\theta_t}) \big) := \underset{\textcolor{cyan}{\theta} \in \Theta}{\argmin}\ \mathbb{E}_{Z \sim \mathcal{D}(\textcolor{green}{\theta_t})} \big[ \ell(Z; \textcolor{cyan}{\theta}) \big]
\end{equation*}
  • milli18_social_cost_strat_class
    • Like the presentation in this
      • Similar presentation as hardt15_strat_class but in a (possibly) slightly more general way
        • E.g. their analysis include the case of separable cost functions (but wrt. the likelihood rather than the input as in hardt15_strat_class)
    • Brings up the point that strategic classification, as presented in hardt15_strat_class, induces what they refer to as social burden
      • This is basically the expected cost of positive individuals when choosing the strategic version $\tilde{f}$ of the classifier $f$
      • Also considers how this impacts different groups of individuals in two different cases:
        • Different groups have different cost-functions
        • Different groups have different feature-distributions
      • Result: strategic classification worsens the social gap $\mathcal{G}(f)$, i.e. the difference between the social burden in the two groups, compared to non-strategic classification
  • hu18_dispar_effec_strat_manip
    • Results:
      • Stackelberg equilibrium leads to only false negatives on disadvantaged population and false positives on advantaged population
      • Also consider effects of interventions in which a learner / institution subsidizes members of the disadvantaged group, lowering their costs in order to improve classification performance.
    • Mentioned in milli18_social_cost_strat_class: "In concurrent work, [?] also study negative externalities of strategic classification. In their model, they show that the Stackelberg equilibrium leads to only false negative errors on a disadvantaged population and false positives on the advantaged population. Furthermore, they show that providing a cost sbusidy for disadvantaged individuals can lead to worse outcomes for everyone."
    • Opinion:
      • Their "paradoxical" result that subsidies can lead to strictly worse overall "benefit" for different groups of individuals is, imo, not that paradoxical.
        • It's a matter of whether or not you're overdoing the subsidies.
        • Using a multiplicative "decrease" of the cost function for the worse-off-group (denoted $B$), they show that if you optimize the factor of subsidies then you can overshoot and cause a strictly worse overall result.
        • Seems very intuitive to me…
      • Don't really like the overall presentation :/
      • Also make very strong assumptions to obtain their results, e.g. linearity of, basically, everything.
      • Also unclear whether or not their assumptions on the cost-functions in the multidimensional case are even valid
        • What happens when $\mathbf{y} \not\ge \mathbf{x}$? E.g. $y_1 < x_1$ but $y_2 > x_2$?
  • liu19_dispar_equil_algor_decis_makin
    • Causal
    • Problem definition:
      • Individual's rational response:
        1. Individual intervenes on the true label $Y$, setting $Y = 1$ if they decide to qualify themselves (at some cost $C$)
        2. Individual receives if and only if $Y = 1$ and $\hat{Y}_{\theta} = 1$, i.e. the individual receives reward iff institution decides it is a qualified individual.
          • Whether or not an individual decides to do $Y = 1$ is decided rationally by weighting the execpted utility

            \begin{equation*}
w \mathbb{P} \big[ \hat{Y}_{\theta} = 1 \mid Y = 1, A = a \big] - C = w \mathrm{TPR}_a(\theta) - C
\end{equation*}

            and

            \begin{equation*}
w \mathbb{P} \big[ \hat{Y}_{\theta} = 1 \mid Y = 0, A = a \big] = w \mathrm{FPR}_a(\theta)
\end{equation*}

            which boils down to doing $Y = 1$ iff

            \begin{equation*}
w \big( \mathrm{TPR}_a(\theta) - \mathrm{FPR}_a(\theta) \big) > C
\end{equation*}
          • Each group's qualification rate is therefore:

            \begin{equation*}
\pi_a^{\text{br}}(\theta) = \mathbb{P} \big[ Y = 1 \mid A = a \big] = \mathbb{P} \big[ C < w \big( \mathrm{TPR}_a(\theta) - \mathrm{FPR}_a(\theta) \big) \big] = G \Big( w \big( \mathrm{TPR}_a(\theta) - \mathrm{FPR}_a(\theta) \big) \Big)
\end{equation*}
      • Insitution's rational response:
        • Expected utility of the institution is

          \begin{equation*}
p_{TP} \mathbb{P} \big[ \hat{Y}_{\theta} = 1, Y = 1 \big] - c_{FP} \mathbb{P} \big[ \hat{Y}_{\theta} = 1, Y = 0 \big] = p_{TP} \sum_{a \in \mathcal{A}}^{} \mathrm{TPR}_a(\theta) \pi_a n_a - \sum_{a \in \mathcal{A}}^{} c_{FP} \mathrm{FPR}_a(\theta) (1 - \pi_a) n_a
\end{equation*}
          • $\theta^{br}(\pi)$ denotes the maximizer of the above
  • frazier2014incentivizing
  • zinkevich2003online

hardt15_strat_class

Notation

  • $A \triangle B$ denotes the symmetric difference between two sets $A$ and $B$, i.e. all elements which are in their union but not in their intersection:

    \begin{equation*}
A \triangle B := (A \cup B) \setminus (A \cap B) = (A \setminus B) \cup (B \setminus A)
\end{equation*}
  • Full-information game refers to the game where both the Jury and the Contestant has knowledge about the underlying distribution $\mathcal{D}$ and the true classifier $h$ at every point
  • Strategic classification game refers to when the true classifier $h$ is unknown and the Jury do not know the underlying distribution $\mathcal{D}$ but can query labeled examples $\big( x, h(x) \big)$ with $x \sim \mathcal{D}$
  • The strategic maximum is the optimal Jury's classifier $f$ in the full-information game:

    \begin{equation*}
\mathrm{OPT}_h(\mathcal{D}, c) = \max_{f: X \to \left\{ -1, 1 \right\}} \mathbb{P}_{x \sim \mathcal{D}} \big[ h(x) = f \big( \Delta(x) \big) \big]
\end{equation*}

    where $\Delta(x)$ is defined (as above)

    \begin{equation*}
\Delta(x) = \underset{y \in X}{\argmax}\ \Big( f(y) - c(x, y) \Big)
\end{equation*}

    (notice it depends on $f$). Under the assumptions that $c(x, y) \ge 0$ and that $c(x, x) = 0$, we can equivalently consider

    \begin{equation*}
\Delta(x) = \max \left\{ f(y) : c(x, y) < 2 \right\}
\end{equation*}
  • Classifier based on a separable cost function

    \begin{equation*}
c_i[t](x) := 
\begin{cases}
  \ \ \ 1  & \quad \text{if } c_i(x) \ge t \\
  -1 &  \quad \text{if } c_i(x) < t
\end{cases}
\end{equation*}

    for some $t$.

  • Define the following

    \begin{equation*}
\begin{split}
  \Gamma(f) &:= \big\{ x: \max \left\{ f(y) : c(x, y) < 2 \right\} = 1 \big\} \\
  &= \left\{ x: \big( \exists y \in X \big) \big( f(y) = 1, c(x, y) < 2 \big) \right\} \\
  &= \big\{ x: c_1(x) > \min \left\{ c_2(y) : f(y) = 1 \right\} - 2 \big\}
\end{split}
\end{equation*}
  • For classifier $f$, we define

    \begin{equation*}
f'(y) := 
\begin{cases}
  \ \ \ 1 & \quad \text{if } c_2(y) \ge \min \left\{ c_2(z) : f(z) = 1 \right\} \\
  -1 & \quad \text{otherwise}
\end{cases}
\end{equation*}

    which can also be expressed as $c_2[s]$ with $s \in \min \left\{ c_2(z) : f(z) = 1 \right\}$

Introduction

Model

Full-information game

The players are Jury and Contestant. Let

  • $X$ be a fixed population
  • $\mathcal{D}$ a probability distribution over $X$
  • $c: X \times X \to \mathbb{R}_{ + }$ be a fixed cost function
  • $h: X \to \left\{ -1, 1 \right\}$ be the target classifier

Then a full information (2-player) game is defined as follows:

  1. Jury publishes a classifier $f: X \to \left\{ -1, 1 \right\}$
    • cost function $c$
    • true classifier $h$
    • distribution $\mathcal{D}$
  2. Contestant produces a function $\Delta: X \to X$. The following information available to the Contestant:
    • cost function $c$
    • Jury's classifier $f$
    • true classifier $h$
    • distribution $\mathcal{D}$

The payoffs are:

  1. Jury:

    \begin{equation*}
\mathbb{P}_{x \sim \mathcal{D}} \big[ h(x) = f \big( \Delta(x) \big) \big]
\end{equation*}
  2. Contestant:

    \begin{equation*}
\mathbb{E}_{x \sim \mathcal{D}} \big[ f \big( \Delta(x) \big) - c \big( x, \Delta(x) \big) \big] 
\end{equation*}
  • $c$ quantifies how much it costs for the Contestant to change the input to the Jury
  • $\Delta$ is the change in the input made by the Contestant
  • Full information game is an example of a Stackelberg competition
    • A Stackelberg competition is a game where the first player (Jury) has the ability to commit to her strategy (a classifier $f$) before the second player (Contestant) responds.
    • Wish to find a Stackelberg equilibrium, that is, a highest-payoff strategy for Jury, assuming best response of Contestant
      • Equivalently, a perfect equilibrium in the corresponding strategic-form game

The Contestant knows the true classifier AND the Jury's classifier $h$, while the Jury does not know the Contestant's function $\Delta$.

<2020-03-22 Sun 07:17> The Contestant knows the true classifier AND the Jury's classifier $h$, while the Jury assumes Contestant chooses the optimal $\Delta$.

  • Best outcome for Jury is of course to find $f$ such that $\Delta(x) \in f^{-1} \big( h(x) \big)$ for all $x \in \supp (\mathcal{D})$
  • Best outcome for Contestant is to find $\Delta$ s.t. $\Delta(x)$ is always classified as a "positive" example i.e. $f \big( \Delta(x) \big) = 1$ (assuming no cost)
    • In general, we'll simply have

      \begin{equation*}
\Delta(x) = \underset{y \in X}{\argmax}\ \Big( f(y) - c(x, y) \Big)
\end{equation*}

      which will then of course maximize the payoff for the Contestant. This motivates the definition of the strategic maximum.

The strategic maximum in a full-information game is defined as

\begin{equation*}
\mathrm{OPT}_h(\mathcal{D}, c) = \max_{f: X \to \left\{ -1, 1 \right\}} \mathbb{P}_{x \sim \mathcal{D}} \big[ h(x) = f \big( \Delta(x) \big) \big]
\end{equation*}

where $\Delta(x)$ is defined (as above)

\begin{equation*}
\Delta(x) = \underset{y \in X}{\argmax}\ \Big( f(y) - c(x, y) \Big)
\end{equation*}

(notice it depends on $f$)

Suppose $c(x, x) = 0$, i.e. zero-cost for Contestant to not move. Then we have the following characterization

  1. if $f(x) = 1$, then $\Delta(x) = x$
    • We assuming that if one of the maximas defining $\Delta$ is $x$ then we pick $x$, rather than one of the other maximas
  2. if $f(x) = - 1$, then we let

    \begin{equation*}
y := \underset{y \in X: f(y) = 1}{\argmin} c(x, y)
\end{equation*}

    then

    \begin{equation*}
\Delta(x) =
\begin{cases}
  y &amp; \quad c(x, y) &lt; 2 \\
  x &amp; \quad c(x, y) \ge 2
\end{cases}
\end{equation*}
    • Here we're just trying to find the "nearest" (in cost) point $y$ such that it will be classified as a positive example
    • It is clear that if $c(x, y) = 2$, then $f(y) - c(x, y) \le 1 - 2 = -1 = f(x)$ and checking values on either side of this boundary it's clear
Statistical classification game
  • Now neither the Jury nor the Contestant has knowledge of $\mathcal{D}$ and $h$, but instead the Jury can request samples $x \sim \mathcal{D}$ (hence "statistical" in the name)

The players are Jury and Contestant. Let

  • $X$ be a fixed population
  • $\mathcal{D}$ a probability distribution over $X$
  • $c: X \times X \to \mathbb{R}_{ + }$ be a fixed cost function
  • $h: X \to \left\{ -1, 1 \right\}$ be the target classifier

A statistical classification game is then defined as follows:

  1. Jury can request labeled examples of the form $\big( x, h(x) \big)$ with $x \sim \mathcal{D}$, and publishes a classifier $f: X \to \left\{ -1, 1 \right\}$. The following information is available to the Jury:
    • the cost function $c$
  2. Contestant produces a function $\Delta: X \to X$. The following information is available to the Contestant:
    • cost function $c$
    • Jury's classifier $f$

The payoffs are:

  • Jury:

    \begin{equation*}
\mathbb{P}_{x \sim \mathcal{D}} \big[ h(x) = f \big( \Delta(x) \big) \big]
\end{equation*}
  • Contestant:

    \begin{equation*}
\mathbb{E}_{x \sim \mathcal{D}} \big[ f \big( \Delta(x) \big) - c \big( x, \Delta(x) \big) \big]
\end{equation*}

Strategy-robust learning

Let $\mathcal{A}$ be an algorithm producing a classifier $f: X \to \left\{ -1, 1 \right\}$.

If for

  • all distributions $\mathcal{D}$
  • for all classifiers $h$
  • for all $c \in \mathcal{C}$ ($\mathcal{C}$ denoting the space of cost functions)
  • for all $\varepsilon, \delta &gt; 0$

given a description of $c$ and access to labeled examples of the form $\big( x, h(x) \big)$ where $x \sim \mathcal{D}$, $\mathcal{A}$ produces a classifier $f$ so that

\begin{equation*}
\mathbb{P}_{x \sim \mathcal{D}} \big[ h(x) = f \big( \Delta(x) \big) \big] \ge \mathrm{OPT}_h(\mathcal{D}, c) - \varepsilon \quad \text{wp.} \quad 1 - \delta
\end{equation*}

where

\begin{equation*}
\Delta(x) = \underset{y \in X}{\argmax}\ \Big( f(y) - c(x, y) \Big)
\end{equation*}

Then we say that $\mathcal{A}$ is a strategy-robust learning algorithm for $\mathcal{C}$.

One might expect, as with PAC-learning, that a strategy-robust learning algorithm might be restrict $h$ to be in some concept class $\mathcal{H}$, however this will turn out not to be the case!

Separable cost functions

A cost function $c(x, y)$ is called separable if it can be written as

\begin{equation*}
c(x, y) = \max \left\{ 0, c_2(y) - c_1(x) \right\}
\end{equation*}

for functions $c_1, c_2 : X \to \mathbb{R}$ satisfying $c_1(X) \subseteq c_2(X)$.

  • The condition $c_1(X) \subseteq c_2(X)$ means that there is always a 0-cost option, i.e. Contestant can opt not to game and thus pay nothing
    • Since this means that $\exists z \in X$ s.t. $c_1(z) = c_2(z)$

For a class $\mathcal{F}$ of functions $f: X \to \mathbb{R}$, the Rademacher complexity of $\mathcal{F}$ with sample size $m$ is defined as

\begin{equation*}
R_m(\mathcal{F}) := \mathbb{E}_{x_1, \dots, x_m \sim \mathcal{D}} \mathbb{E}_{\sigma_1, \dots, \sigma_m} \Bigg[ \sup_{f \in \mathcal{F}} \bigg( \frac{1}{m} \sum_{i=1}^{m} \sigma_i f(x_i) \bigg) \Bigg]
\end{equation*}

where $\sigma_1, \dots, \sigma_m$ are i.i.d. Rademacher random variables, i.e. $\mathbb{P}(\sigma_i = 1) = \mathbb{P}(\sigma_i = -1) = \frac{1}{2}$.

Rademacher random variables can be considered as a "random classifier", and thus $R_m(\mathcal{F})$ kind of describes how well we can correlate the classifiers in $\mathcal{F}$ with a random classifier.

$\Gamma(f)$ is the set of $x \in X$ s.t. $f \big( \Delta(x) \big) = 1$ when $\Delta$ is the best response of Contestant.

$\Big( \left\{ x \in X \mid f \big( \Delta(x) \big) = 1 \right\} \subseteq \Gamma(f) \Big)$: Let $x \in \Gamma(f)$.

  • Suppose $f(x) = 1$, then $\Delta(x) = x \in \Gamma(f)$ and we're done.
  • Suppose $f(x) = -1$, but that $\exists z \in X$ s.t. $f(z) = 1$.

    \begin{equation*}
f(z) - c(x, z) = 1 - c(x, z)
\end{equation*}

    The question is then whether or not the argmax is outside or inside the 2-ball. If $c(x, z) &gt; 2$, we immediately have

    \begin{equation*}
f(z) - c(x, z) = 1 - c(x, z) &lt; 1 - 2 = -1 = f(x) - c(x, x)
\end{equation*}

    i.e. $\Delta(x) \ne z$ since $\Delta(x)$ aims to maximize the above and returning $x$ is strictly a better option. Hence we must have $z$ s.t. $c(x, z) &lt; 2$, which is guaranteed to exists since we can just let $z = x$.

Therefore, if $x \in \Gamma(f)$ then $\Delta(x) \in \Gamma(f)$, i.e. $\left\{ x \in X \mid f \big( \Delta(x) \big) = 1 \right\} \subseteq \Gamma(f)$.

$\Big( \left\{ x \in X \mid f \big( \Delta(x) \big) = 1 \right\} \supseteq \Gamma(f) \Big)$: Let $x \in X$ s.t. $f \big( \Delta(x) \big) = 1$. Suppose, for the sake of contradiction, that $c \big( x, \Delta(x) \big) &gt; 2$. Then $\not\exists y \in X$ s.t. $c(x, y) &lt; 2$ and $f(y) = 1$, because if such a $y$ existed we would have $c \big( x, \Delta(x) \big) &gt; c \big( x, y \big) \implies \Delta(x) = y$, contradicting the assumption that $c \big( x, \Delta(x) \big) &gt; 2$. Since, by assumption, $c(x, z) \ne 2$ for all $z \in X$, we must therefore have that $c \big( x, \Delta(x) \big) &lt; 2$, i.e. $x \in \Gamma(f)$.

Note that

\begin{equation*}
\min \left\{ c_2(y) : f'(y) = 1 \right\} = \min \Big\{ c_2(y) : y \ge \big( \min \left\{ c_2(z) : f(z) = 1 \right\} \big) \Big\} = \min \left\{ c_2(y) : f(y) = 1 \right\}
\end{equation*}

Hence

\begin{equation*}
\begin{split}
  \Gamma(f') &amp;= \left\{ x: c_1(x) &gt; \min \left\{ c_2(y) : f'(y) = 1 \right\} \right\} \\
  &amp;= \left\{ x: c_1(x) &gt; \min \left\{ c_2(y) : f(y) = 1 \right\} \right\} \\
  &amp;= \Gamma(f)
\end{split}
\end{equation*}

And to see that we can equivalently consider the classifier $f'$, we can rewrite the Jury's payoff as

\begin{equation*}
\begin{split}
  \mathbb{P} \Big\{ h(x) = \max \left\{ f(y) : c(x, y) &lt; 2 \right\} \Big\} &amp;= \mathbb{P} \left\{ x \in \Big( \Gamma(f) \triangle \left\{ y : h(y) = 1 \right\} \Big)^\complement  \right\}
\end{split}
\end{equation*}

where

  • $\Gamma(f) \triangle h^{-1}(\left\{ 1 \right\})$ is basically all $y \in X$ such that $f(y) = 1$ and $h(y) = -1$ OR $f(y) = -1$ and $h(y) = 1$, i.e. all the mistakes
  • Complement of the above is the all the correct classifications

Noting that if we did the same for $f'$, the only necessary change would be $\Gamma(f) \mapsto \Gamma(f')$:

\begin{equation*}
\begin{split}
  \mathbb{P} \Big\{ h(x) = \max \left\{ f(y) : c(x, y) &lt; 2 \right\} \Big\} &amp;= \mathbb{P} \left\{ x \in \Big( \Gamma(\textcolor{green}{f}) \triangle \left\{ y : h(y) = 1 \right\} \Big)^\complement  \right\} \\
  &amp;= \mathbb{P} \left\{ x \in \Big( \Gamma(\textcolor{green}{f'}) \triangle \left\{ y : h(y) = 1 \right\} \Big)^\complement  \right\} \\
  &amp;= \mathbb{P} \Big\{ h(x) = \max \left\{ f'(y) : c(x, y) &lt; 2 \right\} \Big\}
\end{split}
\end{equation*}

Note that

\begin{equation*}
\Gamma(f) = \Gamma(f') = \Gamma(c_2[s]) = \left\{ x: c_1(x) &gt; s - 2 \right\} = \big\{ x: c_1[s - 2](x) = 1 \big\} 
\end{equation*}

where

\begin{equation*}
s \in \min \big\{ c_2(y) : c(x, y) &lt; 2 \big\} 
\end{equation*}

miller19_strat_class_is_causal_model_disguis

Notation

  • Exogenous variables refer to variables whose value is determined outside the model and is imposed on the model
  • Endogenous variables refer to variables whose value is determined by or inside the model.
  • $U = \big( U_1, \dots, U_n \big)$ exogenous variables
  • $X = \big( X_1, \dots, X_n \big)$ endogenous variables
  • In a structural causal model (SCM), a set of structural equations determine the values of the endogenous variables:

    \begin{equation*}
X_i = g_i(\mathbf{PA}_i, U_i), \quad i = 1, \dots, n
\end{equation*}

    where $\mathbf{PA}_i$ represents the other endogenous variables that determine $X_i$, and $U_i$ represents exogenous noise due to unmodeled factors

  • An intervention is a modification to the structural equations of an SCM, e.g. an intervention may consist of replacing the structural equation $X_i = g_i(\mathbf{PA}_i, U_i)$ with a new structural equation $X_i := x_i$ that holds $X_i$ at a fixed value
  • $Z_{X := x}$ denotes the variable $Z$ in the modified SCM with structural equation $X := x$, where $X$ and $Z$ are two endogenous nodes of the SCM
  • $Z(u)$ denotes the deterministic value of the endogenous variable when the exogenous variables $U$ are equal to $u$
  • Given some event $E$, we denote $Z_{X := x}(E)$ the rv. $Z$ in the modified SCM with structural equations $X := x$ where the distribution of the exogenous variables $U$ is updated by conditioning on the event $E$; this is what's referred to as a counterfactual

haghtalab2020maximizing

Notation

dong17_strat_class_from_reveal_prefer

  • Institution's utility (Stackelberg regret):

    \begin{equation*}
\mathcal{R}_S(A, B) = \sum_{t = 1}^{n} \ell \big( \hat{x}_t(\beta_t), y_t, \beta_t \big) - \min_{\beta \in K} \sum_{t = 1}^{n} \ell \big( \hat{x}_t(\beta), y_t, \beta \big)
\end{equation*}

    where

    • $\ell$ denotes the loss function for the classifier
    • $A = (a_1, a_2, \dots, a_n)$ denotes the sequence of agents
    • $B = (\beta_1, \beta_2, \dots, \beta_n)$ denotes the sequence of classifier-parameters
  • Individual's utility:

    \begin{equation*}
u_t(\hat{x}, \beta) = \left\langle \beta, \hat{x} \right\rangle - c_t(x_t, \hat{x})
\end{equation*}
  • Each roundt $t$:
    1. Institution commits to linear classifier parameterized by $\beta_t \in K$
    2. An adversary, oblivious to the institution's choices up to round $t$, selects an agent $a_t = (x_t, y_t, c_t)$
    3. Agent $a_t$ sends the data point $\hat{x}_t$ to the learner:
      • $y_t = -1$, then the agent is strategic and sends the best response $\beta_t$ defined:

        \begin{equation*}
\hat{x}_t(\beta_t) \in \argmax_{x} u_t(x, \beta_t)
\end{equation*}
      • $y_t = 1$, then the agent is non-strategic and sends the true features to the institution

        \begin{equation*}
\hat{x}_t(\beta_t) = x_t
\end{equation*}
      • In conclusion, the revealed feature $\hat{x}_t$ is defined

        \begin{equation*}
\hat{x}_t(\beta_t) = 
\begin{cases}
  \argmax_{x} u_t(x, \beta_t) &amp; \quad \text{if strategic } (y = -1)\\
  x_t &amp; \quad \text{if non-strategic } (y = 1)
\end{cases}
\end{equation*}
    4. Institution observes $y_t$ and experiences classification loss $\ell_t(\beta_t) = \ell \big( \hat{x}_t, y_t, \beta_t \big)$

Assumptions:

  • $x_t$ has bounded norm $R_1$
  • $K \subseteq \mathbb{R}^d$ is convex, contains the $\ell_2 \text{-ball}$ and has norm bounded by $R_2$, i.e. $\norm{\beta}_2 \le R_2$ for all $\beta \in K$

Conditions for problem to be convex:

  • Approach:
    1. Show that both choices of logistic and hinge loss for $\ell$ are convex if $\beta \mapsto \left\langle \hat{x}_t(\beta), \beta \right\rangle$ is convex.
      • Show this under the case where $c_t(\hat{x}, x) = f(\hat{x} - )$ for some $f$ satisfying some reasonable assumptions:
        • $f: \mathbb{R}^d \to \mathbb{R}$ is positive homogenouos of degree $k$, i.e.

          \begin{equation*}
f(\alpha x) = \alpha^k f(x), \quad \forall x \in \mathbb{R}^d
\end{equation*}

          for any $\alpha \in \mathbb{R}^{ + }$

Notation

  • $\hat{x}$ and $\hat{x}(\beta, x, y)$ denotes the "manipulated" feature using the original feature $x$. Assume manipulations occur only if "bad" label, i.e.

    \begin{equation*}
\hat{x}(\beta, x, y) =
\begin{cases}
  x, \quad &amp;  \text{if } y = 1 \\
  \underset{\tilde{x} \in \mathbb{R}^d}{\argmax}\ u(\tilde{x}, x, \beta), \quad &amp; \text{if } y = -1
\end{cases}
\end{equation*}
  • $d(x, \hat{x})$ and $d_t(x, \hat{x})$ represents the costs of the agent to change his features from $x$ to $\hat{x}$
  • $c(\hat{x}, y, \beta)$ and $c_t(\hat{x}_t, y_t, \beta_t)$ denotes the loss of the learner
  • A strategic agent $a = \big( x, y, d \big)$ (with $y = -1$), when facing a classifier $\beta$, will misreport his feature vector $x$ as $\hat{x}(\beta)$ by solving the maximization problem

    \begin{equation*}
\hat{x} (\beta) \in \underset{\hat{x} \in \mathbb{R}^d}{\argmax}\ u(\hat{x}, x, \beta)
\end{equation*}

    where the utility of the agent $a$ is defined

    \begin{equation*}
u(\hat{x}, x, \beta) := \left\langle \hat{x}, \beta \right\rangle - d(\hat{x}, x)
\end{equation*}
  • $\partial c_t \big( \beta_t \big)$ denotes the subgradients of $c_t$
  • Stackelberg regret: for a history involving $n$ agents $A = (a_1, a_2, \dots, a_n)$ and the sequence of $n$ classifiers $B = (\beta_1, \beta_2, \dots, b_n)$, the Stackelberg regret is defined to be

    \begin{equation*}
\mathcal{R}_{S}(A, B) := \sum_{t = 1}^{n} c \big( \hat{x}_t(\beta_t), y_t, \beta_t \big) - \min_{\beta \in K} \sum_{t = 1}^{n} c \big( \hat{x}_t(\beta), y_t, \beta \big)
\end{equation*}

    where $\beta$ denotes the the optimal classifier (taking into account that the agents would behave differently)

  • In each strategic round, since $\hat{x}_t$ is a function of $\beta_t$, we do not have direct access to the subgradients of $c_t$ and only see 0-th order information $c_t(\beta_t)$. In this case we estimate the subgradient by minimizing a "smoothed" version of the classification loss: for any $ct$ and parameter $\delta$, the δ-smoothed version of $c_t$ is defined as flaxman04_onlin_convex_optim_bandit_settin

    \begin{equation*}
\overline{c}_t \big( \beta \big) := \mathbb{E} \big[ c_t \big( \beta + \delta S \big) \big]
\end{equation*}

    where $S$ is a random vector drawn from the uniform distribution over $\mathbb{S}^{n - 1}$.

  • Convex conjugate of a function $f: \mathbb{R}^d \to \mathbb{R}$, denoted $f^*: \mathbb{R}^d \to \mathbb{R}$ is defined

    \begin{equation*}
f^* \big( x^* \big) := \sup_{x \in \mathbb{R}^d} \Big( \left\langle x^*, x \right\rangle - f(x) \Big)
\end{equation*}

    and is always convex

Main take-away

  • Problem:
    • Considers decisions through time (in contrast to hardt15_strat_class)
    • Cost function $d_t(x, \hat{x}_t)$ is unknown to the learner
    • Restricts the learner to linear classifiers $\beta_t$
    • Agent's utility is then defined

      \begin{equation*}
u_t(\hat{x}, \beta) = \left\langle \beta, \hat{x} \right\rangle - d_t(x_t, \hat{x})
\end{equation*}
      • Similar to others:
        • hardt15_strat_class: uses $f (\hat{x})$ instead of $\left\langle \beta, \hat{x} \right\rangle$ and restricts the form of $d_t$ to be a separable
        • haghtalab2020maximizing: uses mechanism $g$ instead of $\left\langle \beta, \hat{x} \right\rangle$
          • Though also restricts the mechanism to be, say, linear, in which case it becomes equivalent to the above
  • Under sufficient conditions on the cost functions $c \big( \hat{x}_t(\beta_t), y_t, \beta_t \big)$ and agent cost $d_t(\hat{x}_t, x_t)$
    • Strategic agent's best response set is

      \begin{equation*}
\underset{\hat{x} \in \mathbb{R}^d}{\argmax}\ u(\hat{x}, x, \beta) = x + w^*(\beta) = x + \partial f^*(\beta)
\end{equation*}
    • Satisfied by examples:
      • Hinge loss:

        \begin{equation*}
c_h \big( \hat{x}_t, y_t, \beta_t \big) = \max \left\{ 0, 1 - y_t \cdot \left\langle \hat{x}_t, \beta_t \right\rangle \right\}
\end{equation*}
      • Logistic loss:

        \begin{equation*}
c_{\text{log}} \big( \hat{x}_t, y_t, \beta_t \big) = \log \Big( 1 + e^{- y_t \cdot \left\langle \hat{x}_t, \beta_t \right\rangle} \Big)
\end{equation*}
  • Focuses on obtaining conditions for when the cost function $c$ for the learner is convex
  • Drawbacks:
    • No high-probability guarantees, but instead we just get bounds on the expectation of the Stackelberg regret

Conditions for a Convex Learning Problem

  • This section describes the general conditions under which the learner's problem in our setting can be cast as a convex minimization problem
    • Even if original loss function is convex wrt. $\beta$ (keeping $\hat{x}$ and $y$ fixed), it may fail to be convex wrt. $\hat{x}$ chosen strategically as a function of $\beta$

4. Regret Minimization

4.1 Convex Optimization with Mixture Feedback

Let

  • $T_1 := \left\{ t: y_t = 1, t = 1, 2, \dots, n \right\}$ be the non-strategic rounds
  • $T_{-1} := \left\{ t: y_t = -1, t = 1, 2, \dots, n \right\}$ be the strategic rounds

Then for any $\beta \in K$

\begin{equation*}
\mathbb{E}  \Bigg[ \sum_{t=1}^{n} c_t \big( \beta_t^{ + } \big) - c_t(\beta) \Bigg] \le \mathbb{E} \Bigg[ \sum_{t \in T_{ - 1}}^{} \overline{c}_t(\beta) - \overline{c}_t(\beta) \Bigg] + \mathbb{E} \Bigg[ \sum_{t \in T_1}^{} c_t(\beta_t) - c_t(\beta) \Bigg] + 3 n L \delta
\end{equation*}

where the expectation is wrt. the randomness of the algorithm.

\begin{equation*}
\underbrace{\Big( c_t \big( \beta_t^{ + } \big) - \overline{c}_t \big( \beta_t \big) \Big)}_{\le 2 L \delta} + \underbrace{\Big( \overline{c}_t\big(\beta\big) - c_t\big(\beta\big) \Big)}_{\le L \delta} \le 3 L \delta
\end{equation*}

which implies

\begin{equation*}
c_t \big( \beta_{t}^{ + } \big) - c_t\big(\beta\big) \le \overline{c}_t \big( \beta_t \big) - \overline{c}_t \big( \beta \big) + 3 L \delta
\end{equation*}

References

Bibliography