Balanced poker strategy
Introduction
Having played more poker and chess than is strictly necessary for a productive life, I’m struck by the ‘brittleness’ of poker strategy. What I mean by this is that equilibrium poker strategies are balanced in the way that a pencil balancing on its point is balanced. In chess, by contrast, a move that would be strong against a beginner would likely be the best try against a grandmaster too.
This article considers how a computer constructs an unexploitable poker strategy, and discusses how useful (or otherwise) this is for practical play.
Consider the game of rock / paper / scissors, played many times in succession against the same opponent. Suppose we realise that our opponent plays [rock, paper, scissors] in the proportions [40%, 30%, 30%]. Clearly, we should lean towards playing paper (as paper beats rock), but in fact the optimal strategy is to play paper not just 60% or 70% of the time, but 100% of the time.
Analogously, if we are at the river in poker with a bluff catcher, and our opponent makes a pot size bet, then if our opponent is perfectly balanced we would be indifferent between folding or calling. But it’s almost impossible to be perfectly balanced. If we perceive that our opponent bluffs too often then we should call 100% of the time. We’re getting 2:1 odds and we think our opponent is bluffing more than 1/3 of the time, so calling is on average profitable and folding is on average unprofitable. If they bluff too little, on the other hand, then as an exploit we should fold 100% of the time.
Counterfactual regret minimisation
If this seems counter-intuitive, let us teach a computer to play rock / paper / scissors, using a machine learning algorithm known as “counterfactual regret minimisation”. Pranav Ahluwalia wrote a python programme that does exactly this (see https://www.pranav.ai/CFRM-RPS). I copied the code into my jupyter notebook and keyed in the opponent’s strategy for rock, paper, scissors to be [40%,30%, 30%]. After 1 million iterations it returned that my “maximally exploitative strategy is [0.01%, 99.99%, 0.01%]. As expected I should pretty much always play paper.
The training algorithm can also be adapted to train two agents simultaneously, i.e. our opponent adjusts. After another 1 million iterations it returned that player 1 should play the strategy [33.43%, 33.35%, 33.22%] and player 2 [33.27%, 33.33%, 33.40%]. In other words, again as expected, both players must play rock/paper/scissors in 1/3rd proportions.
Poker solvers
Counterfactual regret minimisation is the same technique used to programme poker solvers like ‘piosolver’. How it works (at a high level) is that the value of each possible strategy is evaluated against the candidate strategy (the counter-factual). If that alternative is superior we associate a negative value to the strategy (i.e. ‘regret’). The candidate strategy is modified over many iterations to take the line of least regret. We arrive at a strategy that minimises our regret compared to any alternative strategy, hence “counterfactual regret minimisation”.
For example, in rock/paper/scissors if the opponent played rock, then we might first assign a value of [0,1,-1] to [rock, paper, scissors] respectively. If the candidate strategy actually played paper (score of +1), then we would reduce our assigned values by 1 to give [-1,0,-2]. In this example playing rock was a tie (value 0), but this strategy would have regretted not playing paper, giving a value of -1. This reinforces our strategy of playing paper in future.
I hope I haven’t lost you, and have given you some inkling how this technique might be used for poker to construct an optimal fold / call / raise strategy on every street for every possible combination of starting 2 hole cards. Clearly this takes an enormous amount of computing power, even if simplifying assumptions are made! Players’ pre-flop opening hands need to be limited to particular ranges of hands and the possible bet sizing is constrained. It’s noteworthy that even in the simple game of rock / paper / scissors, the result hasn’t fully converged even after 1 million iterations of machine learning.
In this sense poker is more mathematically complex than chess, where the next move can generally be computed to grandmaster level by a computer using relatively little time and computing power. With poker solvers it is often more efficient to exchange computing speed for storage space, and to pre-calculate (over a period of days and weeks) and save the solved result for every possible hole card combination and board run out. Boards of similar texture can of course be grouped together (a KsKh7s flop is pretty similar to a KdKc7s flop).
Implications for learning to play better poker
The first implication is that it is hard to learn to be a good poker player! We can learn opening ranges and responses facing computer optimal play until the cows come home, but if our opponents are slightly unbalanced (as they will be) then our correct responses change not just slightly, but drastically. In chess, by contrast, if the computer suggests a good move that would be a good move against almost anybody. Copy it.
Neither chess computers nor poker solvers explain why they suggest a certain strategy (though I think this may be possible in the future). Occasionally the chess computer suggests a move that only a “metal-head” would think of, but the latest computers generally play in an attacking and aesthetically pleasing style. By contrast poker solvers suggest strategies that are not only difficult to explain but are impossible for a human to memorise and replicate at the table. The suggestion in a particular spot with your QJs say, might be to check 10% of the time, bet small 30% of the time and bet big 60% of the time.
Chess computers are cheap as chips, but (and this is just a side gripe) poker solvers are enormously expensive. ‘Piosolver’ costs hundreds of dollars. If your game of choice is $5/$10 NL, and owning a copy of piosolver paired with some diligent study can plausibly improve your win rate by 5 big blinds (i.e. $50) per 100 hands, then you should be able to recoup the cost of the software in a reasonable period of time. At 0.20c/40c NL it would take years . .
Some constructive suggestions
The above comments paint a pessimistic picture of the value of solvers, so here are some more positive observations.
Computer generated game theory optimal (GTO) strategies do establish a baseline from which to deviate against unbalanced opposition. To give a concrete example, suppose we are in the small blind (SB) with 100BB facing a raise of 2.5BB from the LJ in a 6-max game.
The GTO strategy is to 3-bet 7% of the time and fold 93% of the time. Looking at specific holdings, we should always raise with QJs, mix raising and folding with QTs, and always fold with Q9s and weaker Qxs. How would we apply this ‘advice’ to human opposition?
- We can observe that the solver never flats (calls) from the SB. It’s raise or fold. Our play against real live humans should lean in this direction too. One reason for never calling is that this allows the BB to ‘squeeze’ (i.e. 3-bet us) and we’ll often have to fold. If we think that in practice our BB opponents will 3-bet infrequently, we can develop a calling range. But it will be a narrow one.
- The solver plays very tightly, folding the vast majority of the time. Our play against real live humans should lean in this direction too. Despite getting favourable pot odds to call (we’ve already committed 1 BB), the SB has to play out of position on the flop, turn and river. It’s a negative expected value (EV) spot, overall.
- If we know that the LJ raises with a vastly looser range than is solver-approved, then maybe we can have a voluntary put in pot (VPIP) of greater than 7%. Maybe Q9s becomes a call instead of a fold. But if we go overboard, and call all Qx suited, this simply justifies the Lojack’s loose play. I should also note that in our game the LJ usually opens much bigger than 2.5BB (5BB is common), which justifies tighter play from the SB.
Human opposition tends to be unbalanced in similar ways, regardless of player population. We are pre-programmed to under-bluff (it’s immoral you know . .. ) and to over-call (there’s no closure if we don’t see the outcome, and we hate being bluffed a.k.a. cheated).
So look out for opponents that donk bet, check raise, or bet big on the river. In theory, these ‘strong’ moves can be balanced perfectly by bluffs; in practice we should fold with anything short of the nuts. Naturally, there are a minority of players in the population pool that are ‘maniacs’ who overbluff in these situations. Against them we need to close our eyes and call . . .
The ‘maximally exploitative strategy’ (fold or call 100% of the time) may lead to our opponents adjusting. This may be giving our opponents’ observational powers too much credit though! Even if they perceive that we are under-bluffing, say, they will likely still call our bet if they like their hand, regardless.
An alternative approach is the ‘minimally exploitative strategy’ (ref. ‘Modern poker theory’, Michael Acevedo), which assumes that our opponent isn’t completely oblivious to our play and catches on slowly but eventually. Let’s say that the opponent hates to be bluffed, so calls more than is optimal. The maximum exploit is to bluff 0% of the time. The less extreme approach is to ascribe an artificial & arbitrary extra penalty of $x (in the opponent’s mind) for being bluffed, and we can then solve for a bluffing frequency of y%>0 that yields extra profit as long as the opponent keeps this leak in his game.
Conclusion
In summary, it is possible to teach a computer to play a perfectly balanced game of poker, but it is doubtful how valuable it is to try to replicate this strategy against all but the strongest opposition. There are many situations where playing an extreme & exploitable strategy (taking an action 0% or 100% of the time) will maximise our expected value in the long run.