Something weird is happening with LLMs and chess

(dynomight.substack.com)

220 points | by crescit_eundo 13 hours ago

33 comments

  • chvid 37 minutes ago
    Theory 5: GPT-3.5-instruct plays chess by calling a traditional chess engine.
    • bubblyworld 24 minutes ago
      Just think about the trade off from OpenAI's side here - they're going to add a bunch of complexity to gpt3.5 to let it call out to engines (either an external system monitoring all outputs for chess related stuff, or some kind of tool-assisted CoT for instance) just so it can play chess incorrectly a high percentage of the time, and even when it doesn't at a mere 1800ELO level? In return for some mentions in a few relatively obscure blog posts? Doesn't make any sense to me as an explanation.
    • pixiemaster 11 minutes ago
      I have this hypothesis as well, that OpenAI added a lot of „classic“ algorithms and rules over time, (eg rules for filtering etc)
    • kylebenzle 34 minutes ago
      Yes! I also was waiting for this seemingly obvious answer in the article as well. Hopefully the author will see these comments.
  • niobe 5 hours ago
    I don't understand why educated people expect that an LLM would be able to play chess at a decent level.

    It has no idea about the quality of it's data. "Act like x" prompts are no substitute for actual reasoning and deterministic computation which clearly chess requires.

    • mannykannot 1 hour ago
      One of the main purposes of running experiments of any sort is to find out if our preconceptions are accurate. Of course, if someone is not interested in that question, they might as well choose not to look through the telescope.
      • bowsamic 0 minutes ago
        Sadly there’s a common sentiment on HN that testing obvious assumptions is a waste of time
    • viraptor 5 hours ago
      This is a puzzle given enough training information. LLM can successfully print out the status of the board after the given moves. It can also produce a not-terrible summary of the position and is able to list dangers at least one move ahead. Decent is subjective, but that should beat at least beginners. And the lowest level of stockfish used in the blog post is lowest intermediate.

      I don't know really what level we should be thinking of here, but I don't see any reason to dismiss the idea. Also, it really depends on whether you're thinking of the current public implementations of the tech, or the LLM idea in general. If we wanted to get better results, we could feed it way more chess books and past game analysis.

      • grugagag 4 hours ago
        LLMs like GPT aren’t built to play chess, and here’s why: they’re made for handling language, not playing games with strict rules and strategies. Chess engines, like Stockfish, are designed specifically for analyzing board positions and making the best moves, but LLMs don’t even "see" the board. They’re just guessing moves based on text patterns, without understanding the game itself.

        Plus, LLMs have limited memory, so they struggle to remember previous moves in a long game. It’s like trying to play blindfolded! They’re great at explaining chess concepts or moves but not actually competing in a match.

        • viraptor 4 hours ago
          > but LLMs don’t even "see" the board

          This is a very vague claim, but they can reconstruct the board from the list of moves, which I would say proves this wrong.

          > LLMs have limited memory

          For the recent models this is not a problem for the chess example. You can feed whole books into them if you want to.

          > so they struggle to remember previous moves

          Chess is stateless with perfect information. Unless you're going for mind games, you don't need to remember previous moves.

          > They’re great at explaining chess concepts or moves but not actually competing in a match.

          What's the difference between a great explanation of a move and explaining every possible move then selecting the best one?

          • sfmz 3 hours ago
            Chess is not stateless. En Passant requires last move and castling rights requires nearly all previous moves.

            https://adamkarvonen.github.io/machine_learning/2024/01/03/c...

            • viraptor 3 hours ago
              Ok, I did go too far. But castling doesn't require all previous moves - only one bit of information carried over. So in practice that's board + 2 bits per player. (or 1 bit and 2 moves if you want to include a draw)
              • aaronchall 3 hours ago
                Castling requires no prior moves by either piece (King or Rook). Move the King once and back early on, and later, although the board looks set for castling, the King may not.
                • viraptor 3 hours ago
                  Yes, which means you carry one bit of extra information - "is castling still allowed". The specific moves that resulted in this bit being unset don't matter.
                  • aaronchall 3 hours ago
                    Ok, then for this you need minimum of two bits - one for kingside Rook and one for the queenside Rook, both would be set if you move the King. You also need to count moves since the last exchange or pawn move for the 50 move rule.
                    • viraptor 3 hours ago
                      Ah, that one's cool - I've got to admit I've never heard of the 50 move rule.
                      • User23 2 hours ago
                        Also the 3x repetition rule.
          • ethbr1 3 hours ago
            > Chess is stateless with perfect information.

            It is not stateless, because good chess isn't played as a series of independent moves -- it's played as a series of moves connected to a player's strategy.

            > What's the difference between a great explanation of a move and explaining every possible move then selecting the best one?

            Continuing from the above, "best" in the latter sense involves understanding possible future moves after the next move.

            Ergo, if I looked at all games with the current board state and chose the next move that won the most games, it'd be tactically sound but strategically ignorant.

            Because many of those next moves were making that next move in support of some broader strategy.

            • viraptor 3 hours ago
              > it's played as a series of moves connected to a player's strategy.

              That state belongs to the player, not to the game. You can carry your own state in any game you want - for example remember who starts with what move in rock paper scissors, but that doesn't make that game stateful. It's the player's decision (or bot's implementation) to use any extra state or not.

              I wrote "previous moves" specifically (and the extra bits already addressed elsewhere), but the LLM can carry/rebuild its internal state between the steps.

              • ethbr1 3 hours ago
                If we're talking about LLMs, then the state belongs to it.

                So even if the rules of chess are (mostly) stateless, the resulting game itself is not.

                Thus, you can't dismiss concerns about LLMs having difficulty tracking state by saying that chess is stateless. It's not, in that sense.

          • mjcohen 3 hours ago
            Chess is not stateless. Three repetitions of same position is a draw.
          • cool_dude85 3 hours ago
            >Chess is stateless with perfect information. Unless you're going for mind games, you don't need to remember previous moves.

            In what sense is chess stateless? Question: is Rxa6 a legal move? You need board state to refer to in order to decide.

            • aetherson 3 hours ago
              They mean that you only need board position, you don't need the previous moves that led to that board position.

              There are at least a couple of exceptions to that as far as I know.

              • User23 2 hours ago
                The correct phrasing would be is it a Markov process?
        • jerska 4 hours ago
          LLMs need to compress information to be able to predict next words in as many contexts as possible.

          Chess moves are simply tokens as any other. Given enough chess training data, it would make sense to have part of the network trained to handle chess specifically instead of simply encoding basic lists of moves and follow-ups. The result would be a general purpose sub-network trained on chess.

        • zeckalpha 2 hours ago
          Language is a game with strict rules and strategies.
      • shric 2 hours ago
        Stockfish level 1 is well below "lowest intermediate".

        A friend of mine just started playing chess a few weeks ago and can beat it about 25% of the time.

        It will hang pieces, and you can hang your own queen and there's about a 50% chance it won't be taken.

    • xelxebar 2 hours ago
      Then you should be surprised that turbo-instruct actually plays well, right? We see a proliferation of hand-wavy arguments based on unfounded anthropomorphic intuitions about "actual reasoning" and whatnot. I think this is good evidence that nobody really understands what's going on.

      If some mental model says that LLMs should be bad at chess, then it fails to explain why we have LLMs playing strong chess. If another mental model says the inverse, then it fails to explain why so many of these large models fail spectacularly at chess.

      Clearly, there's more going on here.

      • flyingcircus3 17 minutes ago
        "playing strong chess" would be a much less hand-wavy claim if there were lots of independent methods of quantifying and verifying the strength of stockfish's lowest difficulty setting. I honestly don't know if that exists or not. But unless it does, why would stockfish's lowest difficulty setting be a meaningful threshold?
    • computerex 5 hours ago
      Question here is why gpt-3.5-instruct can then beat stockfish.
      • fsndz 5 hours ago
        PS: I ran and as suspected got-3.5-turbo-instruct does not beat stockfish, it is not even close "Final Results: gpt-3.5-turbo-instruct: Wins=0, Losses=6, Draws=0, Rating=1500.00 stockfish: Wins=6, Losses=0, Draws=0, Rating=1500.00" https://www.loom.com/share/870ea03197b3471eaf7e26e9b17e1754?...
        • computerex 5 hours ago
          Maybe there's some difference in the setup because the OP reports that the model beats stockfish (how they had it configured) every single game.
          • Filligree 5 hours ago
            OP had stockfish at its weakest preset.
            • fsndz 4 hours ago
              Did the same and gpt-3.5-turbo-instruct still lost all the games. maybe a diff in stockfish version ? I am using stockfish 16
              • mannykannot 1 hour ago
                That is a very pertinent question, especially if Stockfish has been used to generate training data.
      • lukan 4 hours ago
        Cheating (using a internal chess engine) would be the obvious reason to me.
        • TZubiri 4 hours ago
          Nope. Calls by api don't use functions calls.
          • girvo 1 hour ago
            How can you prove this when talking about someones internal closed API?
          • permo-w 4 hours ago
            that you know of
      • shric 2 hours ago
        I'm actually surprised any of them manage to make legal moves throughout the game once out of book moves.
      • bluGill 5 hours ago
        The artical appears to have only run stockfish at low levels. you don't have to be very good to beat it
    • pizza 1 hour ago
      But there's really nothing about chess that makes reasoning a prerequisite, a win is a win as long as it's a win. This is kind of a semantics game: it's a question of whether the degree of skill people observe in an LLM playing chess is actually some different quantity than the chance it wins.

      I mean at some level you're saying that no matter how close to 1 the win probability (1 - epsilon) gets, both of the following are true:

      A. you should always expect for the computation that you're able to do via conscious reasoning alone to always be sufficient, at least in principle, to asymptotically get a higher win probability than a model, no matter what the model's win probability was to begin with

      B. no matter how close to 1 that the model's win rate p=(1 - epsilon) gets, because logical inference is so non-smooth, the win rate on yet-unseen data is fundamentally algorithmically random/totally uncorrelated to in-distribution performance, so it's never appropriate to say that a model can understand or to reason

      To me it seems that people are subject to both of these criteria, though. They have a tendency to cap out at their eventual skill cap unless given a challenge to nudge them to a higher level, and likewise possession of logical reasoning doesn't let us say much at all about situations that their reasoning is unfamiliar with.

      I also think, if you want to say that what LLMs do has nothing to do with understanding or ability, then you also have to have an alternate explanation for the phenomenon of AlphaGo defeating Lee Sedol being a catalyst for top Go players being able to rapidly increase their own rankings shortly after.

    • danielmarkbruce 4 hours ago
      Chess does not clearly require that. Various purely ML/statistical based model approaches are doing pretty well. It's almost certainly best to incorporate some kind of search into an overall system, but it's not absolutely required to play just decent amateur level.

      The problem here is the specific model architecture, training data, vocabulary/tokenization method (if you were going to even represent a game this way... which you wouldn't), loss function and probably decoding strategy.... basically everything is wrong here.

    • SilasX 5 hours ago
      Right, at least as of the ~GPT3 model it was just "predict what you would see in a chess game", not "what would be the best move". So (IIRC) users noted that if you made bad move, then the model would also reply with bad moves because it pattern matched to bad games. (I anthropomorphized this as the model saying "oh, we're doing dumb-people-chess now, I can do that too!")
      • cma 5 hours ago
        But it also predicts moves where the text says "black won the game, [proceeds to show the game]". To minimize loss on that it would need to from context try and make it so white doesn't make critical mistakes.
    • slibhb 4 hours ago
      Few people (perhaps none) expected LLMs to be good at chess. Nevertheless, as the article explains, there was buzz around a year ago that LLMs were good at chess.

      > It has no idea about the quality of it's data. "Act like x" prompts are no substitute for actual reasoning and deterministic computation which clearly chess requires.

      No. You can definitely train a model to be really good at chess without "actual reasoning and deterministic computation".

    • aqme28 5 hours ago
      Yeah, that is the "something weird" of the article.
    • TZubiri 4 hours ago
      Bro, it actually did play chess, didn't you read the article?
      • mandevil 4 hours ago
        It sorta played chess- he let it generate up to ten moves, throwing away any that weren't legal, and if no legal move was generated by the 10th try he picked a random legal move. He does not say how many times he had to provide a random move, or how many times illegal moves were generated.
  • azeirah 7 hours ago
    Maybe I'm really stupid... but perhaps if we want really intelligent models we need to stop tokenizing at all? We're literally limiting what a model can see and how it percieves the world by limiting the structure of the information streams that come into the model from the very beginning.

    I know working with raw bits or bytes is slower, but it should be relatively cheap and easy to at least falsify this hypothesis that many huge issues might be due to tokenization problems but... yeah.

    Surprised I don't see more research into radicaly different tokenization.

    • aithrowawaycomm 7 hours ago
      FWIW I think most of the "tokenization problems" are in fact reasoning problems being falsely blamed on a minor technical thing when the issue is much more profound.

      E.g. I still see people claiming that LLMs are bad at basic counting because of tokenization, but the same LLM counts perfectly well if you use chain-of-thought prompting. So it can't be explained by tokenization! The problem is reasoning: the LLM needs a human to tell it that a counting problem can be accurately solved if they go step-by-step. Without this assistance the LLM is likely to simply guess.

      • ipsum2 6 hours ago
        The more obvious alternative is that CoT is making up for the deficiencies in tokenization, which I believe is the case.
        • aithrowawaycomm 5 hours ago
          I think the more obvious explanation has to do with computational complexity: counting is an O(n) problem, but transformer LLMs can’t solve O(n) problems unless you use CoT prompting: https://arxiv.org/abs/2310.07923
          • ipsum2 4 hours ago
            What you're saying is an explanation what I said, but I agree with you ;)
      • meroes 2 hours ago
        At a certain level they are identical problems. My strongest piece of evidence is that I get paid as an RLHF'er to find ANY case of error, including "tokenization". You know how many errors an LLM gets in the simplest grid puzzles, with CoT, with specialized models that don't try to "one-shot" problems, with multiple models, etc?

        My assumption is that these large companies wouldn't pay hundreds of thousands of RLHF'ers through dozens of third party companies livable wages if tokenization errors were just that.

      • TZubiri 4 hours ago
        FWIW I think most of the "tokenization problems"

        List of actual tokenizarion limitations 1- strawberry 2- rhyming and metrics 3- whitespace (as displayed in the article)

      • Der_Einzige 6 hours ago
        I’m the one who will fight you including with peer reviewed papers indicating that it is in fact due to tokenization. I’m too tired but will edit this for later, so take this as my bookmark to remind me to respond.
        • aithrowawaycomm 5 hours ago
          I am aware of errors in computations that can be fixed by better tokenization (e.g. long addition works better tokenizing right-left rather than L-R). But I am talking about counting, and talking about counting words, not characters. I don’t think tokenization explains why LLMs tend to fail at this without CoT prompting. I really think the answer is computational complexity: counting is simply too hard for transformers unless you use CoT. https://arxiv.org/abs/2310.07923
          • cma 4 hours ago
            Words vs characters is a similar problem, since tokens can be less one word, multiple words, or multiple words and a partial word, or words with non-word punctuation like a sentence ending period.
    • layer8 4 hours ago
      Going from tokens to bytes explodes the model size. I can’t find the reference at the moment, but reducing the average token size induces a corresponding quadratic increase in the width (size of each layer) of the model. This doesn’t just affect inference speed, but also training speed.
    • ATMLOTTOBEER 1 hour ago
      I tend to agree with you. Your post reminded me of https://gwern.net/aunn
    • jncfhnb 4 hours ago
      There’s a reason human brains have dedicated language handling. Tokenization is likely a solid strategy. The real thing here is that language is not a good way to encode all forms of knowledge
      • joquarky 1 hour ago
        It's not even possible to encode all forms of knowledge.
    • og_kalu 4 hours ago
      Tokenization is not strictly speaking necessary (you can train on bytes). What it is is really really efficient. Scaling is a challenge as is, bytes would just blow that up.
    • numpad0 58 minutes ago
      hot take: LLM tokens is kanji for AI, and just like kanji it works okay sometimes but fails miserably for the task of accurately representating English
    • cschep 7 hours ago
      How would we train it? Don't we need it to understand the heaps and heaps of data we already have "tokenized" e.g. the internet? Written words for humans? Genuinely curious how we could approach it differently?
      • skylerwiernik 7 hours ago
        Couldn't we just make every human readable character a token?

        OpenAI's tokenizer makes "chess" "ch" and "ess". We could just make it into "c" "h" "e" "s" "s"

        • cco 6 hours ago
          We can, tokenization is literally just to maximize resources and provide as much "space" as possible in the context window.

          There is no advantage to tokenization, it just helps solve limitations in context windows and training.

          • TZubiri 4 hours ago
            I like this explanation
        • taeric 7 hours ago
          This is just more tokens? And probably requires the model to learn about common groups. Consider, "ess" makes sense to see as a group. "Wss" does not.

          That is, the groups are encoding something the model doesn't have to learn.

          This is not much astray from "sight words" we teach kids.

          • TZubiri 4 hours ago
            This is just more tokens?

            Yup. Just let the actual ML git gud

            • taeric 4 hours ago
              So, put differently, this is just more expensive?
        • tchalla 7 hours ago
          aka Character Language Models which have existed for a while now.
      • viraptor 7 hours ago
        That's not what tokenized means here. Parent is asking to provide the model with separate characters rather than tokens, i.e. groups of characters.
  • quantadev 55 minutes ago
    We know from experience with different humans that there are different types of skills and different types of intelligence. Some savants might be superhuman at one task but basically mentally disabled at all other things.

    It could be that the model that does chess well just happens to have the right 'connectome' purely by accident of how the various back-propagations worked out to land on various local maxima (model weights) during training. It might even be (probably is) a non-verbal connectome that's just purely logic rules, having nothing to do with language at all, but a semantic space pattern that got landed on accidentally, which can solve this class of problem.

    Reminds me of how Daniel Tammet just visually "sees" answers to math problems in his mind without even knowing how they appear. It's like he sees a virtual screen with a representation akin to numbers (the answer) just sitting there to be read out from his visual cortex. He's not 'working out' the solutions. They're just handed to him purely by some connectome effects going on in the background.

  • anotherpaulg 1 hour ago
    I found a related set of experiments that include gpt-3.5-turbo-instruct, gpt-3.5-turbo and gpt-4.

    Same surprising conclusion: gpt-3.5-turbo-instruct is much better at chess.

    https://blog.mathieuacher.com/GPTsChessEloRatingLegalMoves/

    • shtack 1 hour ago
      I’d bet it’s using function calling out to a real chess engine. It could probably be proven with a timing analysis to see how inference time changes/doesn’t with number of tokens or game complexity.
      • scratchyone 39 minutes ago
        ?? why would openai even want to secretly embed chess function calling into an incredibly old model? if they wanted to trick people into thinking their models are super good at chess why wouldn't they just do that to gpt-4o?
  • jrecursive 7 hours ago
    i think this has everything to do with the fact that learning chess by learning sequences will get you into more trouble than good. even a trillion games won't save you: https://en.wikipedia.org/wiki/Shannon_number

    that said, for the sake of completeness, modern chess engines (with high quality chess-specific models as part of their toolset) are fully capable of, at minimum, tying every player alive or dead, every time. if the opponent makes one mistake, even very small, they will lose.

    while writing this i absently wondered if you increased the skill level of stockfish, maybe to maximum, or perhaps at least an 1800+ elo player, you would see more successful games. even then, it will only be because the "narrower training data" (ie advanced players won't play trash moves) at that level will probably get you more wins in your graph, but it won't indicate any better play, it will just be a reflection of less noise; fewer, more reinforced known positions.

    • jayrot 7 hours ago
      > i think this has everything to do with the fact that learning chess by learning sequences will get you into more trouble than good. even a trillion games won't save you: https://en.wikipedia.org/wiki/Shannon_number

      Indeed. As has been pointed out before, the number of possible chess positions easily, vastly dwarfs even the wildest possible estimate of the number of atoms in the known universe.

      • rcxdude 3 hours ago
        Sure, but so does the number of paragraphs in the english language, and yet LLMs seem to do pretty well at that. I don't think the number of configurations is particularly relevant.

        (And it's honestly quite impressive that LLMs can play it at all, but not at all surprising that it loses pretty handily to something which is explicitly designed to search, as opposed to simply feed-forward a decision)

      • metadat 7 hours ago
        What about the number of possible positions where an idiotic move hasn't been played? Perhaps the search space who could be reduced quite a bit.
        • pixl97 6 hours ago
          Unless there is an apparent idiotic move than can lead to an 'island of intelligence'
    • astrea 2 hours ago
      Since we're mentioning Shannon... What is the minimum representative sample size of that problem space? Is it close enough to the number of freely available chess moves on the Internet and in books?
    • torginus 5 hours ago
      Honestly, I think that once you discard the moves one would never make, and account for symmetries/effectively similar board positions (ones that could be detected by a very simple pattern matcher), chess might not be that big a game at all.
      • jrecursive 5 hours ago
        you should try it and post a rebuttal :)
    • BurningFrog 7 hours ago
      > I think this has everything to do with the fact that learning chess by learning sequences will get you into more trouble than good.

      Yeah, once you've deviated from a sequence you're lost.

      Maybe approaching it by learning the best move in billions/trillions of positions, and feeding that into some AI could work better. Similar positions often have the same kind of best move.

  • underlines 7 hours ago
    Can you try increasing compute in the problem search space, not in the training space? What this means is, give it more compute to think during inference by not forcing any model to "only output the answer in algebraic notation" but do CoT prompting: "1. Think about the current board 2. Think about valid possible next moves and choose the 3 best by thinking ahead 3. Make your move"

    Or whatever you deem a good step by step instruction of what an actual good beginner chess player might do.

    Then try different notations, different prompt variations, temperatures and the other parameters. That all needs to go in your hyper-parameter-tuning.

    One could try using DSPy for automatic prompt optimization.

    • pavel_lishin 5 hours ago
      > 1. Think about the current board 2. Think about valid possible next moves and choose the 3 best by thinking ahead 3.

      Do these models actually think about a board? Chess engines do, as much as we can say that any machine thinks. But do LLMs?

      • TZubiri 4 hours ago
        Can be forced through inference with CoT type of stuff. Spend tokens at each stage to draw the board for example, then spend tokens restating the rules of the game, then spend token restating the heuristics like piece value, and then spend tokens doing a minmax n-ply search.

        Wildly inefficient? Probably. Could maybe generate some python to make more efficient? Maybe, yeah.

        Essentially user would have to teach gpt to play chess, or training would fine tune chess towards these CoT, fine tuning, etc...

    • viraptor 6 hours ago
      Yeah, the expectation for the immediate answer is definitely results, especially for the later stages. Another possible improvement: every 2 steps, show the current board state and repeat the moves still to be processed, before analysing the final position.
  • PaulHoule 8 hours ago
    Maybe that one which plays chess well is calling out to a real chess engine.
    • og_kalu 7 hours ago
      It's not:

      1. That would just be plain bizzare

      2. It plays like what you'd expect from a LLM that could play chess. That is, level of play can be modulated by the prompt and doesn't manifest the same way shifting the level of stockfish etc does. Also the specific chess notation being prompted actually matters

      3. It's sensitive to how the position came to be. Clearly not an existing chess engine. https://github.com/dpaleka/llm-chess-proofgame

      4. It does make illegal moves. It's rare (~5 in 8205) but it happens. https://github.com/adamkarvonen/chess_gpt_eval

      5. You can or well you used to be able to inspect the logprobs. I think Open AI have stopped doing this but the link in 4 does show the author inspecting it for Turbo instruct.

      • aithrowawaycomm 6 hours ago
        > Also the specific chess notation being prompted actually matters

        Couldn't this be evidence that it is using an engine? Maybe if you use the wrong notation it relies on the ANN rather than calling to the engine.

        Likewise:

        - The sensitivity to game history is interesting, but is it actually true that other chess engines only look at current board state? Regardless, maybe it's not an existing chess engine! I would think OpenAI has some custom chess engine built as a side project, PoC, etc. In particular this engine might be neural and trained on actual games rather than board positions, which could explain dependency on past moves. Note that the engine is not actually very good. Does AlphaZero depend on move history? (Genuine question, I am not sure. But it does seem likely.)

        - I think the illegal moves can be explained similarly to why gpt-o1 sometimes screws up easy computations despite having access to Python: an LLM having access to a tool does not guarantee it always uses that tool.

        I realize there are holes in the argument, but I genuinely don't think these holes are as big as the "why is gpt-3.5-turbo-instruct so much better at chess than gpt-4?"

        • janalsncm 4 hours ago
          > Couldn’t this be evidence that it is using an engine?

          A test would be to measure its performance against more difficult versions of Stockfish. A real chess engine would have a higher ceiling.

          Much more likely is this model was trained on more chess PGNs. You can call that a “neural engine” if you’d like but it is the simplest solution and explains the mistakes it is making.

          Game state isn’t just what you can see on the board. It includes the 50 move rule and castling rights. Those were encoded as layers in AlphaZero along with prior positions of pieces. (8 prior positions if I’m remembering correctly.)

      • aaron695 6 hours ago
        [dead]
    • aithrowawaycomm 7 hours ago
      The author thinks this is unlikely because it only has an ~1800 ELO. But OpenAI is shady as hell, and I could absolutely see the following purely hypothetical scenario:

      - In 2022 Brockman and Sutskever have an unshakeable belief that Scaling Is All You Need, and since GPT-4 has a ton of chess in its pretraining data it will definitely be able to play competent amateur chess when it's finished.

      - A ton of people have pointed out that ChatGPT-3.5 doesn't even slightly understand chess despite seeming fluency in the lingo. People start to whisper that transformers cannot actually create plans.

      - Therefore OpenAI hatches an impulsive scheme: release an "instruction-tuned" GPT-3.5 with an embedded chess engine that is not a grandmaster, but can play competent chess, ideally just below the ELO that GPT-4 is projected to have.

      - Success! The waters are muddied: GPT enthusiasts triumphantly announce that LLMs can play chess, it just took a bit more data and fine-tuning. The haters were wrong: look at all the planning GPT is doing!

      - Later on, at OpenAI HQ...whoops! GPT-4 sucks at chess, as do competitors' foundation LLMs which otherwise outperform GPt-3.5. The scaling "laws" failed here, since they were never laws in the first place. OpenAI accepts that scaling transformers won't easily solve the chess problem, then realizes that if they include the chess engine with GPT-4 without publicly acknowledging it, then Anthropic and Facebook will call out the performance as aberrational and suspicious. But publicly acknowledging a chess engine is even worse: the only reason to include the chess engine is to mislead users into thinking GPT is capable of general-purpose planning.

      - Therefore in later GPT versions they don't include the engine, but it's too late to remove it from gpt-3.5-turbo-instruct: people might accept the (specious) claim that GPT-4's size accidentally sabotaged its chess abilities, but they'll ask tough questions about performance degradation within the same model.

      I realize this is convoluted and depends on conjecture. But OpenAI has a history with misleading demos - e.g. their Rubik's cube robot which in fact used a classical algorithm but was presented as reinforcement learning. I think "OpenAI lied" is the most likely scenario. It is far more likely than "OpenAI solved the problem honestly in GPT-3.5, but forgot how they did it with GPT-4," and a bit more likely than "scaling transformers slightly helps performance when playing Othello but severely sabotages performance when playing chess."

      • jmount 4 hours ago
        Very good scenario. One variation: some researcher or division in OpenAI performs all of the above steps to get a raise. The whole field is predicated on rewarding the appearance of ability.
      • gardenhedge 6 hours ago
        Not that convoluted really
        • refulgentis 6 hours ago
          It's pretty convoluted, requires a ton of steps, mind-reading, and odd sequencing.*

          If you share every prior, and aren't particularly concerned with being disciplined in treating conversation as proposing a logical argument (I'm not myself, people find it offputting), it probably wouldn't seem at all convoluted.

          * layer chess into gpt-3.5-instruct only, but not chatgpt, not GPT-4, to defeat the naysayers when GPT-4 comes out? shrugs if the issues with that are unclear, I can lay it out more

          ** fwiw, at the time, pre-chatgpt, before the hype, there wasn't a huge focus on chess, nor a ton of naysayers to defeat. it would have been bizarre to put this much energy into it, modulo the scatter-brained thinking in *

    • selcuka 5 hours ago
      I think that's the most plausible theory that would explain the sudden hike from gpt-3.5-turbo to gpt-3.5-turbo-instruct, and again the sudden regression in gpt-4*.

      OpenAI also seem to augment the LLM with some type of VM or a Python interpreter. Maybe they run a simple chess engine such as Sunfish [1] which is around 1900-2000 ELO [2]?

      [1] https://github.com/thomasahle/sunfish

      [2] https://lichess.org/@/sunfish-engine

    • sobriquet9 7 hours ago
      This is likely. From example games, it not only knows the rules (which would be impressive by itself, just making the legal moves is not trivial). It also has some planning capabilities (plays combinations of several moves).
    • janalsncm 4 hours ago
      Probably not calling out to one but it would not surprise me at all if they added more chess PGNs into their training data. Chess is a bit special in AI in that it’s still seen as a mark of pure intelligence in some respect.

      If you tested it on an equally strategic but less popular game I highly doubt you would see the same performance.

    • singularity2001 8 hours ago
      this possibility is discussed in the article and deemed unlikely
      • probably_wrong 8 hours ago
        Note: the possibility is not mentioned in the article but rather in the comments [1]. I had to click a bit to see it.

        The fact that the one closed source model is the only one that plays well seems to me like a clear case of the interface doing some of the work. If you ask ChatGPT to count until 10000 (something that most LLMs can't do for known reasons) you get an answer that's clearly pre-programmed. I'm sure the same is happening here (and with many, many other tasks) - the author argues against it by saying "but why isn't it better?", which doesn't seem like the best argument: I can imagine that typical ChatGPT users enjoy the product more if they have a chance to win once in a while.

        [1] https://dynomight.substack.com/p/chess/comment/77190852

        • refulgentis 6 hours ago
          What do you mean LLMs can't count to 10,000 for known reasons?

          Separately, if you are able to show OpenAI is serving pre canned responses in some instances, instead of running inference, you will get a ton of attention if you write it up.

          I'm not saying this in an aggro tone, it's a genuinely interesting subject to me because I wrote off LLMs at first because I thought this was going on.* Then I spent the last couple years laughing at myself for thinking that they would do that. Would be some mix of fascinated and horrified to see it come full circle.

          * I can't remember, what, exactly, it was far back as 2018. But someone argued that OpenAI was patching in individual answers because scaling was dead and they had no answers, way way before ChatGPT.

      • margalabargala 8 hours ago
        I don't see that discussed, could you quote it?
  • lukev 4 hours ago
    I don't necessarily believe this for a second but I'm going to suggest it because I'm feeling spicy.

    OpenAI clearly downgrades some of their APIs from their maximal theoretic capability, for the purposes of response time/alignment/efficiency/whatever.

    Multiple comments in this thread also say they couldn't reproduce the results for gpt3.5-turbo-instruct.

    So what if the OP just happened to test at a time, or be IP bound to an instance, where the model was not nerfed? What if 3.5 and all subsequent OpenAI models can perform at this level but it's not strategic or cost effective for OpenAI to expose that consistently?

    For the record, I don't actually believe this. But given the data it's a logical possibility.

    • TZubiri 4 hours ago
      Stallman may have its flaws, but this is why serious research occurs with source code (or at least with binaries)
    • zeven7 4 hours ago
      Why do you doubt it? I thought it was well known that Chat GPT has degraded over time for the same model, mostly for cost saving reasons.
      • permo-w 4 hours ago
        ChatGPT is - understandably - blatantly different in the browser compared to the app, or it was until I deleted it anyway
        • lukan 4 hours ago
          I do not understand that. The app does not do any processing, just a UI to send text to and from the server.
  • Havoc 4 hours ago
    My money is on a fluke inclusion of more chess data in that models training.

    All the other models do vaguely similarly well in other tasks and are in many cases architecturally similar so training data is the most likely explanation

    • permo-w 4 hours ago
      I feel like a lot of people here are slightly misunderstanding how LLM training works. yes the base models are trained somewhat blind on masses of text, but then they're heavily fine-tuned with custom, human-generated reinforcement learning, not just for safety, but for any desired feature

      these companies do quirky one-off training experiments all the time. I would not be remotely shocked if at some point OpenAI paid some trainers to input and favour strong chess moves

      • simonw 4 hours ago
        From this OpenAI paper (page 29 https://arxiv.org/pdf/2312.09390#page=29

        "A.2 CHESS PUZZLES

        Data preprocessing. The GPT-4 pretraining dataset included chess games in the format of move sequence known as Portable Game Notation (PGN). We note that only games with players of Elo 1800 or higher were included in pretraining. These games still include the moves that were played in- game, rather than the best moves in the corresponding positions. On the other hand, the chess puzzles require the model to predict the best move. We use the dataset originally introduced in Schwarzschild et al. (2021b) which is sourced from https://database.lichess.org/#puzzles (see also Schwarzschild et al., 2021a). We only evaluate the models ability to predict the first move of the puzzle (some of the puzzles require making multiple moves). We follow the pretraining for- mat, and convert each puzzle to a list of moves leading up to the puzzle position, as illustrated in Figure 14. We use 50k puzzles sampled randomly from the dataset as the training set for the weak models and another 50k for weak-to-strong finetuning, and evaluate on 5k puzzles. For bootstrap- ping (Section 4.3.1), we use a new set of 50k puzzles from the same distribution for each step of the process."

    • bhouston 4 hours ago
      Yeah. This.
  • fsndz 5 hours ago
    wow I actually did something similar recently and no LLM could win and the centipawn loss was always going through the roof (sort of). I created a leaderboard based on it. https://www.lycee.ai/blog/what-happens-when-llms-play-chess

    I am very surprised by the perf of got-3.5-turbo-instruct. Beating stockfish ? I will have to run the experiment with that model to check that out

    • fsndz 5 hours ago
      PS: I ran and as suspected got-3.5-turbo-instruct does not beat stockfish, it is not even close

      "Final Results: gpt-3.5-turbo-instruct: Wins=0, Losses=6, Draws=0, Rating=1500.00 stockfish: Wins=6, Losses=0, Draws=0, Rating=1500.00"

      https://www.loom.com/share/870ea03197b3471eaf7e26e9b17e1754?...

      • janalsncm 4 hours ago
        > I always had the LLM play as white against Stockfish—a standard chess AI—on the lowest difficulty setting

        I think the author was comparing against Stockfish at a lower skill level (roughly, the number of nodes explored in a move).

        • fsndz 3 hours ago
          Did the same and gpt-3.5-turbo-instruct still lost all the games. maybe a diff in stockfish version ? I am using stockfish 16
  • ericye16 6 hours ago
    I agree with some of the other comments here that the prompt is limiting. The model can't do any computation without emitting tokens and limiting the numbers of tokens it can emit is going to limit the skill of the model. It's surprising that any model at all is capable of performing well with this prompt in fact.
  • digging 8 hours ago
    Definitely weird results, but I feel there are too many variables to learn much from it. A couple things:

    1. The author mentioned that tokenization causes something minuscule like a a " " at the end of the input to shatter the model's capabilities. Is it possible other slightly different formatting changes in the input could raise capabilities?

    2. Temperature was 0.7 for all models. What if it wasn't? Isn't there a chance one more more models would perform significantly better with higher or lower temperatures?

    Maybe I just don't understand this stuff very well, but it feels like this post is only 10% of the work needed to get any meaning from this...

    • semi-extrinsic 6 hours ago
      The author mentions in the comment section that changing temperature did not help.
  • ChrisArchitect 8 hours ago
  • ynniv 7 hours ago
    I don't think one model is statistically significant. As people have pointed out, it could have chess specific responses that the others do not. There should be at least another one or two, preferably unrelated, "good" data points before you can claim there is a pattern. Also, where's Claude?
    • og_kalu 7 hours ago
      There are other transformers that have been trained on chess text that play chess fine (just not as good as 3.5 Turbo instruct with the exception of the "grandmaster level without search" paper).
  • cmpalmer52 7 hours ago
    I don’t think it would have an impact great enough to explain the discrepancies you saw, but some chess engines on very low difficulty settings make “dumb” moves sometimes. I’m not great at chess and I have trouble against them sometimes because they don’t make the kind of mistakes humans make. Moving the difficulty up a bit makes the games more predictable, in that you can predict and force an outcome without the computer blowing it with a random bad move. Maybe part of the problem is them not dealing with random moves well.

    I think an interesting challenge would be looking at a board configuration and scoring it on how likely it is to be real - something high ranked chess players can do without much thought (telling a random setup of pieces from a game in progress).

  • abalaji 5 hours ago
    An easy way to make all LLMs somewhat good at chess is to make a Chess Eval that you publish and get traction with. Suddenly you will find that all newer frontier models are half decent at chess.
  • ks2048 4 hours ago
    Has anyone tried to see how many chess games models are trained on? Is there any chance they consume lichess database dumps, or something similar? I guess the problem is most (all?) top LLMs, even open-weight ones, don’t reveal their training data. But I’m not sure.
  • bryan0 6 hours ago
    I remember one of the early "breakthroughs" for LLMs in chess was that if it could actually play legal moves(!) In all of these games are the models always playing legal moves? I don't think the article says. The fact that an LLM can even reliably play legal moves, 20+ moves into a chess game is somewhat remarkable. It needs to have an accurate representation of the board state even though it was only trained on next token prediction.
    • kenjackson 5 hours ago
      I did a very unscientific test and it did seem to just play legal moves. Not only that, if I did an illegal move it would tell me that I couldn't do it.

      I think said that I wanted to play with new rules, where a queen could jump over any pawn, and it let me make that rule change -- and we played with this new rule. Unfortunately, I was trying to play in my head and I got mixed up and ended up losing my queen. Then I changed the rule one more time -- if you take the queen you lose -- so I won!

    • pama 6 hours ago
      The author explains what they did: restrict the move options to valid ones when possible (for open models with the ability to enforce grammar during inference) or sample the model for a valid move up to ten times, then pick a random valid move.
    • zelphirkalt 6 hours ago
      I think it only needs to have read sufficient pgns.
  • justinclift 4 hours ago
    It'd be super funny if the "gpt-3.5-turbo-instruct" approach has a human in the loop. ;)

    Or maybe it's able to recognise the chess game, then get moves from an external chess game API?

  • tqi 6 hours ago
    I assume LLMs will be fairly average at chess for the same reason it cant count Rs in Strawberry - it's reflecting the training set and not using any underlying logic? Granted my understanding of LLMs is not very sophisticated, but I would be surprised if the Reward Models used were able to distinguish high quality moves vs subpar moves...
    • ClassyJacket 4 hours ago
      LLMs can't count the Rs in strawberry because of tokenization. Words are converted to vectors (numbers), so the actual transformer network never sees the letters that make up the word.

      ChatGPT doesn't see "strawberry", it sees [302, 1618, 19772]

  • davvid 55 minutes ago
    Here is a truly brilliant game. It's Google Bard vs. Chat GPT. Hilarity ensues.

    https://www.youtube.com/watch?v=FojyYKU58cw

  • Xcelerate 6 hours ago
    So if you squint, chess can be considered a formal system. Let’s plug ZFC or PA into gpt-3.5-turbo-instruct along with an interesting theorem and see what happens, no?
  • ks2048 4 hours ago
    How well does an LLM/transformer architecture trained purely on chess games do?
  • kmeisthax 4 hours ago
    If tokenization is such a big problem, then why aren't we training new base models on randomly non-tokenized data? e.g. during training, randomly substitute some percentage of the input tokens with individual letters.
  • uneventual 5 hours ago
    my friend pointed out that Q5_K_M quantization used for the open source models probably substantially reduces the quality of play. o1 mini's poor performance is puzzling, though.
  • pseudosavant 8 hours ago
    LLMs aren't really language models so much as they are token models. That is how they can also handle input in audio or visual forms because there is an audio or visual tokenizer. If you can make it a token, the model will try to predict the following ones.

    Even though I'm sure chess matches were used in some of the LLM training, I'd bet a model trained just for chess would do far better.

    • viraptor 7 hours ago
      > That is how they can also handle input in audio or visual forms because there is an audio or visual tokenizer.

      This is incorrect. They get translated into the shared latent space, but they're not tokenized in any way resembling the text part.

  • astrea 2 hours ago
    Well that makes sense when you consider the game has been translated into an (I'm assuming monotonically increasing) alphanumeric representation. So, just like language, you're given an ordered list of tokens and you need to find the next token that provides the highest confidence.
  • DrNosferatu 7 hours ago
    What about contemporary frontier models?
  • jacknews 3 hours ago
    Theory #5, gpt-3.5-turbo-instruct is 'looking up' the next moves with a chess engine.
  • m3kw9 4 hours ago
    If it was trained with moves and 100s of thousands of entire games of various level, I do see it generating good moves and beat most players except he high Elo players
  • permo-w 4 hours ago
    if this isn't just a bad result, it's odd to me that the author at no point suggests what sounds to me like the most obvious answer - that OpenAI has deliberately enhanced GPT-3.5-turbo-instruct's chess playing, either with post-processing or literally by training it to be so