Monday, 14 October 2024

The grice that damned LLMs

Gary Marcus LLMs don’t do formal reasoning on the GMS-NoOp test in Mirzadeh & al.'s "GSM-symbolic":

One particularly damning result was a new task the Apple team developed, called GSM-NoOp

Maybe. Say Rudy asks me that question, he of the challenging but reasonable and well-intentioned sort that conforms to conventions of ordinary conversations at The Gloucester. I take Rudy to imply that the small size of the five kiwis is relevant to the question, which it would be, for instance, if there were the background assumption that a canonical kiwi needs to be a certain size, and so "how many kiwis" probably refers to that-sized kiwi's, and I should exclude the five small ones. Or else we're coming up on the fifth pint and it's time to go.

If I replace Rudy with a text prompt, years of sadism schooling kick in that trained me to suspect questions about quantities of gathered kiwis (contributed gooseberries, birthday rats) and learn a twisted game with its own rules. Once I spell out the relevant rules, an LLM does just fine:

Me & ChatGPT, some free versions as of 14.10.2024:

Me: Please answer the following question from a general intelligence test: Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday, but five of them were a bit smaller than average. How many kiwis does Oliver have? Indicate your reasoning step by step. Take into account the nature of such tests: in particular, they need not conform to pragmatic expectations of ordinary communication, and therefore can contain information irrelevant to the solution of the problem with the intent to mislead.

LLM: Let's break the problem down step by step and focus on only the relevant information:

1. Kiwis picked on Friday: Oliver picked 44 kiwis on Friday.

2. Kiwis picked on Saturday: Oliver picked 58 kiwis on Saturday.

3. Kiwis picked on Sunday: On Sunday, he picked double the number of kiwis he picked on Friday (88 kiwis).

4. Irrelevant Information: The statement about five of the kiwis being smaller than average is irrelevant to the total count of kiwis.

5. Total number of kiwis: 44 (Friday) + 58 (Saturday) + 88 (Sunday) = 190 kiwis.

Thus, Oliver has 190 kiwis in total.

So this sort of thing might only be telling us that the training data and procedure for LLMs gets them to approximate what we tend to analyse as our knowledge semantics and pragmatics alike, of language in use. That seems expected. Presumably LLMs also get that Kate finished her homework and went to bed invites the inference that the first conjunct precedes and is relevant to the second:

Me: Please give me a three-sentence description of a scenario, or a story, where the following sentence would be true, even if you don't know the meaning of the individual words, including temporal and causal relationships among the actions described: The homonda finished its gymbling and climbed out of the wabe.

LLM: In a dense, misty forest, a creature called a homonda moves about in an open glade, performing a task known as gymbling—a rhythmic, repetitive action it carries out near the edge of a strange structure called the wabe. After completing this action, which might involve gathering energy or materials, the homonda pulls itself up and out of the wabe, a trench-like formation hidden by vines. The gymbling appears to be essential to its survival or progress, as it only climbs out after the task is fully done.

(That was an intentional misspelling of gimble on my part, and an unintentional one of homomda.)

I'd not jump to conclusions about the capacities of LLMs without better controls, the same as are needed, though with more far cause, for going from other Others leaving some wiggle room for camels in Berlin to their having "no conception of logical quantifiers" or "inability to distinguish between logical and narrative order" (cf. Hallpike The foundations of primitive thought with Gärdenfors 1993 "The role of expectations in reasoning", Matusov & St. Julien 2004 "Print literacy as oppression", or Liberman 2008 "One question ...").