wandering.shop is one of the many independent Mastodon servers you can use to participate in the fediverse.
Wandering.Shop aims to have the vibe of a quality coffee shop at a busy SF&F Convention. Think tables of writers, fans and interested passers-by sharing drinks and conversation on a variety of topics.

Server stats:

894
active users

I've said what I can about "prompt engineering," why the idea is facile nonsense, the complete lack of theoretical and empirical basis, and the organizational impact of promoting on the basis of "prompt engineering" skills.

But there's something else to be said, I think, about the underlying fallacy that makes "prompt engineering" appear to be real: namely, the incorrect assumption that LLMs can be "programmed" at all.

Let me step back, though, and be careful about what I mean by "programmability."

Traditionally, in computer science, that's meant that the device or software in question can be made to emulate arbitrary Turing machines; in its strongest form, it even requires no worse than polynomial overhead.

Colloquially, though, one can program a TV remote or similar — that doesn't mean you can run DOOM on it (well, some of you can). It means you can intentionally affect its behavior.

I posit that LLMs admit neither the strict nor colloquial sense of programmability. The former could be argued formally in terms of the number of states that can be explored, using the busy beaver numbers, but that would be original research well beyond the scope of a few toots.

Practically speaking, LLMs fail even the TV remote sense of programmability — if there is some desired behavior you wish for an LLM to exhibit, there's no a priori way to decide what actions to take to achieve that.

Cassandra Granade 🏳️‍⚧️

Should you use "please" and be more polite to it, or should you use "fuck" to bypass its guard rails? Who knows! As discussed, there's no theory to help you here, so you're stuck without a map to track back to what "prompt" you should start with.

The problem is that some prompts *seem* to work. Because these things are trained on lots and lots of stolen labor, it's not too hard to find text that superficially resembles some task or another.

It's the whole thing where if you ask an LLM to multiply two small numbers together, someone has probably done that somewhere, so it "works," but it completely fails for larger numbers. "Reasoning" models can get around that by giving an escape hatch to eval, like the original chain-of-thought paper, but then why not just use eval directly?

But regardless, if you think of a task common enough that it has been solved in the training corpus, then it "works," right?

The next step of the fallacy is to conflate the strict and colloquial senses of programmability. If you can give an LLM "instructions" to multiply two numbers, couldn't you give it "instructions" to do anything else?

Well, no. That's a giant leap, and one that's fully unsupported by evidence. But it *feels* right, if only because you can again try out common problems and find solutions somewhere in the training corpus.

Computer scientists, as a lot, tend to have a pocket industry of proving that this or another toy model is equivalent to Turing machines. Billiard balls can be used to implement the Fredikin gate, which is universal for computation, and thus you can program your local pool table. Magic: the Gathering is infamously Turing complete, a fun fact to bring up during draft.

Those kinds of proofs are useful because the easiest way to show something can be programmed is to do it.

That's precisely what you *can't* do with LLMs, though. Can you use an LLM to simulate a Fredikin gate if you say "fuck" enough, or do you need a "please" or two in there?

What's the construction? What *concrete* steps do you take to get an LLM to do what you want, and how do you know that you're correct?

With billiard balls, you have Newtonian physics. With MtG, you have the comprehensive rules. With LLMs, you've got ?????????.

Even if you had an LLM-based implementation of programmability, will it always work? Will it work for programs up to a certain size? How will you know when it stops working? Will it work for a different seed or at a different temperature?

The programmability fallacy, as I'll coin it, lies to us and tells us not to worry about those problems because we can tell LLMs to multiply small numbers by each other and sometimes get the right answer.

(The programmability fallacy, for what it's worth, also applies to running FPGAs outside of design parameters.)