Compositionality Gap

Can we get strong guarantees from AI tools that are known to hallucinate? We discuss some strategies, and ways that Elm might be a great target for AI assistance.

https://cdn.simplecast.com/audio/6a206baa-9c8e-4c25-9037-2b674204ba84/episodes/d1c5f97c-9700-48b0-ab35-a039edbfd0d5/audio/16dc506d-5aa1-42c1-8838-9ffaa3e0e1e9/default_tc.mp3 elm radio – 080: Elm and AI page

[00:36:16] So one thing that I was really amazed by, I'll share a link to this tweet, but I saw this demo where this was actually with GPT-3, but this example stuck with me where somebody was finding that GPT-3 did a poor job if you asked it questions that went through sort of several steps.

Tweet showing intermediary questions prompt engineering technique github , arxiv

YOUTUBE A3GtlwwWDhI The Compositionality Gap Explained (with GPT-3)

[00:36:40] Like if you said, what is the capital of the country where the Taj Mahal is? [00:36:47] Then it would give Agra or something like the city where the Taj Mahal is instead of [00:36:53] New Delhi, which is the capital of India. [00:36:55] So what they did is they did some Prompt Engineering.

[00:37:00] So they gave it a starting prompt, which said, here's an example of doing this thing. [00:37:06] And they kind of gave an example question. [00:37:09] They gave some intermediaries questions where it said, okay, well, in order to answer this [00:37:14] question, I need to answer an intermediary question first.

[00:37:18] And so they gave an example of that as the sort of context for the prompt. [00:37:23] And then they said, now answer this question. [00:37:25] Question, what is the capital of the place where the Taj Mahal is, the country where [00:37:32] the Taj Mahal is?

[00:37:33] And then it went ahead and followed that set of steps. [00:37:37] Intermediary question, what country is the Taj Mahal in? [00:37:41] Answer, India.

[00:37:43] Intermediary question, what is the capital of India? [00:37:47] New Delhi. [00:37:48] Simple answer, New Delhi. [00:37:50] And it got the correct answer.

[00:37:51] So it got higher quality results because it was guided to break the problem down into [00:37:57] sub problems. [00:37:58] And pretty interesting, right? [00:38:01] Yeah.

[00:38:02] It's a bit funny because I feel like everyone has this like, oh, well, you know, if I tried [00:38:09] to chat to JPT and didn't give me great results, but then I tried to prime it or I tried to [00:38:16] give this kind of prompt or written this way, and now we get great results. [00:38:21] And I feel like maybe everyone will have their own ideal prompt and people will share it. [00:38:29] It's kind of like, well, I have my set of key bindings of shortcuts for my computer.

[00:38:35] Oh, you don't have a shortcut for this action in your ID? [00:38:39] Oh, let me share it with you. [00:38:41] Or have their own proprietary prompts. [00:38:44] Maybe yeah. [00:38:45] And we will start to have like our very custom experience around working with AI. [00:38:51] It's like, hey, this is how I do it. [00:38:54] And this works for me. [00:38:55] And this might not work for you.

[00:38:58] And then it'll be like, here's a very good prompt for generating the best prompt. [00:39:06] I kind of feel like maybe people who use Vim will have something like that. [00:39:13] There's a real art to it.

Solving Jigsaw Puzzles

[00:39:18] And I mean, you're getting it to break problems. [00:39:21] So I took this concept, and I applied this idea of breaking the problem down into sub problems. [00:39:28] I actually had written this blog post years back about like solving Elm code, like solving [00:39:36] type puzzles in Elm, like a jigsaw puzzle frame then fill in. [00:39:41] So like a jigsaw puzzle, you start by filling in the borders and the corner. [00:39:47] The corner pieces are easy to find.

[00:39:48] So you find the corners, then you've got sort of a set of fixed points. [00:39:53] So that's like one low hanging fruit. [00:39:56] It's easy to solve those problems. [00:39:58] Then you find the edge pieces and you can fill those in. [00:40:00] And now that you've got the edges, it makes solving the rest of the puzzle easier, right? [00:40:05] So that's one approach to solving jigsaw puzzles to break it down into sub problems.

[00:40:09] But like with Elm types, I kind of, I use this technique a lot when I'm writing Elm code as a human. [00:40:16] I will, I'll say like, okay, I don't know exactly what's going to fit here in this pipeline. [00:40:23] But I know I want to like take this starting value, I want to do something to it. [00:40:29] And then I want it to be passed through in this pipeline to this next function.

[00:40:38] So sometimes I'll just like put a debug.todo there. [00:40:41] And then maybe I'll extract that debug.todo to a function or a let binding. [00:40:46] And then I'll try to get a type annotation for that value I have in my let binding. [00:40:52] That would satisfy the compiler. [00:40:55] Exactly.

[00:40:56] Now I've broken it down into a sub problem. [00:40:59] So I took this sort of like, fill in this code, I don't know what goes here, I've turned [00:41:03] it into, okay, I know the type I need. [00:41:06] So that's sort of like finding the edge pieces in the puzzle. [00:41:10] So now I've created the sub problem for myself.

[00:41:13] And now I can do things like, so I've got a debug.todo with a type annotation. [00:41:20] Now I can turn that into maybe result.map around a debug.todo or list.map with a debug.todo.

[00:41:29] So now I'm saying, well, I know that if I apply some function over a list, it will work. [00:41:38] And now I've further broken down that sub problem. [00:41:40] And now I can follow that step again and say, okay, we'll break out another, that debug.todo, [00:41:46] give that a type annotation.

[00:41:47] So now it's list.map, some function, I don't know what, now follow that same process with that.

[00:41:53] So it's breaking it down into sub problems. [00:41:55] I use this technique all the time when I'm like, often with like API design, you're doing [00:42:01] weird type puzzles, but also just with like user code, like trying to parse some markdown [00:42:08] and take the markdown blocks and find and traverse them and map them into this type and whatever.

[00:42:16] Now I tried to use this sort of same prompt engineering idea to teach GPT how to follow [00:42:24] this set of steps of breaking down a type puzzle. [00:42:28] And it was actually really good at it.

Dillon's prompt engineering type puzzle examples

* Decode mapping solution (correct on first try) gist * Markdown solution with 2 corrections from compiler feedback gist * Dillon's Frame Then Fill In blog post describes a similar method to the GPT prompt

[00:42:30] So I'll share like a little GitHub gist of my GPT prompt and the results it got. [00:42:39] But what I basically did is I told it, I said, your job is to solve this type puzzle. [00:42:46] And I gave it some guardrails. [00:42:48] So like the guardrails I gave it were, I'm going to give you some code, which has debug.todo [00:42:54] replace me in it. [00:42:56] Your job is to, you know, get a, satisfy the type for that debug.todo. [00:43:02] And your final solution cannot include debug.todo.

[00:43:06] You can write intermediary steps, which may include debug.todo. [00:43:10] And you are not allowed to change any code other than the section that says debug.todo. [00:43:16] So I gave it these guardrails. [00:43:17] Also these are verifiable things, right?

[00:43:19] So you can test for this to see if it's given you a valid solution, given the guardrails [00:43:25] you wanted it to honor, because it might hallucinate and not do that, but you can check that.

[00:43:32] So one thing that I'm thinking of is, will you be able to verify things accurately? [00:43:39] Maybe I'm being too, I'm trying to play the devil's advocate here a bit, and I might be [00:43:44] a bit too hard on chat.gpt. [00:43:47] But for instance, whenever you're working in that TD style, when you do things one step [00:43:52] at a time, you discover edge cases, right?

[00:43:56] So for instance, you give a list as an argument and needs to return a number. [00:44:01] And first you hard code that number. [00:44:03] And then you notice, Oh, well, what if that list is, is, is empty then? [00:44:08] Oh, well then I need to write a test for if the, the, the list is empty. [00:44:14] Ah, okay.

[00:44:17] But the, the AI might not do that, might not notice that. [00:44:20] So maybe it's, it is going to fix things correctly. [00:44:24] Maybe not, but let's say it's going to do it correctly, but it's not going to have a [00:44:27] test for that.

[00:44:29] Or you're not going to know, or you're going to, I mean, you are not going to notice that [00:44:36] you're going to need to write a test with an empty list. [00:44:39] So that's the process is a bit hard to figure out if you don't do it yourself. [00:44:45] It's kind of like, I think also one of the reasons why people say, well, you should pair [00:44:49] program rather than review someone else's code, because you will discover those, those [00:44:55] things while you're working.

[00:45:04] I think we're going to see like a lot of junior developers using these AI tools in exactly [00:45:10] the kind of way you're describing where maybe they trust it too much to do too many steps. [00:45:18] And then what happens is you're not really engaging with the problem intellectually, and [00:45:24] you're not thinking about these test cases that come up.

[00:45:27] So I think there's an art to knowing when to use it. [00:45:31] So like the type of problem here for this sort of frame then fill in problem I'm talking about, this is this is a class of problem that I find myself spending a lot of mental effort solving on a regular basis.

You Know It When You See It

[00:45:45] That is kind of this, this, you know, it has this quality we talked about with a traveling [00:45:50] salesman where you know it when you see it, if you have a good solution, you can easily [00:45:54] verify that it that it solved it.

[00:45:57] And yet it's not really doing anything too creative, because it's fitting pieces together. [00:46:02] It's not really writing too many implementation details. [00:46:06] And I find that often with Elm code, you arrive at these types of situations where like, if [00:46:11] you could make the types work, you would trust that the code worked because like, there's [00:46:16] only really one good way to fit these functions together to get the right thing. [00:46:20] Like you're not doing too much logic of like, give it this empty value and give it this now it might hallucinate these types of things.

[00:46:27] But you could even use prompt engineering to tell it like, you're just fitting these [00:46:33] functions together. [00:46:34] I don't want you to write new logic or use default values. [00:46:37] So I think these types of things can be improved through prompt engineering and also through [00:46:41] selecting the right types of problems.

[00:46:43] But like, for example, the I gave it a pretty, pretty challenging problem with this sort [00:46:50] of prompt I designed. [00:46:52] I had it fill in this, this thing in a in an Elm pages example that I have where it [00:46:58] takes markdown blocks and it traverses these headings and it and I what I did is I primed [00:47:08] it with a list of all of the available functions that it could use. [00:47:15] And another thing you can do a set of guardrails is you can say only use these functions. [00:47:19] And you could even say you must use these functions. [00:47:23] And these other functions are available and you may not use any other functions. [00:47:27] And of course, these things are verifiable in the in the final output easily.

Limiting Its Creativity

[00:47:32] But why would you tell it you can only use these functions because you're now limiting limiting its creativity.

[00:47:41] For instance, if you forget to mention list.map, then it's not going to use that. [00:47:46] And it's going to use something else like list.fold instead.

[00:47:50] Well, so the the way basically I was doing this as a proof of concept of saying, I'm [00:47:58] going to give you all of the functions that are in scope in this particular code base. [00:48:04] And I'm going to give you like all of the functions from from the list module and from [00:48:09] the result module.

[00:48:10] So you can give it like a certain set of available functions and say, these are the types and [00:48:14] I taught it like I even taught it how partial application works.

[00:48:18] So I can take an example. [00:48:20] And I said, given this type, if you pass in this function, now the type annotation is this.

[00:48:26] And it I played around with it and tweaked it in these ways. [00:48:29] And it did a better job at solving these puzzles. [00:48:32] So it's pretty interesting. [00:48:33] You can kind of teach it in these prompts.

[00:48:45] I wanted to solve a sort of uncreative problem of wiring things together that I don't really [00:48:52] like if I could automate that as an atomic step, I would love to like if I could, you [00:48:57] could even think of it this way, like, AI, give me all of the possible solutions using [00:49:04] the functions in scope here to satisfy this type, right?

[00:49:09] Just like, and then you can just select, okay, these are all the valid ways to do that. [00:49:13] So if I can teach through a GPT prompt, the AI to solve that type of puzzle, and give [00:49:20] me all the possible results, and then I can use a tool to actually verify those results. [00:49:25] So I so I know I can 100% trust them.

[00:49:28] That becomes a very interesting tool in my toolkit, where now I'm more capable as a programmer. [00:49:34] And I'm not sort of like hallucinating weird things into my code base, or as you said, [00:49:39] like having it write out these new cases that I'm not considering in my testing or considering [00:49:45] how it affects the user experience.

[00:49:47] And if that's what I want the experience to be, it's just like, how do I solve this type problem?

[00:49:52] Oh, yeah, that's the only one sensible, sensible, meaningful solution, and it found it. [00:49:58] And I can easily look at it and say, oh, yeah, result dot map, tuple dot map first list dot map. Yes.

[00:50:06] […], that it didn't use anything too crazy. [00:50:10] It satisfies the types. [00:50:11] It's good. [00:50:12] I can easily look at it and see it did the right thing. [00:50:14] But it would have taken me a long time to solve it myself.

[00:50:18] And so in case it wasn't clear, my idea is to like, this is something I want to experiment [00:50:24] a little more with the OpenAI API to fully automate this.

[00:50:28] And because it's a pretty easy thing to automate to just say, what are all the types of all the functions in in scope here? [00:50:37] Like that's a an automatable thing. [00:50:40] So feed it that into the prompt and give it the ability through, you know, through writing a small script that hits the OpenAI API, tell it what the compiler error is to let it iterate on.

[00:50:53] So when I gave it this problem as a proof of concept manually, I manually told it the [00:50:58] compiler errors and it with two iterations with telling it the compiler error, it was [00:51:03] able to get the correct result, which I thought was pretty cool.

[…]

[00:52:09] So a lot of our work as developers is not writing new code, it's editing existing code. [00:52:15] So I feel like that's going to be somewhat missing now, but it's probably going to be [00:52:21] solved pretty soon.

[00:52:31] Actually, GPT-4, can't it take like 25,000 tokens as input? [00:52:36] So you can feed it a huge input.

[00:52:40] That's a game changer.

[…]

[00:52:50] But until then, I'm thinking, for instance, if we had a tool that's extracted information [00:52:56] from your codebase, and that you can then just copy paste into the prompt, that could [00:53:02] be pretty cool. [00:53:03] So like having all the types for the functions. [00:53:06] And there is a tool that is called Elm Review, which has an extract feature.

[00:53:25] Unless you automate it, and why not? [00:53:27] It gives good results.

[00:53:40] Like I work on a 200,000 lines of code, codebase. [00:53:46] But how many type annotations does that represent, right? [00:53:51] Like hundreds of thousands, probably? [00:53:54] You think so? [00:53:55] That are in scope in a particular point in code? [00:53:58] Well, you can import anything. [00:54:00] Not everything is exposed, probably.

[00:54:03] And a lot of things are going to be values that are locally scoped to let like I would [00:54:08] I would venture to guess in a 200,000 line codebase. [00:54:12] With code comments? [00:54:13] Or maybe there are a thousand. [00:54:16] Maybe there are like on the order of one to 10,000. [00:54:20] No more than definitely no more than 10,000.

[00:54:28] Oh, there's also dependencies, but that's also not that much. [00:54:33] Right, maybe it's doable. [00:54:36] And you can do heuristics to filter and not even heuristics, like reliable heuristics [00:54:43] to filter out.

[00:54:44] You know, you know that if you're trying to arrive at this type, but none of the types [00:54:51] connect in any way.

[00:54:52] So if you can take a string and use it on this thing, and you can take an int and turn [00:54:58] it into a string, then those two things connect.

[00:55:01] But if a if this custom type can't be turned into a string, then you know it, it's not [00:55:09] going to be involved in your problem at all, because your problem is dealing with strings, [00:55:13] and you can't turn that to or from a string, for example.

[00:55:17] Yeah, you can also remove all the modules that would create an import cycle. [00:55:22] Right. [00:55:23] So basically, you can't import anything that imports this file directly or indirectly. [00:55:28] Exactly.

[00:55:30] So to me, this is the really interesting intersection. [00:55:33] So now you were mentioning earlier that you think that these tools will start to understand [00:55:38] your code more over time, but we're not there yet. [00:55:41] I actually, I don't see it that way.

[00:55:44] I believe that these tools are going to continue doing what they're doing, which is that they've [00:55:49] been remarkably successful in being able to solve complex problems just through this crazy [00:55:56] synthesis and predictive text sort of thing.

[00:56:00] No, I didn't mean it in the sense of understanding. [00:56:02] I meant it's just of gathering the information.

[00:56:07] Like every time you ask something to ChatGPT, you need to provide information about your [00:56:12] code base. [00:56:13] At the moment, it does not look at your code base.

[00:56:19] But I don't think that they will, for example, know type information about your code base, [00:56:25] except insofar as it's part of the predictive text because it's in there.

[00:56:29] But I think that they're getting enough mileage solving problems through this sort of predictive [00:56:36] text, that they're going to keep going with that. [00:56:39] But I think the interesting intersection, especially with typed pure functional programming [00:56:46] languages is if you, so humans have their role, these sort of like compiler tools and [00:56:55] static analysis tools have their role, and these AI tools have their role.

~

PRESS, Ofir, ZHANG, Muru, MIN, Sewon, SCHMIDT, Ludwig, SMITH, Noah A. and LEWIS, Mike, 2022. Measuring and Narrowing the Compositionality Gap in Language Models. Online. 7 October 2022. arXiv. arXiv:2210.03350. [Accessed 19 April 2023].

We investigate the ability of language models to perform compositional reasoning tasks where the overall solution depends on correctly composing the answers to sub-problems. We measure how often models can correctly answer all sub-problems but not generate the overall solution, a ratio we call the compositionality gap. We evaluate this ratio by asking multi-hop questions with answers that require composing multiple facts unlikely to have been observed together during pretraining. In the GPT-3 family of models, as model size increases we show that the single-hop question answering performance improves faster than the multi-hop performance does, therefore the compositionality gap does not decrease.

This surprising result suggests that while more powerful models memorize and recall more factual knowledge, they show no corresponding improvement in their ability to perform this kind of compositional reasoning.

We then demonstrate how elicitive prompting (such as chain of thought) narrows the compositionality gap by reasoning explicitly instead of implicitly. We present a new method, self-ask, that further improves on chain of thought. In our method, the model explicitly asks itself (and then answers) follow-up questions before answering the initial question. We finally show that self-ask’s structured prompting lets us easily plug in a search engine to answer the follow-up questions, which additionally improves accuracy. arXiv:2210.03350 [cs]