Can we get strong guarantees from AI tools that are known to hallucinate? We discuss some strategies, and ways that Elm might be a great target for AI assistance.
https://cdn.simplecast.com/audio/6a206baa-9c8e-4c25-9037-2b674204ba84/episodes/d1c5f97c-9700-48b0-ab35-a039edbfd0d5/audio/16dc506d-5aa1-42c1-8838-9ffaa3e0e1e9/default_tc.mp3 elm radio – 080: Elm and AI page
[00:14:04] […] writing skills have been valuable for a long time, but this unlocks a whole new set of things you can do, including engineering with your writing. And it really is like, I mean, if you think about what these prompts are doing, like the way that they work is they're based on the Context they're given, they're sort of like role playing with that context, essentially, because their basic premise is given like this context, what's likely to follow it, right? >> context role play
[00:14:36] So if you write in a certain style, it is going to be more likely to follow with a certain style. If you write in a style that sounds very much like a scientific paper, a scientific journal publication, and you write about your method for your study and all these things, it's probably going to give you more rigorous results. And it's probably going to do the things that it's seen by gathering information from a bunch of scientific journals, like coming up with a rigorous method and talking about whatever, like counterbalancing, you know, addressing particular concerns and stuff. So like, you have to really get it to role play to solve the problem you want it to solve, to be like the smartest thing to address the problem you want.
[00:15:27] And that's where nerds come into play again. [00:15:30] Yes, totally. Because we kind of get that. It's not just like some magic box to us, like we kind of can understand how it could synthesize information so we need to give it useful context for that.
[00:15:44] Now I was thinking like, people play Dungeons and Dragons. [00:15:48] Oh, that too. [00:15:49] Or used to role playing. [00:15:52] Yeah, very true.
[00:15:55] […] priming it with good context is one thing that I've been thinking about as I've been playing with it. >> context
[00:16:04] And another thing I think about is like, is like how verifiable is the problem you're giving it?
[00:16:28] I'm not like giving it a specific problem to solve. I'm not giving it a problem that I can verify that I've gotten the answer. So if I give it some like an Elm problem, and I have a way to check. So like there are certain problems where it's difficult to find an answer, but it's easy to know that the answer is correct once you have it.
[00:16:56] NP problems. [00:16:57] Is that the term? [00:16:58] No, but you know, P equals NP. [00:17:02] I never quite got that.
[00:17:16] I guess like things like the traveling salesman problem you like would be an example of that, right? And then yeah, that's an NP problem.
[00:17:23] You can tell if you have a solution and it does fit an optimal Path, it's easy to tell, but it's not easy to derive an optimal path, something like that. Yeah, almost. You know whether the solution that is given is a solution because you can check it and the checking is pretty easy. But knowing whether it's optimal is extremely hard. I see. >> path
[00:17:48] So like, is this the most optimal solution? [00:17:51] Well, to check that you would need to check all other solutions. [00:17:55] And it's easy to if you find a counter example, then yes, you know, it's not the most optimal [00:18:00] one, but to know that whether it is indeed the most optimal one, you're going to have [00:18:05] to check everything. [00:18:06] And that's, that's extremely expensive. [00:18:08] Right, exactly.
[00:18:10] So I think like, to me, that's the type of mindset for finding good ways to use these [00:18:15] tools to automate our coding.
~
~
[00:18:18] Also like, like you mentioned, finding a counter example in the traveling salesman problem [00:18:24] is easy to verify because you just check how many nodes it traverses or whatever, right? [00:18:30] And it's, is the number smaller, right? [00:18:33] So that's a very cheap operation to test a counter example. [00:18:38] So if you know, so, so if you're able to like, get, let's say you try to prompt the, you [00:18:46] know, write a prompt engineering prompt for chat GPT to solve the traveling salesman problem [00:18:52] for something and you set it up and you prime it with some context and you like, you found [00:18:58] one solution, but now it needs to find a better path. [00:19:03] And if it, if it gives you a more optimal path, then you're done.
[00:19:07] You can easily verify that and you can say, you know that it provided you with something [00:19:13] valuable because you can easily verify that it's a valid solution and a more optimal solution.
[00:19:21] So the, there, this class of problems that is easy and cheap to verify that it's valuable [00:19:29] is the kind of thing that where I find it to be a very interesting space. [00:19:34] And I think that Elm is very well suited to this type of problem. [00:19:37] So like one very simple example, like if you, if you want to write a JSON decoder and now [00:19:46] another consideration here is like what inputs can we feed it to prime it to give, to give [00:19:52] us better results.
[00:19:55] So that, so we want to give it like prime it with good context and we want to have verifiable [00:20:02] output. [00:20:03] I've also been thinking about like that verification through a feedback cycle. [00:20:08] So iterating on that verification becomes interesting.
[00:20:12] If you use the open AI APIs, you can automate this process where you can test these results. [00:20:20] So you, and then you can even give it information like Elm compiler output or did a JSON decoder [00:20:26] succeed.
[00:20:27] So if you, for example, you're trying to solve the problem of, I want to write a JSON decoder [00:20:32] and you either have like a curl command to run to hidden API or some JSON example of [00:20:39] what the API gives you, for example, that's your input. [00:20:43] You prime it with that.
[00:20:44] You can even prime it with like a prompt that helps step you through that process to, to [00:20:51] give you higher quality output, but then you can verify that. [00:20:54] So you say your job is to write a JSON decoder. [00:20:58] It needs to decode into this Elm type and it needs at the end, it needs to give me a [00:21:05] compiling JSON decoder of this type and it needs to successfully decode given this input. [00:21:12] That's all verifiable.
[00:21:14] So if it gives you garbage or hallucinate something, or it gives you invalid syntax, [00:21:18] you can even tell it and it can iterate on that. [00:21:21] And you can kind of reduce the amount of noise. [00:21:24] Because I don't want to hear about hallucinations from AI. [00:21:28] So you know, like before I mentioned like how much we want guarantees, not like somewhat [00:21:34] high confidence. [00:21:35] I want guarantees.
[00:21:37] But if we can throw away anything that's garbage and only get signal, no noise, then we can [00:21:44] do really interesting things. [00:21:46] And Elm is really good for that.
[00:21:48] You would like to have a system where you skip the intermediate steps of saying, telling, [00:21:52] hey, this is wrong, because this doesn't compile. [00:21:55] So here's some source code. [00:21:58] Here's my request. [00:21:59] And then there's some back and forth between the Elm compiler, for instance, and the system, [00:22:05] the AI.
[00:22:06] And then you only get to know the ending result. [00:22:09] Exactly.
[00:22:11] And then it's like a proven result. [00:22:14] It's a guarantee at that point.
[00:22:15] So this is kind of the cool thing is like, with a little bit of glue and a little bit [00:22:22] of piecing things together, a little bit of allowing it to iterate and get feedback and [00:22:27] adapt based on that feedback, which is actually like GPT-4 is very good at this. [00:22:32] You can get guarantees, you can get guaranteed safe tools, especially with Elm.
Writing Tests
[00:22:41] But I'm guessing, or at least whenever you say verifying the results, I'm thinking of [00:22:46] the Elm compiler. [00:22:47] But I'm also thinking of writing tests, you know.
[00:22:51] I would probably also try to include the results of Elm tests to the prompt, if possible. [00:22:59] But that does mean that you need to verify things. [00:23:03] And that's kind of what our industry is all about, right? [00:23:07] Why we have software engineers and not just coders.
[00:23:11] That's why we call ourselves engineers, is because we make things and we know it's going [00:23:18] to be, we know we shouldn't trust even ourselves.
[00:23:23] We shouldn't trust the code that we're writing, the code that we're reading, and the code [00:23:28] that has been running for years, because we know, well, there are bugs everywhere. [00:23:34] So that's why we have all those tools, type systems, test suites, formal logic, manual [00:23:42] QA, all those kinds of things, to make sure that we do things correctly.
[00:23:48] And also, even the processes, like the Agile movement is running your code in such a way, [00:23:57] or working in such a way, that you get better results out of it. [00:24:02] So we do need to verify our results. [00:24:06] And we can't just use the results of the AI willy-nilly.
[00:24:11] I mean, we can, and people are. [00:24:13] I think that's actually kind of the norm. [00:24:16] It's going to become increasingly common to see, sort of like, this is a really weird [00:24:21] piece of code. [00:24:22] Does this even give the right results? [00:24:24] Like, oh, somebody just YOLOed this chat GPT or this copilot completion into the code and [00:24:32] committed it.
[00:24:34] But I mean, it's something very different from what we do today. [00:24:38] Because in a lot of cases, we are still running code and with not a lot of tests in practice. [00:24:45] I feel like most people don't write enough tests, myself included. [00:24:50] So this is just maybe strengthening the need for adding tests.
[00:24:59] Our role becomes more like verifying and guiding systems rather than like, I can write a line [00:25:07] of code. [00:25:08] That's not the super valuable asset anymore.
[00:25:12] But I do feel like because it's going to be so easy to write code, and because you don't [00:25:16] go through all the steps of writing good code, you're not going to do it as much. [00:25:22] For instance, what you like to do, and myself as well, is to do things in a TDD style. [00:25:29] You know, you start with a red test, and you change the code to make the test green, but [00:25:40] you only change the code as much as necessary for that test to become green. [00:25:46] And then you continuously improve or complexify that function until it hits all the requirements.
[00:25:54] But if I ask the tool, hey, can you give me a function that does this? [00:26:00] Well, I probably won't have all the same tests that would have been the results.
[00:26:07] Just like running tests after the fact. [00:26:09] So you can probably ask the tool to write tests, but do you want an AI to write your [00:26:16] tests?
[00:26:18] It's kind of like, who monitors the police or whatever that sentence is.
JSON Decoders
[00:26:42] Do we want to take JSON decoders for granted? [00:26:46] Kind of like we kind of we want to be able to write them with a lot of flexibility, but [00:26:51] we don't want to spend a lot of brainpower creating and maintaining them. [00:26:56] So I mean, if they're verifiable, that's great. [00:27:01] If we can continue to verify them, if we can, I mean, better still, if we can use something [00:27:08] like GraphQL to make sure it stays in sync even better.
[00:27:13] But we don't really want to have to think too much about building and maintaining those [00:27:17] low level details. [00:27:18] We want that to just be like, given a decoder that works. [00:27:23] And so this is a very good thing to delegate to AI and in my opinion, and whereas like [00:27:29] solve this complex problem that has a lot of edge cases, and a lot of like things to [00:27:36] consider the use case, how do we want it to behave and stuff like these are the types [00:27:41] of things that I think our job as an engineer is still extremely relevant.
[00:27:46] Thinking about the user experience. [00:27:48] And in my opinion, I think that engineering, these types of things are going to become [00:27:54] a more important part of the job. [00:27:56] Thinking about the user experience. [00:27:57] Sure these AI systems can sort of do that, but we need like they can, we can tell them [00:28:03] think about the user experience and think about these different use cases and think [00:28:09] about that in the test suite you write.
[00:28:11] But I think you want a human involved in really artisanally crafting user experiences and [00:28:18] use cases. [00:28:20] And then you want to say, okay, now that I've figured these things out, here's a suite of [00:28:24] tests. [00:28:25] And if some AI thing can just satisfy those tests, maybe you're good, you know?
[00:28:33] Actually, one of the things that I tried with chat GPT three, so maybe it's better now, [00:28:39] but I think my point was to hold is I told it, please write a function that satisfies [00:28:46] these elm tests.
[00:28:48] So I wrote some tests and basically told it to write a function. [00:28:53] And it did so and it was pretty good, but it wasn't correct. [00:28:58] Like there were syntax errors, which I told it to fix.
[00:29:03] And when those were gone, well, the, the tests were not passing. [00:29:06] Some of them were, but not, not all of them.
[00:29:09] And the function that I needed was slightly a bit too complex to be such an easy function [00:29:16] to implement, as you said before.
[00:29:19] So basically the code that it wrote was pretty hard to read. [00:29:24] And so that means that, okay, I have something that I can use as a basis and that I need [00:29:29] to change to, to make the test pass the few failing tests pass.
[00:29:34] But because it was so complex, I was like, well, how do I make the test pass? [00:29:40] Well, to, to make the test pass, I need to change the code to change the code. [00:29:44] I need to understand the code.
[00:29:46] So how do we understand the code? [00:29:48] Well, if anything you've taught me is like, or other people in the agile community, like [00:29:55] you can get an understanding of the code by changing the code, by doing refactoring techniques,
[00:30:03] so extracting variables, renaming things, changing how conditions work. [00:30:10] And as you do these steps, these tiny steps, because we like them, you start to get some [00:30:16] insights into the code and then you can finally notice, oh, well, this is clearly wrong. [00:30:22] Now I know what I need to change.
Work with Legacy Code
[00:30:25] And the thing is that I find funny is that this is exactly how you work with legacy code. [00:30:30] But this code is only a few seconds old or a few minutes old, which is like working with [00:30:37] legacy is becoming even more relevant.
[00:30:41] Even this new code, which I find very odd and more interesting. [00:30:46] That's a nice insight. [00:30:47] I like that.
[00:30:48] I think, I mean, I do think that we need to guide what kinds of results we want also with [00:30:56] these steps, with prompt engineering and priming. [00:31:00] But I think you're right that this does become a sort of process of creating some code that [00:31:07] we can look at its behavior, we can see, we can get a test around it and see that the [00:31:12] test is passing and verify it, but then not really understand the code and need to do [00:31:17] that process of refactoring to get clarity and get it in the way that fits our mental [00:31:22] model or gets away complexity.
[00:31:26] But also like we can say, you know, here's a unit test, make this, like write some code [00:31:35] that makes this test pass. [00:31:37] And we can do some prompt engineering that says, do that using the simplest thing that [00:31:42] could possibly work. [00:31:44] Here's an example of the simplest thing that could possibly work.
[00:31:47] In this test, there's this error message that the test is giving and you write this thing [00:31:52] that, okay, sorting a list, it returns the hard-coded list and it makes it green. [00:31:58] And that's the simplest way it could make that work.
[00:32:01] So you can actually illustrate that with examples. [00:32:04] You can write very long prompts and you can get it to do a sort of fake it till you make [00:32:11] it style process that you can actually understand. [00:32:15] So you can get it to like follow the kind of process you would follow and it totally [00:32:20] changes the results you get.
[00:32:22] And if you've, in addition to that, connect it to test output and compiler output so it [00:32:27] can iterate on that, you can actually like automate some of those things, which starts [00:32:32] to become very interesting.
[00:32:33] I'm wondering whether that would have the same effect in the sense that if I don't see, [00:32:40] if I do this and I only see the end results, which is kind of the point, well, will I have [00:32:46] an insight into how this function works because I didn't write it.
[00:32:50] So now it's just like someone else's code. [00:32:52] And again, if I need to change it, then I need to go through all those refactoring steps [00:32:57] or making it easier to understand for myself or just go read it well.
[00:33:05] But definitely the thing that I will keep in mind is that all these techniques about [00:33:12] running good code, they will stay relevant.
[00:33:15] So if I don't want to lose my job, this is the kind of things that I can maybe should [00:33:21] focus on because I think that these will stay relevant.
[00:33:25] Maybe my whole job will be removed. [00:33:27] Maybe I will get fired if it has become way too good. [00:33:31] But maybe my chances of not being fired increase if I am one of those who are better at these [00:33:37] tasks.
What Do I Want to Be Atomic?
[00:33:40] And one of the things that keeps coming up for me is like, what do I want to be atomic? [00:33:45] Like there's a certain philosophy of using tools that I've arrived at through a lot of craftsmanship principles and TDD and things like that, which is like, I don't want tools that I can partially trust and I don't want tools that give me partial results. >> atomic
[00:34:04] I want tools that I can completely trust and that allow me to take a set of low level steps, [00:34:12] but think of them as one high level step. [00:34:15] So to me, that's the question.
[00:34:16] Now, in the case of making a red test green and a TDD step, for example, like do the simplest thing that could possibly work.
[00:34:26] What if that was an atomic step I could take for granted? [00:34:30] That instead of a set of low level steps, I will look at the code, I will hard code the return value, I will create a new module with the name that's failing. It says could not find module of this name. I will create that module. [00:34:43] I will create a function of the name that the error message in the failing test says [00:34:48] is missing. >> atomic
[00:34:49] I will write a type annotation that satisfies the compiler and return an empty value and [00:34:56] have a failing test. [00:34:58] And then to make it green, I will change that empty value to a hard coded value that makes [00:35:03] the test green.
[00:35:05] What if I could just take that for granted and say, hey, computer, do that step, do that [00:35:09] TDD step to make it red and then make it green in the simplest way possible. [00:35:13] And I could take that for granted and then I can take it from there. [00:35:15] That would be great.
Guard Rail
[00:35:16] And then that's something I can fully trust and I can sort of verify it. [00:35:22] And so another principle I've been thinking about in sort of like designing these prompts [00:35:27] and these workflows using AI tools is guardrails.
[00:35:31] So like not only verifying something at the end that it did the correct thing because [00:35:37] you can run the Elm compiler, you can.
[00:35:39] But along the way, if you can say, OK, like, for example, you can create a new module and [00:35:46] a new function, but you can't touch any other code and you can't touch the test. [00:35:53] The test has to remain the same and the test must succeed at the end. [00:35:56] You sort of set up guardrails and you say, listen, if the AI given these guardrails can [00:36:03] give me a result that satisfies all these criteria by the end of it, then if it does [00:36:09] that, I can verify that it gave me what I wanted and I can fully trust it.
[00:36:14] Those are the types of tools that I want.
[00:36:16] So one thing that I was really amazed by, I'll share a link to this tweet, but I saw [00:36:23] this demo where this was actually with GPT-3, but this example stuck with me where somebody [00:36:31] was finding that GPT-3 did a poor job if you asked it questions that went through sort [00:36:38] of several steps.
YOUTUBE A3GtlwwWDhI The Compositionality Gap Explained (with GPT-3)
compositional questions ⇒ Compositionality Gap
~
PRESS, Ofir, ZHANG, Muru, MIN, Sewon, SCHMIDT, Ludwig, SMITH, Noah A. and LEWIS, Mike, 2022. Measuring and Narrowing the Compositionality Gap in Language Models. Online. 7 October 2022. arXiv. arXiv:2210.03350. [Accessed 19 April 2023].
~
Can we get strong guarantees from AI tools that are known to hallucinate? We discuss some strategies, and ways that Elm might be a great target for AI assistance.
[00:56:29] But I think that they're getting enough mileage solving problems through this sort of predictive [00:56:36] text, that they're going to keep going with that. [00:56:39] But I think the interesting intersection, especially with typed pure functional programming [00:56:46] languages is if you, so humans have their role, these sort of like compiler tools and [00:56:55] static analysis tools have their role, and these AI tools have their role.
Trifecta
[00:56:59] So with this Trifecta, I think each of these pieces needs to do what it is best at. [00:57:06] Compilers are good at verifying things. [00:57:08] Humans are good at, do we even need humans anymore? [00:57:15] Humans are good at critically thinking, guiding these tools. [00:57:20] Humans have goals. [00:57:24] Humans are good at gathering requirements.
chat.openai: Trifecta is a term used in various contexts to refer to a sequence of three events, actions, or elements that occur together or are grouped together. In horse racing, a trifecta is a type of bet in which a bettor selects the first three finishers of a race in the correct order. In politics, a trifecta is a situation where a single political party holds the majority in all three branches of government: the executive, legislative, and judicial branches. In sports, a trifecta can refer to a player or team achieving three major accomplishments in a single game or season, such as hitting a home run, stealing a base, and making a great defensive play in baseball. Overall, trifecta is a versatile term that can refer to any three related or significant things that occur together or are grouped together.
~
⇒ Trifecta