Tereza Sokol

YOUTUBE M9ulswr1z0E Demystifying Parsers

terezka/yaml github

I'm Teresa and today I am going to be doing some demystification of parsers.

and um now you might be asking why that's a reasonable question cuz like who cares about parsers and until recently literally the only person who I knew who cared about parses was actually Evan and he'd be really weird about it too like he'd come over to me like oh man like I had parser who's allocating even less now and it's like I don't know what these words mean like a little did I know cuz I was like this doesn't sound like something I should get into like this sounds like something that would turn you into a crazy person um but meanwhile so I'm working at nerding is somewhere you know and I'd never think we have this one particular interview which I conduct a lot and it's basically so you have some data and that they doesn't yeah Mille and then you parse the data yeah and you do some transformation on that data and then you do a tiny visualization on it but like the yamo part is totally like trivial in most languages and like doesn't matter to the interview so it's like a beard detail the shouldn't matter all right so like at the same time we really encourage our candidates like use the language that they're most comfortable with the problem is though that like if you want to use them for this is a really bad choice because there isn't a Yamaka parser and i'm sorry does i can't do it which is sad because obviously we get a lot of candidates who are really comfortable with elm so so yeah this might actually made me be like somebody you know somebody should write a llamo parser for elm and then like weeks passed by and I'm like I guess I'm making a yellow parser right now so now I'm one of two people who care about brushes that I know and but hopefully after this talk there will be a slightly broader interest those words so let's parse some llamo step one is knowing what a partial is a parser is very simple in its most abstract form it's really just anything that takes a some input and turns it into a data structure and that input is like in 99% of the cases just text some of the more common uses of parsing is Evan mentioned programming languages which is really funny because like I always think of programming just like something different but it's really just text and then you read it and it's kind of weird and then you can also force that natural languages obviously that's a harder and other form the way uses are you parse JSON and XML and yamo like we're going to be talking about today and seems to be and lots of other stuff and the cool thing is that once you parse it and turn it into a data structure you can do a lot more cool stuff with it you can like analyze in different ways you can visualize it you can translate it so like it's really the step one of like getting the data out of the file and and you can get the information out afterwards and so yeah parses are pretty simple depending on what you're parsing so this is what Yamma looks like as we can tell it's a lot so we're just gonna be focusing on one types of list which is the gamma lists one of the ways and so if this is the text input that we're gonna be parsing and we want to turn it into the data structure of list string so how do we start doing that well we can start by installing the library which is called LM / parser which is made by Evan it was great

video – I usually take a look at the API first and that looks like this.

Take a look at the API first. video

I was really pleasantly surprised because I thought parsers we're gonna be like something difficult and hard, but I was like: this is just like a JSON decoder.

oh yeah well we'll see about that so it has all the elements that I'm familiar with I got there's an int and a float and one off and and then and you just run the parser with the input and gives back the result easy PC and I was further encouraged when I saw my first example which was a little bit like this so we have a tuple this is our string and we want to turn it into a data structure that is this like this point where the first integer is X and the second integer is the Y and the parser that we end up with looks a little bit like this and so you might already be getting the point of this but the only part that might be unfamiliar is these operators and I like to think about them as I'm like the one with the dot it's just like eats the stuff and throws it away and then the other one keeps the stuff I was like things that are I'm sorry it's not on purpose is compulsive but what happens is that so if you just go by it one by one is that okay it finds the parenthesis and then it further away finds the spaces for survey finds integer keeps it finds a comma further away space for the array integer keeps it space further away parenthesis and throws it away so it's very important like to get in this mindset of reading one character at a time it's crucial for making parsers but what happens now is that okay has the two integers that he kept and those are the arguments for the thing in the succeed therefore function in the succeed on if you ever work with the JSON pipeline library it works kind of like that on me but let's see how this knowledge can help us make a list parser so it starts out well we can parse the - it's exciting actually we want to parse the space afterwards as well because that's required by animal but what's that that's a T let's see for trouble yes in our fast example as you just saw we had an integer parser and that made our life really easy but if we look at the API now it's where is string and it's actually it took me a minute to realize why this is like everything is a string like your whole job making a parser is to define what a string is and what is trash and you want to throw away um so I was like oh this is harder now but there is another part of the API which helps us out with that so it looks like this it's chomping this library has great names and done all the whole thing so the first one is chomp while and you get this criteria which is the check that it doesn't every character and as long as that check returns true then it keeps going to the next character and once it's false it stops and with chomp if it's very similar you have to one criteria except the only reads one and the parser fails if the criteria returns false now the observant doesn't will have notice that these do not return a parse of string it returns the parser of unit which is silly because I know how to make a unit on you to parse it but there is even front of that so he made another function called get chomp string and that does what the name suggests is if you chomped some characters then he gets you the string that you chopped so this is all very abstract so let's let's take a look at some examples first so imagine that we have 123 lizards this is heaven for some hell for others but if we just want to focus on the numbers yeah and so we can do this by making a purse that looks like this so we chomp while the characters a digit goes one two three those old digits when it comes to the space it's like oh that's another digit we're gonna stop here so what the result this of this is it will return one two three which is great this is one so what we're gonna do now is we're gonna create a little bit of helpers that we're gonna use later and we can start by making a string one that's a little more flexible so we decide the criteria we can use it around later and it looks very much like the digital except we decide now one thing that's worth noticing though is that chump while always succeeds meaning that if it doesn't find any of the characters that fit the criteria every turns just an empty string like it just doesn't jump anything and so a more appropriate name for this function is actually like several more so and that in place that if we want to do a one or more we want to use the Chomp if function because that will actually fail if it doesn't find one of the characters that we're looking for now how can we use this for our list journey so we still have this one and this is how far we came last time and we want to use several more because it can actually be an interesting that's fine and we want to be not a newline just the thing at the end there and so we'll chomp up until the new line and then would be like then it stops and then now we want to jump to symbol right the new line or dooley it's very hard to make parsha is this exciting what I'm really trying because you know next time there isn't a new line in the end there's the end of the file so what do we do about these things that are sometimes you know well we can use the one off which is if you use the JSON decoder it's quite like that so you take a list of parsers which are your options that you want to check out and it just takes it from the top the first one pressure is done great if not doesn't rise the next way so we'll see but if we try to use that in our case it's gonna probably go up so we were here and then we add a one-off and say okay if there's a new line and we want to parse the next element in the list and if not then we're done there is tricky stuff though so if you come from Jason decoders and this is your metaphor to how to approach this library then you're in trouble with this one because it acts slightly different for that purpose is we have a small example so we have some strings and they can buy the dollars or something else like it doesn't matter and we want to turn it into a data structure that is dollars or not dollars the two building blocks of the universe and we can have a parser looks a little bit like this and so this is pretty straightforward you ask if there is a dollar sign then we want to take the number afterwards and put it in our data structure dollars and if not then we just take whatever it is doesn't even matter put in to not dollars now what about if there is somebody rights sicker what will happen done well it will come in and they'll be like okay we found the dollar simple but the int one is failing and so if you come from the JSON thing you'll say okay well then it's failing and then it will just try the next one that's not the case though because we're committed to the path now there's no going back once you ate the first character and accept it that this is the path that you're on and so instead we actually have to just continue down this path and so okay is there an end okay then we can come back up and then but if there's not then we have to deal with that case itself as well so the moral of the story is once you get on the path there is no escape that's just how life is you just have to deal with it and so this is actually not relevant for this thing it is an important lesson for dealing with parsers and others um but yet there's some questions left from our previous progress it's like what is next what is done it's a good questions I just made it up doesn't mean anything there's what we want to do though is that when we find a new line we want to somehow like go back up and parse the next entry like the next after the new line is the - so that's we want to parse and as you can imagine you can actually you just call it again and make a recursive pressure although that might end up in like some Stack Overflow things so there's a special API for loops we're gonna talk about no and so the first function as you can tell is loop and it takes an initial state and then it takes a function which takes that and it's immediate state and it turns a parser step and what is the step that is a either loop or done and if you try and do then we go back around and when you're done you're done so I don't like talking about types like this it always confuses me so let's just look at how it how it actually plays out so this is what our parser would look like so we start out with the loop on an T list as our initial State and then we do us before I would parse the - and the turtles and we find a new line and if it's the new line then we turn in loop and then we end up in the finish function and the entry is the turtles and next is the loop we add it to our temporary list of strings and go back around find the - parse the lizard's and then this time we find end of the file instead and we're done and reverse the list because it's backwards and that's it like we can parse the list now like that isn't even that crazy um so we learned about chomping and we learned about parsing sometimes and loops and this is like 99% of like all departures that you need to do so let's try our knowledge on something a little teeny bit harder which is yeah more in line lists and they look like this they should be familiar and so they would start out like this so the only thing that is really different is that it's encased in these brackets and so we want to start out like parsing that back fit on this basis Oh oops and then we want to parse a or do we but then I already pressed the next one of them it can be an empty list instead so we want to actually ask ourselves okay is it in the list and then we send back the empty list and we continue to the actual looping and this one looks very much like the one before you start out with empty string empty list and then you parse the a and you end up like with the comment then you ask about the comma I go back to the loop go back around to the next finish to finish and then trim it because there can be some extra spaces you do it until you reached a c1 you see character and then that's the end of the list and that's good too you made an inline list parser that's great although there's no project known that's good we finished before well you can be good but it's not finished before it has good error messages and the API for that is really simple you just have a string which is your error message and you put it in wherever you need it so in our past example when we're doing we want to eat one or more characters we can make it into a more customized function where this is what it would look like now and we just add it back in here and then if we want to create an error for if we do not actually have a character that matches that then we can put a one-off in which is like the natural environment of the problem and then if it if that one fails to Chomp if then we can set an IRC okay I was expecting at least one of valid element characters and so if you want to imply that we can use the entry in here that we just made and we can add another one in the one off here and that's it and so parsers are actually really easy like that's the point of this talk is that like they're just really dumb like primarily you just have to like pretend that you're explaining like a data format to someone who was born yesterday and someone who was born yesterday and can only read one character and can must absolutely know what to do next based on that character so they're not scary at all and I only have one recommendation before you start like parsing everything on the Internet as I of course know you will immediately when you go home is that test-driven development is really excellent for parsers it helped me so much as to creating a full suit of it because there's the more complex your format gets like the more education star and like if you write many parsers and they're like inside each other and you're like ah this is crazy so like is either just like run the test suit and you feel a lot more comfortable with your pressures now there's one thing we haven't covered which is the craziest part which is the indentation and so it's a unappreciated unappreciated the animals appreciation day as you can tell with the köppen shark the way that we did that I did this in the library was actually I saw these like functions in the API and I was like I don't I don't get it so I didn't use those I just use the get column like that was I figured I can figure it out on myself and so the way this works it's okay if you have this new list of ours first of all we have to change the data structure we're gonna get into because now we can have with not only a list of strings so the list can also have lists inside of itself and it continues so this is our new structure and would end up looking like this so the first thing is that we get when we call this function the first time we pass down one as in column one is the first one and then for each step I made this helper function which just like asks like is the indention we expect like what is it like so we can act relative to that intention so if it's smaller then it means that we're going back into a parent array or this and then we should be done and if it's exactly then we expect some new element and if it's larger that means like you mess something up and you have to give a problem report a problem and if it's ending you're ending and so in this case when we read the text it's like you start with the zero and it is exactly or it start start with one and it's exactly one we read the entry we read Sarah Moore with the turtles we find oh we don't because I forgot to write that up we go back up and then we expect another elements and go down and then we find the dash and the new line and then we go to the next non-white space and get the column again now we're starting in the list again with a new indent so then there is now three and then we do it again all over and so it's actually not that crazy the function that I made Alber you can look it up in the library because it's actually very simple but with those words we don't really have a lot of time more so if you want to try out the library you can find it on - go slash yamo or you can talk to me about more about parsers if you want to yeah thank you [Applause]