CS162 Lecture 22 (4/21/2020): Distributed Decision Making (Finished), TCP/IP Networking

okay welcome back everybody we are going to pick up where we left off last time talking about distributed decision-making and and just to set the context here consensus problem basically is one in which we have many nodes in the system and by the way here just in case you were wondering a node is a separate physical box that might be connected only via network some nodes can crash and stop responding and eventually all nodes decide on the same value for a set of proposed values so that's basically the consensus issue here that we're trying to solve and the key thing that makes this difficult is basically the fact that nodes might crash and stop responding and then they come back and we want to make sure that they still everybody kind of does the same thing and that's what we're going after here in a little bit we'll also talk about what happens if nodes are actively malicious and trying to screw up the process but we're not there yet so distributed decision-making is for instance the notion of all of the nodes are going to choose between true and false or commit and abort these are kind of equivalent ideas and it's going to be atomic in the sense that all of them will decide on true or all of them will decide on false but we'll never get a mixed grouping of them and equally important but something that sometimes gets forgotten in this whole process is making sure that once the decision is made it's not forgotten so if you have a set of nodes and they make a decision and then they immediately crash and lose all their information that as if they never made the decision in the first place and this is just to remind you this is the durability or D portion of acid okay and I mean a global scale system D gets a little bit trickier but we talked last time about erasure coding or massive replication or even block chains have a replication aspect to them for getting our durability so we were at the very end of last lecture talking about two-phase commit and really we came into two-phase commit because we couldn't solve the generals paradox if you remember the generals paradox was two or more parties have to decide on a time in which to perform some action like attacking and the messages going back and forth are unreliable okay and really what we showed is that this is impossible to do and so what we're going to do instead is a simpler problem which is we're going to get the machines to agree to do something or not do it atomically but we're not going to force them to all agree on a time all right and so the two-phase commit protocol roughly speaking has two phases not surprisingly right the prepare phase is one in which a coordinator and there is a single coordinator in this basically requests that all participants make a promise to either commit or abort or rollback the transaction and participants are going to record their promise in the log and then they're going to acknowledge by saying whether they will commit or abort and the main the coordinator will basically make a decision to either commit or abort based on what it hears from the participants and essentially if any of them say that they want to abort and the coordinator world board and if all of them say that they're going to commit then only and only then will it actually commit okay so the commit phase is basically that point after we've heard from everybody and we make a decision that either everybody wants to commit so will commit or somebody doesn't want to commit in which case we'll abort okay and a key aspect of this is the persistent log on every machine that's participating both the coordinator and all of the additional participants and what is this law of doing for us well this is basically helping those nodes remember what decision they've made so that if they crash and come back up they will continue to make the same decision and this is where two-phase commit gets interesting because what we're trying to do is we're trying to make this atomic decision in which everybody makes the same choice and acts on that same choice regardless of the fact that some of these nodes may crash and come back in the middle of it okay so let's set this up a little bit more we were starting to kind of look at the meat of this and any of you who have and any of you who have actually started in on experimenting with homework 8 will kind of understand what's going on now we have a question in the chat here basically saying not yet convinced to this process what if the coordinator fails to receive an abort well if the coordinator fails to receive an abort basically that's a timeout and it will assume it's an abort ok so we can basically make the decision that any time we don't hear from somebody we're going to assume that they're aborting and you'll see that that allows us to keep our atomicity on this process now so the coordinator basically initiates this protocol and asks every machine to vote on transaction and the two possible votes that the the participants can come up with is commit or abort and the commit only will happen if it's unanimously approved by everybody and the coordinator doesn't receive it the coordinator waits until it receives votes from everybody and if it times out then it's going to assume that somebody was going to abort and it will treat that as such now preparing in the prepare phase if a machine has decided to agree to commit what it does at the point that it's made that decision is it's guaranteed that it will accept the transaction all right and and what does it do well first of all after it's decided that it's going to accept the transaction it makes a little mark in its log before it responds saying that it will remember the decision so even if it crashes before it tells its decision to the to the coordinator if when it comes back up it looks in its log and it sees what it decided to do and it will keep with that decision once it's made it now if it agrees to abort instead we have the same idea the machine is guaranteed that it'll never accept the transaction even if it's crashed and comes back up again and so this is we didn't record it in the log and so the machine will remember that decision if it ever crashes and restarts so this commit phase or the finishing phase basically the coordinator learns that all machines have agreed to come in and records its decision to commit in the log and applies its transaction and tells all the voters to go ahead and commit and we're good to go and even if the if the coordinator crashes and come back comes back up after it's made the decision to commit it'll see that in its log if it comes back up before it's made that decision it's going to assume that it's missed out on some messages from the the participants and it'll just go ahead and tell everybody to abort so the abort action is when the coordinator learns that at least one machine is voted to abort records its decision the log and basically tells the voters to abort and if you notice basically because there's no machine can take back its decision because of the log we will get this atomicity out of this okay now there's a question here that if a node crashes crashes indefinitely is a backup node pulled in and used in its place now the answer is no and as you have identified here the one of the big issues with two-phase commit is it can potentially block indefinitely in bad circumstances and we'll talk about that in a second okay so two-phase commit by its nature does not have the ability to pull in backups okay it's a it's a simple algorithm and we'll go from there now so here we go with an example just so that you know about it so the coordinator says oh I'd like to know what you want the workers wait for that and if they're ready to commit they'll send back and commit if they're ready to do a board they'll send back and aboard the the coordinator waits until it hears from everybody if it gets a vote commit from everyone it'll send a global commit otherwise it'll sell you a Global abort and then finally the worker waits until it hears the status from the coordinator and if it's a global commit it will do a commit operation if it's an abort he'll do an abort operation and notice by the way that regardless of what the worker decided to do during the first phase it if it hears that it's supposed to aboard during the second phase it will abort okay so here's some examples for instance here's a failure free example the coordinator says fill request they all say commit the coordinator says Global commit and we're good to go and everybody commits now this might be a good time to say well what does it mean to commit well remember for a moment here that this algorithm is really just trying to make a decision commit or abort that's it and it's an atomic decision so they all make that decision to do something but what it is that they do is what you're applying it to so for instance it could be that there was a an update to a database if it's a key value store that you're going to add this key with this value to the global data store the global commit would be that everybody has agreed that should happen ok and so the the commit and abort actions are basically the global decision on what to do with some proposed action prior to that now basically you can also view this as a state machine on both the coordinator and the workers so the coordinator has a simple state machine with four states starts in an it and when it's ready to start it basically sends out a vote request and and then it waits until it hears from everybody if it hears a vote commit from everybody it goes forward if it hits a hears a vote abort from anybody then it'll send out a global abort okay the worker kind of looks like this so the worker basically also has an init and if it hears vote quest and it wants to commit it'll go to the ready state and wait otherwise if it if it wants to abort it'll go ahead just to an abort state now the question of what happens if a worker misses a global commit a good example of that would be if it crashes and comes back up well at that point it can't make any forward progress because it doesn't know what the decision of the coordinator is and so at that point the worker can start pulling the coordinator to see whether there's a decision yet all right now so work how do we deal with worker failures for instance so if you notice here that's sort of a good segue into the previous question this state that I'm colored in red is one where the coordinator is waiting to hear and the failure really only has affects states in which the coordinator is waiting for messages and the coordinator only waits for votes in the wait state and once it here is that there's an issue it will you know if it times out or whatever then it will assume that we're gonna do an abort okay and if it doesn't receive end votes it times out and sends an abort so the way that this protocol is set up is it's set up so that whatever failure cases might happen we will always keep the main constraint which is that either everybody agrees to commit or everybody agrees to abort and we don't get a 50/50 or whatever some of them do one thing and some of them do the other so here's an example of a worker failure so the coordinator sends out the requests but only two of them come back and nothing happens eventually there's a timeout and at that point the decision is made to abort and since that's recorded in the log by the way even if the the coordinator crashes and comes back up whatever they'll never be confusion on this and if the worker 3 eventually reboots it can ask the coordinator and they'll find out that a board was what happened so how did the head of the workers deal with coordinator failure well this is a little more interesting here so you know we wait in one of a couple of places one we wait for the vote request from the coordinator and in that instance you know if nothing happens and we timeout we just kind of wait to eat dear what's going on and eventually if we want to commit we go to the ready state and we're going to wait to find out what the decision was if we decide to abort then we know what the decisions going to be which is abort okay now worker waits for the vote request and in it voters can time out an abort worker waits for Globalstar messages and ready and if the coordinator fails the worker has to wait okay and the reason for this is once the worker is said vote commit it has to find out what the decision was it has no idea and the only way that it can move forward is by hearing from the coordinator so in this instance basically what we find out is or what happens here is we have to just stall there's really no other option here and so this is part of why this is a blocking protocol as you can see so here's an example the coordinator failure failing maybe it sends out vote requests but nothing happens we eventually timeout and all the workers might abort and the init stage or here we send out vote requests we go forward by everybody saying committing and eventually a coordinator crashes at that point the the workers can't do anything they're all waiting in that ready state but when the coordinator comes up it figures that it's missed some messages and it just says abort all right so all nodes have to use stable storage to store the current state and stable storage is basically non-volatile it could be a disk it could be SSD or NVRAM whatever and on recovery then nodes can restore the state and resume so you know coordinator aborts and an it way to reboard etc we can list all of these out the key thing is that this algorithm is one such that no matter when and and how the nodes crash and come back up if we do the right thing with the log will always maintain our atomic behavior where they all either decide to commit or they all decide to abort so really the kind of the key issue here is blocking for the coordinator to recover you know a worker waiting for global decisions can ask if fellow workers about their state for instance if we're in the ready state and we don't know what's going on we could ask other workers and if they've already gotten a global commit and we can take that decision and move forward okay does some very the question that's come up is to some variation on this system allow for non unanimous decision making ie simple majority voting so the answer is not the two-phase commit so two-phase commit is an all-or-nothing we'll talk about some majority voting kind of options in just a moment here so if another worker is in a board or commit then the global coordinator must have sent a global message and the worker can safely aboard or commit respectively and so basically there are cases in which we can exchange information between the workers to find out what what happened if we missed it or was lost if another worker is still in the anit state then both workers can decide to abort in that case for instance if all workers are on the ready then we really need to block because we do not know really what the the coordinator is going to do you might guess for instance that well because they're on the ready they all said vote commit and so therefore the coordinator is going to choose to commit but in fact the coordinator might have crashed and come back up and lost its state and in that instance it's going to abort so we really have to wait when we're all and ready to move forward okay so why is distributed decision-making desirable and the answer is fault tolerance we want to have a bunch of nodes together making a decision and we'll wait until they all make the same decision then we know that we aren't due making our decisions based on faulty information a group of machines come to a decision even if one or more of them fail during the process because we come back up again eventually and move forward and after the decisions made the recall the result is recorded in many places and so we'll know what the decision was even if they subsequently crash so why is two-phase commit not subject to the generals paradox I actually saw somebody else question on Piazza too and really to face commits about all nodes eventually coming to the same decision but not necessarily at the same time and that's really we're allowing reboot and continue and reboot and continue to gather information so that eventually they all have recorded in their logs their decision and we can make an atomic decision the generals paradox had this problem that we were never quite sure that our messages made it through and therefore there was no way to settle on a time for sure okay so an undesirable feature of two-phase commit is blocking as we've mentioned and one machine can be stalled until another site recovers you know site B writes prepared to commit and it's log and sends a yes to the coordinator and crashes site a crashes he wakes up checks its log realizes that it has voted yes and now it's stuck until B is basically blocked until a comes back and so there really is nothing in this protocol that allows nodes to not eventually be present without blocking forever okay and so that's an issue and a block site essentially holds resources which might be locks or pages pinned in memory or whatever until it learns the fate of the update and so that's an issue all right so there's a number of interesting alternatives to two-phase commit there's three phase commit which is one more phase and it allows nodes to fail or block and it's more of a majority voting kind of scenario it's a little better Paxos is a very popular example it's an alternative that's used by Google and others it does not have the two phase commit blocking problem it was another protocol developed by Leslie Lamport mentioned him last time there's no fixed leader and it can choose a new leader on the fly and deal with failure this is extremely fault tolerant there's some that would claim that this is extremely complex and there's even been a lot of papers about you know taming the complexity of Paxos and so on but Google seems to have done so and they're using it actively raft is a variant developed at Stanford which is an alternative to paxos which the claim of John mr.

Hood at Stanford and his students are essentially that this is much easier to understand and therefore much easier to implement correctly but that's a that's another fault tolerant version but what's interesting is what happens if one or more of the nodes is malicious and this is an interesting question where malicious is actively trying to screw up the protocol all right hold on a second I'm going to pause I'll be right back okay so the question that's on the sorry about that I'm back so the question that's on the chat here is which protocol is commonly used in industry so right now Paxos is pretty commonly used two and three phase commit are have been common for a long time in databases distributed databases but Paxos is used pretty widely by Google and there are some libraries that do paxos so now what happens so there's another question here with workers waiting once the server crashes I'm not sure I understand that question did I miss the middle of it okay so now what's interesting about these other protocols up top here is that if a node is malicious which means that it's been broken into or it's running a version the protocol that is designed to mess up the decision-making then they are not resilient against that so even though things like three phase commit Paxos Raft etc might be resilient against failures and made manage to get forward as long as move forward as long as they're say a majority that are still functioning properly if one of those nodes is actually malicious then they're not and so this becomes an interesting question what do we what are we doing those that instance okay and so we have another Leslie Lamport paper that was quite interesting I'll put up both the Paxos paper and the Byzantine generals papers up on the resources page I may have done that already the Byzantine generals problem is as follows there's one general and n minus one lieutenants okay and so in total participants here and the sum number of these are malicious okay or insane or going to act weirdly okay or incorrectly or maliciously and so the question is what do we do then and before we can actually solve the problem we really have to figure out what we're doing so what we want as our parameter our semantics here and what we'd like is the commanding general is sending an order to all of his lieutenants and we'd like the following integrity constraints to apply here I see one says all loyal lieutenants which are those that are not malicious obey the same order so if you notice here these two lieutenants are the ones in the red hats basically are both deciding to attack and if the general is loyal and not basically malicious then the the loyal lieutenants will do with the general says okay so how could a general be malicious well a general could tell all of his different lieutenants to do something different and in that instance then we're gonna say that the general is malicious and what happens then is the remaining loyal lieutenants are all going to do the same thing okay so they're always going to have I see one in a good instance in which the general is loyal they will also do what the general wants okay now notice that I've said here at sort of introduce some terminology so or some notation so one of them is that there's going to be F malicious entities in the system here and n total entities okay and so what's interesting about the original paper from Leslie Lamport is he shows that we can't solve the Byzantine generals problem with N equals three because they're one of of malicious player can basically mess everything up so here we have an instance where one lieutenant is malicious the general is not this lieutenant is not so the blue ones are not malicious and the general tells each of them to attack but the lieutenant the blue lieutenant has no idea whether the general is malicious or not so it's got to find out what the the tan lieutenant says and the tan lieutenant says well the general told me to the retreat and so now this blue lieutenant is stuck because it has no way to make a decision that will let him satisfy both the interactive consistency constraints ic1 and ic2 in the case of the general being malicious you know it says attack to one and Retreat to the other then this poor lieutenant on the left is once again lost because he's hearing attack from the general and retreat from the lieutenant if you notice these two scenarios the one on the left on the right are the same as far as this good lieutenant is concerned and so basically the impossibility result says you can't solve this problem with N equals three and in fact then it quickly generalizes to show that if you have F faults or F malicious nodes then you have to have n greater than 3 F total participants in order to solve this problem okay so there's a bunch of algorithms that exist to solve the problem the original algorithm was purely a thought process because it was exponential in n which is never great right newer algorithms although new is perhaps overstating it since they're from 1999 basically have a complexity that's about order N squared that's supposed to be N squared sorry about that in the number of nodes and so a message complexity of N squared is doable but you're probably not going to want and to be too big ok and I will say that I've even designed systems with the MIT version of the Byzantine general solution where we kept n 2 4 or 7 or ten and not too much bigger because then the message complexity gets pretty complicated at that point so this Byzantine fault-tolerant algorithm BFT is what this Castro and Liskov algorithm is called and it basically allows multiple machines to make a coordinated decision even if some subset of them basically less than an over three are malicious okay and so what you can think of again going back to our earlier discussion a distributed decision making is that requests kind of come in from somebody a client or or one of the participants they go through this decision mill where they're running the N squared algorithm that solves Byzantine generals problem and as long as these red malicious nodes are less than and over three total than what comes out they are distributed decisions that are agreed upon by all the non malicious parties okay and so that's kind of a decision this is a pretty key advantage to a good Byzantine generals solution is that we can have a coordinated set of nodes that together come to a decision even if a few of them are malicious all right questions on this and notice by the way in reference to some questions earlier that we're not talking we're talking here that more than two-thirds of the nodes have to be non malicious in order to solve this in the in the previous algorithms we're talking about that aren't tolerant and malicious nodes it's only you only need more than half to be to be non to be non faulty okay more than have to be non faulty though you can have up to half of them being faulty alright so now enter blockchain so it's interesting these days is of course there's lots of discussion of blockchain since the 2009 introduction of Bitcoin way back when it's hard to believe that's been over a decade ago but what's interesting about blockchain algorithms is let's start with what a blockchain is so blockchain is a set of transactions that are back linked with hash pointers and those of you that have taken 161 or know something about security will know that what this really means is you take the contents of a block on the left and you run ahead cryptographically secure hash over like a sha-256 and you put the resulting hash into the next block and as a result as long as you know that hash it's impossible to insert something on the left okay without being detected and so the way a blockchain typically works is mostly everything's in a single chain except at the very head where new transactions are being added and in those cases there are some possibility possibilities for the new head or some branches and what happens is eventually the one branches with the longest chain become probabilistically the the final head and if you run this long enough it all the new stuff eventually looks like what I have on the left here okay so blockchain itself is a chain of blocks connected by hashes to a root block the chain has no branches except for at the heads and blocks are authentic part of the chain when they have the right authenticity info in it now if you are taking 161 and not talking about Bitcoin or something like that you probably think that something's authentic by there being a signature well that's a one way to do it in Bitcoin or a theorem or some of these other block chains what actually happens is the heads chosen by some consensus algorithm and in many of them the head is basically chosen by solving a really hard problem some extensive search of the hash change space to find to find cryptographic proof that this is the right head okay and this is the job of the miners who try to find basically a way to put some set of bits into the package such that when you take a hash the resulting hash has some number of zeros and we can talk about that offline or at office hours if you guys are interested but this is called a proof of work because you have to burn a lot of cycles on a processor and burn a lot of energy to get it and selected blocks above here presumably already have the proof of work in them this hashed one I've got that's green is an example of one that's got a proof of work but isn't known by everybody yet and so it's not considered the final chain yet it's still kind of tentative okay and this is a longest chain wins kind of scenario now why this is good for Bitcoin is that these transactions represent the exchange of money for you know if you buy coffee with some Bitcoin that's an awful lot of coffee these days still but but it used to be the bitcoins were worth you know seventy dollars or a few dollars and it made sense to have some micro bitcoins spent for coffee now there were thousands of dollars and it's a little bit less obvious that you want to do that but you might ask a question about is this blockchain algorithm a distributed decision-making algorithm and the answer is really you can think of it that way because once we've got some item some choice of commit or abort that's in one of these solid green blocks that has been held on that has been part of the long chain then you can't change it and so now it's a distributed decision that everybody will agree on and so if you look at the way that for instance a typical blockchain algorithm might work here's the cloud you've got these miners that are around the world trying to solve these proof-of-work problems and what happens is they're basically copying information to each other and as soon as somebody solves a proof-of-work it's very quickly replicated to everybody else and that person's success at solving that problem gets them a few fractions of a Bitcoin and everybody hears about it and that becomes the new head of the chain okay and so the way you'd use this for distributed decision-making is you make a proposal to one of the miners that instead of being like a Bitcoin transaction would be something like I would like to commit the following record to my to my distributed database and depending doesn't matter who you send it to eventually they send it to everybody else and those transactions get put into the blockchain and they become distributed decisions okay and so a decision in this case means the proposals locked into the blockchain could be a committee board decision could be a truce about a choice of some value or state-transition whatever you know if you put give a proposal and you get an ACK back you might have to retry because something went wrong but once it's in the blockchain then everybody can observe it and those of you that know anything about Bitcoin know that there's a much smaller number of miners and there are people that are observing and using the blockchain and pretty much anybody who gets copy the blockchain can verify the decision so we have the nice property with block chains here of basically the decisions that are locked into the blockchain can be verified by everybody okay and so that's so I would say that yes the blockchain is a distributed decision-making algorithm and interestingly enough there are a number of these out there now that use not necessarily the Bitcoin blockchain but use other block chains to solve the Byzantine agreement problem despite the fact that there are malicious parties in the system and what's interesting is whereas back here when we talked about say the MIT BFT algorithm which is N squared number of messages Bitcoin in Bitcoin excuse me blockchain style Byzantine general solutions tend to be closer to linear in the number of nodes and see if they'd have many more nodes involved and so this is interesting and kind of exciting to see where this goes okay and these are these Byzantine agreement algorithms are relatively recent within the last five years or so so there's a little bit a question on the on the chat here about saying a little bit more about what the block contains I'm not going to say too much more because I don't want to spend too much more time here but take one of these green blocks what it is is it's a series of transactions from different people so when you when different people propose a transaction it's epidemic lee sent to everybody and what the miners do is they collect all the new transactions into a block and then they and then they start adding numbers to that block and hashing it and that's the problem they're solving and the first one to figure out sort of which four bytes to add to the set of transactions such that the new hash over it is got a number of zeros that's specified by the current state of the system solves the problem and gets the gets the coins and so from the standpoint of us discussing Byzantine Agreement here really what the proposals are they go into these green blocks and those proposals are things like commit or abort and usually have attached to them you know commit this key value pair to my global store ok I hope that helps a little bit to the question that's on the chat the other thing I will point out is you can sort of see one of the big problems here there's a lot of people that like to think of blockchains as the the solution to all of the world's problems and what they do is they talk about rather silly things like I'm going to put my videos into the blockchain because then they're guaranteed to be authentic and everybody can verify that well if you look at what's going on here the any data you put in the blockchain gets replicated all over the world and it's extremely expensive process and so putting everything into the blockchain is is actually a pretty you know almost a non-starter although many people forget and are doing it so but there's another question here on the chat but I'm going to let's talk about this offline if we could Geoffrey I think that's a more extensive question ok so anyone in the work world can verify the result of the decision-making all right so now I want to switch gears a little bit and if you notice yeah you can Google by the way all sorts Google blockchain there's a lot of really interesting information on there ok so now let's move forward a bit here and there are many levels of networking protocol and what I wanted to do before we move forward with some of the more interesting distributed file storage systems and peer-to-peer protocols is I want to very quickly get some common terminology for everybody here on some networking terms ok and many of you have probably taken 168 so some of this will be things you've heard before but I just want to make sure we all have this so the network networking protocols are extracted at a number of different levels there's the physical level which is the mechanical electrical network itself sort of how those zeros and ones are represented there's the link level which is how to actually transmit physical small packets fits over these physical links and then what's more interesting for us in this class at least is the network and transport level where we put together small packets into bigger ones that are reliable and figure out how to deliver a packet from here to the other side of the world and deliver it to the right application on a particular node okay so that's kind of what we want so protocols on today's internet sort of showing you at least these kind of three layers there's really four of them illustrated here the physical and Link layer are things like Ethernet or Wi-Fi or LTE that's one hop worth of communication the network ties it together with IP to transmit data multiple hops and then the transport layer starts worrying about things like how to deliver to an application directly and how to do reliability and so on okay so start with I want to say a tiny bit about the physical link layer so to do that we're going to talk about broadcast network so broadcast network is a shared communication media and although it doesn't have to be Wireless it can be a wired situation I'm showing you up here in the upper right corner just think of it as a broadcast network where we're sending to everybody who can hear us in the equivalent view of this might be like a bus where all of these items a processor and a bunch of i/o devices and memory are all attached to the same wires and as a result when the processor sends out a request everybody can listen in okay and so the shared medium could be a set of wires or it could be you know the space around a Wi-Fi etc what's perhaps interesting here is Ethernet in its original incarnation actually was used as a broadcast media where you had a whole bunch of items here's three workstations and say a router all connected over to the same cable and all communication basically went to everybody all the time okay in a local subnet alright and many examples of this okay cellular phones CDMA Wi-Fi etc so what's interesting about a broadcast network is when I'm sending from say this node here to from nodes say 3 to node 2 and I'm sending my data over that broadcast media what it really means is that every one of these nodes have to look for at least as long as until they get the header to know whether they need to observe or can just ignore the packet okay and that header address is typically called a media access control address a MAC address and most of the things you're going to encounter now are 48 bit physical addresses MAC addresses and in theory which is kind of amusing they're supposed to be unique for every device everywhere in the world is supposed to have a unique 48 bit address and there's there's a special way to to identify the various tuples in these 48 bits that have to do with manufacturers and which item number it is and so on there are some reserved bits that are supposed to be said of all so those are not necessarily unique and any of you who have played a little bit with your networking stack on your machine know that in many cases you can just set a software version of this into the network card and it will ignore its own ID for the one you tell it to so this idea of things being unique is more aspirational than real but every card that does come out should have a unique address so how do you deliver this when you broadcast a packet well you put a header on the front which is a MAC address and everybody gets a part packet and discards if it's not the target and typically this is all done in hardware so the software stack doesn't have to deal with it too much ok now I did want to say give you guys one little interesting tidbit here so as you can imagine if everybody's on a broadcast media and multiple nodes start talking at once you're going to get chaos and so how do we deal with that all right well the way we deal with this is in fact something called CS m CS cm ad called carrier sense multiple access collision detection and it's from the early 80s Ethernet and it was the first practical local area network and Ethernet has most of the Ethernet protocol has survived for the last many years okay almost thirty thirty years forty years okay and it uses a wire instead of radio but it's still a broadcast media and the key advanced to making this work was this arbitration mechanism csma/cd and how that works is as follows everybody who is attached to the network when they start talking there's a carrier that goes out and so what that says is before you talk you listen so that's carrier sense and if you hear somebody talking you just don't say anything until they're done so that's a way to avoid talking over people okay and however it's possible that both nodes start talking at exactly the same time so they don't hear anybody so what they do is they start talking simultaneously and at that point they both are listening to the medium at the same time they're talking to to notice when there's a collision okay if there's a collision then both nodes stop talking and they back off and retry later okay the back off scheme is basically choosing how long to wait before trying again and how do you determine that well if everybody always waits the same amount of time then you're just going to collide over and over again so instead what happens is you basically have a random mechanism for randomly backing off okay and so it's an adaptive randomized waiting strategy you don't want to wait too long because that's gonna destroy your bandwidth so what you'd like to do is figure out how long to wait but do so randomly and so what happens is you uh first time you pick a random wait time within a small interval and for every time you collide you up your interval and so basically what happens is the the average for the wait times keeps increasing by a factor of two every time there's a collision until you eventually get to go and so what's nice about this csma/cd protocol is it automatically figures out probabilistically how far to back off so that if you have two people trying to talk versus four people trying to talk there's a different back off process and this works remarkably well and it still is in most of the ethernet stacks that you're gonna run into okay our gonna do this back off all right okay so that basically gives us a way to deal with a broadcast media even when there's multiple people on there so the question about how does the sender check for collisions is really that you notice that your bits are being trampled on top of by somebody else you can see that there's checksum that's failing all right so let's say a little bit about the MAC address here for a moment so it's a unique physical address at the interface if you were to look if you were to take 168 you'd see that this is typically the physical and data link layers okay and what's interesting is I just wanted to mention for those of you that might take a look on your phones you can see that the Wi-Fi a hardware of your phone has this 48 bit address so notice this is six double hex digits so that's 48 digits and also if you do something like if config on a Windows box etc you can see that like here's the wireless LAN adapter has a 48-bit MAC address and your Ethernet adapter has one as well so every one of your physical system items in the system have a MAC address so you might ask yourself all right so why have a shared bus at all why not simplify and only have point-to-point links and the answer is well originally it wasn't cost-effective originally it was much easier to drop a cable that snaked around a whole floor in fact I my where I was graduate student and we originally we had one of these that was up in in the ceiling and they dropped down and we attached it to every one of our machines and there was a shared media for every network for every machine on the floor okay however you can imagine that's got bandwidth issues because you'd like to have point-to-point networks where only the communicators are actually communicating and so that would be a network in which every physical wire is connected only to computers and so how do we do that we get a switch alright and so in that instance of a switch what we have here is the switch is a piece of hardware we've got point-to-point connections okay and this is a bridge typically that transforms the shared brought bus broadcast media into a point-to-point network and you can buy switches pretty much anywhere you can get them at Fry's for Ethernet and what happens here is even though you can in principle broadcast or multicast everybody that's connected to the same switch if you're just doing point-to-point communications the switch adaptively learns where all that what all the MAC addresses are and when you send a message address to a particular MAC address then the switch basically just routes it internally to the right port and now I can get as many pairs as I've got going here depending on how much bandwidth my switch has okay now a little different than that as a router and what a router does is it basically is a way of transferring packets from one switch domain to another and so when you go across a routing domain they're not routed by MAC address and something else has to happen and this is the point at which IP comes into play all right so we're gonna take our brief break here and we will be back in just a second okay so so IP as you're all aware is the protocol that has really taken off it wasn't the only protocol originally for routing across physical domains but now it pretty much has taken over and basically it's a way of getting packets from some source to some destination no matter how far away it is and so this is the Internet's Network layer so if you were taking 168 you'd see third layer protocol and the service that it provides is best-effort so what does that mean that means that when I send packets from source to destination they can get lost or they can get corrupted or they could get duplicated or they could have arrived out of order okay and so really you might say it doesn't guarantee much but surprisingly it guarantees enough to make our very interesting packets very interesting applications that we're all used to that we know and love okay and so the IP packet itself is called a Datagram and this is a Datagram service which can route from source to destination across many hops across the planet okay and so that's remarkable that it works as well as it does okay so there are the ipv4 and ipv6 addresses the ipv4 address space which is much more common still has an address that's a 32-bit integer so notice that's different from our 48-bit MAC address and it's the destination the IP packet so often written as four dot separated integers so here's an example for instance at one point the file server for CS was 169 dot two to 960 283 sometimes you see it written as as set of hex digits like oxa 95 3 C 5 3 etc okay a host on the Internet is a computer connected directly to the Internet and the host has one or more IP addresses used for routing some of them may be private and unavailable for routing so not every thirty of the 32-bit addresses can go everywhere and I'll say more about that in a moment but groups of machines may actually share a single IP address and in that case we can get what's called network address translation I'm sure many of you who have networks in your house have a router to a service provider like Comcast or what have you and that router then connects to all of your devices may be either wired or Wi-Fi and what happens in that instance is the the world sees your house with a single IP address but inside you have local addresses okay a network address translation turns the query from your laptop through the Gateway to the to the world from the local address that your laptop has to to the remote address and it does so in a way that keeps that connection unique it allows it to work so what's interesting about this is the number of network connected devices in the world is tremendously larger than just what you get by seeing all of the 32-bit addresses that are reachable in the public Internet okay so network address translation gives us that capability now within this a subnet is a set of network a network that's connecting hosts with related IP addresses that's typically for instance either that broadcast domain or that switch domain I mentioned earlier and a subnets identified by a 32-bit value with the bits that are differs Eros so an example here might be 128 32 dot 131 0/24 or the other way to do that is 128 32 dot 131 xxxx what this says is every host address that matches in the first 24 bits but not in the last eight is considered together and on the same subnet alright and typically that's a set of machines that are all connected together either in a common switch or on the same the same physical network okay and oftentimes there's a mask which can be used to identify the subnet so if you look at this address 128 32 dot 131 that 24 bits there represents a unique subnet and then the last eight bits read presents the host and the mask this 255 dot 255 dot 255 dot 0 is 24 ones and 8 zeros when you and it on top of 120 8.32 dot 131 dot whatever address you get only those 24 bits that represent the subnet all right and so often write routing within the subnet is done by MAC addresses by the switches not necessarily by IP addresses and in fact a lot of the ports that are in for instance soda Hall there's a lot of MAC address routing that's going on in on the subnets there to make it fast ok so address ranges in IP I'm not going to go on this in great detail but back in when it was first started up when IP first became very popular there were what we're called Class A networks which are ones that that only have the first that only have the first octet unique so like a 10 dot something dot something that's something or a 6 or 127 dot something that's something that's something those are Class A addresses MIT for instance is 18 .

X X X X X X X X and so what that means is all of the 2 to the 24 hosts are all kind of owned by that organization now there's a question about not covering what's the difference between a MAC address and an IP address we'll say more in a moment but the MAC address is 48 bits and it's only you routes within a switch domain and IP is what routes across which domains through routers ok so think of a MAC address as a physical address attached to an actual network card and an IP address is a virtual host address that is used for routing on the larger scale so on Class A basically is a / 8 network where the first eight bits are unique Class B the first 16 bits in Class C the first 24 bits and some of these are what are called private networks so for instance 10 dot something that's think that something is a private address that's a Class A and so if you were to use the VPN at Berkeley and login you'd find that your computer has a ten dot something address associated with that VPN commonly if you buy a router at Fry's or at Best Buy and you put you put one of those routers on on your network you'll see that 192 dot 168 is a very common Class C Network that's used a lot and it's private okay so oh by the way how our MAC address is different from Ethernet addresses I guess I didn't quite answer the question that was on on the group chat so the Ethernet addresses are the MAC addresses okay those are the those are the MAC addresses okay so address ranges are often owned by organizations and can be further subdivided into subnets so for instance the you know I said MIT is one a few the few institutions that actually has a Class A address and they certainly don't have two to the 24 hosts all tied to a single physical domain instead they're all divided into a bunch of subnets which are then physical domains but the Class A address is something that MIT has full control over those addresses all right now the IP for format as you've seen if you take in 168 is a set of bytes that go in front of the data all right and it's a well-defined format I'm not going to go into it in detail but what you can see here is for instance if you look in the packet header there are there's a four in there for ipv4 there's a total length of the packet bits there's some flags there's a checksum and then there's a source and destination IP addresses so the source address is where it comes from and the destination address is where it's going to and this is a basic IP Datagram all right and this is set unreliably from one host to another notice there are no ports this will get the ports in a second but IP can only go from machine to machine not from application to application all right so now what's a wide area network it has many of these physical domains okay so the Internet is a wide area network it connects multiple physical data link layers with routers so you can see these routers what goes on inside of the subnet is kind of up to the owner of those devices so even though I kind of show that hostei enters into this domain and then goes through our to our for to the destination it's possible there are other hops inside here which are handled by the owner of this domain either via MAC addresses or something else the data link layer networks are connected by routers as I mentioned here okay and we can we'll see mote say more about a router here so a router forwards each packet received on an incoming link to an outgoing link so here the router is circled and what this says is that if a packet needs to go from point A to point B and it goes through a router we have to make sure that when it arrives at the router it knows the router knows what the next hop is okay and so the router is a highly optimized piece of hardware software device that basically takes packets coming in off the network on a you know 10 gigabit or 100 gigabit link and is able to often at lines feed if it's owned by Comcast or some service provider can basically pull the packet in fine just take the header off figure out what the next port is and send it on its way at line speed hopefully by basically keeping this thing going at one gigabit or 10 gigabits or hundred gigabits whatever so here's an example of packet forwarding so here we have host a is talking to host B and as you can see here basically on receiving a packet the router figures out how to forward it what's the way to get it closest to the destination and if it doesn't know anything about how to get it closest to the destination then it might send it to a default route which hopefully has more information ok so here's an example that package going on everybody catch that see you doo doo doo doo all right so what about IP addresses versus MAC addresses why not have everything routed by these 400 of these 48-bit MAC addresses and the answer is it doesn't scale that well okay the analogy here is MAC address is kind of like a unique social security number that everybody has and an IP address is kind of like your current home address so the nice thing about your current home address is you can hierarchically say that it's in some state which is inside you know and it's in some city in that state and it's in some sub piece of the city in that city and so on and so you can do hierarchical routing to home addresses whereas hierarchical routing to Social Security numbers isn't doable because each social security number is assigned uniquely to a person and it's not based on anything to do with locality but rather based on the person okay and so a MAC address is kind of like a social security number it's uniquely associated with the device for the entire lifetime of the device and so you know your your IP address changes depending on where you are so when you're out in soda Hall your laptop gets one IP address when you're in the dorms it gets a different one and when you're back home it gets a different one and that's because by and large not exclusively but by and large IP addresses relate to physical locations okay and so if you look basically if we move then we're moving our address to something new and therefore it's easier to route okay so I so why does packet forwarding use IP and why is it scale better so we just kind of said that but specifically IP addresses are aggregated and hierarchical okay so I all IP addresses and UC Berkeley might start with an O X a 95 okay in reality there's 120 8.32 dump and 160 9.22 2/9 are the two ranges that represent UC Berkeley addresses by it's that aggregation that helps the routing okay all right I think I've said enough on that and yes there is somebody who's noticed that the first few digits of the social security number were originally based on sort of where you were born but people move around a lot and the Social Security numbers are in the position I think of being recycled now so I think that original locality is Social Security numbers is you know there is no actual locality as a result okay so how do you how do you set up the routing tables while the internet has no centralized state no single machine knows the entire topology of the internet in fact it's fascinating to read books on the topology the internet because the Internet is a whole series of loosely collaborating administrative domains that have a set of agreements with between each other and there's crosspoint places where certain classes of AI or certain groups of IP addresses will route quickly while other ones will route more slowly and this is all based on agreements and so no single machine knows the topology and the topology is always changing and there's faults and reconfigurations and so on and so you really need a dynamic algorithm that somehow acquires the routing tables so that we can even figure out how to get a packet on to its next hop and so there are many possible algorithms you could imagine okay one of the the one that's common now it's called BGP but you know the routing table has a cost for each entry that sort of reflects how many hops it will take to get to a certain destination address and there is some optimization for hops okay and neighbors periodically exchange routing tables to try to make this to optimize for cost the problem is this particular algorithm that optimally tries to eliminate optimally tries to have the fewest number of hops Square scales is N squared and so that's not generally what's done in the internet instead there's basically whole groups of addresses that are run at different scales and so on and so your path from point A to point B is certainly not optimal in the Internet and in fact sometimes there are loops and so you can actually have situations where packets on their way to their destination get routed in a loop and they just keep looping around and they would loop around forever if it weren't for the fact that there's a time to live field that keeps getting decremented and eventually they timeout and go away and there have also been some pretty interesting disasters back in the I think in the early 2000s there was a single tunnel that had fiber that partitioned the internet and so one side of the Internet was on one side of that fiber and the other side was on the other and there was a truck fire in there and it actually took out the ability to communicate across the internet until they fixed it okay there's a lot more redundancy now but this is a pretty chaotic process and it's fascinating and good reason to take 168th I'm sure they talk about internet routing the other thing I want to talk about here is these internet or these IP addresses either ipv4 which are 32 bit IP addresses or ipv6 which are 128 bit IP addresses are really not necessarily ones you can remember easily and so really since we've got humans in the picture we need to go from name to IP address somehow and how do we do this you know we want to map something like WWB our cue to 128 32 139 depth 48 or google.com – believe it or not the closest Google facility that can service you so these human readable names need to go to IP addresses so that the underlying system can then route them and how do you do that well you need a system that goes from human names to IP addresses and this is necessary basically because you know humans have trouble remembering IP addresses unless they are particularly attached to them and IP addresses also change so if one server crashes and an alternative comes up you'd like the name that the humans are using to automatically switch over to the new one so that they don't have to know that there was even a failure and so the mechanism for this as many of you know is called the domain name system okay and the domain naming system is hierarchical okay so there's a top level of the hierarchy that's managed by a centralized organization and then for something like berkeley.edu the next level down is edu and then there's berkeley.edu and then there's EECS that berkeley.edu etc alright and so it's hierarchical and organizations own parts of the hierarchy okay and this top-level organization is just a really big organization that's global okay so it's a hierarchical mechanism for naming names are divided into domains right-to-left as I mentioned so start with edge you then Berkeley at Berkeley edges and ECS that Berkeley did you and let's see resolution is a series of queries so when you're somewhere and you want to get to you know mit.edu then here I'm attached at Berkeley I might see whether my local cache as a address and if it doesn't then you work your way up the hierarchy to get to the edge ooh domain which will then tell you where mit.edu is and then it'll send it back to you for a full resolution or it may tell you how to get to mit.edu which then has the server you're really interested in okay and there's caching because this is expensive as you can imagine and so what's interesting is the caching is loosely consistent and so it takes some time for the cache to timeout and so if you make a query and it gives you one answer and then something changes you don't always get a very quick change to the answer okay which is one of the reasons that DNS is not great if you've got items that are moving rapidly and changing their IP addresses with some frequency so you need something different and perhaps we'll talk a little bit about that in some of our remaining lectures but how important is this correct resolution so if you notice when I'm trying to get to a particular server like www I need to know that it's one 69.2 29.1 31 81 and I want to do that in a way that maybe a malicious person can't get in there and give me the wrong answer because that could minimum deny me service and it it could also potentially be a security hole if I am NOT careful and notice that this server is not the one I thought I was talking to okay so how important is the correct resolution very all right so get somebody to route to a server thinking they're routing to the different server and get them to log in to their bank and give up their username and password now of course one of the ways that banks prevent this is by having certificates but for certificates can also be faked under some circumstances and so incorrect dns resolution complete with a breached top-level certificate can lead you to route to something and give up your username and password if if the wrong sort of circumstances happen so you might ask as DNS secure it's definitely a weak link in this whole process because you think you're talking to one thing and you're actually talking to something else and the answer is DNS is not always been secure what was interesting is in July 2008 there's a hole in DNS that was located and the security researcher actually discovered it and then quickly informed a bunch of authorities about this before it was published in a conference and it was a very high profile problem and basically because DNS wasn't properly authenticated it was possible for one node to send out a query to to a top level DNS server for instance and somebody quickly comes in and gives a different answer and it wasn't noticed that the person answering wasn't the one that we were asking the question of and you could actually pollute the DNS cache of a whole ISP in one swell foop so to speak one fell swoop sorry joking and as a result this was a pretty serious bug alright so DNS is definitely a weak link and it's had many upgrades over the years here so so now moving on we need layering in our network which is building complex services from simpler ones the physical link layer is pretty limited okay so basically if you look at what can go on an Ethernet link or a Wi-Fi link there's a maximum transfer unit size that's often in the 200 to 1500 bytes in size and across slow links the MTU can get small okay and so packets actually have to be fragmented up into small pieces to get over long distances and they have to be reassembled or something in order to basically allow us to do something large so our goal in the following few slides and we're gonna pick this up next time as well is basically going from the physical reality of the networks to the abstraction we really want so we're going to go from packets which are limited to messages which are potentially unlimited okay and this is kind of like our virtual machine abstraction we talked about at the very beginning of the last lecture or the very beginning of class I mean so packets are of limited size but we would like arbitrary size communication packets are not ordered all the time they can be reordered we'd like ordered messages packets may be unreliable and loss we'd like reliable ones packets basic communication is machine to machine we would like it to be processed the process instead packets might be only on a local area network so using the MAC addresses whereas what we'd like to is route them anywhere they might be in a synchronous because they're just sort of being sent when the hardware is ready where perhaps we want something synchronous where we can do some synchronizing on it packets might be insecure we want secure ones and so this is basically an abstraction process of giving us a better communication mechanism then what the hardware gives us okay so that's a theme that we've had throughout the term so process the process communication is a good one to start with so you know machines have an IP address and so that's a machine to machine communication what we really want is routing from process to process which you know process on machine a to process on machine B and the way we do that as we've talked about earlier is by adding something in addition to the IP address we're going to add ports and so basically a communication channel which we have mentioned is actually a five tuple of source address and source port that tells us what application we're talking to at the source side destination address destination port that tells us the application of the destination side and then a protocol which tells us sort of what level of transport protocol are we using and the protocol what those protocols are things like TCP or UDP etc and just to see the simplest example of a protocol this is IP protocol 17 and remember the protocol field if you were to look back at the header earlier is a it's an 8-bit field so when we fill that 8 bits with 17 the number 17 that's gonna be up in these 20 bytes then we've got a Datagram and in addition to that we add a new header we wrap a new header on this which has a source and destination port which now let UDP go from a application at one side to an application at the other and there's some additional things like a link for our UDP data and a checksum and so on but very very simple protocol here for UDP it's an unreliable Datagram from application application and it's often used for very high bandwidth video streams etc but you can be very anti-social about your use of UDP if you send too much and you fill up your network ok so it has none of the well-behaved aspects of tcp/ip which we'll talk about next time so just to finish this out if you guys can bear with me I have a couple more slides I want to make sure to get through here for today but process to process delivery is technically a layer four or a transport layer thing okay and so if you look what we start out with our data we start wrapping headers so our data gets a transport header which like the UDP header which adds a port to it and then we wrap a network header which gives us the IP address of our destination and then we wrap a MAC address on top of that and that might be our Ethernet address for instance okay and then so this is going through several different layers in the operating system down to the physical layer where the data is actually transmitted and then it comes back up at the other side and we start unwrapping the data link layer so we only get to a node who has the same MAC address as our desired destination it comes in we strip the the frame header or the data link layer header off we bring it up to the networking layer that networking layer is going to check and this for instance could be a router in which case we see oh this isn't the right IP address we're gonna forward it back down to a different data link later layer out of a port but if it turns out this is the IP address of the local node then we'll forward it up the transport layer which will grab the port and that will further multi multiplex it by forwarding it up to an application and so this idea of wrapping headers and unwrapping headers is a common theme and all of the layering that you're going to run into so there are many transport protocols we just talked about UDP which is considered best effort IP and it's protocol 17 protocol 6 it's a pretty common one that you're well familiar with called TCP which offers a bunch of it more semantics than I then UDP so it lets us setup and teardown connections it's discards corrupted packets retransmission of lost packets gives us flow control and congestion control which really means that if we use TCP across the planet for instance the flow control and congestion control will actually make us good citizens and we won't use more than our fair share of the network links ok so that's a nice property of TCP there are actually a bunch of other examples that are kind of often not heard of but things like DC CP which is a datagram congestion control protocol RTP is a reliable Datagram protocol SCTP the stream control transmission protocol these are all transport protocols that you may not have heard of what's interesting about SCTP for instance is this is like TCP but has a bunch of different streams that can be simultaneously connected the T the transport protocols do not provide a bunch of services okay that's up to applications and so when we get into things we can do it for instance UDP and TCP like distributed storage or peer-to-peer storage we'll be able to do things like provide bandwidth guarantees or surviving change of IP addresses or so on okay so the the problem we're gonna solve next time is the reliable message delivery problem which is basically how do we get reliable delivery out of unreliable packets all right and we'll pick that up next time so just to finish up for today in conclusion we talked about two-phase commit and as a distributed decision-making protocol first you make sure that everybody guarantees that they will commit if asked or that they won't and then everybody asks to commit and through these two phases we're able to get and either everybody commits or everybody aborts semantics as long as we allow people to reboot in the process we also talked about the Byzantine generals problem which is a distributed decision-making with malicious failures one general and minus one lieutenants and some number of them may be malicious and here malicious is pretty much they can do anything they want and that maliciousness can include looking correct whenever they're probed but still behaving incorrectly and what we see is that it's only solvable as long as the number of nodes is greater than or equal to three F plus one we also talked about how blockchain protocols can be used for distributed decision-making as well so we started talking about IP which is a Datagram packet delivery used to route messages through routes across the globe 32-bit addresses for ipv4 and 16-bit ports we talked about DNS which is the system for mapping from names IP addresses flaws that have been discovered are problematic and they've been continuously fixed as they show up we started talking about how to get good semantics and next time we're going to talk about ordering and reliability okay so we'll we'll finish at this point I hope you all have a good Wednesday and we'll see you on Thursday

As found on YouTube