15 October 2019
CHAIR: Hi, good afternoon, I have just one question, is Thomas Weible from Flexoptix here?
We're about to start, please all the candidates to the RIPE Programme Committee come here because you will have the opportunity to present yourselves.
SPEAKER: Hello, I am Pavel Lunin from Scaleway ‑‑
CHAIR: I was just going to say that we are having to sit for re‑elections at the RIPE PC, and you are able to vote for them from today till Thursday, and we will announce the results on Friday. We have five candidates so far ‑‑ six candidates, oh, we have a new one ‑‑ we have six candidates so far. If any of the candidates is here and not here, you can come to the stage and present yourselves. And we're beginning now with candidate presentations.
SPEAKER: Hello, Pavel Lunin from Scaleway, I have been with the RIPE NCC since two years, as ENOG representative. And today my term is over, and I am stepping down as ENOG representative because I don't live in Russia any more and I don't feel myself reliable to represent the ENOG community here. But I am running independently to collaborate with the RIPE NCC to make the agenda more interesting and make it more, let's say, technical to get back to the roots and just speak more about DHCP, RPKI and IPv6 and stuff and less about GDPR and blah‑blah and stuff. This is my programme, and if you are interested, please vote for me. Thank you.
SPEAKER: Dmitry Kohmanyuk from Ukraine. My main employer is ccTLD of Ukraine, Hostmaster, but here I'm independent. I already serve a three‑year term. I'd like to stay and improve the meetings for the 21st century. I think we should have more diversity in presenters, in topics and also kind of get out of our RIPE bubble, not to say that we are already not out of it. So, with that being said, I would just like to thank Pavel for his application and I'll pass the microphone to Fernando. Thank you.
SPEAKER: Fernando Garcia, from Spain. I work for the society of the incumbent telecom in Spain. I have been attending RIPE since RIPE 30, that's a long time ago, but I have never been part of the ‑‑ any group, not anything. And I think this is an opportunity to contribute, given perhaps an old man with an ambition to this group and to try to reach meetings that are useful for everybody. That's it.
CHAIR: Okay. Thank you. Thanks to the three candidates, and remember, we have six candidates, and you can check the biographies at the website, at the RIPE 79 website, and vote for them, please.
And with that said, we are starting the last session in the afternoon. Our first speaker is Doug Madory from Oracle. He is going to speak about how BGP AS paths prepending is a self‑inflicted vulnerability. So...
DOUG MADORY: Hello everybody. My name is Doug from Oracle. By way of Dyn and Renesis [Acquisitions], this is a bit of BGP analysis that I did with my colleague, MAT Prosser, who is in the audience, about BGP prepending that I thought would be pretty interesting. The original title was why route servers should be dropping BGP prepended routes, but I had to change that after Erik's talk this morning. That was a joke.
So, to get on with it here, for those who don't know what I'm talking about, just to be clear what we're talking about is a kind of traffic engineering technique to deprioritise a route by artificially increasing the AS path length. Typically, this is done by repeating an AS, typically your own or maybe an upstream AS. Like in this example here, we have 4192 repeated in the AS path. And with the BGP selection criteria, assuming all other criterion are equal, then the shortest path is performed, a longer path would be less performed and therefore deprioritised.
So, BGP prepending can cause some problems. Rarely is it the direct source of an issue. But here is a notable exception to that. So it's a little more than ten years ago in some of the older folks in the audience will probably remember this case where an AS started to announce very, very long AS path that ended up causing a bunch of route servers to crash around the Internet. What was happening in this case, was due to a command line configuration difference between MikroTik routers and Cisco, if I recall correctly, Cisco, the way you would configure it, you would put the AS you wanted prepended to the left of your AS, and MikroTik, they figured they save you some key strokes and you would type in the number of times you wanted prepending to the left and so if you were like this one unfortunate AS didn't realise the subtle difference in the command line change and you typed 47,848 times you were telling it that many times to prepend, the value of stored as an 8‑bit integer, so this flipped over many times until it landed on 252 pre‑depends and then as that propagated through the Internet and more ASes were added to the AS path length, eventually it crossed the length of 255 triggering and unknown to that date bug and the Cisco IOS causing machines to crash. That was one of those rare examples of where a prepending was the cause, but it's also a kind of a sub‑plot in a lot of routing incidents that's worthy of consideration. So, we do a lot of routing analysis of ‑‑ I have written a lot of autopsies of routing events over the last decade or so and it's often an issue when we look at ‑‑ most effected routes by propagation, those that are most affected are often those that are heavily prepended. So, in this case, this was the China, you know, China Telecom leak of 2010. I kind of thought this was kind of put to bed, that the hijack of 15% of the Internet. I think everybody in this room can rattle off a number of reasons why that's not either possible or happened. That still in circulation though, I still see this come up and maybe it's worth kind of trying to get them to put that to bed.
This graphic on the left is to take a look that incident in 2010, this is a way that we have started looking at routing, especially routing leaks. It's trying to take into consideration propagation. So in this case, we have got three dimensions in this graphic. Time is going you know on the bottom right. The amount of prefixes, so there is about 50,000 involved in this case is the other axis along the bottom and the top is peer percentage. So basically there's a percentage of our BGP sources that saw each prefix and so as you sort this, you end up having, and they all look along these lines, where there is a small set of prefixes that were propagated widely and then a rapid decay down to a long tail and so then a lot of these numbers end up being really maybe over stating the impact.
But I bring this one up because this is where I first learned of this prepending issue, because I was looking at that in 2010 and just flipping through the data and if you looked at the top ten or even the top five, most propagated prefixes that came out of that leak, the vast majority were Chinese. But in the top five, two of them were US, by our measurement. And so that there is a grand conspiracy theory here, why are these routes getting propagated? It turned out that this was a small ISP in Charlottesville, Virginia in the US, which was prepending, they had only one upstream and they were prepending their own AS, another five times, so it would be six times, through just one upstream, and for two of their routes. And so, if anybody does an origin leak or any other kind of leak you run the risk of other ASes then choosing the leaker as opposed to the legitimate route and that's what happened with these guys. We brain‑stormed on these, on how we would term this kind of scenario, settled on prepending to all. When I say excessive, I'm not trying to ‑‑ I'm referring to AS path length of like 252 ASes in an AS path length. Referring to prepending to all. So you are prepending to a point where this is how you are announcing it to the entire Internet and therefore, not achieving much in the way of traffic engineering.
If we were to go back to, this is the Indosat leak, if you go back and look at our write‑up, again, we were highlighting the prefixes that were most impacted by propagation, and each of those that was most impacted were typically ones that were prepended to all. In the TM Net leak the following year, the same story. If you were to go back and look at it, again prepending is a kind of undertone. If you are prepending you are getting whacked harder than other people due to your own configuration.
So, like I said, we have known about this for a while. These are ‑‑ we kind of coined a phrase of prepended to all, to refer to a situation where you are no longer shaping route propagation and therefore traffic engineering, you are simply incentivising anybody else to be chosen as the origin if another entity were to start announcing your range. What I didn't know, was how much this was, and I asked Matt, my colleague, if he could run a query and try to gather up, I thought it would be like 500 or so cases of this. It ended up being amazing how much there is out there. And so we'll go through each of these routing table v6 and v4 that we had come across.
So, this graph, just to explain it here. This is the X axis is the v4 global routing table sorted by how much of our peering base, so we have, this is the Renesis data, it's pretty equivalent to a RIPE reviews, how many of our BGP sources see prepending on the way to this prefix, and if we set a threshold, obviously it's a kind of a curve as it decays here, but the low end of that curve is still pretty significant. So if we said the threshold was 95% of our BGP sources saw prepending, we're at 60,000 routes at any given day in the global routing table. That's 8% of the routing table, one out of 12 routes, and if you flip through the list of who is affected, it's like anything of any stripe, there is Internet companies, everything is in there.
V6 is similar. Obviously the numbers are smaller. The percentage is also smaller. We're looking at for 95% of our sources we're seeing about 3,000 routes, 5.6 percent of the routing table. We drop that down to 50% because that's where maybe the propagation, or the traffic engineering starts to decay after you cross the 50% line. Than it's up to maybe 6,000, 6,000 routes.
Smaller maybe there's just been less time with v6 or there is less traffic engineering going on in the v6 Internet.
But, I think one of the takeaways of this talk is that prepending is frequently being used in a manner such that it renders routes vulnerable to disruption or to misdirection, accidental or otherwise, these were all accidents we went over just a minute ago. But I tried to look for something that maybe it was either an intentional thing. It's hard to find. Let's just take an example. By the way this is an example from the blog I wrote a couple of months ago. It's since been fixed. Every example I used has since been fixed but you could find another example if you wanted to on any given day of here is a prefix that is prepended about 20 times to a single upstream. If you were a bad guy and you wanted to insert yourself into the path of traffic going to this prefix, then you might announce an AS path that was ‑‑ you have a lot of room to work with here to still get selected. You could maybe preserve the origin, maybe preserve some of the prepending, preserve the ‑‑ fabricate the single upstream. And we see ‑‑ we have reported on a lot of BGP path prepending, whether that's bit canal, bad connect, things out of Ukraine and Russia. People do this if they want to. We know this. And if you were to announce this, it would get in circulation, you would attract some traffic and it would be very hard to detect. I think an existing solutions that are in the field right now would struggle trying to alert on this kind of new kind of AS path that would appear. So I think that's maybe a theoretical security risk, but there is plenty of accidental incidents that would also maybe motivate people to rethink some of this prepending.
And then while we're on the subject. I think it's interesting to talk about, you know, how prepending sometimes the impact I think is maybe not so straightforward. So, for example, here is an example of a prefix that is announced two different ways. So it has two upstreams, Cogent and Hurricane Electric. For Hurricane, it's prepended many, many times. However, based on our sources and others, we see that Hurricane Electric route is chosen at a far hire percentage. Obviously that's happening you are probably figure this out already. The peering base of Hurricane Electric is trumping the AS path length. So this is a decision of where ASes are choosing to send traffic through a free peering blink versus to go through a transit path, and it doesn't matter, they could prepend this 100 times and it wouldn't make much difference. And maybe this is how much prepending is necessary to achieve this level of balance and it would be even worse without it. So maybe there is some rationality to it, but I think we can come up with examples where we come across stuff where it's not obvious that the, what's happening matches up with the intent. Because I think ‑‑ this is one of those rare cases in BGP where you can infer intent out of the routing.
So, when we looked at ‑‑ looking back over time is this, is this something that's increasing or getting worse? And I guess it does kind of vaguely look like it's increasing. We went back about 18 months, just take a day out of a month going back 18 months and you could see that there is a slow increase. I think what's happening here is you have ‑‑ are just kind of accumulating configuration cruft, from one day to the next, and just eeks up over time. I don't foresee this going to a hundred percent, but it does go up. V6, a little harder, if you squint your eyes it looks like it's vaguely moving up.
So, you know, I feel like we made the case that an inadvertent origin leak could disrupt traffic to these heavily prepended routes to all. Accidents happen, so why deliberately put your routes at risk? Why would you bother? So we went into ‑‑ myself, I was talking to Job Snijders about this, so we teamed to up ask companies to get someone to explain why they are doing what they are doing. So, we couldn't go to everybody so we had to kind of just pick some significant players in the Internet, starting with AS 30321, which you probably don't know, because that is the AS of the Burning Man festival, which is a scatter culture festival that happens out in the desert in the western part of the United States. That's a photo from it. It just so happened that one of my colleagues knows one of the guys on that ops team for Burning Man. They have their own AS. They route their own prefixes so good on them. We sent them a message, we said you are prepending you have got one upstream, you are prepending to it. What gives? We think there is a vulnerability here. And they went and they looked at it and they fixed it, and we didn't get an explanation on that one but they were very responsive and good on them. So, we went to other companies too, CloudFlare, Google, we also brought them cases, they looked at it yeah that's not right and fixed it it. So they are up there in the burning man as far as responsiveness on this.
That was actually the minority, most companies I guess didn't respond at all. And then a couple of time kind of came back and claimed there was some sort of operational issue, and yeah, we didn't get much of an explanation and most of these are still out there. In fact, I think it's the numbers have even increased since we ran this analysis last.
So just accumulating our interactions with people who are doing this as well as other experts in the space. We came up a few theories as to why it is that this take place, why do we see all this prepending?
Theory one is pure housekeeping, this probably accounts for the vast majority where there probably was a time in the past this makes sense. There was two upstreams, maybe they don't have any communities to allow a traffic engineering so you are left with just prepending, let's just make that scenario. You have prepended to one of them, not to the other, the other goes away, and now you are left with just the one upstream and you never went back to remove the prepending because when it became unnecessary. I would suspect that probably accounts for a large number of these.
We have also seen a lot of tricky stuff going on in BGP space as far as return path influence. So, if you recall the level 3 leak that occurred in November 2017, and there was about 20,000 or so prefixes that were more specifics that were announced as a part of that leak. In the end, it was Comcast and Bell Canada, both were announcing more specifics into level 3 to try to guarantee the return path of traffic and then this blew up when level 3 leaked those out onto the Internet, so we see a lot of stuff, people doing this, the theory is if you were sending out prepending basically to the Internet, maybe that would force the traffic back on your peering link. That ought to be the priority anyway. But you never can tell.
The other thing, especially for anybody who spends any time encrypting through BGP data, you encounter a lot of mistakes, like typos. For example, this one is a bit of an eye test here, you can see that the final two integers in this repeated AS are oscillating between 29, 29, 92, 92, 29, 29, 92, 92, and there is just a bunch of mistakes out there, so we ‑‑ I think that would explain probably a portion, as well, of how these, this prepending shows up.
So, in conclusion. The long AS parts are problematic in general. Whether it's due to AS path length, a prepending or not. So if you are in Tajikistan and you get your service through Kazakhstan, you are going to have long AS paths, and, through no fault of your own, just the geography, and in that case then, if someone originates your prefix, it's going to be shorter, likely to a lot of people, a lot of other ASes.
I would recommend that network operators go back and logic at the prepending, what they are doing and remove it if it's unnecessary. If you have any questions, you can come find me and I will give you your AS and I'll tell you your most ridiculous routing, happy to do that.
And because it's almost ‑‑ it feels like it's almost everybody has got examples. I am happy to share those. And I think with these numbers, one out of 12 v4 routes and 5.6 percent of the v6 table that's prepended to basically the whole Internet, I think we can safely conclude that this traffic engineering technique is kind of gone awry here and is, you know, overused to the point of creating some vulnerabilities,
So that is the excessive prepending. I'm happy to take any questions.
CHAIR: Thank you. Any question or comments? We do have seven minutes.
AUDIENCE SPEAKER: Hi. My name is Iljitsch van Beijnum. So maybe I missed it, but I don't think I noticed where you observe all these different paths.
DOUG MADORY: Okay. So we have ‑‑ this is Renesis infrastructure, so we have basically, for v4, about 400 full tables of peering sessions that we collect. Similar to RIPE has a similar BGP routing, route collection process, system, and route views.
AUDIENCE SPEAKER: So that's big enough for there to be no real bias?
DOUG MADORY: Yeah, I mean, every data source is going to have maybe a little bit of bias, but I think as far as the other ‑‑ like, if you compared us to other, I think on this one, we're all looking at the same Internet, and there is a lot of overlap with the BGP sources, so I think if it's looking, you know, a hundred percent to us, it's probably a hundred percent to the other sources.
AUDIENCE SPEAKER: Okay. So ‑‑ and do you have any thoughts about the number of prepends?
DOUG MADORY: Like length?
AUDIENCE SPEAKER: Yes.
DOUG MADORY: I think, it doesn't take much to ‑‑ I think after, like, one or two, there is not much value. So, there is a lot of hilarious ‑‑ hilariously long AS paths that are in circulation in any given time. It's like getting up to 20 times and stuff ‑‑ yeah ‑‑ excessive in the other dimension.
AUDIENCE SPEAKER: Okay. Thanks.
AUDIENCE SPEAKER: Hello. My name is Alex. So, if AS paths prepending is so bad but I think it's maybe one, one chance for traffic engineering to have a multihomed AS, you don't have another option to make an engineer ‑‑ maybe if you don't have where network marked engineer and very good uplinks who can make, who can give you an opportunity to use in your community so any other smart options.
DOUG MADORY: Okay. Yeah, I don't want to ‑‑ was there ‑‑ sorry, was there a question?
AUDIENCE SPEAKER: It's just a notice.
DOUG MADORY: I don't want to come off as the anti‑prepending. Just not prepending to the whole Internet.
AUDIENCE SPEAKER: Leandro Bertholdo, University of Twente. Two questions. In fact, first of all, you tried to make or tried to measure the effective or really use prepending to manipulate traffic, like to see the return of traffic if it's really changed when you make a prepend or not. How many prepends are effective or like this?
DOUG MADORY: That wasn't the analysis in this project to determine what was ‑‑ yeah, I guess you could look at ‑‑ I think the way would I approach that is look at, you know, if there is prepending now, if there wasn't yesterday, how did the routing change from one state to another? That would be how I would go about answering that question, but that wasn't what I was after today.
AUDIENCE SPEAKER: Second question. Did you analyse the AS path and did you just look for your own AS in using to making prepends but did you look for another autonomous system? Because I can use another autonomous system or I can put whatever ‑‑
DOUG MADORY: That's a good methodology question. So we're just looking for a repeated AS. It doesn't have to be the origin. Somewhere upstream.
AUDIENCE SPEAKER: You just look for your own autonomous system, not if I have injected my provider autonomous system.
DOUG MADORY: You want to know how I did this? I was just looking for did we observe from one of our BGP sources a repeated AS like multiple times consecutively on route to the origin.
AUDIENCE SPEAKER: But just your own?
DOUG MADORY: Not us, no, not ours.
AUDIENCE SPEAKER: Okay. Okay.
DOUG MADORY: Somewhere in the AS path.
AUDIENCE SPEAKER: Okay. Thank you.
RANDY BUSH: Somebody really ought to mention that the best inbound traffic steering for multihomed customer is to announce the, two 24s here and a 23 there or whatever.
DOUG MADORY: Okay.
RANDY BUSH: Right? That's the way to steer your Unbound. Not prepending, at all.
DOUG MADORY: Sure. More specifics, you are saying.
RANDY BUSH: More specifics. That's the way to steer your traffic. And that will really steer it and it's not crap.
AUDIENCE SPEAKER: Hi. Do you have a big idea of what the longest non‑prepended path you saw in there is like if I wanted to set my filters to, like, double or triple that or something?
DOUG MADORY: Like naturally occurring AS path? I didn't look at that. I feel like it's going to be less than 20, but I have heard someone tell me that they drop everything longer than 40. I don't know. I'm not making any recommendations along those lines. 16 is Geoff Huston ‑‑ we can bid who has got the longest AS path. Geoff is hijacking my time here. What's a reasonable limit of AS path length?
AUDIENCE SPEAKER: My filter is set to 42 because I set it 15 years ago and it's the answer to everything. And I haven't had a complaint since then.
DOUG MADORY: Maybe I heard of your solution.
AUDIENCE SPEAKER: I have seen reasonable answers anywhere between 254 so our Ciscos don't blow up and anything shorter we don't get complaints about down to 64.
DOUG MADORY: I think 42 is pretty conservative. The Cisco bug has been mixed ten years ago.
CHAIR: Thank you.
Next speaker we have Geoff Huston from APNIC, 30 years of BGP. Two quick notes and reminders, do not forget to rate the talks and also vote for the PC candidates. Thank you.
GEOFF HUSTON: Afternoon all, my name is Geoff Huston, I am the chief thing at APNIC, I do researchy, sciencey stuff. I wanted to speak about BGP at a more generic level than Doug. And in looking around at the audience, as far as I can see there are a few protocols that are older than half of you here and BGP is one of them. Because, in actual fact, it was 30 years ago in June that the first standard specification of BGP came out. Kirk Lougheed and Yakov Rekhter did it at RFC 1105. It really wasn't that much of a change from the predecessor. Basically, it was eBGP ‑‑ sorry, EGP. Part of the issue at the time was actually the introduction of the National Science Foundation's NSFNET, because prior to that, the core of the network was the arpa net and it was kind of a single routed core and everyone just attached into that single routed core. With the rise of the NSFNET, there were now a number of Internet cores and, all of a sudden, traffic policy, particularly the United States, became a real issue and so we needed a protocol that treated the Internet as more of a set of domains rather than just one single routing bunch. And this is where BGP came from back in '89.
It went through a small number of quick refreshes, and roughly what we use now, with a few minor tweaks, actually dates back to 1994, this is BGP 4, and what you run today and what you ran in '94, I'm not sure that they would interoperate if you just put them back‑to‑back but they probably would, that's not that much that's changed.
What I'm going to talk about is why a protocol can last 30 years. Because if you look back at what we did then, there was no HTTP, no worldwide web and some of you might recall FTP was really really popular. You know, why some of these protocols died, and why some of them are still there and actually still keep the Internet running.
One of the things that BGP did actually was I suppose done with other protocols, and at the time DECNET phase 4 was around and it had a very similar idea. When you get really, really big networks routing everything becomes a mind‑bogglingly difficult problem. So what you do is, you divide and conquer, and hierarchies are your friends, so BGP like DECNET 4 adopted this idea of areas. Inside an area, it's a bit like inside a city, the details are the details, don't bore me with details. I am only interested in how the cities interconnect. How the areas interconnect. So, like in DECNET and I suppose like real world routing, the way BGP works is, inside an autonomous system, that's fully interconnected. That's not BGP's problem. Don't even try and go there.
How those clouds interconnect is how BGP works. Unlike DECNET, BGP didn't even bother to define what you do inside your network. And whether you still like RIP, God bless you, or RIPv2 because you are really with it or you are still running EIGRP, love you, whatever you choose to use is your problem. BGP just doesn't care.
And so there is a lesson there about protocol design. And the real lesson which for a lot of you UCD people out here is a really hard lesson to understand, don't overachieve. Underachieving is a really big virtue. Don't try and solve every problem. Solve a very focused set of problem and just ignore the rest, someone else will solve them a different way. So underachieving can be a virtue.
The second thing BGP did which at the time was quite radical because most of the other routing protocols actually used some kind of datagram transport. RIP uses UDP, ISIS uses its own protocol, as does OSPF. BGP does not. It uses TCP port 176, as I recall. 79? Thank you. I'm getting old.
TCP does all the hard work, All of the hard work. TCP frames it. TCP makes sure that the order in which frames are sent are the order in which they receive. TCP does flow control because by advertising a window back to the sender you can say, look, I'm still processing the last stuff you sent, just hang on I'm not ready for more. All of that flow control, including rate adaptation, works in TCP. BGP should not lose messages because TCP doesn't. And so there is a really good lesson to be learned from all of that, that says, reuse don't reinvent. Don't duplicate what's already underneath you. So that choice of TCP actually made BGP incredibly simple, because a huge amount of work that other have to do including, oh, my God, I missed a message, now what do I do? BGP doesn't have to worry about. If I send it to you and you acknowledge it in TCP, it's now your problem, not my problem. I can move on with my life. And so this idea of simply reusing other protocols for BGP, that was actually quite an insightful and brilliant move.
BGP talks about autonomous systems. But it doesn't write it into the packet. It doesn't alter the packet in any way at all. It's just forwarding and it just informs a forwarding decision. BGP doesn't stamp your packet. The packet doesn't contain any BGP information. No ASes, nothing. What does that mean? Well, interestingly, it makes multihoming, remultihoming and all those other things you do incredibly easy. Because BGP is just forwarding, it doesn't actually get into the data plane. Let me go back there. So there is a lesson.
Focus. If your job is forwarding, Just do forwarding. Don't try and do anything else. Don't do side effects. Don't try and solve other people's problem. Focus on your problem and leave it at that.
This was always the hard bit in routing, because every other routing properly tries to sort of smear a forwarding system across a whole bunch of players. You all have to play by the same rules. You all have to see the same metrics, you all have to make consistent forwarding decisions. So, shortest path first. All those SPF‑style algorithms mean that every player has to play with exactly the same set of rules. In BGP, my policies are my policies. They are not your policies. Go and invent your own. Do your own thing, and that's okay.
So every autonomous system can act autonomously and have its own set of policies about what traffic it prefers to import and to export.
Why is there an AS path? To prevent loops. It was just by seeing myself in an incoming path, obviously I have seen it before. It's just there for loops. It's nothing more than a loop preventer. And by default, if you haven't got any other policy, when you are tie‑breaking you will prefer a shorter AS path to a longer one, hence prepending. But that's only in the absence of any other local policy. But it's your policies that actually make sense. So, if you want to use any other metric to make forwarding decisions, knock yourself out, BGP doesn't care.
The other thing about this too is that I don't have to tell you what my policy decisions are. I can go to the Internet Routing Registry of choice and say, transport policies, opaque. It's my call. Whether I choose to expose it or not, is my call. So, whatever I do, is actually up to me. I don't need to tell everyone else. What I accept and what I readvertise, my business.
The protocol is not forcing my actions. The protocol is supporting my actions. So the real lesson there is stop trying to make the world behave according to the protocol. Let the protocol actually work the way the world wants to work and if you think about it, the business model appearing in transit and the way BGP works are, oddly enough, much the same. And that's not a coincidence.
There is no single answer in routing, and BGP does not come to the same answers every time. It's actually non‑deterministic. Why? You have export policies. I have import policies. Who wins? Well, you know, it's a negotiation. And sometimes, depending on who starts first and what the other criteria are, will send traffic one way. Sometimes it will send it the other way. You might try and bias my decision with more specifics. I might filter them out. You might do communities. I might not listen. It's a negotiation. So, BGP doesn't have a single best answer. It just has an answer as an outcome of working out import and export. It's a negotiation protocol. Changes in timers recollect changes in information, different outcomes.
So, the answer about this is that's okay. Don't be obsessive about it. Any solution that comes to an answer is an answer. Right. If it works, it works. Walk away and leave it alone.
So, why has it lasted? Because it underachieves. Because it reuses. Because it doesn't duplicate. Because it has very sharp focus on what it wants to do. It doesn't force business models. And it's not obsessive.
That's what we thought. And that's where we intended. How has it worked?
Well, when we thought the Internet was exploding back in 1992, we started an effort, some of you were there, I know, I saw you, and it was called the road effort. Routing and addressing, not addressing. It wasn't just we're running out of v4 addresses, it was the actual routing system was getting bigger than those poor old IBMXTs could manage the stuff inside the tiny amounts of memory they had. And what we thought at the time was we actually needed a new routing protocol as much as we needed a new addressing system for the Internet. We actually needed a full transplant. Now, we came up with two kinds of interim solutions to give us breathing space. In the addressing world we thought let's just go NAT. It's not a long‑term solution, but it will give us time and space. In the routing world, we said let's just drop classes, it's not the ultimate solution, it will give us space. And, you know, that kind of worked.
Because when we went classless, back in sometime April '94 when it got released by Cisco, the whole impetus of the routing explosion changed to a new trajectory, and that sort of worked for a while, so the stop‑gap solution was kind of okay.
Did v6 change the routing paradigm? Well, any of you remember the massive fights over top‑level aggregate announcements and sub‑TLAs and the world was going to have 7,000 transit providers and no more. Not 7,001. And those indeterminable debates would realise we didn't have a clue. The IETF was just filling up space and that, realistically, v6 didn't change the routing paradigm whatsoever. The whole effort of the IETF to impose a new business model on routing was a dismal failure, as it should necessarily have always been. We just used BGP for v6 the same way as we used it for BGP in v4. And, quite frankly, we just hand out blocks, you advertise them. Half of the routing table in v4 is more specifics. Half of the routing table in v6 is now more specifics. A /48 makes it everywhere. A /64 makes it everywhere. Nothing new.
So, now, traffic engineering, hijack defending, as Randy pointed out with his /24s, you just advertise more specifics. Same, same, same. Nothing has changed.
Because traffic engineering and BGP should never be put in the same sentence but we have no other way of doing traffic engineering. BGP is shit at it. All BGP is meant to do is topology maintenance, but it's a really crude protocol. It doesn't give you choice. It gives you the best path. And if you want two best paths, it's kind of brain explode, can't do that, won't do that, won't balance traffic, not cooperating, not working. The ideal two‑year‑old.
So if you can't do this easily, we get into all kinds of sort of really weird contortions to try and make traffic engineering work in an environment that doesn't naturally support it. So, yes, people do AS prepending because, you know, when you are drowning, anything that floats looks good. Not, but you know, people try. We do communities. Sometimes they get on, sometimes they don't. Sometimes they are transitive. Sometimes they are not. You know. So the leverage in traffic engineering is really difficult. What we get as a result is a huge amount of senseless BGP abuse. Because what actually goes on is that the attempts to try and engineer traffic, introduce more routes, more updates, more churn.
You can see that. This is the scale of BGP since 2009, so ten years, and we have grown from around 300,000 routes to around 800,000. Congratulations, well done. This is the amount of updates per day, and, up until 2013, independently of the size, we were seeing around 100,000 updates per day on any eBGP session and it started to grow and grow and grow. Why? Traffic engineering updates. What we're actually seeing is more specifics have a higher churn than the aggregates, and, as you move things around, that moving causes churn in the entire BGP system.
Why do we know that? Because the withdraw rate, the bottom line, is constant. There is only 10,000 withdrawals a day. I don't know how you guys do it, but you have organised amongst yourselves in some room that you never invited me into and you have said, right, today you can withdraw 10,000 routes, tomorrow, it's your turn, tomorrow it's your turn. And somehow, the stuff is orchestrated in a way that no other self‑organising system should ever exhibit. This is scale‑free. It's remarkable. When you think, okay, well, that's cool. Later. But what's even cooler is the fact that BGP converges in 50 seconds. Any update comes to a new state, whether withdrawn or stable in 50 seconds. It did that in 2009. It did that yesterday.
Wow! What are you doing? Keep doing it, it's really cool. Right. I have no idea what you're doing either, but it's brilliant.
Now, the problem is, that at this kind of scale, 800,000 entries, 70‑odd thousand ASes, 75,000 actually, scale becomes inertia. It's really, really difficult to change a massive system. So, you know, can we set up a new inter‑domain routing protocol tomorrow? We had a hard time with the KSK roll in the DNS. If you think the KSK roll was hard, this is monumental. Scale is just a massive problem here. It generates its own inertia.
What's going on behind this that's causing us to think that way? Well, in actual fact, the way BGP was designed and the way we use it have some really critical differences. We weren't meant to keep sessions up for three years. Think about that for a second. Because you guys really try and do it. You know, perfection for operations is a five‑year‑old BGP session, because it's stable. But, no, the original design was if you have a problem, reset the session and go and start it again. Learn all those routes and reset, yeah? That's the design experience in BGP. There is no soft error in BGP design, it was all just hard errors. We all love each other, we all trust each other, and we all think that everyone doesn't tell lies.
It's a public function. Who is going to attack me? Yeah, right! So, there are a number of problems with BGP in terms of its security, i.e. it has none. We notice this first with the session attacks on TCP itself. Long hill TCP sessions, you start just injecting random resets, sooner or later you are going to get the right number in the window sequence, bang they are down. In the payroll integrity, no one is meant to lie. What's this BGP hijacking? What's this malicious routing? We're all friends, right. We have no defence against that in BGP per se.
Lastly, performance. BGP MRI time is work every 30 seconds. Yes, it takes 50 seconds for a route to converge, but, no, it doesn't take less than 30, no matter what. The thing is slow, and there have been efforts to make BGP faster but in general BGP is simply slow. Protocol performance is not one of the key achievers. And of course the last thing here, the use. It was meant to maintain topology not traffic engineering. If you were running out of capacity, get more capacity, don't mess with BGP, was the theory behind BGP. Don't try and make BGP give up, you know, try and compensate for the fact you haven't got enough bandwidth. So BGP isn't perfect. We have said, insecurity, instability, sparseness of signalling, and, quite frankly, traffic engineering is a nightmare.
So what are we going to do?
Obviously, it's not for ticking all the boxes. There are obvious problems. So, we do what the IETF is really, really good at ‑ tinkering. Because, you know, they just can't avoid it, take a protocol and just nibble away at the edges. Capability negotiation. I can stand on my head. Can you? Let's stand on our head in this BGP. Let's just not send the best path. Let's send a number of additional path, let's add paths. You know, this community stuff, I want more communities, I love them. Let's have extended communities. Let's do fast BGP and cut the timers down. All of those things are tweaks. They are not changes to the fundamental protocol, they are just changes to the top end of it, just the little bits of functionality.
Does it work? It might be fun and you might attend lots of IDR meetings and have lots of debates, but does it work? No, not in the slightest. Because, quite frankly, to get everyone to adopt it, you have got a real problem with acceptance. And unless you get sort of universal adoption, your tweak is just a local theme and that's not good enough. So as long as tweaks are localised in both their impact and their benefit, you are never going to get everyone else to do it.
Now, the only thing that BGP did universally is, we did run out of bit AS numbers and we have changed to 32‑bit. Now, it took a whole lot longer than we thought. And, you know, over in America they were still carefully handing out the last of the 16‑bit numbers because they wanted to do community‑based signalling, and they didn't have the protocol support for it, so they were leaving Europe and Asia to actually work on the 32‑bit numbers initially. But, you know, 4‑byte ASes finally came out. That's been the only one that's fundamentally changed the protocol.
So maybe it's time for a new protocol. Maybe this whole tweaking thing is a joke.
But this is not the first time someone said let's do a new inter‑domain routing protocol. We tried that back in '92, we tried it in '98, we tried it in 2008, and so on and so forth. There are only a couple of ways to do routing that we truly understand. There is a flighting style distributed computation of shortest path first. Everyone has the same topology, everyone runs the same computation, everyone has the same forwarding tables. That assumes flooding, uniformity and metrics, the inter‑domain space doesn't do that. Or you go back to Bellman‑Ford, you do disk inspector, I send you the best paths, how I compute them is up to me. This is where we are. If someone has a new idea, let's talk about it, but, until you do, there is not many other ways of doing this, it's a really constrained space. We have no new insights into how to do this kind of routing.
We didn't plan for 30 years of BGP. It just happened. It happened because it's simple. Oddly enough, in an end‑to‑end world, hop by hop protocols are actually really powerful. And hop by hop was actually what made BGP work.
TCP is astonishingly powerful. Don't underestimate TCP. And it's not perfect, it never will be. It's a set of compromises that we can all live with.
So where are we going from here? Well, will BGP even be relevant? Because if you think the world is transit, lots of providers, large space of the Internet, that's one future and BGP has a role. If you think the CDNs are going to take over and that all we're doing is last mail delivery of television content for a world that's obsessed with NetFlix, then BGP has no role because, at that point, in a uniprovider CDN, there is no transit, there is no peering, there is no routing. There is only one way of doing it.
So which way are we heading? The reason why v6 was never sort of given the urgency it sort of had originally was that the problem that v6 solved, lack of addresses in a peer to peer network, was actually solved by changing the architecture of the Internet completely. It was easier for the industry to make a client server network that didn't need to number of clients than it was to change to v6. We changed the architecture, which is bizarre, as the easiest way out of a problem with peer based network addressing. If we have a problem with BGP, it might well be yet again we are going to change the Internet rather than fixing BGP. Never underestimate the perverse nature of this industry.
So, RPKI, radios, secure, secure, secure, will it ever work? Are we ever going to get rid of folk announcing crap address space? Are more specifics ever going to die and senseless routing vandalism disappear? Senseless prepending, are you ever going to stop? Are we ever going to see an end to those massive route leaks from year to year to keep us all amused? No. None of those things are going to happen. We're just not going to be able to do this. Why?
It works well enough that we can cope. We are used to it. It's functional. And like any old ageing car that only works three out of four cylinders and a few of the wheels are falling off, etc., if you know how to drive it, it's just fine. And this is like BGP, we know how to drive it. We have actually trained a large body of folk to put up with BGP's bullshit and we trained them on this bullshit. If you see this, you should do something else, right. And so we have eventually grown used to BGP and all of its foibles. The levels of abuse are tolerable, because if they weren't, you'd fix it, and you don't. And there is no plan B. There is no new protocol waiting to take over. This is BGP. You are stuck with it.
AUDIENCE SPEAKER: Hi. Brian Trammell, Google. There is a thing that you went over really quickly that I want to go back to is, so, yeah, there are all of these features, there are all of these IETF tweaks to the protocol. And only one ever made an impact, and that was the 32‑bit AS numbers. And I think it makes sense to dwell on that a little bit and ask why? I don't know why. Because if you say that, well, running out of numbering is the reason that you leave IPv4, I mean BGP 16 based numbers, then obviously IPv6 has already been deployed, we can toast to its universal deployment. So is there any insight that we can take from that particular success that we can generalise to changes that we can make to the Internet routing domain architecture or is it just a one‑off emergency?
GEOFF HUSTON: That's a great question, and I suppose the answer is, we could have done AS NATs in 2‑byte ASes, we could have. And let me say at the outset that there are at least ten different networks in this world that all call themselves AS1. They do. Ten of them camp out on the same AS. They can't speak BGP to each other, obviously, but if they have default, they can send packets. So, we could have gone down that way. But I suppose no one was brave enough to go let's just do AS NATs, right, because after the v4 NATs got such a sterling reception in the IETF, no routing person was willing to go there saying NATs in BGP just fine. There was a potential future. I agree. But you know we went the other way and did the whole transition. Thanks.
AUDIENCE SPEAKER: Hi, Geoff. As you know, I am Iljitsch van Beijnum. So you say underachievement is a virtue, but then explain to me how this is possible. BGP is 30 years old. BGP 4, 25 years old. IPv6 is less than 25 years old, a little over 20 years, so, how can a protocol that was created before another protocol provide routing for that newer protocol?
GEOFF HUSTON: Magic. Isn't it brilliant? Because in some ways ‑‑
AUDIENCE SPEAKER: Yes, it is brilliant. You are not giving BGP enough ‑‑
GEOFF HUSTON: In some ways, the way BGP was designed is that every network is a bit like an airline. It may not look like it, but the airline doesn't like your luggage, and it really wants to piss it off. So that when you arrive without your luggage, the job of the airline is, Christ, I will find it and get rid of it. The job of BGP is to get rid of the packet. And what they really want is an exit point and it doesn't matter what address, family or protocol that exit point is. When it accepts a packet in the protocol, you know, the packet is in, it goes, what's the exit point? Go there. Now, the interior of the AS and the interior of the routing, you don't have to be v6, you just need to know the exit point, the same as the airline system. Once again, this beauty of BGP, simple, elegant and just enough. I mean, it's a great design.
AUDIENCE SPEAKER: Okay, that was not the answer that I was looking for, so I'll give you the one that I was ‑‑
GEOFF HUSTON: You are looking for your answer.
AUDIENCE SPEAKER: My answer is that back in the early nineties when they designed BGP 4, they did such a brilliant job that there was no need to ever go to BGP 5 because you can extend BGP 4 with 32‑bit ASes with all kind of new crap, including IPv6. We don't need BGP 5. BGP 4 can do anything.
GEOFF HOUSTON: I'll let that happen stand on its own merits. There are people at the mics.
CHAIR: We have only two minutes.
AUDIENCE SPEAKER: Remco here. Well, actually, tying into Iljitsch's comment, BGP 4 is perfect. That reminds me of another protocol that the IETF has been tinkering with for a good three decades, called DNS, where, basically, DNS was the protocol of choice where if the IETF went fuck if I know where this has to go, let's put it in the DNS.
GEOFF HUSTON: I have a T‑shirt.
REMCO: Isn't it that ‑‑ I mean, right now, DNS, this whole discussion about the DNS camel that you might have heard of where a protocol ‑‑ if you actually want to read through the entire DNS protocol, it's like three‑and‑a‑half thousand pages of specification, and, with BGP, it's heading in the same direction, I feel, because if there is something about routing or actually stuffing packets in a certain direction, if you want to call that routing, needs to happen, the default choice is let's throw it into BGP so now we have all sorts of MPLS extensions, we have got FlowSpec, we have got God knows what else, is it maybe time to say let's call it a day? And let's leave BGP 4 as it is and anyone who wants to touch it from now on can ‑‑ well, go find a new hobby.
GEOFF HUSTON: So, there is eBGP, which knows nothing about MPLS, VPNs, AFI/SAFIs or anything else, as far as I can see, because what goes across the boundaries is generally not enterprise based solutions. And then there is BGP the iBGP dramatically good flooding signalling protocol that can carry data to all parts of your network, and iBGP has been randomly abused and tweaked like crazy, but in some ways it's not the protocol, it's the pay load that gets augmented, ornated and adorned. The fundamental operation, TCP message parsing, hop by hop relaying, nothing changed. So, what you're talking about, the kitchen sink BGP, we have had these debates in the IETF from time to time, but if you think about it the BGP protocol behaviour never changed. The ornateness and decoration was actually down in the pay load rather than in the fundamental operation of the protocol. And part of the issue is, iBGP, we have given up, do what you want. EBGP out there in the Internet the way it works is actually quite sane and reasonable for the moment. But saying that I am sure some of you said, right, let's make it worse because Geoff has said it's going to be easy.
AUDIENCE SPEAKER: In DNS, the packets still look the same and, arguably, DNS resolver from 1995 could still parse ‑‑ in terms of you saying this is all enterprise and internal. I would like to contest that, because I remember presentations here at the RIPE meetings long ago, I think 2002, when people were doing inter‑domain and PLS with BGP signalling and actually today we had a conversation about BGP FlowSpec into the inter‑domain, so, saying that all of this just sits inside of a network I think shall I mean ‑‑
GEOFF HUSTON: I think the folk doing inter‑domain FlowSpec in the best sense of the English word ‑ courageous.
AUDIENCE SPEAKER: I don't disagree.
CHAIR: We are running out of time.
AUDIENCE SPEAKER: Chris Woodfield, ARIN AC. I pretend to be an Internet person on TV but, in reality, I play a lot inside data centres just as much as I do on Internet devices. I wanted to mention that about ten‑odd years ago, so there was some data centres that were being built and somebody realised, hey, I have more network devices in my data centre than there were Internet routers at one point. I wonder if BGP would work just as well for routing inside a data centre as it does for Internet routing, as it turns out they were mostly right. I was wondering if ‑‑ and a lot of the extensions and tweaks that we have seen have actually come, a lot of those have come from the people who were operating BGP inside data centres just as much as people running BGP to route between Internet networks. I was wondering what your opinion on that is?
GEOFF HUSTON: I have conveniently ignored the whole AFI/SAFI yawning chasm of ornate detail inside the data centre. There is a whole universe of minions tweaking it in that space. We only had 25 minutes, not five days, but yes...
CHAIR: Last one.
AUDIENCE SPEAKER: It's Tasha, consultant for German government. I would propose to change the perspective a little bit. Because I like the thing Remco said, he said, okay, there is another protocol like this, it's DNS, and he is right. I read a lot of articles in the last couple of months about, oh, the Internet is in danger, it goes down in a black hole and it's not working any more, and they are all going on to DNS, these articles, and BGP. And the issue in my perspective is, someone says, okay, there is a problem with this Internet, and the Internet is globally important, and then some politicians say, okay, we need to do something. If we don't fix it, we ‑‑ I say the knowledge community, the IETF, someone else says, okay, when they don't, if they are not able to fix it, perhaps we can do it, you know. And you know the organisation I'm thinking about, I think.
GEOFF HUSTON: So, this is this whole area of market‑based economies regulation and increasingly sophisticated regulator, but an increasingly rich market‑based economy. And in some ways, we have the BGP you want. Because if you wanted to change it, we can. But all of you are prepared to live with it the way it works, including the occasional route leaks and the insecurity. You tolerate it. If you didn't tolerate it, we would change it. But you don't want to spend the money on changing it. So, the regulators are sitting there going, what am I going to do? The real answer is, if you want to walk in, I hope you have a massive chequebook, I hope you know what you are doing because you are about to mess with something that the rest of the industry doesn't want you to play with. So this is dangerous territory to walk in, but I agree, the regulator is becoming increasingly active. This talk about ‑‑ yesterday, about I want national boundaries around my Internet is as much as I want national boundary around my routing as it is about DNS. We are going to see more sophistication in the requirements, but the market is going down a different trajectory. You wanted deregulation, you have it. You don't have phone companies doing national bidding any more, you have something entirely different.
These communities need to figure out what they have done to this industry. Because if you are going to say it's a market based, you know, private enterprise, you are going to have to live with the consequences. If you want to reimpose control, wow, good luck. Because we have let the cat out of the bag, we have let this loose, it won't come back easily. So I don't see any solution for what you are saying. It's a problem. I admit. And we're going to spend a lot of time talking about it and this is may be the first of billions, unfortunately, I hope I can run away. But you can see the issue, there is no easy answer. We have the BGP. We are prepared to live with, weaknesses and all. This is it, this is life. Get over it. Thank you.
CHAIR: Thank you.
Our last speaker today is Thomas Weible from Flexoptix who is going to speak about the complexity of HyperSpeed transceivers.
THOMAS WEIBLE: Hi everybody. I'm the co‑founder of Flexoptix. When I started in networking industry, especially in the field of networking, I ‑‑ the speed of 100 megabit was somehow commodity and gigabit was new introduced to pluggable transceivers. And at this day also, Google launched their services connecting everything together with a broke 4,000 switch, and, these days now, we are mainly getting there 100 gigabits, gets more and more commodity, and 400 gigabit is somehow depleting edge. If you look at the complexity of your networks, I'm sure you see differences there, that it has changed tremendously and it will get more complex. The same happened to transceivers. That's what I want to talk about. I want to show now what happened to the transceivers.
So what I did there is, I opened up a 10‑gigabit SFP plus and a 400 gigabit QSPDD and, beside the fact that I mentioned are different on the sizing there, they are also a new component introduced in the 400 gigabit transceiver. So that's one part of the talk at the end. And then I do have those two tiny golden boxes, the transmitter and receiver section which keep quite interesting components inside and that's what I want to start with now.
Let's start with the transmitter first, how it brings light on the fibre, and it's a golden box, at the end of the day, until we see an empty golden box and now we want to fill it up. So we got an electrical connector attaching this golden box, it's also called a TOSA, Transmitter Optical Sub Assembly. It's connected with the electric connector to the PCP. And first of all, we need somehow the laser diodes in there. We are talking about 400 gigabit. The same set‑up for the 100 gigabits. And they have two characteristics, basically. First of all, their main purpose is to change, to convert the electrical signal to an optical light level and then they emit, somehow, light, but this light is not really focused. And the second one is, they need current to operate properly and this current heats them up.
So, the next one what we need is, we need a cooler, and actually to keep those laser diodes in a certain temperature level, and then to keep the wavelength stable. That's an important thing because we don't want that the wavelength is going to drift away. We want to keep it stable because when it drifts away for one laser, for example, then our link will fail and then we have packet loss and so on.
So, the cooler also is an electrical component. It takes current again to cool down the laser diodes.
And the second one is to focus the laser itself, to use lenses. It's pretty much like you wear glasses, and pretty much everyone here, or contact lenses. But they are a little bit small in the transceiver itself because that golden box we're talking about has a dimension of 5 by 10mm so each length has a dimension of 0.5mm, and the challenge there is in production, for example, just to get them placed properly in X, Y, Z axis and we do this in roughly 15 nanometre steps in the mechanical positions. Then that's the first challenge. And the second one is to keep them there for the next year. And it's typically done with glue and this glue is hardened by UV light. Now, you can imagine that this glue needs some certain characteristics that it stays in that position and consistency for now, tomorrow and also 2025.
So that's a bigger part there.
Now, we have focused light for wavelength actually, but they are not really hitting the receptacle at the far end, which we have here. So, we need another component in there to bundle everything together, and there we come back a little bit to school physics of the principle of mirrows [phonetic]. We had an optical multiplexer, what it does is combines all the four wavelengths together in that filter block and the light is doing a little bit of zigzag and reflecting and bouncing back and forward, getting to that section where it can be combined all four wavelengths through that receptacle here and that's a bigger drawing.
So now the four wavelengths are on the fibre, when they are on the receptacle we plug in the fibre cable there and on the other side we will have the receiver. It's pretty much vice versa. Again, we have now a grey box which is opened and it's empty. So the four wavelengths are coming in and this now it's in rose a, receiver optical supply assembly. We need an optical demultiplexer. It works a little bit different but it still has compared with the multiplexer but it has the same principle, but then there are filters, so what happens is the green light is passing through the filters and the other colours are bouncing back and forward again and going down to the yellow light here and this one is passed and so we got a separation now again of all the four wavelengths inside that ROSA.
Then, again, we need to focus it. It's the other way around, we need lenses again, to position them precisely, hitting the pin diode, the receiver itself who does the conversion from the light level to the electrical level.
Now this electrical level has an issue, actually. The ample attitude is quite low, so the post processing systems behind it, like the DSP, or your switching router, they can't handle that signal, it's very, very low. So what we need there is an amplifier and this amplifier will boost up the signal a little bit higher. It takes current again, so it's going to heat up again, and brings stronger signal to the DSP or the post processing systems.
One side story here about the amplifier is some of you might have done that are, I assume, that when you have an 80 km transceiver, for example, or even longer and you connected them back to back without any attenuation of 2 metres of cable with single mode, then most probably the link got up for a couple of seconds and then it dropped again and then it stayed there forever because you broke something. So, what you actually did there, and quite a lot of people think they burned the receiver, actually you didn't burn the receiver, you bumped the amplifier because the amplifier got way too much current in there and that's actually what was broken. So you could have fixed that if you just replace the amplifier at the end of the day.
Let's move on. So, we have those two components now, the TOSA and ROSA and we need to assemble them now somehow to the PCP, I don't want to get into the details there, but I have pointed out one important thing what I think is important to know about production itself.
When it comes to timing, the production is ‑‑ so I explained the graph here a little bit. On the X axis, I do have three sections, 10 gigabits, 100 and 400 gigabits. On the Y one I have timing in minutes. And I have two graphs, basically; the blue one is the assembly and the green one is testing, And the combination of both of them is the orange one.
For a 10‑gig transceiver, you can say roughly it's 8 minutes which you take for production and testing, we do this in batches so it's normalised a little bit. But for a 400‑gigabit transceiver, the assembly is a little bit higher, it's roughly 20 minutes in total. So it's a more complex component to build, but when we look at the curve, the testing to do the entire setup, to make it a proper product, is way way steep err, which we can see here, it's going really way up. So, an interesting point is if you want to, for example, save cost or optimise your production, actually it's not on the assembly part, it's more on the testing and you can do it in a good way and a bad way when you just cut down the timing. This is going to be a bad way or to save costs there. But also, if you, for example, have more intelligent process of testing, there is more automation in there, this also saves you time at the end, costs.
One example about testing I want to point out here is to give you a rough idea of how you can do different types of tests. I took the example of temperature testing, for example. And I do have my artificial TOSA and ROSA here, at the temperature of 0 degrees and on the right side it is at 70 degrees centigrade. You can just do testing at 0 degrees and 70 degrees and just wait until my green laser is normalised again, so here what happens is that the wavelength is drifting away a little bit and you wait and you wait and it somehow will normalise again, and you say, well, test passed, passed. But this can be one test case, if you say well, okay, just instead of having a temperature we add a second variable, which is the time, and we say we start at room temperature, 25 degrees, we cool it down within 30 seconds to zero degrees and it needs to be stabilised, let's say, after 30 seconds again and then we ramp up the temperature within the 60 seconds to 70 degrees centigrade and it needs to stabilise again within 30 seconds, you have a totally different test set‑up. And if a component like inside there can't make it, the test has failed, but if you don't take the time and consideration, it might pass.
And with that also, the timing, for example, like on the laser side, which is drifting away, you can also cool ‑‑ you can also test other components like the cooler, that it's properly operational.
How important test cases are, I am pretty sure you are aware, all the software you are writing you do test cases for that. I give you another example. I have here an artificial crane which is horse‑power‑driven. So we have here a wheel by the horse is spinning and in gear itself, what it does, it lifts up the weight which we have here. But, if even the tiniest ski wheel within that wheel will break and we haven't tested it properly, it will break the gear itself and it can be even worse, that this weight here might destroy something else because it's heavy and so on. And this can also happen in a transceiver, so you would have a chain reaction of certain components which then will fail.
The goal is to have a production for sure, at the end of the day, passing all the tests within a proper test setup, which is not extending too much time as well.
At the beginning, I mentioned the DSP, which is the third component on which it's newly introduced on 400 gigabits transceivers. In the DSPs are mainly hold two functions, one is the modulation, I don't want to talk about modulation now, I don't want to do that. But I want to talk about the second one, which is forward error correction.
In gigabits, for 10‑gigabit transmission, for example, there was already an error detection, the cycling written check at the end of the Internet frame which does the math over all of the frame of 1,500 bytes, for example. With that CRC you were able to at least detect errors. Now, within 100 gigabit and 400 gigabits, this is not sufficient any longer; we need also not only to detect error, we need to correct those errors, and that's the role of the forward error correction, what it does.
How does it work? In the 100 gigabit world, it's quite straightforward on the switching side, on the left side, you have an encoder which takes care to encode and bring it over to the right side, which is other receiver. There is the decoder. So when we look inside there, and I used a little bit of another example here, with the lorry and the X provider. What we have here is, we have bricks on a lorry, and that lorry ‑‑ so the bricks are basically our frame of 1,500 bytes, which is roughly 12,000 bits, and this figure is later on important. So that lorry carries that pay load over a dodgy road, which is our fibre, and all the bricks are disordered somehow and even one is lost here, the blue one. And on the receiving side or on the decoder, he brings all the bricks back in order, actually. And how powerful ‑‑ and that's the decoder, and how powerful is it? When we have those 12,000 bits, for example, the fact that, most probably, when the error side equally distribute it, it can correct up to 330 bits, and when they are clustered, the performance gets ‑‑ is getting down, so we can correct only 33 bits there.
Another thing, what is does is, it will add a latency or delay to our overall pay load, so you can imagine this takes a while to bring all the bricks back in order, and the same happens with the forward error correction in your switch, it takes some time. So, the fact it takes, roughly, depending how it's configured, of 100 nanoseconds which you add on a delay to your link. Plus, the third one is, it takes power again. So whatever correction is power consumption, has a high power consumption there.
And why do I mention that? Because in 400 gigabit we will have more FECs. So besides the ones in the switch itself, so we still have the FEC, which is now the FEC 2 in the switch, and it does the encoding there to get all the bits and bytes, shovelled over to the transceiver, where the transceiver does the decoding. And then the FEC itself needs to encode in the transceiver again, brings it on the line and on the other way, on the other side, the other way around so, so transceiver decodes the line signal and brings it onto the direct connect with the interface to the switch itself where the decoding needs to be done. And there are, I think, mainly two takeaways on the forward error correction, is that we have now a higher complexity of four FECs. Well, if you have four involved, you have four times the delay on your link, that's a really important thing and you will take way more power then.
Another thing, what is a really important take‑away is if ‑‑ is the amount of FECs. So, what you really want to avoid is an unequal amount of FECs involved. So if you have three or five ‑‑ three or one involved, sorry, five is not possible ‑‑ then it won't work. He can't handle it and the same can happen in your network actually, because, typically, when you plug in the transceiver, the switch OS should determine according the type of transceiver that's plugged in to disable or enable the FECs like some standards, they require the FEC to be enabled. And some of them, like LR4 doesn't require it, and, depending on the implementation, they should actually do it automatically, but sometimes it's not ‑‑ they don't do it and then you have an unequal amount of FECs in there and then your link won't work at all. And the other aspect is, you might have a working link, but you have built errors on that link itself. So you better double check the FEC settings itself. If it is enabled at all on both sides for sure, but if it is enabled because it can also happen that it's just not enabled and then you end up with bit errors on your link.
Okay. Coming almost to the end of the presentation. I always give a little built of an overview of what is latest development, what we can achieve currently on the market, and I pointed out three things which I think ‑‑ three types of transceivers which are really interesting in this time.
I took the 400 gig LR4 again because it's still not standardised. And what is not standardised, it is the wavelength which is going to be used. A lot of people assume we will have the same wavelength grip, like in 100 gigabit LR4 like it's a LAN WDM 5mm spacing of each other, but there is also proposals from the industry doing a 20 nanometre wavelength grid for LR4. And the reasons are quite simple: it's way cheaper to produce 20 nanometres spacing lasers and keep them stabilised than 5 nanometre lasers. Let's see where actually they will go there. There is those proposals for the wavelength grid and we'll see when LR4 is standardised.
Another new development is the long reach of 100 gigabit, but I'm only talking about interesting form factors, which is QSB 28, I don't talk about CFPX form factor. The 100 gigabit is coming up. We will have the most probably by, yeah, next week in our lab to see ‑‑ and they can do 40 km and 80 km distance, this was actually a big issue in the last couple of years, there was somehow the landmark of 25, 30 km and not going beyond that, and now, with that solution, we actually have something getting up to 80 km.
And the third one is, it's more an interesting approach. I'm not quite sure if it will be interesting for us as ISPs or data centre providers, is the 50 gig ethernet itself and also the possibility to do 50 gig ethernet up to 40 km. It's coming from the mobile networks doing 5G deployments because their need is, if you have 1,000 poles and you have to connect them, you need somehow a technology there, and there was a simple math to do it; 50 gigabit ethernet was sufficient, they decided to do for their 5G roll‑out and 100 gigabit was not required. It's way cheaper to provide a 50 gigabit transceiver compared to 100 gigabit.
If this will ramp up, I don't know, but what is for sure is the ASICs inside of your switch is handling 100 gig and 400 gigabit, they are capable to do 50 gig ethernet. It's more a question of the operating system which is driving those ASICs that it's going to enable the 50 gig ethernet or not and those ASICs can also do 10 gig, 40 gig, 25 up to 400 gig, so they are quite multi‑speed, it's more a matter of the operating system. But what is also for sure is, all those three technologies there is they are making heavily use of the 1310 nanometre window and they use the full spectrum there of the 17 nanometres which we have, so I would really give that recommendation say, well, we keep that window open and don't add an additional filters in our network, because what we have seen previously on my slides is, the filters are actually in the transceivers itself on the higher speed ones at 100 gig and 400 gig and you don't want additional filters in the network design there. That's really important important. So the 1310, keep it open.
And if there is the need to do slicing, DWDM is still a proper solution there. In the speeds got up as well in a single solutions doing 25 gig or even 40 gig and they are currently not only legacy space environment, they are already available.
Okay. That brings me to the end. There will be a lot of space for us to do some improvement also when it comes now to the DSP, instead of just having the micro controller, which we have to tweak, we also will do the DSP, we have to put some work on the DSP in the future with the flex box to change the programming.
Thank you very much and if there are questions, I am happy to take it.
CHAIR: Thank you, Thomas. Any questions?
AUDIENCE SPEAKER: Hi. Chris Woodfield. You mentioned the increased testing times that are required for the higher speed components, or higher speed optics. Is that a ‑‑ is the result of that because the, you know, 100, 400 gig optics have more complexity and require more complex scenarios, or are you seeing lower silicone yields in those parts than you do on a lower speed components?
THOMAS WEIBLE: First of all, actually both. You have more parts in there and they need more testing because it's higher speeds, yeah, it will adapt double the amount actually, yeah.
AUDIENCE SPEAKER: So it's kind of, kind of testing squared.
THOMAS WEIBLE: Yeah.
AUDIENCE SPEAKER: Hi, Thomas. Cool slides.
THOMAS WEIBLE: Thank you very much. One disclaimer, I gave all the toys back to my kids. They have it back again. I didn't keep them.
There is one more thing I want to add. I would like to have Fergus on the stage. And he is coming... please... well, then, you have to take the celebration on that way. Fergus has today a birthday.
You also get a present over there.
Happy birthday to you...
Happy birthday to you...
CHAIR: A couple of reminders, we do have a BoF in the side room about the diversity. Also, there is a RACI session on at 6, and do not forget tonight's social event, it's a 20‑minute walk to the ship. The ship will leave the deck on 9:45, so be there at 9:00. Thank you.
LIVE CAPTIONING BY
MARY McKEON, RMR, CRR, CBC