17 October 2019
JOB SNIJDERS: Ladies and gentlemen, welcome to the Routing Working Group at RIPE 79. I am still catching my breath. I ran in just 30 seconds ago, my apologies.
Today we have an awesome programme, packed with highly technical insightful presentations, I am very much looking forward to what the presenters can teach us today. But before we delve into the actual meat of this Routing Working Group, I will hand over to Rob, who will go through the administration of this Working Group.
ROB EVANS: Thank you. So, what have you done to deserve me back up here again?
Ignas is, as, you know, Chair of this Working Group but he has been unwell and so Job and myself stepped in to run this session and sort out maybe some chairs for the future.
This is the agenda, first of all a big thank you to the scribe and the Jabber monitor from the RIPE NCC, also a big thank you to Mary from, who is doing the stenography, does an amazing job.
ROB EVANS: I have got a note on here to approve the minutes of RIPE 78, but when I was looking through the mailing list, I don't think they were ever sent out. So, I sent a little note out this morning, have a look through, if they meet with your approval, don't say anything. If you object, then speak up. And we'll get them changed appropriately and declare them final in a couple of weeks.
This is the agenda. Let's bash it.
The first thing we're going to do after we have gone through it is select some co‑chairs. Then we will have Flor again Andy from the RIPE NCC talking about route collection, changes in the architecture, how they're doing that.
Massimo from NTT is going to talk about BGP alerter.
Geely from UCL is going to give a presentation on groups of border links. This presentation has come from the RACI programme.
Nathalie from the RIPE NCC is going to talk about RPKI resilience.
Andrei Robachevsky is going to talk about validating MANRS over the network.
I have been brought up that there might be one issue of AOB to cover at the end. We are going to run quite tight on time so I would ask the speakers to do their best to keep to the times allocated.
The Routing Working Group co‑chair selection policy is fairly lightweight. I should probably take some of the blame for that. But it says that "As a large part of the work of the chairs is creating an agenda for the Working Group meetings, these co‑chairs are generally appointed by consensus of those present at a working group meeting."
So Hans Petter sends a request out to the list a couple of weeks ago, we had volunteers, who are both present here today. I'm going to just do a couple of, I was going to do hums, Job suggested a show of hands, I really don't mind. Let's start with a hum because it's consensus based rather than voting. I'll do two hums, one of which is do we agree to appoint both of these as co‑chairs force the Working Group and a second one if there are objections.
If the objections are louder than the agreement, then we'll go through and do a bit more. So first of all, first time, do we all agree to appointing both Paul and Job and co‑chairs of the Routing Working Group? Please hum now.
Are there objections to appointing both Paul and Job as co‑chairs, please hum now.
I think than sounds fairly clear to me. So congratulations to Paul and Job.
So, one quick observation from myself on the co‑chair selection process. I think a few of the RIPE Working Groups should get to, get a bit of a handle on this.
The way we do it at the moment I think is, I can see how it could be quite intimidating for newcomers trying to volunteer to be co‑chairs because we have really good candidates who then get a lot of public support on the mailing lists. Which means that you are only new you think there is no point in me standing. I'm going to suggest that maybe we tweak the selection procedure and suggest that in the future candidates make themselves known to the existing co‑chairs, then they are all announced together and then we work on the consensus of the group there.
I know the DNS Working Group already aims to reserve one of their co‑chair slots for a new co‑chair to try and train them up, to try and bring them into the fold. That might be an idea to follow for this. But that is now, Ignas and Job and Paul, that's up to you as to how you want to do that, that's done with the admin trivia. I shall ask Florian is up first to start talking about RIPE WHOIS collection.
FLORIAN OBSER: Good morning. The last RIS architecture update we did was at RIPE 70, so we figured we probably should give you an update of we have been busy since then.
So about RIS. We are collecting BGP data, updates and withdrawals on machines that we're calling remote route collectors, those are basically BGP speakers. We operate 19 Internet Exchanges, they can only talk to peers over the exchange, and we have 2 Mudhubs now, one is in Amsterdam and the one in Montevideo.
We are doing this since quite sometime we are storing data data since 1999, we are generating update files every five minutes in MRT. You can access this via this website. And we store RIB dumps every eight hours, you can access those at the same location.
Additionally, we provide a process data on RIPE Stat, and recently, we added RIS live. There was a presentation at the last RIPE meeting about that one.
This is how this looks on the map. We are a bit crowdeded in Europe. And there are some thoughts on how we can maybe extend this over the world, Emile will talk about that a bit.
Since the last presentation, we added these RRCs. Of note is I think RRC 24, sponsored by LACNIC, this is the other multipath RFC, so we are facilitating peerings all over South America on that one.
What we were also busy with was ‑‑ so some of the RRCs are operated by the NCC so it's our hardware. We did a full hardware refresh on those. That means getting the boxes in, shipping them around the world. It takes quite sometime. And we also had to do a last update, we updated all of them to 7. And the big change that we did we moved from Quagga as the BGP speaking software to ExaBGP which had some benefits for us.
So about extra BGP. It's an ICMP enough BGP speaker written in Python. It just keeps the session alive, collects data. And because of that, for our use case, that's much more scaleable than Quagga that we had before there. Quagga was quite struggling with a lot of peers that give us a full table.
With extra BGP, we can massively simplify this. We just get the data in and Tor it on dynamic. We don't need to process the data. We don't actively care what people give us as long as the session stays you up.
The other thing we can do with extra BGP, the remote route collectors also announce some prefixes over time so that we can monitor a propagation. And with extra BGP, we actually have this inside of the daemon which makes the timing much more accurate. Previously this was triggered by a Cron Cron job where you have an accuracy of maybe a minute. You would not be unluckily that is 59 second after your time slot but maybe it's 20 seconds after. With what we're doing now we're quite sure that we have this within a second, which helps with propagation measurements.
So we have the data on disc on the RFC, now we need to do something with this. I'm saying we are generating MRT files. What we're doing is we are taking the data off the disc, outside of the BGP speaker and ship it to Amsterdam and stored in a Kafka cluster. The upshot here is that this does not interfere with BGP.
Previously, with Quagga, we had the issue that when it was generating the dump files, it stopped doing anything else. If you have a lot of peers on there that maybe have a full table, then it might stop doing anything else for 30 seconds and suddenly your whole time expires, your sessions go down. Then you are creating art acts and you are seeing an instability. But the instability is not caused by the Internet. It's caused by the measurement. So we have that solved.
Also, the individual update is on our collector in Amsterdam in the queue system within seconds. Previously we had a delay of at least minutes, we created the update file every five minutes, so it would take five minutes to see the very first update in that file. Now we have this more or less instantaneous, which in turn allows us to do RIS live.
The other thing is, as I was alluding to, that Quagga was struggling with actually writing out the dump files. Since we no longer need to to this we no longer need to keep state on the RRC. We can do this in Amsterdam where it's much easier to scale this up. If you have having an RRC that is sponsored by an IX, we asked them can you please upgrade that to three, four more machines? That might be a bit rude.
It also allows us to do more accurate timing when we create the update files. We actually look at when did that message come in what previously happened was a Cron job triggered writing the update job or the dump file and whenever that Cron job triggered that was the data that was in the file. Now we actually know if we want to do this every five minutes, we know that the data from those five minutes is in there.
Unless you have questions about any of these specific slides, I would suggest that we do questions at the end of the presentation. I would hand over to Emile now.
EMILE ABEN: So onto the second part. My name is Emile, I do data analysis at the RIPE NCC, and so, doing that, I process a lot of RIS data and that's where I want to share some thoughts about how to further develop RIS based on that.
So, here is my problem in a nutshell. I'm not sure if you can read this, but RIS is growing, so, a single view, a dump of all of the RIS tables, beginning of the year was 1.4G. Now it's 2.1G. That's medium sized data but still if you want to process that comfortably on app single machine, that takes you ‑‑ it increases the time and you can throw more machinery at this but I was also wondering if we're growing this thing, how are we actually growing this thing? There a redundancy in this data? So this brings up all types of questions. If I have problem with this type of processing and prototyping and I have a bit of time to actually do that as a researcher, then network operators are probably not going to be able to bother with this.
So, do we have redundancies in the data? Can we sort of give you a reduced dataset that actually provides maybe a little bit less value but you have far smaller dataset. And it also questions if what we are actually doing here is RIS diverse? And what does it actually mean for services, like we have BGPlay in RIPE Stat and we have RIS‑live and we stream a lot of data into that and maybe there is parts of that that are redundant and by actually removing some of that, the visualisations and stuff like that we make might actually be more clear.
And so, what we are currently doing as an expansion is add route collectors at IXPs but we also do the multiphop and there is more room for growth there with all the upgrades and I was wondering, do we actually need to focus more on the diversity instead of the volume here. So, growing it into a way so you team have more signal from the whole Internet.
Because that's where well, what you really would want I think is a system that is really representative of the Internet, and so, a lot of visualisations and a lot of analysis, and I am also guilty of this, has the implicit assumption that RIS is representative of the Internet, or any route collector project is representative of the Internet.
But is it? And/or to what extent is it? So actually the way we collect peers suggests it is biased. I'll get back to that later. And what is the actual value proposition for somebody to peer with RIS and should we do something there to attract a more diverse set of peers?
So, current values for the good of the Internet, I look better in the Internet rankings is also what I heard but I don't know, is there any other one there. I would love to hear that, and there is a term I borrowed or heard from Randy Bush, are we observing the clue core and that's outside of that? And there may be really interesting stuff happening there that we just don't know about because we don't see it. Or to go to analogies in the search, like, if you have personalised search, you find yourself in filter bubbles. Are we with all of our route collector problems, this is not just a RIS problem. Are we in a filter bubble?
The way we actually ‑‑ if you go back and think about it, there is a really interesting Wikipedia article about convenience sampling, which is pretty much what we do with RIS. It is, you draw a sample of the population you want to measure from the population that is close at hand, and it is speedy, very cost‑effective, but it's also the disadvantage actually is that cannot even know the bias of what you measure.
So, what can you measure and what can we actually see is the diversity within our route collector project.
So what I actually did, and this is based on a RIPE Labs article, the URL is here. What you see here is, this is a matrix, every row and every column is a single full feed RIS peernd an I actually look at how much of the topology they see the same or they see different, so, if something happens on that topology, you expect two very similar peers to actually see the same thing. So it might ‑‑ there might be some redundancy there, and what I would think would be interesting is to have an as diverse set peers as possible.
For instance, if you look at this, which is very lightly coloured, so that means that the peer that you see here and you also see here, because it's the same, they are ordered the same way. Is very different from the rest. So it's more ‑‑ there is lighter colours there. And this is IPv4 and IPv6. There is a whole different story there on this actually is people who tend top be behind Hurricane Electric and they are very similar from this which is cogent, surprise surprise! But still, there is ‑‑ that's a type of diversity, right. And in v6, this is pretty known.
But the fact that we actually have this, to me, suggests that we could actually, in the further develop RIS for new peers we could actually see how much diversity they bring, which I think might be an interesting way to further develop this.
And also for BGP hijacks, I would suggest in an ARTEMIS tutorial, systems like that and also what Massimo is going to present on right behind me is that you have a very diverse set of data to start with, so if you make RIS more diverse and we spend some effort in making it more diverse, we'll be able to see hijacks that you won't see in the current system because like globally visible events, you need a single peer to actually see it. But detecting local stuff, you need this diversity I think.
So, how? So this is my big ‑‑ I have some ideas here, I have talked to people in the hallway on how to actually do this. I got some good suggestions already. So, this is my big question. How can we actually do this with the least amount of effort.
Technical BGP add path. There is all kinds of things we could further develop there. We could focus on the multihop collectors as they seem to provide the most diversity. And maybe have one per RIR region or per one for the Middle East, you name it, you can do a lot of things there.
But also, think about the incentives, what would somebody who does not see these things that I mentioned, it's good for the good or the Internet or I look better in the Internet ranking, what other value can we provide? Like, we can analyse your table as compared to our people's table, or how are you different from others? We could benchmarking, you name it. T‑shirts work well in IPv6 Ripeness, for instance, or we could do a specific targeting. We have a room here full of people who have probably really interesting peers. Can you maybe be ambassadors for these people to also peer with RIS? We could go to the nothings and do really specific targeting there. Or, set ourselves targets like we do all the state telcos or former state telcos or one RIS peer in every country in our service region. You name it.
So, that's actually my question to all of you. How can we actually do this? So I would really like some input on that. There is four minutes on the mic, four minutes for this presentation and otherwise, my e‑mail address is there. I'll be here in the hallways, so, I would really like to get some suggestions on that.
We can't answer. This. Is this a representative question? It's just ‑‑ it's a shame, but ‑‑ but we can assess and steer where new peers add to our diversity and my question is, do we want to go there? And I personally believe we should, because that will make the system so much better. And there is also an interesting quote there from Chris, I'm not going to read the complete thing out. But he also sees that for services like RIS live and the BGP alerter that Massimo at NTT did, that you really need like diverse vantage points for this to be even more useful.
So with that, I'll end and open up for questions to Florian or me.
JOB SNIJDERS: We have a few minutes, so, let's try and keep this concise.
AUDIENCE SPEAKER: Blake. Thanks for putting this together. RIS is super useful. Back in the day I used them as stratum 1 NTP servers. Two questions. One, do you take a feed from the route servers at the IXs as well?
FLORIAN OBSER: No, currently not. We don't know what ‑‑ how to represent that data.
AUDIENCE SPEAKER: And question two, I forgot to I'll let you go in and I'll think about it.
JOB SNIJDERS: One question per person.
AUDIENCE SPEAKER: Martin Levy, CloudFlare. You have a lot of data. Therefore in theory, you know where you don't have data from, and I wonder whether instead of the passive asking people will they feed, that you take a more active role and from your data, or from let's say common knowledge there are N, and I won't specify what N is, tier 1s in the world, that you actively go out and forcefully ask for a feed. You have become a de facto source of data. You should be very proud of that. We all use you, quite frankly. Now we want you to fill in the gaps. You have the data. Would you consider taking this more active aggressive role at finding your collection points?
EMILE ABEN: I would say yes. But...
AUDIENCE SPEAKER: What does your boss say? That's my second question, sorry.
JOB SNIJDERS: The Routing Working Group could use the PDP process to propose that everybody that sets up a session to the RIS collector gets a €50 discount per year on their LIR membership.
AUDIENCE SPEAKER: Andrei Robachevsky. As you mentioned, you say the multiple route collection efforts, not only RIS. Have you analysed diversity for RIS views or PCH or collections, and is there any mileage of combining those efforts so sort of address this issue?
EMILE ABEN: I know there is an effort on the way and there is like a RACI talk earlier this week about that, so I suggest you look at that one. Or anybody else, by the way.
AUDIENCE SPEAKER: Paolo: I have a comment like in the first part of the presentation I was about to say, like, I see BMP missing, like it was all BGP, and then the BMP word came in one of the last slides of Emile, fantastic. So my comment is, like, in that last slide there was BMP /add path and I would just say you know that you really want to look into BMP, like on they are not on the very same line. Like you can do, you can collect add path with BMP. And there is a lot of work going on in the grow group at IETF at BMP we are adding TLVs, so we will transport BGP information and any metadata, vendor specific and whatever, so I totally think that to collect more information you want to know the V BMP way all the way.
JOB SNIJDERS: All right.
AUDIENCE SPEAKER: Hi. Sebastian. So, in a similar way, since the RIPE Atlas team has been sort of pushing the idea of having one Atlas probe on each ASN in the world, I would suggest of picking an idea from Martin of sort of suggesting you kind of nudging rather than forcing people to oh, it would be really nice to have a collector in X or Y place. So, I'm sort of bias on this that we have a very specific view of the world back in new Zealand and I cannot promise you a collector now, but we could talk about that, but it will be really interesting for to you to go and sort of, these are the places we will appreciate having a collector. I think we can influence people to go and cooperate with you.
JOB SNIJDERS: A round of applause for our fellow researchers.
Thank you for the updates. So to now hook into an application of RIPE RIS live. I would like to invite my colleague Massimo to the stage to share an update on a new Open Source tool called BGP alerter.
MASSIMO CANDELA: Good morning everybody. I am Massimo. Today I will present the latest tool that we released Open Source and it's called BGP alerter. This presentation is also contains a quick tutorial. So if you have your laptop with you, I suggest you to open it because in some minutes, you will be able to monitor your prefixes in realtime.
So, why we developed this. Well the first reason is because we need it for our prefixes, at least to.monitor hijacks and visibility loss. I additionally added other possible ‑‑ other monitors and we will see something later.
And we decided to release it Open Source, BSD 3 I would say almost as Open Source as you can get. We released it because we thought that somebody else may be interested into it and also because by providing some easy way to monitor prefixes, we can make, I would say, the global Internet slightly better.
And I suggest you also to connect to this reap owe now because you will need it for the quick tutorial and this is it here, or you can just Google NTT BGP alerter. It works in realtime, so you will get alerts if something happens in seconds, and it is easy to use. We have our automation in place, so we really didn't need the auto configuration but we thought it was going it be useful for other people that may be they have less automation and we basically dispatched, we delivered the application containing an auto configuration that you can simply type your autonomous system and it will get what it has to monitor.
This is an example of a slack channel that we use. Well, of course the messages are fake, they are just a test. And you will get others with various colours when something happens we have it installed ‑‑ I have it on my phone, Job has it on his phone. It is a great source of realtime alerts, also a great source of heart attacks. But if you have the laptop ready, we can start with the setup. Really easy.
So the first thing is you go on the GitHub repository. And you have all the source code, which is completely Open Source, but if you are not interested in the source code, just go to the reloses tab. You get the latest release, and you have the binaries. These binaries they include everything, no dependencies you have just to run it.
So you download them. You makes that executable. And basically you are good to go. The ininstallation is done. It's just executable. And if you run it now it will tell you well there are no prefixes to monitor. So the next step is to tell it to auto configure, so you run the comment generate dash A, you put your autonomous system. You can put more than one autonomous system by doing comma separated list and dash O is the output file so you put prefixes dot Yaml. I suggest you do that so you don't have to change anything else in the configuration and it will just run.
So what this script will do, it will understand what your autonomous system is currently announcing. It the generate a snapshot and it will check that you will keep the same status in the future. Of course if you have prefixes that don't have ROA associated, a warning will appear and will tell you verify this configuration by hand because you don't know if what we see currently is the status that you would really like to monitor.
So, the list generated in prefixes is something like this. These are Job's prefixes but you will see yours in in case. It's just a block for each prefix. And that's all. And now you have just to run it. And no parameters or anything, by default the application will start, will load config and will tell you which prefixes it's monitoring.
So, now, quickly, I will just run through this. Three components, essentially the idea behind this is that okay we despatch it, we deliver it, everything you need for basic monitoring but if you want to do more advanced thing or you want to create your own monitor, you can do that and you can essentially ‑‑ there is these three components that you can combine: The connecters, monitors that analyse the single BGP update and generate others in case something is wrong, and after the reports. They receive these and they decide if to aggregate them and how to despatch the slack or whatever.
This is an example of composition, the list of monitors that you want to do, and each monitor has a specific issue that it will monitor. And if you implement another one, you are just point to which file it is.
The channel where the monitor will put the alerts. And other parameters if needed, like how many peers they have to need that issue before they trigger another and things like this.
After you have a list of report. By default it reports on file so you will see logs, and here it says the list of channels with these others will come and so basically, report takes others from this channels and dispatch them on file or SLAAC or where you want. You can do other combinations where you dispatch on SLAAC on visibility loss on files, you can do whatever you want.
So connecters. There was a presentation before so I don't have to say anything about this. We use, for now, only RIS, but we plan to implement others. You can implement yours if you want. But RIS is amazing and it's free, it's huge, but we don't use the MRT dumps, we go directly through the web sockets so we get really fresh realtime date. And we deliver in the same application for now reports by e‑mail, report on slack, report on file, and you can also attach whatever you want, you can sliver it on because the report doesn't contain only the alert, but contains also the BGP messages that are generated, that trigger that. So you can, for example, store that in your database or put them on Kafka and replicate the issue later or whatever you want.
And this is an example of a report on file. Just a simple log. It will tell you oh this prefix is announced by this AS but it should be instead another AS or things like this. Or any more we don't see this prefix from appear, so logs like this.
This is by e‑mail. And this is the slack that I showed you before.
Now, a bit more details on the prefix list. By default, you will see this three parameters, description, autonomous system and ignore more specific, they are the mandatory one. Description is human readable that will be reported in the alert. The autonomous system that is supposed to originate that prefix, or a list of autonomous systems. And ignore more specifics is if you want to monitor or not sub‑prefixes.
And you have other optional parameters like ignore if you want to skip the monitoring of that prefix, so there is this prefix but ignore that. Legacy holder a group, if you want to dispatch a specific group to user groups, for example if you have SLAAC channels and you deliver on one channel a set of prefixes alerts and on the other channel with another team of people, you deliver other set of other prefixes.
And you have like exclude, if you want for example, if you have a prefix that is experiments and you know that it's going to go up and down, you want to exclude, for example, visibility, you can do that, and the last part is for pot matching, so you can essentially get alerts when the AS path matches some regular expression or doesn't match it or has some length that you set. This is a simple example but you can do pretty complex stuff with this one.
Usually, collaborate a lot with the researchers so I thought to make it easier for them. Basically, all the components they extend are super clusters that the all the done, for example the monitor and you can implement the monitor method with your own logic or you can use this as a simple way to filter BGP updates and to pipe them in whatever other system you are researching with.
And I have a quick example of, I just did it for fun and it's not a service, it's not anything, it's just an example.
I basically started listing all the BGP updates both for the v4 and the v6 address space, all of it, and for each originating prefix I basically validated with Cloudflare validator, which, by the way, I thank Louis and Vasco, because this validator is fast, and what I basically do, I do this 5,000 and more BGP updates per second recollect and I validate all of them, and on average, I have 5 invalid updates per second, and I dump them in app kind of repository, it's just a page where you can go in, there are various log files, they are this format and they are divided by date. So you can see what BGP updates were invalid and it tells you more information.
And well, basically, that's all. And please contribute on GitHub or open issues. A big thank you to these guys for helping me, we talked a lot to make work better between RIS and this product.
And well, that's all. If you have questions. I see already you have a question.
AUDIENCE SPEAKER: Good morning, Chris Buckridge from the RIPE NCC and there is app comment or a question in the chat.
McHale Klava, with no hats on, asks would you consider offering this BGP monitoring service as a hosted solution for others?
MASSIMO CANDELA: That's a good question, I don't know what should I answer? Maybe.
JOB SNIJDERS: We can definitely consider.
MASSIMO CANDELA: We can consider that, yes. It is indeed, it would be cool. But... I have to ask to somebody higher in command.
JOB SNIJDERS: We need to ask our manager for approval to spend time on that. But yeah, it's crossed our mind.
AUDIENCE SPEAKER: Ivan Bench. IG. You mentioned being able to send messages to SLAAC, is there a configureable way to send it to a locally hosted equivalent?
MASSIMO CANDELA: If you ‑‑ okay, the reporter does that, essentially uses IX queries or web book. If MAT Mo supports the same, I believe it does, you essentially have to change the code is Open Source, you change the JSON format and the end point and you create your own reporter and if you do that, since MAT Mo is anyway used a lot. You can do pier requests so other people can enjoy that. But it's really easy, and ‑‑ yeah ‑‑
AUDIENCE SPEAKER: Will, so basically I did what you suggested to do, take my laptop and try it and so I just installed it. It works, thank you. So basically, hosted service would be great, but it was so easy to do that I ‑‑ yeah, maybe it's not so needed. So it's really cool and thank you for your work.
MASSIMO CANDELA: Thank you very much for this. I just add one thing just in case. The query that we do to RIS, if you are scared about hosting it for resources, it's really lightweight both in terms of CPU and memory. It was done on purpose and it connects to RIS by doing whatever you have to monitor, it optimising the description to do. So, you will not get the full stream, except if it's really needed, because you did some crazy complex query. But in general, it gets only really the minimum. Sorry.
AUDIENCE SPEAKER: Robert. Thank you very much for this work, it's excellent. Speaking a bit for the back end of this, the RIS live part. The team is about to declare that production. It hasn't happened yet. So you know, if 400 of you jump on it, it may behave differently than what we expect it to do. But there are levels of this, so if someone actually is trying to read from the fire hose on RIS live, and you know, 4,000 people do that, that's quite expensive. On the other extreme I know that the team is looking at, let me put it this way, what we already have is filtering, so this is my prefix, this is my AS, let me know if something happens, awe some. What they are also working on is this is my AS, this is my prefix P let me know if someone else announces it. Which basically cuts down the number of ‑‑ well the amount of traffic if you will, seriously. So, once we get there, we would like to work together with you to integrate that into the system. And we will trial to keep a close eye on how the system behaves, and I wish that, you know, more people used it so that we know it can handle the load. If it can't, we will do something about it.
And finally, it is entirely possible that the NCC can take up such a hosted service. After all we are there to serve the community. So if that is a wish from the community, let us know and we will look into it.
MASSIMO CANDELA: Thank you Robert.
AUDIENCE SPEAKER: Randy Bush: I'm with Will. Thanks. I have been using it for I don't know, weeks, and it rocks. It's also very easy, as Will says, to set up. But scaling: Does it put a wider load on the NCC's servers if we aggregate users? What's the tradeoff space? Personally I don't like outsourcing security, so I'm real happy to run it. In fact, I might ask for to be able to have, to feed it myself, instead of using BGP stream err.
MASSIMO CANDELA: So, the load, as I said, we tried to optimise it a lot in initial collaboration with RIS, and also the subscriptions are going to be minimum. In general, it is less than a megabit for sure ‑‑
RANDY BUSH: Traffic is cheap. We charge for it, spend 10 megabits, great. It's the load on the back end in RIS that I worry about.
MASSIMO CANDELA: For now we did, well, some tests and collaboration with Christopher and it looks able to hold it for now, with multiple instances but we really tried hard to reduce in face of subscription. I don't have number to give you on exactly how many users, but it's also the face phase that we have now, basically we developed this because I used to work for RIPE before, so I know that that was going to happen and we collaborated and it went out, but it's still not official, as Robert says, on the service end. But if you start using it, many people use it, it works. So, we'll see what happens. If you want to load your data, you can create a connector and also in that case it's really easy, you have a method that's call publish that you call every time that you have ‑‑ you can use whatever format also. You have a transform matter that allows you to transform it in JSON at some point. I don't think if it answers your question.
JOB SNIJDERS: Definitely the tool specifically designed under the assumption that RIPE's RIS live API would not be the only thing that feeds it but that other things can feed into it as well.
MASSIMO CANDELA: That's why you have connecters.
JOB SNIJDERS: Massimo. Thank you very much for your time and your software.
Next up we have a contributor that came through the RACI programme on the usage of contemporary CB Ls in Internet multipath routing.
JIE LI: Sorry we changed the title of the slides just now. I am a Ph.D. student currently. And thanks for RACI for granting me this opportunity to present our work on GPR, and this is what is on the cooperation amount, and these universities collaborated.
What a GBL is, It's a group of border links that are used between AS paths for the routing between the same host path.
This work is carried out under the background of multipath routing. As we all know, there have been some old assumptions in this area, for example, there is only one single inter‑domain path between a pair of hosts, and if multiple paths were observed, they were likely to be caused by measurement errors, misconfigurations, or dynamic routings.
But some recent works have shown that the usage of multipath routing would be legitimate for the purpose of implementing multipath BGP or load balancing purposes. And most recent research has shown that there are some peer article patterns on these changing paths.
The aim of our research is to get to know the practical usage of border links among some top transit providers.
Okay the key question to our research is that what are border links? This is the IP level link between two routers that belongs to different ASes. The problem here is that the egress interface is actually invisible in traceroute data, so it would be difficult to infer. So we use a replica of a border link which is two consecutive ingress addresses of border routers. And these are some terms we used in our research.
A host path consists of IP addresses of both source and target posts. An AS pair is the two adjacent ASs along the AS level routing path between a pair of pair of hosts and we describe the near side AS to the far side AS along the routing path.
And here, a GBL is a group of border links used for the same AS pair for the routing between the same host pair.
Here are some examples about, of GBL cases. These cases were all observed for the same host pair. And for the first AS pair along this AS path, we observed two border links and they were used with different usage rates. This would be the first case for these host pair. And for the second AS pair, we observed more than three border links and this would be another case of GBL.
Seeing as we only observed one border link for the third AS pair, we wouldn't consider it as a case of GBL, so we observed the two cases of GBL for these given host pair.
These are some research questions we need to answer for this study. First, we need to know how to identify these cases of GBL from traceroute data, and we also want to know the frequencies and connection patterns of the border links using a GBL.
Finally, we want to find if there are any periodic patterns on the usage of these GBL.
So answer those questions, we conducted a traceroute measurements on the platform of RIPE Atlas. We studied the top 50 ASs according to CAIDA's AS rank data and at the time we started our measurements. There were 30 ASs among these top 50 hosting RIPE Atlas hosts, and we chose one probe for each, in each AS to issue traceroutes inquiries. The inquiries were between each pair of these hosts for every five minutes for 56 days. We obtained more than 16,000 measurements, and this figure shows a part of the default settings on RIPE Atlas, all the measurements and analysis are based on IPv4.
After obtaining the traceroute data, we used a twostep methodology to identify the border links. We used border mapping forecast mapping and we used Midar to use PS resolution.
Applying this methodology along the traceroute path which includes about 1200 IP addresses and we used IP address mapping to obtain and obtained 249 border links with 267 IP addresses. We then used IP alias resolution to reduce the number of border links to 227. These border links were observed between 242 border routers.
We calculated the usage rates of these border links and we can see that around 36% of the usage rate values are less than 1%. Conservatively we only used ‑‑ can see the border links with usage rate larger than 1%.
Finally we got 121 border links which were used between 13 AS paths for routing between 13 different host pairs.
These border links were identified in 22 cases, and in each case, different numbers of border links were involved ranging from 2 to 32.
We also started the involved ASs identified in these cases. These ASs were used as either the source of hosts or the target of the host pair or the near side or far side of the AS paths. They are about 13 ASs involved.
And we can see that, GTT is involved in 19 cases of the 22 cases. So it seems that GTT uses GBL extensively for exporting or importing its traffic. We vantage gone deeper about this reason, so we would love to hear your voices on this.
Okay. We validate our cases of GBL, the tool of Looking Glasses. We validated for true positives by querying the public assessable Looking Glasses in three or four near side ASs in our GBL cases.
All of them confirmed the usage of multipath eBGP, which means that 13 out of 22 cases were showing in our dataset and we also validated the through negatives by querying destinations where no GBL were observed.
The results showed that none of the rules were denoted as multiple external. Which means that we didn't miss any potential GBL cases from our traceroute data.
We classified our GBL cases into different types and patterns according to the usage rates and the connection patterns.
In Type I cases, the difference between the top two usage rates should be larger than 20%, and in Type II cases, the difference between the highest usage rate and the lowest usage rate should be smaller than 5 percent. The other cases fall into type 3.
We observed that in each case of the GBL, the border links started from different border routers, and to be specific, 10 cases, in ten cases the border links ended at the same border router, we classified them as pattern A, and in the other 12 cases, the border links ended at different border routers. They were classified as pattern B.
Here are some examples of the types and patterns we used. In the left case, only two border links were observed and we can see that. The difference between the usage rates is far larger than 20%. So, it's a Type I case. And because the border links ended at different routers, so it's pattern B. In the right case, we actually observed none cases for this AS pair, we only showed the three links with top usage cases. The highest usage rate is around 31% and the second is around 18 percent. So, the difference is less than 20% and we can also see that the difference from ‑‑ sorry, the, here the lowest rate is around 4%, so, the difference between the highest usage rate and the lowest usage rate is larger than 5 percent, so, this case falls into type 3. And because all the border links ended at the same border router, so it's a pattern A case.
This slide shows a study of type 3 pattern A case where non border links were observed from the AS path from GTT to AS3 and the route between this given host path. The top two figures shows the appearance of border links at each time point with different interval of data. Note that the 15 minute interval data was sampled from our five minute interval measurements. We can see that the five‑minute interval measurement can actually provide more details than the 15‑minute interval data, and for each link, the usage of border links actually still casted without any apparent pattern.
And we also plot the daily usage rates of these links. We can see that when border links are used with high rates, they are actually used with persistently high rates and the links with low rates are used persistent with low rates.
We also plotted the number of appearances of these links as a function of the hour and the day as measured over the 56 days. We only show the plus for the top two usage rate links. We can see that, there is no apparent periodical pattern here. All these conclusions apply to the other 21 cases in our data.
So as conclusion here, our observations has confirmed the usage of group of border links for the routing between the same host pair. And given the relatively small scale measurements in our data, we will say that we have observed in many cases, and our study also suggests that five minute interval traceroutes could be a fine granularity for network operators to do measurements on their own network to check the performance.
And there are some applications of our research. For example, we can get to know how a network is resilient to the failure of border links or we can find situations where different border links were preferred when two ASs were peered. This is the, this involved in the conflicts of interests.
As future works, we will include more measurements of data for top 200 ASs, these sources include RIPE and aid CAIDA, and we would do periodical analysis. Currently we have started during the analysis on the 15 minute interval RIPE anchor dataset. We look forward to hear your voices on how to set the thresholds like the values of 1%, 20%, 5 percent, and we also would like to hear your comments on how to improve this.
Thank you. Any questions?
JOB SNIJDERS: I have one question myself. Can you go back a few slides to the listing of the AS numbers where I think GTT was fairly high up.
Why is NTT not in this list? AS2914?
JIE LI: These ASs only involved in the cases but it doesn't mean it's not in the measurement.
JOB SNIJDERS: I see. Because it's a matter of providing ground truth. I can confirm that many different paths between the same pairs, so I was kind of expecting to see us there. But thank you for the verification. Thank you.
IVAN BEVERAGE: Ivan Beverage, IG. I was just sort of thinking it would be useful to have sort of feedback from the carriers to understand why you were actually seeing the stats. That was it.
SPEAKER: Thank you. Thank you.
Next up, we have a presentation from my friend Nathalie Trenaman on how trustworthy RIPE's trust anchor is, and I am kind of hoping the answer will be a hundred percent, but who knows
NATHALIE TRENAMAN: Good morning, my name ‑‑ good afternoon, my name is Nathalie Trenaman, I am the routing security programme manager at the RIPE NCC and today I would like to share with you some of our plans that we have with the RPKI core infrastructure for the next year or year‑and‑a‑half.
You might have seen, if you actually read the budget of RIPE NCC, that there is an increase in the budget for RPKI, and I hope with this presentation, I shine some light on why we needed this extra budget.
So, in the last two years, we have seen significant uptake in all aspects of RPKI. We have only in our region now more than 12,000 certificates ‑‑ no, that's not true ‑‑ 12,000 ROAs, almost 10,000 certificates. So, and every week we hear about organisations that start doing origin validation.
So the last thing that we can use right now is a big outage on RPKI. I think you agree with me. So, we sat around the table and we thought okay, what can we improve? What can we do there? We identified three main risk areas.
Operations. Technical, and trust.
And I'll explain very briefly each of them.
If you look at the technical infrastructure, we have quite a good track record in terms of uptime. For the repository, it was clear a hundred percent. For the core, it was a little bit lower, that is because we had two planned maintenance Windows, one in May, one in July. And if you ‑‑ and those consisted of a planned outage of the require portal. The rest API and the up /down protocol. If you have read the notifications you will know that tomorrow, Friday and Saturday, we also have a planned outage should for the LIR portal which means that cannot create or update ROAs for the LIR portal for tomorrow, one hour I think, and Saturday. Something like that.
So, that's that.
Uptime is good. Only due to schedule maintenance. We had the down time for the rest API and for the up /down protocol because we replaced servers in this year.
Then, on the operational side, we want to improve the knowledge of the RPKI core. We had some staff changes, which is normal. But then this is very specific knowledge, so it takes time to update that knowledge again and to make sure that that knowledge is transferred in a good way, so that means good documentation as well. So we have to spend quite sometime and effort in that as well.
At the moment, we have got three technical teams operating RPKI. We have got the software engineer team, the IT department and security officers, and what we are starting to do now is we are going to build a Dev ops team to have more interaction between the software engineering team and the IT department, so that goes a bit more smooth. So that's also to improve the quality of that.
This is a nice one. The procedures and processes. We had an interesting one. We have a lot of very strict processes and procedures, and we are the only RIR that provides 24/7 support on the RPKI repository. So, and the LIR portal for that matter. So, I thought in February, might be a good idea to check what would happen if I would call after hours the 24/7 help hot‑line and see if they even heard of RPKI. Because in the past, you might know that, on the RIPE website, we were mentioning certification and not the word RPKI. I know that a lot of these guys work with script, so, if they hear a new word, they might not recognise it as such and don't know where to start in the process. So I did that. I called TTL 24/7 number. We did indeed discover that there could be some progress there, and also we came, in the end. After persuading it was broken because all the dash boards were green of course, so I had to tell them that it was really broken, persuading them to get me to a department which they eventually did, but it was the wrong department the
So it's always a good thing to actually, if you have those procedures, to occasionally also check them to see if they actually work.
Everything, all those procedures have been revamped, tested again, tried again, so that's for now is good. I'll try again in February.
Also, in the current RPKI structure, RIPE NCC is a trust anchor. And that means that we sign our own certificate authority. And this means that we tell the world to trust us and that we're doing the right thing. But are we? That's a fair question. And I want to have a very crystal clear answer to that.
So I think it's a wise idea to have a third‑party to look at our code to do an assessment there, but not only the code, I plan to do a full risk and security assessment to identify the weak points and find areas where we can improve. Because the ultimate goal is of course to have a rock solid, stable and resilient core for RPKI and that's basically it for me for now. I will be back. Berlin might be a bit too early, but definitely the meeting after, to give an update on what we found and what we did and, if you have any questions about that, do please let me know. Thank you.
RANDY BUSH: Thank you for diligence. Just to reassure people that you're wanting to do more to ensure service. This whole technology was designed by Crypto and security and RIR and vendor people, and then some actual operators who the any able on rerouters, and so should you go out for a day or two, as other RIRs have, things will keep working. No fud. So, that should not scourge you for trying to keep operationally good. But not to panic.
NATHALIE TRENAMAN: That's true, and thanks for clarifying that. It's a very misconception that the world will stop spinning if RPKI goes down. That's not true. Everything reverts to unknown. Now that's not good, we know that, but it's not invalid. It's very important to understand. Still, yes, Randy...
RANDY BUSH: Even more than that. The validators, or at least ones that I am aware of, will use stale data if they can't get fresh data.
NATHALIE TRENAMAN: Yes. Thanks.
JOB SNIJDERS: I would like to second Randy's comments that I very much appreciate RIPE NCC's efforts to ensure that this operational service is of the highest quality. It makes perfect sense the trajectory you are on. So thank you for that.
NATHALIE TRENAMAN: I see Rudiger ‑‑
RUDIGER VOLK: (Appreciate) I certainly appreciate the work that the RIPE NCC is doing. I certainly appreciate that we are now seeing up take in the community and some noise about it. Nevertheless, I think there is still some way to go, and kind of developing a full understanding in all the parts where it's needed certainly is going to take a little bit of time. So, I think that the ‑‑ well, okay, not only ‑‑ verifying that your stuff is good by a third‑party is fine. I think however that is not sufficient. Kind of the published details of how you are operating, I think should be at a state where it is actually publishing all the relevant things and typically and formally, that disclosure is done in the CPS. I would guess the fact that the current CPS is very many years old, without looking at any details, proves that it can not be the current information
NATHALIE TRENAMAN: Correct. And I am already in the process with legal and Athina and the software engineering team to do a full review of that. Yes.
JOB SNIJDERS: I'm going to cut of off Rudiger. Sorry.
RUDIGER VOLK: One short ‑‑
JOB SNIJDERS: You talked for three minutes. We have one more presentation to go. Your point has been taken. I am sorry, we need to stay within the timeline.
NATHALIE TRENAMAN: We'll talk further Rudiger. Thank you.
AUDIENCE SPEAKER: Peter Hessler: You mentioned that some of of your downtime outages were due to the server being upgraded. What sort of redundancy and especially geographical redundancy do you currently have and plan to have in the near future?
NATHALIE TRANAMAN: There are two things. First of all, we have the repository of course, with the 100% uptime and that's because it's quite a distributed system. We have got IR D P and that runs on Amazon clusters, but then we have Rsync and that happen runs on local servers. So, that explains why there is a hundred percent. Then if you look at the core infrastructure, I think there are four, five servers, where two of them were old hardware basically that we replaced by virtual machines. First one, then the other one. And then also the other down time was because of the LIR portal, so you can't really always avoid to have ‑‑ yeah, everything still keep running.
JOB SNIJDERS: Thank you Nathalie.
Our final presentation today:
ANDREI ROBACHEVSKY: Thank you Job. Hi everyone. This is another tooling presentation, but this project is in pretty early stage, so, rather than offering you something, this project is actually asking for your feedback and contribution.
So, validating MANRS. It's MANRS, I will not go into detail. We have a table outside if you have more questions, please come to our table, get some goodies as well. Essentially what MANRS does, it specifies four major requirements that network operators need to use. That's filtering, anti‑spoofing and these.
What's important for an effort like this. It's just not a web page and for the credibility of this effort, transparency is porn. It's also important that networks that are joining this effort are extending by their commitment in long term. So there is assurance that indeed those requirements are implemented by those networks, that's the only way the effort like this can make impact and that's the only way efforts like this can gain credibility.
So, we launched this MANRS Observatory for this reason. Partly to look at how those networks look from the outside, how well they are performing as it sort of relates to MANRS requirements. But you can only see so much from the outside. For instance, for filtering, if you don't see any incidents which this network is implicated in, cannot assume that the proper controls are in place, so I think it's important to also be able to look how the network looks from the inside. But also for engineers, I think it's useful if they can analyse their configuration and figure out if their configuration actually meets the requirements of MANRS before before joining this effort.
So that's why this sort of idea came up with this project vision to create a locally running tool that will analyse your router configurations and will report on any shortcomings compared to MANRS,
MANRS is just one baseline and you can think of other base lines that we can create if you have this framework in place. Also we start sort of thinking well let's start with one configuration but we can extend our platforms and other vendors.
So, when talking about checks, what kind of checks can we do?
Well if you look at like action filtering, right, so, what kind of checks? Inbowed routing advertisements from customers and peers are secured by applying prefix level filters? Is the router configured to connect to RPKI to our router protocol for ROA validation? For anti‑spoofing you can check are the uRPF statements for particular interface your customer facing? Are they ACLs applied on those customer interfaces to prevent those IP address spoofing?
Those are relatively easy sort of questions we can figure out by looking at configuration. More difficult checks, we don't have an answer yet how to implement them T would be useful to implement them as well. Prefix level filters dynamically applied from the IRR entries or do prefix filters match the customer cone to whatever you define the customer cone, and with anti‑spoofing, for instance, are the ACLs correctly match customers network blocks? They are more difficult questions, but they require knowledge of some configuration, pre‑configuration of your network topology.
So, before going into sort of deep detailed requirements, actually this idea came from one of the MANRS participants from charter communications, and what was sort of developed there is a small prototype as part of the hackathon. It uses robot framework, that's a framework to do automated testing and was used for automatic router configuration analyser. It uses single high‑level platform tool and it sort of expandible, it can be used for other platforms and other vendors. They start with Junos, that's right the prototype that is currently available. There is a GitHub repository open to everyone, so you can look at this. And it produced graphically aware reports available to make it easier to understand and act.
So that's how the current line interface look. You run this on your router configuration, it runs those checks and let you know whether you failed or not.
And a more interesting sort of extended web interface that allows you also to see what shortcomings, what needs to be fixed to make your configuration conformant.
That was a very rough prototype. I think it was developed in one day. It supports Junos, it performance only a few tests not the full sort of MANRS suite, also it doesn't answer more complex questions about sort of the whole infrastructure that allows you to automate your routing configuration.
So, what needs to be done so get from this sort of prototype to a tool? Some better testing is needed. We need to verify that the results that this tool produces are the results that can actually be useful and used in network operations. And also verify this this tool is useful, that people can and will use this tool.
And if that is the case, then I think we can work together and increase support for this tool, create sort of a library of different configuration and operating system versions to be able to sort of cover a wide range of possible platform being used at network operations.
And then of course, share this tool to encourage to use this. We can ‑‑ I mean you as network operators, we can also think of this being an auditing tool back to the MANRS initiative, sort of establish some form of you not just submitting your form but you also submit this result of the tests you run on your router and that would indicate the level of conformance with MANRS requirements.
So, that's sort of what I wanted to present. And I wanted your feedback, maybe not now immediately. But immediately is appreciated as well. Do you think something like this is useful? And would you like to contribute? So if you have any ideas to share now, please, if not, you can contact us or you can come to the table outside and of course MANRS participants are aware of this, so that's also appreciated any contribution here. Thank you.
AUDIENCE SPEAKER: Pete letter Hessler: Your tool, is this primarily analysing the configurations of the router? Or is it sending live traffic?
ANDREI ROBACHEVSKY: No, it's just the input is your router configuration and maybe some metadata and then the output is how this configuration matches the requirements.
PETER HESSLER: Okay. So this does sound like quite an interesting project, both to tell MANRS that we are compliant but also for operators to actually go, are we compliant? Do our configurations match what we hope we're trying to match? So this does sound like a very sound project.
JOB SNIJDERS: All right. Thank you Andre.
So we have two minutes left, which brings us to the any other business part of our agenda.
Is there any other business that somebody wants to contribute? Benno.
BENNO OVEREINDER: Thank you. First of all. Thank you Job and Paul for standing up as Working Group Chair, but I also want to think about ‑‑ take this moment to think about the next generation of leadership. And I think the PC is a great thing to introduce young new professionals into the RIPE community, but I also think the Working Groups could fulfil a role.
So I am asking you publicly now, maybe we could discuss it later, can we think of two experienced Working Group Chairs in the Routing Working Group and one young person as a trainee intern, fellow? I am happy to talk to that with you in more detail about how to implement that. But I just want to call‑out this publicly.
JOB SNIJDERS: Speaking as myself, now as new co‑chair of this Routing Working Group, I whole heartily agree. I think it would be fantastic if I can use the next two RIPE meetings to teach a younger person what it means to be a Chair in this Working Group, and then hand it over.
But we have some work to do in terms of how we select chairs. I think in an updated version of how we select chairs, it would be good to emphasise a degree of diversity, to emphasise the notion of having a fellow or a trainee and ‑‑ I think it would be good if there is a degree of churn in the leadership of this Working Group.
All right? I think that's it. Thank you for attending the Routing Working Group session. See you in roughly six months at RIPE 80. Enjoy lunch.
LIVE CAPTIONING BY
MARY McKEON, RMR, CRR, CBC