15 October 2019
DMITRY KOHMANYUK: Hello everybody. So we are ready to start, I hope you had a good lunch. I am moderating the session, together with Pavel Lunin, and this is Giovane Moura, ask about TTLs in DNS and how it works. Thank you.
GIOVANE MOURA: Can you hear me? So, yeah, good afternoon, everybody. Thanks for being here and hope everybody had a good lunch, you are not falling asleep.
Now, so this is actually paper I am going to present also next week at the Amsterdam at IMC conference, Cache Me If You Can, a research doing together with USC, ICI and UPF and SIDN Labs. Let's see how DNS and ‑‑ so, just this is like the presentation here today, just a bit of context before we move forward with this talk. We actually ‑‑ we are a team of researchers from different organisations that have been working on several, over the last several years, trying to do some research on DNS secure stability. These are a bunch of papers we have done and like to improve DNS stability and security, and then had another paper in 2017 looking specifically at the resolvers, how things can be changed there and my colleagues, I was not involved in the third study, and they also look at Anycast engineering, so the references are here. Last year we look at caching on resolver side and when I present that paper to the Ops at SIDN, this is ccTLD for the Netherlands, it's nice to know TTLs an important and which should I use. That's this talk, this paper, so it's going to be presented next week in Amsterdam.
So, before I start, how many DNS folks are in the room here? Can you raise your hand, please. I would like some feedback here. But even if you are not working with DNS this is about network performance, so we have here user that wants to send a query to resolver and it will be google.com and it will forward that query and get an answer and the answer comes back, but the resolver, this survey without the resolution job on behalf of the user can be optimised by having a local cache and if you have that, next time another user, in this home boat, query the same thing for the resolver, it can get an answer straight from the cache, so, you avoid having to go to authoritative site here and you can get a way faster answer so having a cache hit, having a cache in the resolver actually improves performance.
The question is, for how long should you cache an answer that you have before, and this is actually the answer of the TTL, it's time to leave. So, TTL is pretty much allow authoritative server, in this case here from Google, this is example, to signal the resolvers and their caches for how long they should be able to cache that record. Because Google may want to change, or any other example, may want to change their values of their records so it's a way for them to signal that. And cache is very importance for performance and improves a lot of user experience. If you register domain, you have to choose a TTL, there is no way around it. You see here this shows the TTLs for different records for this domain, one hour and one day. And this figure here shows currently DNS operator attempting to change a TTL of a DNS server ‑‑ for a certain zone, and what turns out to be like everybody is setting TTLs but there is no really a lot of ‑‑ there is not a clear recommendations what actually how to choose TTLs, and most operators I talk to, a bunch of them it, different ccTLDs, routes, whatever, the approach they usually have is if it ain't broke, don't fix it and they just skip that. And when they have to change, people are very afraid of that. I think we can help and do some research, so eliminate a bit of fear, uncertainty and doubt in this process.
So our contribution here is the fact of TTL comes from multiple places, there is parent and child. I will be covering that soon, and currently, with the scans we did, a lot of it. TTLs are unnecessarily short. It can be be longer because you have can have better performance for that and we contacted a bunch of ccTLDs and three of them changed NS records to TTLs and we have improved performance of 20 milliseconds for middle latency and 171 for the 75 percentile. So the rest of this talk I am going to be walking through this points so you will be looking at different measurement experiments, look at parent versus child and records dependency show variations in the real world exist and finally recommendations.
So parent versus child, so let's say you have a domain here, again cache test .net and you want to send this query and it asks, so here we have DNS hierarchy tree and here have a user, in this case Rembrandt, and it will send an query to this resolver and ask the same at two different places, and once it gets the answer you are going to get this TTL from that net, but the actual authoritative server for cache test .net is this one here, so can ask and get a different TTL. So the question we want to know is, like, which TTL with recommend brand, in this case here represent all the users, we use, the blue from the parents or the child. So what is our next experiment with that UY, on this particular date here, the TTLs of the parent and for UY this values here equals to two days but the child have very short, means for the A, and that allow us very easy to query RIPE Atlas every ten minutes for this queries, and every 10 minutes, by the time it do the second or third round are, the records must be expired from.cache because their records are so short it will be expired.
You look at the TTL values received, 15,000 vantage points, thanks again, RIPE Atlas, for the platform.
This is the CDF of the values observed on the Y axis you have here the percentage of the answers that actually, what the TTLs they have, what do we see here, we see a spike for the A record around 120 milliseconds which mean most ‑‑ for the NS around 300 milliseconds, and so most resolvers for the e‑ permit from RIPE Atlas they actually being child centric means they trust the values for the child over the parent. We also tasked with passive data from.nl, we look at how often resolvers come asking for A records and have a value of one hour in the child and we see spikes here, this is the inter arrival time from resolvers, the minimum for the A record so most of them also we confirm that child centric and we did a second letter of domain it's in the paper but find it's the same. Here the conclusion can have, like most resolvers will be child centric and child to child controls caching most of the times.
And okay, so now let's look at the different problems, zone configurations and effective TTL. So let's say you have a domain name like this one here, there is two ways you can configure its NS records, represent which are the authoritative name servers that can answer for that. One way is called in zone and out of zone, and in zone means you can set your NS record, it's going to be like, this is the zone you want to set, the domain name is going to be under that tree, so there is a match between this part here and that part here. Out of zone, you just choose something totally different so there is no match between this part here and that part. And of course, the names they have to have IP addresses, those are the A records you see here. So let's set some TTLs for them, one hour and two hours, 3600, to 7200, these are two different scenarios we are going to evaluate with RIPE Atlas. So to resolve anything under this sub zone here you need to know both A and NS, and the answer we want to know, if these records are.cache independently in the resolver side, because you need to know both and there is ‑‑ you could expect balance between them we want to evaluate that in a while. So what do we do, time equals zero, all RIPE Atlas query for this particular domain and fill their caches and at this time here, the NS should be expired, the A should be valid. What is happening at this moment? Once it has expired A is still in cache, the resolvers trust the cache A or they would refresh it again since the NS expires. And the question again, are they caching independently or not? And the trick we do to figure this out, we add 540 seconds, and we renumber the A record and keep all the machine running and gave different answers so we know where the answer came from. So the question we are going to have here, we are going to use historical figure, show if they receive cache or new answer, so we wanted to know that.
So, what do we see here, we see two different figures and the X Axis is to the time and the Y axis you see for both, the number of answers. What do we see for in zone and out of zone down here, the first run of measurements will just have cache so everybody goes to the blue colour, the regional server we have set up for the A record, but after time equals to nine here, we renumber the A record so we are going to have both active. But the part we are interested in is the part where the NS expires, which is this colour here and the A is still valid, and the blue colour show the servers that like they knew the A record was doing cache but they went to the new server so the somehow they figured out the new A and they went to the new one. So, what we see here is that in zone, in bailiwick the A record gets refreshed, they go to the new server and it was refreshed. And the out of zone, the results are completely different, seems to be completely independent results are, independent caching so that's very interesting because these are domain names that have been configured the same TTLs and everything. The only thing changed was in or out of zone.
When you look at the answer, the glue records ‑‑ when the NS expires and ask for the parent, what are the NS records and if they are bailiwick they get the A records and the AAAAs as well so that means that they get the resolvers without one automatically, they also get an update of that and that seems to be causing this problem here. It's not a problem, just a feature that forces them to update somehow.
So, yeah, in this case, our user Marcus will notice early refreshed A for in bailiwick, it's important in or out, it depends that is how caching occurs in DNS. So that we investigate two different corners of it TTLs use to caching, but as I said, people still have to choose TTLs, so how do they choose that in the wild? We had a bunch of data sources and ask the TTLs of the records to, the other TLDs are the root and entirely.nl zone as well and we analyse child TTL and we discuss results with operators.
So, and then analyse this data found a bunch of stuff. One of the things we found most domains are out of zone, meaning their NS records, they are in total different domain names and what is important, the importance of that is that if you are out of bailiwick, out of zone, records are cached independently, so the values they have are going to be respected. You see the yellow colour here showing percentage of the different zones, not the roots, they have a lot of in bailiwick configure as well, the TLDs. And you start look at the different types of records, in DNS NS records they point to authoritative name servers, and what we found out is there is a lot of ‑‑ here we have a CTF and hours for the TTLs ‑‑ there is a lot of NS records that are, they spike around one hour, meaning 60% of the records are long, which is very good for caching, but 40% are not which is not that very good for caching. We also look at different stuff, how NS records compare to the A records for DNS, and what we found out, that A records for ‑‑ far shorter than typical NS records for the same domains. They are spiking around one hour here, and shorter A leads to poor caching and I am going to explain the consequences of that in terms of performance.
I think that's kind of, this part of the paper is what got us in at IMC because we found 34 TLDs we had very short TTL for NS record, under half an hour and we notify eight country codes to ccTLDs and three of them increased their NS records to one day after our notification and one of them was Uruguay, and we had measured Uruguay before the change and then also measured later as RIPE Atlas the latest, so here you see the two different colours, this is a CDF, the percentage of answers here and this is RTT milliseconds and the blue colour shows the CDF before the change and here shows the performance after the change and what we see here is the median RTT for users that were querying dot UY once they changed their TTLs that went from 28 milliseconds to 8. That is this purple bar here, it went down from 173 milliseconds to 21, so it's saving this amount of time here. And if you think about it, I mean before that, people always, when you are using RIPE measurements they are always concerned about the bias that RIPE may have towards probes in Europe. We broke down the results per region based on the location of the probes and we found that all regions in the world experience latency reductions, up to 150 milliseconds for Africa. And if you don't know, follow football, this is top scorer here, helped to improve their performances and I think it's a good thing for this paper we are able to demonstrate if you change one parameter you are able to improve performance in Africa and Middle East. So this natural experiment, we didn't design it and measured it before and later shows how TTLs are important for performance and user experience. And after giving some thought to that, it seems to me that TTLs are like the terrible button on old computers: if you don't know what that is that means you are young, lucky. But back in this day there was button you could press and performance would double. That is the same of it. TTLs, if you change you can improve all your users performance because of caching, so that's what we learn from this part of the paper.
So, now, I am going to look into different perspective of the problem here. Cashing user longer TTL versus Anycast. So people in CDNs, they spend a lot of money on huge Anycast deployments. These are massive that you announce one IP address from multiple locations across the globe and since people are afraid of how to choose or change it. TTLs, say I will have a short for my records just in case because I use Anycast all my users are going to get great performance any way, Anycast make it up for that. Can it really? Let's look into that.
So here have this diagram, that RIPE Atlas probes here in this colour, salmon colour is the resolver. We have 10,000 and set up two scenarios here, one we set up a Unicast name server for domain task in Frankfurt and the other one we use route 53 Anycast network, that has 45 sites globally, the time ‑‑ in the time we have measured. Our idea here is just to compare this. We are going to run, going to measure performance of of this users, by probes here, one they query Frankfurt, one Unicast location only and you can imagine this is huge, there might be users here in New Zealand, Africa, but this we are going to enable caching for one day, this means good caching and for the Anycast users where going to set up a TTL with 60 seconds so we are going to it be probing every ten minutes meaning we are disabling the caching for the users that are going to use Anycast in this particular experiment. So the question wanted to know: Who is faster, which one is faster? Having a single Unicast here in Frankfurt of long caching or having Anycast network, big one, with 45 locations globally but no caching?
So this again is CDF, distribution of the queries in the RTT here, this is log scale, and Anycast with no cash in this case median performance of 29.96 milliseconds, which is very good. Unicast at Frankfurt at under of 8 milliseconds, and it was 22 milliseconds faster just ‑‑ than just having Anycast with no caching and on top of that the query load on site was 77% lower in comparison to Anycast, and that's ‑‑ that is the proper colour here, which shows the performance difference. I am not saying you should not use Anycast, you should, if you are a large provider because it's the only way to go for having global good performance but I also say pay attention to your TTLs because if you want your users to have good performance choose carefully. We strongly recommend, we had a paper, I think last year, on that, or two years ago, that you should use TTL Anycast as well.
So, that brings us to discussion. What are the reasons actually we would have longer or shorter TTLs. Longer, that means longer cache, faster responses, users can get answered from the cash so it's much faster, lower DNS traffic on authoritative site, and shorter caching allows you have ‑‑ shorter TTLs, allows you have faster operational changes, you can have useful ‑‑ it's very useful for you if you are using DDoS based and some CDNs use DNS based load balance, each organisation must straight off to find to get balance. So that moves us to recommendations and conclusions here.
If you could provide one recommendation from this paper, is to use longer TTLs if you can, unless you are using let's say CDN load balancing, set up your records for one day because then your users are going to be able to do that. And use cache and improve performance, keep on using Anycast too, good for many different stuff, I am talking about domains, get a lot of queries like globally popular domains and the thing is, people when resolver operators have designed very complex efficient caches them wisely to improve performance of own users, you can do that with TTLs. And the question I want to pose: Should you consider TTLs as well? So you can find this paper here, if you want the full paper is available here. We also have a draft on the IETF, you ‑‑ it was before DNSOP, if you want to check it out, these are the two things. And I think I am good.
PAVEL LUNIN: Thank you.
Questions? No questions. Thank you. Our next speaker is Sofia on community problems and the ways to solve them, please.
SOFIA SILVA BERENGUER: I am manager with APNIC and originally from Uruguay, I am happy to hear my folks had the recommendation and improved things for my friends and family back there.
Before starting with my do ‑‑ the main topic of my presentation is that we are trying to solve problems for our community, I wanted to mention a few things about me. I started my journey in this RIR world nine years ago. Time has gone by quite fast, it feels like it was yesterday, but actually in 2010 I joined LACNIC as a Hostmaster and policy officer, and after some time I moved to some technical roles, I was networks and security engineer and then senior security and stability engineer. And it was fun, I was with lack lick for four years but I moved five years ago to Madrid to do a master's and I was doing some research on AS level interconnection in the Latin American region. It was very interesting, I learned a lot about a proper research methodology and everything but I wasn't sure about going for the PhD so after I finished my master's I decided to take some sabbatical time and I was travelling around Asia and by chance I ended up visiting the APNIC offices and they were curious about my research and I presented about that and I got a job offer and I was, like I didn't have a plan so why not. I decided to move to Australia. So for more than two years now I have been living in Australia, I joined them as a data scientist but I changed roles, I year‑and‑a‑half ago I became a.product manager. Before that I have always suffered from imposter syndrome and I wanted to know in the room, I just want to see a show of hands, who doesn't know what imposter syndrome is? Okay. A few of you don't. I'm glad to see there are a lot of people who do. I didn't know what it was and this research that shows that 70 percent of us will suffer ‑‑ have already suffered at some point from imposter syndrome, even if you don't know what it is, you will or you have already suffered from it. And so before moving into my main topic I wanted to share app personal story because it was just a few months I had joined APNIC as a data scientist and I was struggling with one of my scripts and I felt that the server was like that picture, I was like really freaking out, I was ‑‑ my script is not working, it's Coombe can go a lot of memory, it will cash everything running in that server and I was very scared of being found out because they will know I am not as smart as I think I am, I didn't want to tell anyone because I was so embarrassed and I wish I would have seen Sasha's presentation yesterday about testing and all that and I was, yeah, I wish I had some testing for my scripts. The thing is that I ended up talking to my colleague George Michaelson who was sitting next to me and I was like don't tell anyone but I have this issue and I don't want anyone top know but please help me to fix it. He ended up helping me but then we had a conversation and he was like, you have imposter syndrome and I was like what do you mean, I have what? .I just don't want them to know I am this stupid. And he is like, well, he shared an article about imposter syndrome, most of us suffer this ‑‑ it was also interesting to see it was not just me, I stopped feeling that alone in the world. I thought I had tricked everyone and I had just been lucky during my interviews and then I realised it was all about expectations and expectations I thought people had of me. And the thing is that a year‑and‑a‑half ago, I changed roles and I became a product manager and I don't have imposter syndrome any more because no one expected to me to know what I was supposed to be doing as a product manager. My boss knew I had to learn to become a product manager. Now in retrospect I realise the main differences, I don't have that high.expectations on myself either, and I'm just trying to.embrace my ignorance and as a product manager I am just acknowledging I don't know, I don't know as much as I thought I knew and I want to learn from uses and I want to learn from network operators what the problems underneath are so, as a product manager I can do my job and try to solve those problems. So that's why the title of my presentation is trying to solve problems for our community.
As I said, I am now a product manager. I manage a product family, we call information services, and we have three main products, some internal information services and we are also doing discovery of new ideas. Today it, I will focus on the three main products, the first one is the Internet directory, instead of URLs I started using QR codes, I think they are cool and they are easier than typing. If you are curious you can scan the QR code or download the slides. The Internet direct Reese is a portal where we offer information about resources delegated by APNIC. We have some charts showing delegations to the different economies in the region, grouped by sub regions. We have a map view where you can see accumulated IP space, to the different economies. We also have some IPv6 deployment charts that show IPv6 capability in the different economies and we recently enabled the comparison mode for these charts as well so you can compare IPv6 deployments in different economies. And we also have these diagrams that show the AS interconnection in IPv4 and in IPv6 in the different economies of the region.
So, as I said, this is a portal that provides information about delegations and how these Internet number resources are used in the Asia Pacific region. We had like three different versions of this, the two first version of the Internet directory were developed by third parties and we didn't have that much control over ‑‑ over code so we decided to reveal ourselves and in September last year we launched this new version, so these new features are not that new any more because they are already ‑‑ well more than a year old but for those of you who haven't tried the Internet directory, the kind of new features are those IPv6 deployment charts that I was showing, the comparison mode that enables you to choose the whole Asia Pacific region or specific sub regions or economies and compare the delegation charts or over IPv6 deployment charts. We also have those new visual diagram, it used to be a stand‑alone website and now part of the Internet directory. And we converted all the charts into web components so they can easily be embedded, we use it for blog posts about the situation in a specific sub region in the Asia Pacific, we embed the so it is interactive and always fresh.
I think I skipped one.
What is next and we are currently ‑‑ what currently working on. A few weeks ago, we had our conference in Thailand and we ran some new research activities, some interviews with users. We are trying to clearly define the unique value proposition of the Internet directory and this is reverse engineering the product, the thing is I already adopted it, it was already there when I became a project manager and in order to really understand who the main target audience is and to develop new features for that target audience we want to clearly define what the unique proposition of this product is, that is work we are currently doing and always working on continuous improvements, if you have used it or happy to try it at any point, you want to give us any feedback it is more than welcome.
And another thing we are starting to think is maybe the Internet directory as a name doesn't work that well any more because it has changed a lot since it was born, so we are considering maybe it needs a new name.
So also if you have any ideas, any cool ideas for that, I would be happy to hear about that.
The next product is what we call DASH, is dashboard for autonomous system health. That is a product that visualises information about network ‑‑ about malicious traffic or rather suspicious traffic that we see are going out of a network, so for members of APNIC who hold space, they can log into this product and see if we have seen suspicious traffic going out of the networks, we do this through a Honeynet project, we have our own Honeynet but both nets from the community who deploy their own as well and share data with us and all the collection of that data we inform our members that we have seen some suspicious traffic coming from that networks and they may want to do something about that. We show them a comparison between last month and current month and they can compare themselves with their economy and sub region.
So that's a chart showing the AS, the user has selected compared to the economy or they could also compare to the region. And then we show them a percentage from the routed space, how much has been seen sending suspicious traffic. They can see a list of prefixes. At this stage, this is a prototype, it's a partial version of a product that is only using SSH attacks so from all the data we are collecting from the honey pot, we have just offering through this interface information about SSH traffic.
So, as I said, it's currently a prototype, just showing SSH attacks and we are planning to implement some more features. This short video that explains the product at that URL and this is also a product that we are kind of reverse engineering but it's a bit less of reverse engineering because it's still a prototype but the thing is that ‑‑ it was already a project when I started as a product manager so we stepped back and we were like, we need to understand motivations and behaviours when network operators deal with network security incidents to better help them through these products. A few weeks ago in Thailand we had some interviews with technical people from our community asking them about how they deal with network security incidents, we are trying to understand how they interact with ‑‑ what information they need to ‑‑ information they need to report information and try and identify pinpoints where we could help them with this product. And as I said, it's just a prototype so we will not launch this yet openly to just anyone in our membership, but we will start with a soft launch identifying just a set of members who would see interesting information at this stage, because as I said, it's just SSH traffic so it's like just part of the data, so there would be some users who would login and see 0 percent, nothing wrong here so it wouldn't be interesting for them. We will focus on a set of members who would see something interesting in the product at this stage.
The last product is NetOX, network operators toolbox. You can see for resources and it looks very similar to RIPE Stat and it's not coincidence. We are working in collaboration with RIPE NCC, with Christian specifically, so currently this is an APNIC version of RIPE Stat, but the idea is that we will differentiate that may also become part of RIPE but we are working on some problems and needs we have already identified to create some more widgets of our own, but also, another benefit of this collaboration is that we will be moving content to a CDN so that the performance perceived by the users in the Asia Pacific region is better.
So we are exploring how to implement features to solve those problems and some is bringing information or points of contact from other sources from Whois. We have had some interviews with network operators and they all agree that the points of contact in Whois are not adequate when dealing with routing issues, the Whois points of contact are not supposed to be for that, so they usually go to the wrong team or they are not up to date, so the de facto place to look for points of contact when dealing with routing issues is Peering DB, so instead of trying to replace what is already out there and widely used we are consuming data from Peering DB and making it easier for our users to find it.
We will also identify that upstream providers are the usual escalation point when an routing issue is not getting solved quickly, upstream provider of the causing the issue is the natural escalation point, we also consume information from CAIDA, the inference relationship dataset to make it easier for our users to identify who the providers of a given AS are. And also, something that kept coming up during interviews was that a lot of ISPs are using IRRs to automate their route filters so we are exploring how that make that information valuable in an an easy to consume way to automate the route filters.
Another idea we are exploring, and this is currently under the NetOX umbrella but will probably be a separate product which is for network operators as well but a completely separate platform probably and I would also like to have some feedback because there is lots of NOGs in this region and you are a very active community. So something that I kept hearing during interviews, informal conversations and other interactions it's a bit of a pain to go back to discussions that happened in a mailing list, so digging in a mailing list archive and trying to find what the conclusion was, if there was a consensus or something like that, is kind of tricky. So, we are exploring this idea of offering a shared knowledge base or a kind of of a WIKI for network operators. So, in Thailand during our conference, we also ransom interviews to try to validate this problem, and it seems that people we talked to, they are already using some platforms like forums and support platforms and everything but they are also using mailing lists and they are experiencing these difficulties. So we are now exploring what alternatives we could have to implement this idea. We don't want to implement anything from scratch, we expect we will have technology from the back‑end and adapt our front end that adapts to the features we want to offer our community.
As I mentioned we are working on some new widgets to differentiate ‑‑ first is based on CAIDA's relationship dataset.
The second one is using job IRR explorer, what is the situation in terms of the different IRRs and RPKI, so why that message is what it is, they can go to IRR explorer, to the proper website. And the last column shows two instances of the points of contact because it consumes information not just from Whois but Peering DB as I mentioned and in some cases network operators choose not to make public their points of contact in Peering DB, so for example, for 608, the one at the top, that's APNIC's and apparently chose not to make public their points of contact in Peering DB and you can see those points of contact if you log into Peering DB. So we are letting the user know that and they can go login to Peering DB and find those contacts there.
So, I did pretty well with time, I am happy with that, because I am always ‑‑ it's hard to estimate how long it will take to tell my stories. But, yeah, as I said I would be really happy to hear from you, if there are any questions, comments or feedback.
DMITRY KOHMANYUK: Your timing is perfect. So please, guys, don't be shy and shall I know it's English right, I say guys and girls, I don't know how to say it better, people, come to these mics and you can ask our presenter your questions. Thank you.
SOFIA SILVA BERENGUER: I was so happy the RIPE community is so good in providing feedback. The APNIC community is very silent and quiet so I thought it would be better here.
DMITRY KOHMANYUK: Sometimes people at the presentations are on the computers and you can't really scan unless you are assuming you have a phone ‑‑ I tried one of them it didn't work. Put it at the line below or above, especially for people who are visually impaired they may not be able to use QR codes at all.
AUDIENCE SPEAKER: It's a thank‑you from one imposter to another for raising this issue.
DMITRY KOHMANYUK: That is a good answer, thank you.
With that being said, I need ‑‑
RANDY BUSH: IIJ and Arrcus. Anybody here who says they do not suffer imposter syndrome either lies or does not know what it means.
DMITRY KOHMANYUK: Lai Yi Ohlsen.
LAI YI OHLSEN: Hi, I am Lai Yi Ohlsen, I have the highly coveted spot of after lunch and before coffee, so thank you for joining me. I am the new project director of Measurement Lab. We provide the largest open dataset of Internet performance measurements. It's my first time at RIPE so thank you for having me. Measurement Lab was present at RIPE 69, Colin Anderson presented our 2015 interconnectivity report which used lab data to study ISP interconnection and its effects on consumer Internet performance.
So today I will be introducing ‑‑ reintroducing Measurement Lab and sharing the work we have been doing to support open Internet research for the last eleven years, including some recent updates.
First to speak about who we are, I forgot a slide for this. But we are a fiscally sponsored project of code for science and society which is a US‑based nonprofit focused on science issues and we work alongside a team at Google. Previously we worked with the open technology institute at new America and Princeton's PlanetLab.
Over the years we have worked with numerous partners and collaborators who I will talk about as I present our work. And they come from all corners of the industry and we do our best to be a resource to all of them.
So, some history:
Back in 2008, Internet researchers started discussing the many challenges of studying Internet performance. It's hard. One of the problems was the lack of wildly deployed servers with sufficient connectivity to support the type of Internet measurement experiments that researchers wanted to do. Another issue was that there was no good way to share the large data sets that the experiments produced. And finally, there was no real public resource that existed to provide aggregated performance data to regulators and researchers and consumers that would give them usable or useful and actionable information about their Internet performance over time. And so, for the past eleven years our mission has been to measure the Internet, save the data and make it universally accessible and useful.
So, in 2008, it was hard to measure the Internet and in 2019 it still is but we hope slightly less so and I will describe what we are doing to help with that.
So, to measure the Internet we do a few things. First, we run a measurement platform which means that we run hardware and well‑connected data centres where ISPs can interconnect with one another.
Putting these high capacity servers next to the content allows us to measure the user experience of the full route from user content. And each location we place at least one, and multiple if we can, of what we call pods, and pods are three to four servers and one switch connected directly to an upstream provider. We try to be in every large metro to obtain diversity and transit and routes.
So here is a map, it's a screenshot that I took of our infrastructure map this morning. It is interactive, so if you access it on the website you can zoom in and see more specifically where we are placed.
We run 500 plus servers in 130 plus locations and an exciting update we are about two‑thirds of the way of upgrading the entire platform to be managed by Docker and you can read more about that on our website.
Then on the hardware we run a number of network measurement services and tests provided by several partners who through the years are listed here. One of the tasks is NDT which stands for network diagnostic tool. It was originally built by Internet 2 and is now maintained by Measurement Lab and NDT is a single stream performance measurement of of a connections capacity for bulk transport as defined by IETFs, RFC 3148, it reports upload and download and can be useful in identifying what problems are limiting performance.
Like all the tests that run on our platform, it's Open Source and linked here. And in fact, today you can host your own NDT test by running this Docker command on an Linux machine, to run NDT 5 which is the latest version. You can also run NDT 7, which is a new version of the protocol which was designed and it's currently in development.
NDT 7 uses TCP info which allows us to support BBR based TCP resulting in measurements that are compatible with IETF RFC 8837 and works over TLS and web sockets and so it's compatible with modern browsers. And so if you run that Docker command and point your browser to your local host you can run tests using the original protocol and if you pass a TLS and run on port 443 you can run on NDT 7 using TLS and SSL.
Other network measurements services and tests we host include DASH, which is different than what we just talked about, reverse trace route, side stream, Paris trace route, BISMark, SamKnows and more and the experiments that we host are written by researchers, mostly academics that are approved by our experiment Review Committee. And the new platform makes it easier for tests to be deployed and maintained and we hope to see this enable new experiments to be proposed. So if you are interested in potentially proposing an experiment please come forward after the presentation or let's talk about it this week.
So, in addition to running the platform, the second thing we do to measure the Internet is run tests, but to clarify, we, meaning Measurement Lab, we do not run the tests, all tests are active measurements and use ‑‑ users run tests via client integrations.
But we get about 3 million new NDT measurements per day and there are, we are proud to say there is about 2 billion rows in query table as of I think sometime in June. There is a blog post there about it, you can't see it but it's there. That is not even including all the data from other tests. So here is a time‑line of how it's grown and you can see we have almost doubled since 2017.
So, an easy way to run a speed test if you are just curious about your connection at any time is to go to speed.measurementlat.net, there are a lot of client integrations including Google search, various software, a Chrome extension we developed to enable automatic measurements from the browser, and some of the software integrations are built and maintained by local communities which I will speak to later. But I will mention to you that with NDT 7 building these clients is easier than it's ever been so also if you are interested in what that could mean, let's talk.
And the second part of our mission was the save the data part. We use a push based KH pipeline that stores everything in Google Cloud storage, you can access all of the raw data there which includes a lot, and includes raw packet traces and metadata. It's mostly used by academics and researchers but it is really important to us that anyone can access it and see how we got the answer to the equation, essentially. But of course, the mission doesn't end there. We also want the data to be universally accessible and useful. Raw data is hard to work with and kind of unnecessary for most people so we parse the data, annotate it and put it into BigQuery.
And accessing the data in BigQuery is free and requires a sign up to the M‑Lab discuss list. And once you are in you can use SQL to query the data as needed. You can access the NDT data as well as data from other tests. BigQuery is easy to navigate but we are here to support you so please reach out. It's what the team is there for and we really like hearing from people and about the questions that users are asking of the data, and it's something that we already do but I'd like to start doing more of, you know, connecting people based on their similar research interests so if you are playing in the data, please always feel free to reach out. And there is our support address.
But the most common form of accessing data and the easiest one is the Viz site, and here you can run just basic queries to you region, aggregate by location and transit ISP and ISP provider ISP. So we identified the ISP by using a combination of Google up engines, geolocation service and the freely available geoIP database from MaxMind.
And there is more. And so the goal with all of these tools again is to make the data universally accessible and useful because running a speed test once gives you information about that one moment, but by collecting longitudinal open data we hope to be able to provide meaningful information about the behaviour of the Internet over time.
So, that's the tech, but it's really only part of Measurement Lab because making the data useful doesn't only making it available but making sure that it's being used by researchers. So I am going to give a few examples of how that is being done.
Due to our open access methodologies, academic, we have become a trusted data source for the an academic Internet measurement research community and you can find app list of publications that cite us as a ‑‑ this includes from University of New South Wales, Foundation Alliance for Affordable Internet, Belgian University, but this is just a few of them. And I encourage you to take a look to get a sense of what the research ‑‑ what is possible with our data and also because they can speak to their research better than I can. And we do try and keep track of all the publications that cite us as a resource but we might be missing some, so if you know of any or maybe even if you have authored one, please send them our way.
So another big part of our work is collaborating with local governments and institutions who are interested in specific questions about their community's Internet access and performance. And first I am going to give a bit of background on the US regulatory context regarding broadband access and performance. So, the FCC, which is the Federal Communications Commission is the agency that has overseen the regulation of the Internet in the US. They measure the broadband in a few different ways, one of which is the Measuring Broadband America Programme which fixed broadband access, we provide the off net servers for the study. They also collect data from the ISPs on a bi‑annual basis around broadband subscriptions and speeds.
So, with that acknowledged, there is still a lot of discussion and interest in improving the available broadband data and to look for alternative and complementary sources of information because the fact is, the digital divide is really felt in America and and other places and it's being felt through indicators such as economic opportunity and citizen engagement and homework gap etc. And leaders of communities are often unable to properly advocate for their constituents because they don't have the right numbers meaning the data that is available to them is not reflective of it's being indicated when we look for the expected outputs for sufficient digital access. So we have supported some community initiatives that help communities gather data using a user volunteered and locally maintained approach so that communities can be the stewards of their own data and help promote a metrics based understanding of what their broadband needs are and what their priorities for improvement should be.
So, first example is we have worked with the folks at MERIT in Michigan, along with a state university on a project called Michigan Moonshot. These were.interested in understanding the homework gap of their students and as it pertains to the digital divide. More than 380,000 do not have access to broadband and accurate connectivity data has potential to assess the community needs and build support for funding sources so we worked with them to build a portal on mibroadbandtest.us, where you can access an NDT speed test and a survey with questions relevant to their research such as does your family own or rent the place you live in or on a scale of one to five how satisfied are you with your Internet? And in addition to the ability to ask these questions, the portal allows them top collect enhanced geolocation and this data is not stored in the public dataset but it is useful to researchers seeking information past the location data that our test usually provides. My favourite part is the creative approach that the researchers took to conducting the survey so students were given the quiz both ‑‑ the quiz ‑‑ or the survey at school and at home too as a homework assignment to provide the study with a complete picture of the students' broadband experience, which I think is just a genius way to make sure that surveys get taken.
The project started with three schools and is expanding state‑wide this year, into more schools State‑wide. This portal was built used piece wise which has been used in other survey and mapping initiatives in Washington and identify hoe, and the goal of of this is to make it easy for communities to conduct a broadband survey, speed test and aggregation of the data and visualisation of the tested area on a map so they can see how areas are doing in relation to one another with within an specific region. I am happy to say this tool recently got a bit of funding so we will be seeing some further development by the end of this year.
We also participated in a public private partnership to create a similar tool called speed up Louisville and and it's now being developed further, as is speed up America, by the technology association of Oregan, and it's quite similar to piece wise, I think just written in Ruby, I believe, but it's a good example of how you can ‑‑ anyone can create a client than fits their specific needs of measurement.
And then we also have the national ‑‑ the test it which is worked on by the national association of counties, which is a member organisation of locally elected officials in the US. It's another example of collecting and aggregating user volunteered data, this time collected by immobile application, specifically intended to collect data in rural areas. With the press of a button on their mobile device users are.able to run a speed test again with that enhanced geolocation that is saved in the database, not the public one, and many of the members of NACO represent communities with broadband access spanning access from poor to none and helps identify an open and verification of areas lacking the action they need to participate in the digital world.
And then last year we collaborated with Penn State, broadband availability and access in rural Pennsylvania, also providing a comparison to the FCC data on subscription and speed that I mentioned earlier so you can actually go in to the census tracked and the county and Senate and house districts and see what the user volunteered speeds are versus reported speeds. It's useful when you are trying to get a sense of actual versus reported and can compare how far off the data points are from one another, if they are close or far apart and it can vary.
And the last community based research I will talk about is a programme we are in the second year of calmed measuring library broadband networks and it's a partnership with Internet 2 and Simmons University and we are working with public librarians and their IT staff which are sometimes the same person and to build a system for automatic measurement using M‑Lab speed tests. For the next year we will work with them to understand the broadband speeds and quality of service at libraries currently receiving and defining the priorities for what they need to improve their access and, you know, this is important because libraries are often the main broadband access point for towns and cities especially smaller ones but also bigger ones and understanding the performances is really critical to understand the region's digital needs. We are packaging in a Docker container and running from small computers connected to, they are similar to Raspberry Pi and connected to libraries' public access networks, we are using currently developing an option to manage them locally using web framework.
So, test data from each device will be pushed to the library's local database and our dataset and this year the team will start designing visualisations based on the research done last year and a study of what metrics are most useful and understandable to the average library IT staff. As excited as we are about the library use case, there are a large amount of other contacts that these could be useful for outside of the library and so developing them for this project is a way for us to continue exploring how automatic measurement can be used to understand public networks.
And so those are some of the ways that Measurement Lab is supporting community based research and all of these examples come from the UK context but our website has more information and if you have thoughts on how we can be useful in your region, whether local or national, we should talk about that.
And so those are all ‑‑ and so those are all some active or recently active research projects that use M‑Lab but here are kind of like just random list of potential research areas that we think about a lot and would be happy to see pursued so I will just sort of rattle through them and put them out there.
As mentioned every M‑Lab test generates app Paris trace route test and there is a lot of ample opportunity to be researching routing. We talked about city based initiatives and it's interesting to think about how those could be used to develop and create standards or key performance indicators for cities that are trying to reach a.specific level of digital access, it could be useful in the planning of smart cities and I come from an Internet freedom background and I am particularly interested in how we can be useful in the measurements of throttling and shutdowns by creating metric based definitions, obvious behaviours to enable the behaviour of anomaly detection tools that can work against disruptive on the network. Throwing those out there.
All of this, I hope, reflects our principles. Mostly, you know, all measurements are active measurements, we take user privacy very seriously. Anyone can build a client, they are built by and for the community mostly.
And finally, openness. Our tests are open, our data is open and our methodology are open and this is because we do this is the key to improving the Internet and understanding it through Open Source and open science methodologies. You know, there is kind of no point in having standards if we don't define how to measure that standard.
And so, with that, I hope you have a better sense of what Measurement Lab does and why we do it. If you'd like to get involved, there is a number of ways how to do that. We can pose an experiment, use your data ‑‑ use our data in your research and planning.
I forgot to put this bullet‑point and it's the most important one ‑‑ if you are using RIPE database, we would love to see how they can play together. You can integrate NDT into your clients and if you are a data centre or internet exchange you can host a pod. It goes without saying, we are a community resource, and I'm here to give this talk but also to learn how we can be useful to all of you, so if you have ideas about this or want to be involved in how we think about that, please feel free to approach me here or throughout the week. Thank you.
PAVEL LUNIN: Thank you. 20 minutes for questions. Anyone?
ROBERT KISTELEKI: RIPE NCC. I guess representing RIPE Atlas and R&D and stuff, I would be happy to pick up the thread again. We tried cooperating with M‑Lab before, it didn't lead to a very happy place although there were advancements. So let's pick it up again.
LAI YI OHLSEN: That sounds great, I joined the project in July and have been caught up to speed so would love to continue, fresh face.
PAVEL LUNIN: No more questions? Thank you.
This is the end of this session so see you at 4 p.m.
LIVE CAPTIONING BY AOIFE DOWNES, RPR