DNS Working Group
Wednesday, 16 October 2019
9 a.m.
JOAO DAMAS: Good morning. Welcome to the DNS session of the DNS Working Group at RIPE 79. As Dave mentioned, in case someone came in late, if you were looking for the Address Policy that's in the side room.
We have app full agenda today. We'd like to go through it as quickly as possible. We circulated the minutes for the DNS Working Group session at RIPE 78 recently, hopefully no one has nipping to Ad in which case ll declare them final. I'd like to thank the RIPE NCC for providing the minute taker and Jabber as well as the stenographers, who make the meeting actually possible.
One other item, we had for this this session this time around, was that Shane Kerr, one of the three co‑chairs was due for, his term was up. We solicited volunteers for the co‑chair position. The only volunteer we got was Shane, who wanted to stand again. No one when the full list was disclosed no one said anything against. A few people said for. So, I think that means that Shane starts a new term. So thank you Shane.
(Applause)
So, as far as me, that's it. First speaker is Petter, if you can come up and go for it.
PETR SPACEK: Good morning everyone, I appreciate that you are so eager to hear being about benchmarking that you came here so early. So, we'll look at the obligator side with motivation and to the classic problems with the approach we use all the time, and then in the second half of the presentation, I will be introducing a new tool, a new approach to DNS resolver barging. So, obligatory motivation, running a resolver costs money, for cost savings we have to optimise it and optimisation by benchmarks make any sense. So, with that, I would like to remine you but the presentation yesterday during a Plenary which was called catch me if you can, and the most important take away from that presentation is that cache hit rate is the most important parameter, the DNS TTLs are effect the resolvers and that will in the end affect the performance perceived by the users. So cache hit rate is the important parameter. Okay.
DNS resolver can either answer from its cache, which we call cache hit own. Or it might happen that the answer is not in the resolver's cache and in that case, the resolver has to go to the Internet and do a the low of work. For cache hit it's super easy, look up in the memory, send is back, that's it.
For cache his, the resolver has to do a lot, lot lot more work, and in the end it's like order of magnitude, slower maybe, two orders of magnitude slower than cache hit depending on the network and other factors.
So, any benchmarking attempt which will ignore the cache hit ratio is going to fail. And that's what usually happens. The most classic tool or four resolver benchmarking is this. If you look at the manual it very nicely summarises the process, but let's think about it a little bit more, it's basically boils down to three steps. First step is getting the data. So, typically running a TCP dump and dumping the list was queries in the form of DNS name and query type in a text file. That's the input for the reserve tool. Then we take this input file and give it to the resperf. The resperf is going from the top‑down and is sending queries to the DNS resolver. The problem is that the resperf is totally ignoring any timing information from the original traffic so as it is going through the file, it the increase the speed which means queries per second over time, and it will be increasing the number of queries per second as long as the resolver can respond to you will a the queries, or almost all the queries. At the point where the resolver cannot keep up, that's here, the edge, the resolver will start dropping answers, or queries, and we will see an edge on the chart and that's the maximum value of queries per second which the resolver CA handle. That's or less what the resperf manual says. The problem is that this is completely ignoring the timing information. That's the most important problem that in the original traffic, a lot of clients is using their own local cache, for example, Firefox oresolver Microsoft Windows so the client is not going to ask the very same question five times in a row. Because it will keep the TTL or at least for some short period of time. Resperf is signalling because everything gets conflated to one huge text file without any timing that it just goes through the text file as quickly as it can.
Besides other things the text file doesn't contain any information in both EDNS buffer sites and other stuff, so, all the TCP traffic is not there for benchmarking and it has also other problems like for example the ramp up which is used by resperf doesn't exist in practice. If you restart your DNS resolver, clients don't care, they will just shoot the queries and the resolver has to deal with it no matter what.
There are also some implementation problems, or, well shortcomings in the resperf that is using a relatively short number of sockets so the port numbers are not as randomised as we would like and it doesn't reflect population of hundreds of thousands of clients, so it can mangle somehow the workload distribution of the resolver, and we are not even talking about TCP, TLS and other stateful stuff, which is more relevant in the past years.
We can sum up the resperf approach and say that it's kind of over focussing on the Q PS on the number of queries per second. Because in the end, the Q PS value is no what's interesting. If I'm an Internet service provider, I am interested in the information how many clients can my resolver handle. It's not the Q PS. If I get a Q PS value I have to do my magic equation and find out how many clients it means for me, for my deployment. So that was the motivation why we came up with different approach and we are experimenting with it right now
It's implemented in a tool called DNS shotgun. It's open sourced and based on the the DD organise, and the aim of the tool is to make DNS resolver benchmarking more realistic. So, what do we do?
As I told you before, the over focussing on QPS is probably a mistake because that's not the value the Internet service provider is interested in. The interesting value is how many clients can we handle. Of course, not all clients are equal, so if you have little IoT device which sends some telemetry data once a day it will have totally different DNS pattern traffic than mail server which is doing tonnes of queries all day everyday because it's looking up a Amex records and so on.
So to solve this problem, we spread it into two phases to supply utilities. In the first phase, the tool is processing the traffic capture from your particle deployment and in the second phase, it is replaying the traffic or simulating the real clients based on your particular setup. So the traffic analysis. Let me show a picture, that will be easier,
In the top half of the slide you can see the first table and this table represents a PCAP, a traffic capture from actual DNS resolver deployment. The clients 1, 2 and 3 are representing three source IP addresses in the PCAP, and the comes 1 to 8 represent time. So, in the first second, the three clients are sending different query each. In the second second only the second client is sending query and so on and so on. Okay. This is the original al traffic capture. For the purposes of benchmarking, we obviously need more traffic, so how do we do that.
The idea is that if we want, for example, double the work load on the DNS resolver, we can shorten the period to half, split the PCAP in half, and the second half will be split and paralysed with the first half. So in the end the table at the bottom shows the end result with doubled DNS traffic and the idea is that now we have six virtual clients with IP addresses 1 to 6 and they are sending queries which keep exactly the same timing as previously but we have double the number of clients so the length of the traffic capture is just half.
That's the basic principle. And once we have done that, that's the first phase, the data preparation, we can start sending queries.
So, we are trying to assimilate a real client so we will use new socket for every client and then we will be playing the queries while keeping the timing. So plus, minus one second to keep the cache rate realistic. And then we will be monitoring the response rate from the resolver. So, in practice, this process is done in loop. So we generate data for let's say a hundred thousand clients, then we replay the queries, monitor the answers and check that the response rate is almost 100%. If the resolver is able to respond to this number of clients, we will, for example, go to 200,000 clients and repeat the process. And we are increasing the number of clients as long as the resolver keeps up and once it breaks, it's not able to respond to 100% queries, we will declare that value as the maximum.
On the next slides, you will see experiments on the realtime resolvers. We took a traffic capture from one of the cheque universities, empty cash running experiments for two minutes and measuring the response to make sure the resolver didn't get crazy and fail to response to everything or something like that.
The resolvers in the test were configured with roughly the same configuration as we tried to be as close as we can, but sometimes it's not that easy.
This slide shows the results for power DNS, that was the first experiment. This blue line in here, is representing 100,000 clients in our datasets and you can see that we are starting with empty cache and flooding the resolver from the very first millisecond with queries and the resolver is not able to respond to you all of the queries because it doesn't have the data in cache, it takes some time to go in the Internet and so on.
So you can see that over time as the resolver is building its cache, filling it in with data, the response rate will go up to 100 percent and that's where we want to be.
Eventually the resolver will just answer everything. If we increase the number of clients as you see here, we get to 300,000. That's the red line. You can see that the resolver is keeping up still, even though it takes considerable amount of time to get to a reasonable response rate. Eventually, it will respond to everything. But if the increase number of clients too much, the resolver is simply not going to ever respond to you will of a them. So, we can say that okay, in this particular configuration and data, the resolver is able to handle approximately 300,000 clients.
Okay. Power DNS has tonnes of options and we are not experts so we picked almost randomly two of them and tried to tweak them. As you can see the tool can be used to, fore comparison of two different configurations. So the blue line is the original al configuration with default. There is this. And the orange line is retuning which apparently didn't help. Because we don't know what we are doing, we are not power DNS experts.
We tried to tweak this by default is disabled. That's the blue line, it worked okay. After tuning it was worse, so don't touch it if you don't know what you are doing.
In short, do not take these slides as a guidance, you really should do the measurement yourself, because every deployment is different. The traffic capture as an input has tremendous impact on the results so do not generalise, measure it yourself.
Okay. The same went Working Group BIND, we just ran BIND, try to see how it performance with default values we could handle like a hundred thousand clients. That's the blue line. After sometime, we tried with 160,000, and it asn't that good. And surprisingly the trend was going downwards. You can see that a little bit is going downwards. Which was totally surprising, so we were tuning the stuff, what's happening. Eventually we tried the large tuning, which is a compare time option that didn't help at all, because we have seen like this drop to 0 response rate after like ten seconds, then it recovered and it was fine. But for a higher number of clients it just died. So we found a bug. Reported the bug to IC, they are looking into it. Eventually found out another bug which was related to the thing from DNSSEC feature. If you disable this feature which is supposed to increase performance, you will see that the trend is not downwards any more, it's actually increasing, we found a couple of bugs, they will be fixed, and again do not generalise. Different measurements of BIND, you will have different properties, different system, will have ditch properties, different traffic capture will.change results. Do not generalise, measure it yourself.
Knot Resolver, finally my home project. We started to develop the tool so we are able to realistically compare two versions of Knot Resolver, the old one and new one and to see whether we are improving things or making it worse.
So, this is measurement for the latest version, you can see that the green line is approximately 350,000 clients, more or less. And we were interested in comparing with the new not yet released version 4.3.0. And that's here. The original al version is the blue line here. The new version is the green line in here, you can see that the new version can handle 50,000 more clients. It's another way how you can use this tool. You can compare two different versions.
Again, measure it yourself.
Of course, when we measured three previous resolvers, we had to measure on one as well, so we did. And it turns out that Unbound is super performance. But we found a weird artifact when we were pushing the software to its limits on that particular box, it seemed that everything is fine, but after one minute there was a sharp drop and then it recovered and went on. We don't know why. And we looked into the data, it doesn't seem like artifact in the data because no other resolvers showed that behaviour, so we don't know. And guys from NLnet Labs are looking into it so eventually we will know.
I will preempt the question, why didn't you go further to see whether you have, whether it will repeat or not. We didn't have enough PCAPs. So next time.
Again, maybe in your configuration the drop will not be there or will be elsewhere or something else will happen, the sky maybe will fall down or something. So measure it yourself.
That was about the experiments. You try the tool. There are some limitations. The most like practical and pressing issue is that for this method to work you need a tonne of PCAPs, literally gigabytes and big bytes of data so you can flood the resolver with realistic traffic. That's a bit inconvenient, but if you are running your own DNS infrastructure, just let the TCP run for a day or a week and you will have enough, assuming you have enough storage.
The other problem is in my opinion, a feature, and that is that results shouldn't be generalised because the number of clients, the number you get from the tool, is absolutely dependent on the data datasets. So it will give you an answer for yes deployment and it the ‑‑ the result will not be right for my deployment. So, again, measure it yourself: Right now we don't have support for state for transports, but we would like so. So I encourage you to try the tool. It's a prototype. It might crash and burn horribly, let us know, we'll fix it. But right now it works for us. If you are interested in benchmarking TCP, TLS, whatever else transport, let us know. We would like someone to give us support, talk to us, specific requirements and so on.
That's it. So to sum it up. Do not generalise. If you have QPS number somewhere, trust me, it doesn't tell you that much. It's a nice number. You can have it on your wall, but that's it. It doesn't really help you.
Do not generalise. Again, everything matters on configuration, the version of the software, the way how you compiled it, the network drivers, everything. So measure it yourself.
If you are interested, get in touch. I'm more than happy to discuss here or in the hallway next steps of this project.
Thank you for your time.
(Applause)
JOAO DAMAS: Okay. A couple of minutes for questions. Benno?
AUDIENCE SPEAKER: Benno, thank you for your presentation. All the work you have done, developing the code and the interaction with us. I have one question, it's not a critique, but just a question. So if you compress the rival raters, you take a PCAP and you kind of half the time, compress the a rifles, did you look at the statistical characteristics of the compressed lot? The question behind that is, if you do it ‑‑ if it is Anna rival process, I can't imagine these are, how do you call that, independent statistical event. I can imagine you can compress them. But if it's not a arrival process, I'm not sure...
PETR SPACEK: The theory behind it is the law of big numbers will more or less say that okay, if you have a hundred thousand clients, the next hundred thousand clients is probably going to be about the same.
AUDIENCE SPEAKER: But I think you know more about that but this is my ‑‑ thank you very much. Happy to help and to look into this.
AUDIENCE SPEAKER: We'll take that off line, but we should talk about it. I am around tomorrow. Another thing I am around tomorrow for, do you have a plan for how you are going to do dot DOH encrypted things. They are different and ‑‑
PETR SPACEK: Let's talk about that in the hallway. That's for a long discussion.
AUDIENCE SPEAKER: You have thought about it. That's good, I am happy. Thanks.
AUDIENCE SPEAKER: Hello, Christoph. Thank you very much for your work. I faced this problem quite a few years ago, tried to measure the DNS server. I have a practical question. When you try to measure a server you don't really know if you hit the limits of the server or the tool that you're using to measure. So ‑‑
PETR SPACEK: We have an answer for that because we have developed little BP F programme which is running inside the kernel and it's just shooting answers back, so if you replace the DNS resolver with this thing which just shoots back whatever it came in, you will get totally different numbers which are much higher than the DNS resolver.
AUDIENCE SPEAKER: And do you run your tests on the same server
PETR SPACEK: Yes recollect it's the same hardware.
AUDIENCE SPEAKER: I mean the server you are measuring and your tool run on the same hardware.
PETR SPACEK: Yes. Well, okay, you mean the same machines? No they are sitting in the rack next to each other connected by a network.
AUDIENCE SPEAKER: Thank you very much for this work. Thank you.
AUDIENCE SPEAKER: Warn from Google. First off, thank you very much. This is awesome. I really like the idea of using your own data to benchmark. Just one thing to keep in mind obviously is traffic type shift. What you manage to benchmark with this shows manges to show in standard state.
PETR SPACEK: Of course. Use appreciate data. Not like two years old PCAPs.
AUDIENCE SPEAKER: We are, from what we have observed on our community resolvers, the DNS resolution traffic change highly especially in high school, during the breaks and stuff. So, at first I was thinking that you were about to use like learnings something like that, you seem the pattern of new clients. So did you try to look into that because if you just like take what's later in the PCAP and put it now, it's probably not the same traffic pattern.
PETR SPACEK: That depends on the period. I mean you can take like two days and replay it for two days and that should reach it out.
AUDIENCE SPEAKER: Thanks again for all this. It's a very cool stuff. We have had, we have tonnes of tools like this as well but they are tiny and more like scripts. So having something a bit more complete would be nice. One thing for those who are afraid of PCAPs because of GDPR and privacy etc., etc., we have a tool available it's called dance washer, you just feed the PCAP and it will replace all the IP information in there but it will retain the same clients. So we just start at 1 and that same address in the whole PCAP will be replaced with 1. So, don't be afraid to capture, but please anonymise.
JOAO DAMAS: Thank you Petter. Next up.
NEDA KIANPOUR: Good morning everyone. I am a lead network engineer and I am joined with Tyler Shaw here from F5 and we're going to talk to you about the challenges and successes of DNSSEC signing F5 hosted zones.
So, last year we officially announced that we are in the process of signing our zones, and to fully meet our customers compliance requirements we determined that to sign the whole footprint and that happen involved signing ‑‑ that involved basically F5 DNS devices. F5 DNS devices in our infrastructure respond to and listen to the queries to the hosted zones on them. And signing the zones on them had its own challenges. But, with the help of F5 guys and two engineerings later we are now in the process of ruling out DNSSEC in our environment.
Some of the challenges we came across are listed in this slide. I'm going to go through as much detail as the time permits but I'm happy to answer any questions afterwards.
After implementing DNSSEC on the F5 DNS devices, TTL first issue we came across was the inception, signature inception. So, we noticed that every time we were querying a record in one of these hosted zones, the inception, the signature inception date that was returned was set to 0. That this was quite concerning because the resolvers with clock skew would have marked those responses as invalid causing major impacts of the clients sitting behind those resolvers. And how we noticed it was just running you know a test and using DNS physiotherapy to see that this actually signature inception date was set incorrectly. A little bit of research and we came across an RFC 6781 which allows for a parameter called inception offset, which is what it does basically is setting inception, signature inception start time to sometime in the past. And this is what we were looking for to resolve this issue. So we brought this request to the F5 guys and they add add feature to their code that would set the signature inception to 60 minutes in the past resolving this problem for us. This is available in version 15.1, but can be backwarded to version 12 and I have included the bug ID on this slide. So if you are interested in this or dealing with a similar issue you can researches it in your support case with the guys.
The other issue we came across was a vulnerability in the DNSSEC implementation on the DNS devices. And this was actually brought to our attention by the good folks in cz.nic and, so thank you guys which also advised on the type of attacks that could be performed by ex employing this vulnerability. The issue was that the F5 devices were returning incorrect NSEC 3 record for a DNS query that does not exist at a given name and even though there was A or AAAA records that could be queried by the clients.
So, the impact that this issue had was that the resolvers that were using aggressive negative caching, could incorrectly for nonexistence of other record types at the NSEC 3 record.
So F5 acknowledged it and they published a knowledge base advising of a work around which was quite complicated, so, we kind of, we had a discussion with them and they came up with a permanent fix and they gave us another engineer hot‑fix. This is the second engineering hot‑fix during the implementation that we're talking about.
Third issue we came across was the master key change. So when you implement DNSSEC on these devices, there is a master key that is used specifically to encrypt and decrypt the DNS keys on these devices. The issue we came across was, like sometimes when the device was rebooted or taken down for maintenance or code upgrade, we noticed that the configuration would fail to load and this was due to the Master key changing on the device. So, when the configuration was failing to load, it was also a fact in the DNS, DNS keys on the devices. No DNSKEY getting decrypted, no zones were getting signed and causing DNS resolution failure to those zones.
So the current work around is that you have to manually take a copy of the key prior to doing any maintenance and using that key to restore the configuration should you come across that issue. But we are still in discussion with them. Hopefully they come up with a better plan than just work around, but this is one of the issues we came across.
The fourth item, I don't want to call it an issue but it's rather a feature request that we have, it's the support for Algorithm 13, which is not available at the moment, but we have expressed our interest in using it for or DNSSEC implementation and there is already app company who is also expressed interest to the F5 guys. Our company is also attached to this feature enhancement request. It's referenced here if you're interested you guys chime in as well so we get it working at some stage with the help of you.
Last but not least, was the implementing DNSSEC on the demand devices, formerly known as GTMs, which are in the active stand‑alone state, which is the scenario that we're dealing with in our infrastructure. So, in this scenario, the devices are actually dynamically responding to the DNS queries, both of them, and to implement DNSSEC on them we had to create separate ZSKs and KSKs and use them to sign the same zones ending up with two DNS records to share with a parent zone. For this model to work on these devices, they needed to be aware of each other's ZSKs which is something the F5 devices are not capable of today. So, the work‑around is to have them share the GTM configuration between them. So basically by enabling DNS sinking on group on them. This is still being tested by us to make sure we have covered all the failure scenarios, so we don't shoot ourselves in the foot. But this is a work around and hopefully there will be a better fix for it by making them aware much each other other having a way of importing the ZSKs, you know, from one to another so that it works like a multi‑signer model.
So, there is actually, I couldn't kind of squeeze that picture in the previous slide but this is how this kind of very, very high level works in our environment, and so that's all we have got today. If you have any questions, we're happy to answer and tiler is happy to answer 5 related questions. So thank you so much.
(Applause)
AUDIENCE SPEAKER: Petter. First of all, thank you for doing this because we were hoping that F5 will start like taking DNS seriously. Thank you you are working with.
AUDIENCE SPEAKER: Edoardo. So I have found another problem with F5 process. So, the default documented process as an orientation on the KSK key. Do you know about this problem? Shaw is that you that it has auto rotation.
AUDIENCE SPEAKER: Configuring F5 it auto rotates the key.
TYLER SHAW: There are standard times in there, yeah, I mean depending on your use case and what type of you know provider or enterprise you are, you need to figure out those timings yourself, but, you know, I'd be happy to talk with you specifically about how we should improve that documentation and/or defaults.
AUDIENCE SPEAKER: I'll get you after.
NEDA KIANPOUR: We have set it up to defaulted because it's too complicated, we said leave it as is. Don't mess with it.
AUDIENCE SPEAKER: Peter from power DNS. We run into a lot of broken F5s mostly in the terms you have just described. What do I tell people who have these problems? Can they get the same hot‑fixes?
TYLER SHAW: Absolutely. Net a put the, in the presentation, the actual either bug IDs or the knowledge based article that identify where those are at. So, some of the fixes, like the NSEC 3 issue that Petter had found and reported to us, that fix is in code now. It's gone beyond a hot‑fix where it's now made it into our mature code. So those specific things are found within this presentation.
AUDIENCE SPEAKER: And those back been back boreded as well.
TYLER SHAW: As far as they can be yes.
NEDA KIANPOUR: To version 12 I think.
TYLER SHAW: The bug IDs are called out specifically so if it's in a version they are running currently. If it can be back boarded they can research those ID in a support case to F5 and have an engineering hot‑fix built for that.
AUDIENCE SPEAKER: Even if they are not sales force.
TYLER SHAW: Even if they are not sales force.
NEDA KIANPOUR: This is why we're here, to get you the advantage of access to F5 guys.
JOAO DAMAS: Okay. Well thank you very much. We'll see how people respond to it. Thank you very much.
Next up is Austin, talking about costs an benefits of DNS, DoT and DOH, the hot topic of the day.
AUSTIN HOUNSEL: Hi everyone, I am from Princeton University, I am going to be talking about research we have been doing about measuring the performance of DOH, DoT DNS on times, response times, and also in emulated network conditions.
So, I probably don't have to tell you this, but DNS privacy has become a significant concern, so, you know, on path network observers can observe and spy and traditional DNS traffic or DO 53 as I'll talk about it for the rest of this talk. There's been two protocols that have been proposed to encrypt the DNS traffic, there's been DoT and 2016 from RFC 7858 in DOH in RFC's 8484 in 2018, and so the general idea is that we just want to provide confidentiality guarantees that traditional DNS lacks, and so, this might have significant performance impacts and we just wanted to measure this as researchers.
So, the primary things that we measured are query response times of course just to see you know what is the impact of using DOH DoT, DOH 53 response times, also page load times, just from the user perspective, what does it mean from an end to end page load if you are to use any of these protocols from different public recursors and also the local recursive resolver. We didn't actually perform measurements were real cell phones but nonetheless we wanted to see, if we were to perform some traffic shaping and kind of emulate a 4G, 3G settings, how would this impact page load times and query performance. We performed these measurements from 5 global points. We performed them in Ohio, one in northern California, Frankfurt, Seoul and Sydney, just to see from a global perspective what would this look like.
Just for the sake of time today I'm only going to be focussing on our measurements from Frankfurt.
So the kind of unexpected finding from this work was that despite higher response times that you might get, compared to unencrypted demand, we found that actually page load times with encrypted DNS transports can be hire, which you know going into this research we did not expect at all and so towards the end of this talk I'll go into why we think that might be and dig more into the data.
So, first, we're just going to look at query response times. So, as mentioned, we perform these measurements with different large public recursors, so there was CloudFlare, Google and quad 9 recollect and then we performed measurements with the local recursive resolver that were given to us over Amazon EC 2. So starting with CloudFlare. We can see that there is this interesting behaviour where it starts out that DoT is faster than DOH, but then it seems to catch up. So, for about 50% of queries it seems that DOH is actually out performing DoT. And of course we're seeing that traditional DNS not only for the local recursive, but also CloudFlare is generally out performing DOH and DoT. So these are just CDFs of DNS response times, and again for the rest of the talk whenever I say default DO 53 I'm referring to the local recurser that was provided by Amazon EC 2.
Also importantly the local recurser does not include DoT or DOH so we were only able to get encrypted DNS measurement or DOH measurements from Amazon.
Another interesting behaviour we observed is that DOH is actually out performing DNS or rather do 53 in the tail. I'll go in a bit later why we think this might be. This is unexpected behavior, that we are interested in digging more into.
We are seeing some behaviour for Google. So you do see this behaviour where DOH actually catches up to DoT in response times. It's a bit later than CloudFlare, but nonetheless, we're seeing this kind of same general trend. And also we're seeing that DOH is actually out performing do 53 and DoT in the tail response times.
And it's also out performing the local recursive resolver provided by Amazon.
And lastly with CloudFlare, we're seeing a bit more interesting behaviour in the sense that DOH is out performing DoT. We disclosed to quad 9 our results about, we're seeing this interesting behaviour with DoT where it seems like there might be some caching that may or may not be correctly configured for DoT, and this is something that we're actively looking for into. But nonetheless, we see that once again, DOH is catching up in the tail response times to do 53 which, you know, seeing that across all 3 large public recursers is once again something we did not expect and, it's really sticking out to us.
The take away here is that DOH can actually catch up to and even out perform DOH 53 response times. Another way of putting this is that it has a higher mean in terms of query response times but a much lower variance, and one possible explanation for this is HTTP caching where the entire format amounts getten cashed because the ID is set to 0 which was recommended in RFC 8484 to prevent to semantically identical queries from being cached in different ways. And you know, being table to pull something out from the HTTP cache may be faster than having to parse and construct a new DNS response. Again, we're actively looking into this and wanting to come up with new explanations for this, if anybody would like to talk to me after this presentation about why you think it might be the case, I would love to hear it.
Next we also wanted to measure page load times and in addition to measuring page load times and kind of the default network configurations that a.m. zone EC 2 gives us we wanted to measure it on emulated cellular conditions. The idea here being that DOH and DoT are starting to be offered on phones, whether it's Android 9 supporting DoT or CloudFlare putting out their 1.1.1.1 Nap, we think it would be very interesting to see how performance differentiates on phones, so, we wanted to just perform traffic shaping on Amazon EC 2 and see just on different network conditions that were provided by open signal for 4G, 4G lossy and 3G, what would the performance look like for query response times and page load times. Again, for the sake of time, I'm only going to be focussing on page load times and only focussing on CloudFlare, just for the sake of time.
So, what we're seeing is, from our measurements in Frankfurt on the kind of default network condition, and these are CDFs that show the differents in page load times between for example CloudFlare dot, CloudFlare DOH 53, CloudFlare DOH, CloudFlare DOH 53 and the vertical line shows the median is you pick any two protocols and they are performing within 30 milliseconds of each other in page load times. This is again without traffic shaving of any kind so DoT was 1 millisecond slower than do 53 in the median page time and DOH was 16 MS slower. The pages that we loaded were from a list of websites called the trust anchor owe top list, which was I believe published at NSD I, it's take can the top list, averaging it over a period of time and so we took the top thousand, then we took the bottom, I believe it was 99,000 to a hundred thousand websites. We wanted to combine these things together to measure not only how they optimise the websites that are hosted on CDNs but also websites they're not to get this together. Importantly these graphs with putting all of these websites together and showing all the data as one graph.
So, now I'm moving on to our 4G settings. And again, these are based onsetings provided by an open signal report. We traffic shaped our measurements with an additional 50 M seconds of latency, 0.5 percent packet loss and 8 milliseconds of jitter. We also changed our uplink rate to, I believe it was 7.44 milabytes a second and then our down link rate to 22 milabytes per second. So, this is showing that DoT is actually speeding up, compared to DOH 53. Now instead of being 1 MS faster it's 11 MS faster and this is with traffic shaping. DOH has gotten slower where it's 5 will MS slower than DOH 53. So the thing that's really interesting behaviour where one protocol is getting faster and another one is getting slower. And now, things are changing even more on the lossy 4G setting. So now we have taken the 4G settings and increased the loss race to 1.5 percent packet loss. So, now DoT has become 101 MS faster on the median page load time compared to DOH 53 and now DOH is 33 milliseconds faster. So put another way, both protocols are faster than encrypted DNS in the median page load times, even in kind of lossy emulated settings or conditions, which, it's really surprising to us, and now the 3G settings, both protocols collapse compared to DNS, where DoT is 156 MS slower than DOH 53 and DOH is 310 MS slower. So, you kind of take away here is that we believe, to explain this, that TCP might actually be helping page load times where the TCP packets can be retransmitted much faster than UDP packets, and the default timeout of DOH 53 and Linux with resolve set to 5 second. These kind of differences and retransition times might not make it to where one protocol or encrypted protocols rather might be able to perform better in median page load times than unencrypted protocols.
In summary we measured DOH 53 DoT DOH performance in five vantage points. This is only focussing on Frankfurt, just for the sake of time, but in our full paper we show results for different vantage points. And in future work we want to do analyses over diverse networks; For example, residential ISPs and we'd love to hear more from the community about other interesting measurements that you would like to see in terms of performance.
Thanks.
(Applause)
AUDIENCE SPEAKER: Jen Linkova. It would be really interesting to see the measurements from real devices because I suspect what you are going to see in real life is when we are using normal DNS, even if you are trying to ask 8888 or 1111, the response will not be coming from them in many ISPs. They will be coming from something which sits in much closer to the user and pretends to be the real resolver you are asking, so actually you might see much faster response time in UDP than for encrypted protocols. So I'm looking forward to seeing measurements from real devices, real networks.
AUSTIN HOUNSEL: Yeah, that's something we are doing right now to perform measurements from different white boxes that are actually located within residential ISPs, so that is something that we're wanting to see as well. We're trying to measure.
AUDIENCE SPEAKER: Jelte. One other one that I would like to see added is plain text DNS over TCP. Because that would tell you whether DOH or DoT have something extra that could explain the differences instead of just it's probably TCP, which I also expect, but that would be another one to add I think.
AUSTIN HOUNSEL: Absolutely. Thanks.
AUDIENCE SPEAKER: Petr Spacek. I have a question for whether you have taken into account the cache hit rate when you are comparing the default which means the Amazon's resolver compared to CloudFlare or whoever else, did you take into account the difference in cache hit rate? Because in the long tail version DOH to the clout resolver out performance the local resolver, it might be caused by the difference in cache hit rate because CloudFlare is having much bigger cash hits than the local resolver, right?
AUSTIN HOUNSEL: Right. To be clear our measurements were performed just continuously over the span of about I think like three and a half weeks, and so it might be the case that what is appearing in the Amazon resolver where all the domains that were queried might not be also within the CloudFlare resolver, these kind of big resolvers that were being queried by users from web browsers. That's something that we know where it just might be that the same domain names are not in the cache between resolvers like CloudFlare and Amazon. Some that might totally be the case. But we're still seeing that even within CloudFlare, that the DOH queries are out performing the DOH 53 queries in the tail. But definitely, I mean, with Amazon they might just not have the same domains in the cache and so the cache hit rates are probably much higher for the same domains within CloudFlare for example or Google or quad 9 than within the Amazon cache.
AUDIENCE SPEAKER: Thanks. Let's talk in the hallway.
AUDIENCE SPEAKER: Jim Reid. I think this is interesting work and I'm very, very glad to see that you, your colleagues are starting to research the analysis. There is a the low of things that need to be done here. A couple of sessions for you. I think it might be helpful to have a clearer separation between measurements to do with DNS transport and how they have been being used by browsers and other applications, as opposed to page load times. Because we might not be having a clear ‑‑ you might be comparing apples and oranges a little bit. So it may be better to look at what's the overall impact of page load times and see things like DoT is better than DOH which is better than vanilla DNS in certain circumstances, to have some clearer metrics around that. For example, it might be a good idea to look to see if you can measure the latency for page loads which eliminated because the web servers will be turning applications /DNS messages, so the end device is not having to do the DNS lookups to resolve those names in order to fetch content from URLs. I think that would be something that would be very, very worthwhile doing.
Another potential project for the future is going to be assessing the impact of CloudFlare's lack of ECS support in DOH. So, could you perhaps stand up a web server that's another CDN as opposed to CloudFlare and compare what page load times are when you are on the CloudFlare CDN node as opposed to say an Akamai or Amazon CDN node.
AUSTIN HOUNSEL: Both of these things are very interesting. One thing that we note in our full paper is that the page load times with CloudFlare, which as you mentioned generally do not support ECS, were actually faster than the page load times than using Google's resolver which just supports ECS. So this is something that we want to dive more into, for example, we could set up a web server and try to play around with this. The preliminary results we have do seem to suggest that ECS or lack of ECS support may not be affecting page load times and may improve page load times with CloudFlare.
JIM REID: I'd love to have a chat with you later this week.
AUDIENCE SPEAKER: Dave Lawrence, Oracle. I actually got up to make the same comment it would be interesting to compare TCP plain text, but more specifically, a deep dive into that would include the difference between persistent connections whether encrypted or not with, not uncommon TCP implementations that don't keep open persistent connections. And then to just add to Jim's observation on the ECS issue, there ‑‑ we have seen a couple of presentations recently that happen pointed out the importance of cache hit in TTLs so I think that's going to explain a lot of difference that comes up with ECS.
AUDIENCE SPEAKER: William tour op. I want to suggest you do also a measurement from vantage point which are a little further from the Cloud DNS providers in Frankfurt, for example what would DNS over HTTPS mean for people in Africa.
AUSTIN HOUNSEL: It's funny you mention that. There are some colleagues that we're working with that are in South Africa as a university there, we are actively working with them to perform measurements within different countries within Africa to actually get these measurements. So...
CHAIR: Thank you very much Austin. We look forward to you coming back with more results soon.
(Applause)
FLORIAN OBSER: Good morning everyone. I am Florian of the RIPE NCC. I am giving you the DNS update. Things we worked on since the last meeting and things we plan to work on in the coming period.
There is no change in the DNS team this time around. And for our first update at RIPE 78 we informed you that VeriSign sold their secondary DNS servers that we were using for DDoS mitigation purposes to Neustar, and we were evaluating if we wanted to stay with our contract or go with someone else. We decided we'd stay with the contract and we completed the migration in September this year.
Next update is on CDS and CDNSKEY, we have asked by the community to compliment this to ease DNSSEC deployment. We had opened to have this ready for this meeting. We are nearly, nearly there with you but we did not want to rush it so now we are thinking we will have it next month. So what is this about is, if you already have a DS record in the RIPE database, we will track that, look at the to see if a record is present, check if it's valid and then update the database.
To establish the first trust, you still need to go to the database and put your DS record there.
Updates on K‑root. Since RIPE 78, we added three more sites online. We have three more in the pipeline but they are not online. The first one is Bejing China. It does about 2,500 queries per second, charter completely within China. The second one is in Berlin in Germany, 250 queries per second, they are completely in Berlin actually, which is a bit surprising. And the third one is in valve Dore in Brazil with 5,000 queries per second. This is mostly south and middle America, curiously it also leaks to New Zealand and I have not looked into if that is actually a good thing or not.
More about K‑root. This is what a day in the life of K‑root looks like. These are queries per second for a whole day. The first two colours on top, these are the hosted nodes around why the community run by volunteers, and it's run about 50% of the query load. And below that, the other colours are the current ones that we run. This is about 140,000 queries per second at peak time during a normal day.
Comparing with this with audit DNS, we see about 200,000 queries per second during a normal day. If you compare this at K‑root, we are present currently at 70 locations, we recently did the hardware refresh on the core side, and adding new hosted nodes is its routine for us. We needed to look at Aut DNS more, it's only present at four locations. We need to do a hardware refresh there and we are also thinking of adding hosted nodes there. We currently run one experimental one. We do not have a concrete timeline for that.
That's it from me. Are there are any questions or comments?
AUDIENCE SPEAKER: Peter Koch, DENIC. Thanks Florian for that update. If you go back a few slides where you had the K‑root instances, the three additional ones. And ignoring Berlin for a second which I can easily say, this 7,500 seem to be a non‑negligible part of the overall query nodes, did you have a chance to look into where they disappeared, which other nodes did get less load and was it operationally relevant in any case?
FLORIAN OBSER: Not for these nodes unfortunately. So, for those, I do not know. What we noticed when we brought up type A that Anand talked about at the last meeting is that those queries were not disappearing from K‑root at the sides. We actually took our queries from other letters. I have no chance in looking into these where they are coming from.
PETER KOCH: That would be like 2% of the total load if I'm not completely mistaken, so you might want to look into that.
JOAO DAMAS: No one else? Thank you very much Florian.
(Applause)
Next up, Geoff Huston on the resolvers we use.
GEOFF HUSTON: Morning all. I work with APNIC. This is actually a report on some measurements we have been to go across this year to try and understand some of I suppose the substance of the debate that I have seen behind the lightening fast standardisation of DNS over HTTPS. For any of you you have been vaguely alive in the last few months you would have seen considerable debate over DOH on the side there is this whole issue that the DNS is phenomenally abused, it's one of the few remaining open protocols, and everyone looks at it; everyone manipulates it, everyone plays with it, maybe it's time we stopped, the privacy argument basically says this needs encryption and it needs it now.
On the other side, you have to acknowledge that the DNS is one of the few remaining pieces of common infrastructure. And there is a huge amount of diversity. Why, because there is a huge inform am diversity of IXPs themselves. With DOH, all that changes. And instead of being an infrastructure that comes through with your ISP, you have to admit the possibility that with DOH your application can make independent choices and resolve names without letting the platform or the ISP in on the story. And they might only be a few folk who are actually offering that service. So all of this massive amount of query information might go to a very small number of players.
The centrality argument.
So, will DOH make the centrality argument better or worse? And of course, the answer to that question is kind of hard without knowing the starting point. So, the starting point is how centralised is the DNS today? And you can say a whole bunch of things, but it would be really good to have some data about it. So, we decided to have a look at this.
We run an add based measurement network, thanks to the general support from Google ads, we run between 12 and 15 million ads a day and basically what we do is we ask users when the ad gets presented, don't click, we have to pay more. When the ad just simply appears on your screen, leave it alone, and it will go and fetch a small number of URLs. The domain names are always unique, so it eliminates caching. And all of the domains are resolved and the corresponding URLs on servers we run around the planet. So, all of the interaction between the client running the ad and the target is with our servers. And so what we do in this particular exercise is because we're the authoritative DNS server for all these queries, is match the queries that we see against where they are coming from and map it back to the original user.
So that's what that says.
First answers. Kind of weird. And the first thing I suppose you have to understand is that some IXPs cover a huge number of customers, there were not that many large scale providers in the Philippines, but an awful lot of people live there and so oddly enough that particular resolver is one of the biggest resolvers we can see as a single IP address. But in other cases if you look hard, what you find, for example, in reliance geois a whole bunch of almost sequential addresses, each handling a huge amount of load and they are sitting inside the sub‑net and I see it in 4 and in 6. This looks clearly like a bunch of engines, a farm with a front‑end load balancer. So the query comes in, gets fed to a machine, probably by caching the domain name and then off it goes down that query. So, maybe we should look at this a different way. And one of the ways of grouping this up, the simplest way is that to bunch it up by origin AS, but as soon as you do that you start to see some pretty strange things. Google now covers around 9.4% of all users on the planet. Their first go to resolver is Google. There are also some very big ISPs out there. There are a lot of Chinese out there so China Unicom and of course China Mobile, China Backbone. There are an awful lot of Chinese users of the Internet. Oddly enough the next one, 114 DNS, there is a huge number of open DNS resolvers in China. After that comes open DNS, Chinese, Chinese, CloudFlare. So, there is not that many open DNS resolvers sitting inside that kind of list.
Now, one thing that you probably notice when you actually look inside your, is you don't necessarily have one resolver, you might have two, if you are feeling silly you might have ten, if you are feeling insane, you might have 100. What might be interesting to understand is if we look at the entirety of the set, do those numbers change? So we set up a subtly different experiment, where no matter what the question, the answer is sure to fail and if you know your DNS, what happens with survey fail, you use the next resolver in the list and the next and the next (survey)
One quarter, 22.48 percent of the world have Google listed somewhere, which is actually really white large. So the list is subtly different but Google is certainly still there. Open DNS rises and there is level 3, which is oddly enough isn't used as a first resolver, but is used far more often as a second or third. CloudFlare, one DNS is Chinese. So, again, not that many open resolvers that you'd think. But nevertheless, you are now seeing clustering. Three resolver farms handle 30 percent of users. But if you strip out Google, you have got to understand that the other two are China and India, and there is an awful lot of users there. So, what we are really saying is the distribution of users behind ASes is heavily centralised, the DNS is just an artifact of under populations of ISPs, with the exception of Google. 90% of users are covered by basically 450 of those visibly resolver sets. So there is some degree of centralisation, but realistically most of this is all about Google. So theories the worldwide figure everyday over sometime. 55% of users use resolvers in the same AS. 40% use resolvers that we can geolocate what looks like the same country. 23% as we saw there use Google. And everyone else literally everyone else.
What about who uses open resolvers around the world? So let's do the converse, where do we see people using their own ISPs resolvers? Well, oddly enough, India, Iran and bits of Africa seem to be the throw outs, the folk where it's not as high, almost everywhere else, you just use default, most customers just use what they do. Where are open resolvers common? Africa ‑‑ Google. Iran ‑‑ Google. Mongolia, sort of Google. Myanmar, Papua New Guinea, bits of Brazil.
So, where is Google actually used as a percentage of the country? Huge amounts of ISPs in Africa send all their queries to Google. This is the forwarding line, it saves in doing DNS, we have some evidence in Iran, Mongolia, Myanmar. Wow! That Suriname or somewhere else? French Guiana. Let's invert this. If you were Google, where are your users? Oddly enough, the greatest population of Google's users as a pool of users is China ‑‑ sorry, second great /R* greatest. The greatest is India. The second greatest is China. Even though in terms of the Chinese population Google isn't that high, there are an awful lot of Chinese using the Internet and those awful lot of Chinese users tend to be high in terms of Google users. Why? I don't think it's centrality as much as and issue around Google's public DNS. Because over the year, 9 percent of users use Google first choice. 23% use is somewhere in their choice set. Very rarely or sometimes people fiddle with the DNS, it's not common. Almost all the other times when we see it it's just your ISP going, it's just cheaper. I just send the queries to Google, I don't have to run the DNS, so just remember, most users never twiddle the knobs. What does the Netherlands look like? Well 66 percent use their own network. 55 percent basically use stuff in the country. 19 percent of users in the Netherlands use Google. And everyone else. And everyone else in this case is CloudFlare, quad 9 and open DNS.
Is it the DNS centralised? No real evidence. Quite frankly no. Google, yes, but that's ISPs, not users. All other open resolvers, in spite of all of the hype, all of the fanfare, no one else gets a look in. There is just no data in terms of volume going to the other open resolvers.
But this leads to a few questions that I think we should all spend sometime thinking about. Like the content world, where if you want content to survive you can only use one of five publication platforms because everyone else will die at the slightest amount of DDoS pressure is the DNS under pressure to aggregate and do a small number of both servers and recursive resolver, and the depressing answer is yes. Will it be free? Will it continue to be free but the only way it's going to be free is if the data is exploited and monetized. Your queries are valuable, they will get sold if they are not being sold already. That's obvious.
Can we still use these common resolver caches and reduce the amount of information we leak? Are we all going to be running obscure DNS? I doubt it. It's too complicated. None of us understand it. That's not going to happen.
So, what's going to happen instead? My suspicion is that DOH actually brings up an entirely new world of possibility that says I can take the name space and in my application, I can run DOH but I don't go to a server that also exists in the public DNS. I can go to an application specific name space and do my stuff encrypted between me and that application based service. Are will we going to blow up the DNS into tiny applications and fragment this entire space? You betcha. Just watch it happen.
Thanks.
JOAO DAMAS: Okay. Questions anyone?
AUDIENCE SPEAKER: Hi Geoff. Alison man kin. A question, does your data show before and after for the US with Mozilla's decision? Because that would be interesting to see what kind of impact a decision about defaulting can have on the overall picture and that could help to justify your picture of the future if you see trends like that.
GEOFF HUSTON: Hard volume data starts on June 1. When Firefox turn it it on in their release set? Post June 1, right we need the fingerprint knowledge to map so we can see Mozilla users and the resolvers they use, so hidden deep in the data is that data. I haven't actually lifted the lid to see it yet. But it can be and will be done.
AUDIENCE SPEAKER: Robert, RIPE NCC. I notice that north Korea has a very distinct colour in many of your maps. Is that because there is no data or is it because they are different?
GEOFF HUSTON: No they are not different. They are actually still oddly enough, there are mobile phones in that country, they all play games, games have ads, ads run this script, we see them. Yes, what we see is those systems, I think I saw the use of Google, is that what you picked up on? They use Google. It doesn't lie. So, why not? Yes.
AUDIENCE SPEAKER: The other question is, do you have IPv6 seems to be a topic nowadays, do you have separate data for v6 proportions or any data on that?
GEOFF HUSTON: No. This was a topic of a few RIPE meetings ago where I talked about DNS happy eyeballs and this whole issue of the transport you use for the DNS. Do not forget one thing about v6: That when you do any kind of extension header, the survival rate of that packet through the network is less than 67 percent. So one third of the time a packet with an extension header in v6 doesn't make it. Fragmentation is an extension header. DNSSEC is large responses. UDP in DNSSEC leads to extension headers. And so the real issue is, if you are running v6 in the DNS, think very, very carefully about fragmentation because fragmented UDP and v6 as a combination is close to toxic which means you have really really got to wonder how to do a quick fallback to TCP if you are going to run v6. Would it be different? It's not. And so part of the caveat when you turn on v6 in the DNS is, be awfully careful about what you are doing and monitor the situation of the size of the responses, and what you might want to think about is setting the TC bit and truncate to avoid fragmentation, because any answer is better than a wait for nowhere. End of rant.
AUDIENCE SPEAKER: I am still trying to understand one thing about your data about in network versus in country resolvers. Using the Netherlands one is an example as I recall it said like 66 percent of users in the Netherlands were using their in Internet resolver but only 55 percent were in country. So in some of these cases you are geolocating their network to outside the country.
GEOFF HUSTON: The first issue is you might use two or three resolvers for one question because the DNS is extravagant. If you use your ISPs, that's a one, if you use a different resolver that is in the same country but not your ISP, that's a different one, and if you use something that's not an open resolver in another country, that's a different country answer. It might sum up to more than a hundred percent. Why? Because the DNS loves asking questions.
AUDIENCE SPEAKER: Very quickly. The open source DNS vendors are planning to decrease the default UDP buffer sites to 1232 by default, so fragmentation by default won't be a thing.
GEOFF HUSTON: You are going to turn on trust anchor Kate and TCP at ‑‑
AUDIENCE SPEAKER: Networks UDP at 1232, so the trunkation will happen quicker and there won't be any fragmentation in the DNS packets.
GEOFF HUSTON: Why don't you think about using A TR. Thank you.
SPEAKER: This talk is about XOT and all the fun things you can do with XOT.
Seriously. So, currently zones are transferred over the Internet with DNS and T secretary doesn't help because it does the data authentication it does not provide confidentiality. So, operators use IP secretary to distribute their zones from the primary re over the Internet, but we don't want you to do that kind of hacking ‑‑ we want this to be available with the name servers, this would be on by default. Hence transfers over TLS, which is XOT.
So, just doing transfers over TLS will already provide confidentiality, so it prospects over you eavesdropers but not on active in path eavesdropers that hijack the TLS session for that you need a bit more authentication, which I will address later.
Also, with the development of DoT DNS over TLS, a lot of thought went into how to do DNS over TLS efficiently, and none of that is currently in use with transfers. There is a lot of low hanging fruit for a performance optimisations which can benefit from this work in transfers.
So, on the left side you see how transfers currently work according to RFC 1995. There is a new version of this notify the secondary with the serial number. Secondary requests the serial number from the primary record to see if it's really changed and then just incremental transfer according to 1995 over UDP first and then falls back to TCP if the reply would be too large. But all the Open Source frameworks, or all the Open Source software that we have seen uses TCP directly and does not do the initial UDP query.
So if you have a large dynamic DNSSEC zoned, for the signatures change regularly of the records or if you have a lot of different zones from primary re to secondary, then the rate of these transfers, of these incremental transfers can be rehigh, and so currently, for every change there is a new TCP session, so the primary re is easily flooded with TCP sessions.
So the performance optimisation would mean that you keep the TLS session, which does now do the transfer open, and we use that for all the transfers between this primary and secondary relationship. Not only for a single zone but also we use it for different zones.
So, the preferred data authenticity and the TLS even done opportunistic provide general confidentiality, but to be protected against TLS hijacking we also need general authentication, with TLS this can be done in two different ways. Strict would mean that the client just validates the name of the certificate on the primary, so the secondary validates that it's really connected to the primary DNS server. Mutual, then the primary would also authenticate the secondary name server. So for various reasons, we think that this is the most convenient default that the secondary authenticates the primary but the primary still authenticates the client by way of ACL and also by trust that the secondary has authenticated the primary ‑‑ it's also authenticated by the data authentication that it's actually using, the shared TSIG secret. Other authentication is probable, so we would like your feedback on that.
So secondly is very easily to implement. It's already in Unbound. Primary results in the one I have just described is even easier and can be done with a TLS proxy in front of a normal authoritative DNS server.
ALLISON MANKIN: Maybe you can do the slides. A question arises as it does with any encrypted protocol. So that was one of the things this is still a consideration when we're working on the spec. And we have done measurements and we have tonne that the answer is yes, probably, as opposed to we know exactly what to do.
A quick preview of a surprising thing that DNSSEC actually helps you a lot here because the DNSSEC resignings create the equivalent of flack on your ‑‑ they are valuable but they create a lot of kind of noise that makes it harder to tell what is going on, how many packets are being changed, how much updating is happening as opposed to otherwise, and I should have said before, that the leaking of information when you don't pad is that people can tell something about the size of your zones and something about the frequency of your updating and that can be very valuable information for somebody.
So we have a series of data collections that show this. The first one shows you the number of records that you can see in their encrypted for one change, one addition, two additions, three additions, four additions, in a relevant pattern of updates. A lot of zones don't change very often. So you might not see this regular pattern, but you would probably see sort of a predictable size of packet if you have X number of changes or X number of additions. That's what's saying here. It's actually difficult to be sure whether it's changing records recollect two changed records can be the same as four added records, because there is a detail and an addition in the changed records. But there is still information to be garnered here by the dedicated folks who try to look at what's going on in an encrypted packet.
Here is an example to show you what's going on when you have an XIFR in DNSSEC and the main point here is that the XIFR in DNSSEC has a lot of pathing of its own because, if you are using NSEC 3 for example, you have 28 records in the response for one SOA change and that has to do with the four SOAs which are in RFC 1995 and it's an efficient way of doing XIFRs and then the 12 removes and 12 adds and there are 12 removes and 12 adds because of NSEC 3 records and things like that.
So, here is an example of where there is not much change happening in the zone, but there are periodic signing, and what you are going to see is a pattern that's kind of a similar pattern whether you are changing the zone or whether you are just resigning the zone frequently. And the size is sort of centre owe typical, it has to do with one SOA change, the 3 K byte packet that we saw, it's a TCP packet you can find from your encrypted channel. Then there is one record update in this particular data collection and that's the small relatively small one that you see that's in orange.
Now, here is where you have lots and lots of changes going on, but you also have the DNSSEC signing, resigning happening. So, if you can go back to the other slide. You can see that someone ‑‑ there is probably some sort of dedicated pattern analyst who could tell the difference between those but there is a lot of safety in the fact that the records are dominating and they are happening in the ‑‑ in a regular pattern of some kind ‑‑ in a random pattern.
Finally. So the takeaways are we do think that unsigned zones will leak the number of record updates and other informs pretty easily. Not perfectly, but easily. And the resigning helps you to disguise that pattern and if you have DNSSEC signing with jitter which we highly recommend for very large zones, that helps you even further.
A future subject for both our sections is that because you might want to control the padding depending on the type of zone you have, you might want to do encrypted channel with A DLS ‑‑ A DS 1 zero signalling to set up some parameters in common with your secondaries.
And that's it. And we have one minute left for questions.
AUDIENCE SPEAKER: Matthias. I like a lot that you are thinking of the next step like privacy, not only to the resolver but also for EDF. If you can go to slide 12 please, or something like that. I think this shows that while the change is small you can get a pretty big XIFR, this is why I actually pitched minimal XIFR, but I don't encourage you to do that in XLT. The other comment I have is you are looking at performances of this way of doing zone transfers on TLS. I wonder how much of this is expert related and how much is generic. Can we apply the same logic on, for example, dynamic updates because these are for a different transport protocol.
SPEAKER: I would suspect that dynamic updates are from many different clients. So there is no longstanding TLS connection here. If there is a longstanding connection there, we need to diagnose ‑‑
AUDIENCE SPEAKER: I think the generic answer is probably there is some overlap and /TKEFRPBs but it would be really nice if we could talk in the hallway what those are.
ALLISON MANKIN: I think it would be an interesting study and this whole area has been a little bit neglected by because of where it takes place.
CHAIR: Thank you very much. We have one last micro item. Jim wanted to present.
JIM REED: I am sorry for keeping you from your coffee break. I'll be as short as possible. I want to talk to you about a new thing that's being set up, Eddie, the encrypted DNS deployment initiative and the object of this organisation is to try and help with deployment and your experiences with the deployment of new encryption technologies, such as DoT and DOH and things of that nature. And the organisation was soft launched just over a month ago in September. Developmental practices, enhanced privacy and security, all the usual /TAOEUPBD of things, find out some information about metrics, performance issues, tools to help with this kind of deployment stuff.
So this is open to everybody who has got any kind of interest in what we can loosely term encrypted demand. It's on the mailing list and the website. It's not a membership organisation, we don't have by‑laws, we don't have a budget and it's self organising, anybody is welcome to participate as long as you have got an interest in this kind of development. Comcast has kind of infrastructure, hosting the website and mailing list. We have got something of the order of about 40 or 50 participants already in the space of a month. This is an example of some of people that have shown the interest. We have got content providers, TLD registries, consulting firms, big telcos, big cable companies, we have got people who do threat analysis work, we have got enterprise DNS, people involved in there too.
Final slide is at the moment we have ad hoc meetings and conference calls. We think we are likely to have physical meetings adjacent to other event.
And the initial workstreams that are being proposed by some initial participants are up on GitHub just now. Everything is out in the open and if you are interested go to the website. If you have any questions, now is your time or over the coffee break.
I guess we're done. Thank you.
CHAIR: One last thing.
SPEAKER: I have minus 30 seconds. Please come to the DNS Dev room. Please rate the talk. Thank you.
CHAIR: With that, the session is now over. The coffee break started a couple of minutes ago. I hope to see you all around and at the next RIPE meeting.
LIVE CAPTIONING BY
MARY McKEON, RMR, CRR, CBC
DUBLIN, IRELAND.