As I said in my last post, I think a bigger issue contributing to global slowness from Load Balancing DNS (operated by CDNs and others) than misconfigured users behind Local DNS is the underlying accuracy of Load Balancing DNS systems in general. In this post I share real-world data that confirms my premise.
To do that I used 50k Local DNS (open recursive DNS servers) to perform DNS lookups of production customer domain names of many CDNs using Load Balancing DNS. The idea being that by using a local DNS in France, I should see the servers the CDN uses in that region, and by using a Local DNS in Japan, I would see a different set of servers. It should be very unlikely that the most optimized solution a Load Balancing DNS system could make would be to direct a single user to both servers in France and Japan.
By using a very distributed and large set of Local DNS to perform the lookups, it should give me a fairly complete map of each CDN, with some (most likely minor) exceptions, and a good sample of each CDNs “confidence” in their decision making.
The Exceptions in the Map
First, I will only see the servers/locations that the CDN has configured for that customer. The CDN might have other locations that they use for other customers/services that will not be discovered. I will also only see servers/locations of each CDN that their Load Balancing system thinks my 50k Local DNS should be mapped to. For most CDNs, the distributed 50k Local DNS should find “all” servers. In cases where a CDN might have made a specific PoP for a particular ISP, these might be missed.
Regardless…none of that should matter. My interest is in looking at where the CDN maps a given Local DNS to, how consistently they do that, when they are not consistent how far apart are the PoPs they use, how much their decision is like that of other CDNs, etc. To further that goal I looked up each name from each of the 50k local DNS 10 times over the course of a few hours.
The Candidates and the Exclusions
Below is a list of domain names I used to evaluate each CDN. The domain names came from URLs that we use internally at Highwinds to monitor the performance of each CDN. The names are our best guess at a representative customer for how each CDN performs. In each case I looked up the CNAME (not the Customer Domain) to avoid any conditional responses that might be generated by the customer.
A few CDNs are absent, with the most notable being my company – Highwinds. I only included CDNs that use Load Balancing DNS. Highwinds uses Anycast. If you use DNS to look up a Highwinds CNAME (tlb.hwcdn.net) from anyplace in the world you will get the same IP every place, with no exceptions. Because of that, we are not impacted by the problem of devices not being ‘near’ their Local DNS (the purpose of the Google IETF draft) which this follow-up post is trying to investigate. Upon completion of this project, I may do another follow-up post on Anycast vs. DNS Load Balancing.
Also, Google is not represented. I would have really liked to include them, but I was forced to omit them since I could not find an indication of them self identifying server locations (more on that below).
Mapping CDN Server Locations
For each CDN I looked up the production CNAME 10 times from each of the 50k local DNS and created a unique server IP list from the results. I then had to map each of the servers to a CDN location.
Don’t read too much into the Unique IP counts for each CDN. In some cases the IPs are an actual count of servers in use (for that domain), and overstate the number of locations in use (multiple servers in a single location). More important than that, an end user’s performance on any given CDN is only impacted by the single IP they download something from. How and why that IP is selected is far more important than the count of all IPs not used by that user. In other words, when I make a connection to Level 3 I get connected to one IP. The performance, count and location of the other 175 has no bearing on my performance (ignoring if they do tiered caching, etc.).
The next step was to map each server to a location, and then to a Latitude and Longitude. For most of the IPs the mapping to a corresponding location was fairly easy. I used of the following techniques:
1) I looked at our Gomez test results. If a Gomez test agent had a connect time of less than 2 ms to one of the CDN’s IPs, I assumed that IP must be in the same city as the Gomez agent. You can’t go many miles in 2ms.
2) Many of the CDNs return a server name in their HTTP header responses. By making a request against each server I could see how the CDN themselves labeled the location of the server.
3) The same kind of CDN self-labeling existed in reverse DNS lookups (mapping IPs to names) of the IPs for some CDNs.
4) Some CDNs used an intermediate CNAME that included location information that pointed to the IPs.
5) I looked to see if I had already mapped an IP one higher or one lower and used that as a guess.
6) If all else failed, I did a trace route from multiple locations and manually analyzed the results.
Only a small handful of the IPs needed to be manually analyzed, and many correlated across multiple methods.
The Akamai mapping was the exception. The only method that worked (other than manual mapping) was the Gomez correlation, but it identified less than half the IPs. I was going to exclude them from the study as I didn’t want to manually map them all (it would take too long and be far too error prone).
In the end I had the IPs Geo mapped using the Akamai Edgescape service. I had avoided using IP to Geo systems for the mapping in general as they are much too error prone, but I assumed the Akamai server IPs would be accurate in their own Geo Data. I manually checked a few, and they looked correct, and any place I had Gomez data to compare it to matched.
For all the CDNs, each IP got mapped to a 3 letter airport abbreviation, which was what many of the CDNs use to identify their locations. I did this based on the Geo information that I gleaned and I normalized the Airport codes a bit so that I only had one per city (as an example I just used SEA in Seattle even if a CDN had self-labeled their server as BFI). I also took some server locations within about 50 miles of another and used the larger of the two cities as the designation for both.
After all the mapping I was able to create a table that showed the locations of each CDN’s server locations. A 1 in a column for a given CDN indicates that I found servers in the location represented by that row.
I assume the above table alone will generate a lot of questions and comments and possibly be the first time some people have seen the data for 8 CDNs’ locations organized in one place. A few things to remember about the table:
The table is only accurate…
· …based on the Local DNS I used. While it was a large group (50k) it might not have found ALL locations for each CDN (though I suspect it is VERY close to 100% accurate for all but perhaps Akamai which should be very close to 100%).
· …for the locations in production on the evening of September 13, 2011. If a location was offline that evening (or is new since then), it could not have been discovered.
· …for the specific customer I tested on each CDN. Other customers of a given CDN may be configured to use more or less locations.
· …for the given service on that CDN. Before people start asking where Akamai’s 1000’s of locations are — I tested their SSL service (which seems to use a subset of their locations) to make things easier for the study.
I also don’t want people to read the table with a bias that more locations is better and leave at this point having picked “winners”.
I will discuss one small aspect (load balancing accuracy) of why that notion is not true but there are many more. The locations are only as good as the CDNs’ systems’ ability to direct you to the right one. Assuming you make it to the right one, lots can still go wrong; including getting an overloaded server, having to wait for a disk read, or worse, having to pull from an intermediate cache/origin. But that is all for another post.
The first measure I made of the data was to assess what level of confidence a CDN had in their decision making, regardless of if that decision was ‘”correct”. If a given CDN directed a given Local DNS to the same PoP all 10 times, they seemed confident in their decision (I am not suggesting it was always right, just confident). If over the course of the 10 lookups the Local DNS was directed to 10 different locations, this seemed less confident. If the 10 different locations were all within a small geographic area that would increase the confidence compared to a CDN that used 10 different locations all over the world.
To calculate the confidence, I computed the Geographic Midpoint of all locations used for a given Local DNS on a given CDN. I then calculated the distance from the Midpoint to each of the locations and used the maximum distance (in miles) to describe the confidence of the choice for that Local DNS on that CDN. That distance (or radius) contains all the locations.
So let’s say a CDN directed a given user to the following locations (as one CDN did); ORD, DAL, IAD, TTN, BOS, SEA, and SJC. I calculated the midpoint for the 7 locations (labeled MID in the table and M in the picture) and then computed the distance from each location to the Midpoint.
I ran this computation for all the DNS lookups for each CDN and sorted the results from most confident (least average miles) to least confident (most average miles), and plotted the 1% – 100% in 1% increments for each CDN. The vertical axis is Log Scale and I added a shaded background representing 0 – 400 Miles in green, 400 – 1000 miles in yellow and greater than 1000 miles in red. The numbers are somewhat arbitrary but seem to indicate cases where a CDN is directing users outside of a region. For reference, Boston and Washington DC are about 400 miles apart.
I concede that there may be errors in this study, or misinterpretation in analyzing the data in this way, but there is certainly something to it.
Take for example the Local DNS IP of 18.104.22.168. These are the PoPs that each CDN used for that IP and the max miles from the midpoint I computed. A consensus seems to develop amongst the CDNs that the location of this Local DNS IP (or at least the best location to serve it from) is between Seattle and Dallas. Level 3 and Mirror Image both used many different PoPs for this single Local DNS, some separated by a great distance.
I don’t mean to pick on Level 3 or Mirror Image…every CDN has issues for some IPs. In the case below most CDNs seem to agree the best location is London, but Akamai (and again Mirror Image) use locations much further away.
Now it could be in these two cases that Akamai, Level 3, and Mirror Image know something that the other CDNs do not, and it is somehow better to jump between locations separated by thousands of miles over the course of an hour to serve a single IP.
Somehow I doubt it.
It could also be that those CDNs have limited connectivity (peering) to those two particular IP spaces, and they are load balancing across 1000’s of miles deliberately to solve that problem – looking for a location (even if remote) with connectivity. If this is the case, it may not be an issue with the Load Balancing DNS system used to determine a best location, but it is still a problem.
I cannot think of a single explanation that makes these answers not seem sub optimal, and as I said, they exist for ALL of the CDNs in the study at some percentage.
The following table shows what percentage of the Local DNS used a group of servers on each CDN with an average distance of greater than 400 Miles and 1000 Miles.
|Akamai||CDNetworks||CloudFront||Cotendo||Internap||Level 3||Limelight||Mirror Image|
|> 400 Miles||4.40%||9.59%||9.18%||3.13%||5.53%||10.83%||1.87%||62.52%|
|> 1000 Miles||1.90%||3.70%||2.59%||0.70%||3.51%||3.33%||0.99%||62.31%|
The results show that some CDNs did better than others. Cotendo was able to minimize exposure to locations 400 miles apart or more to 3% or less of the local DNS (still 3 out of every 100 Local DNS were impacted), and greater than 1000 miles to 1% while over half of the Local DNS I tested Mirror Image with used locations 1000 Miles apart or greater (which seems to either have real issues or something I just truly don’t understand).
Regardless of the percentages, I think all of the results point to a bigger accuracy issue than the problem the Google DNS draft looks to address. In other words, the Web is slowed down more by Load Balancing DNS making the wrong decision (even if only a few percent of the time) than it is by end users behind wrong DNS.
This second part still needs to be quantified. If someone wants to save me the time of mapping the geographic size of an end user base behind a large collection of Local DNS, that would be helpful and very interesting to discuss. I am sure all the Load Balancing DNS folks that implemented the Google solution already did the calculation to prove the return on doing the work. 😉
There are a lot of other confidence calculations that one could compute across CDNs; did they all direct the local DNS to the same part of the world, what would a “group vote” by the CDNs’ decisions show as a consensus midpoint, did each CDN tend to use their locations nearest this midpoint, if not how far off, etc. But those are calculations I will leave for another day or another person. If you would like a copy of the data, please email me (rich dot day at highwinds dot com) and I am more than happy to share it.
So what does this all mean? While distant users behind local DNS might make finding the optimal PoP tricky for Load Balancing DNS systems, it is also difficult for many to find the optimal PoP in the best of circumstances. Using Anycast (which certainly has its own unique challenges) avoids many of the pit falls of load balancing DNS. The choice of using Anycast, in addition to a ton of other technical choices, continues to allow Highwinds to drive stellar performance at the top of the CDN pack.
By Rich Day, President, Highwinds