from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 $('div.output_stderr').hide();
 } else {
 $('div.input').show();
 $('div.output_stderr').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action='javascript:code_toggle()'><input STYLE='color: #4286f4' 
type='submit' value='Click here to toggle on/off the raw code.'></form>''')

Pinging universities ¶

Disclaimer: if you do not see the code cells, consider toggling them at the top of the document.

Introduction¶

The purpose of this report is the show an introduction about how pinging works. Ping time is the latency between two host through internet connection. Ping is most of the times measured in milliseconds, and many tries are done one after another. However, sometimes this can change over time. In this report, I am going to measure the ping time of various universities from the USA based on this website. Then, after checking the location of the ips, visualizing them on a map shows us how geographical distance affects the latency. Comparing the results of a pinging in the morining (in the USA between 4 AM and 8 AM) and in the evening (6 PM and 10 PM) ahows us how the traffic of these networks differs (and some interesting phenomena).

Dependencies¶

To run this notebook besides the data provided in this folder. Other dependencies can be found in the code cells below.

import sys
sys.path.append('/home/user/anaconda3/lib/python3.7/site-packages')
sys.path.append('/home/user/graph-tool-2.29/src/')
sys.path.append('/home/user/pkgs/')

import subprocess
import folium
import os
import time
import operator
import ipinfo
import pandas as pd
from  tqdm import tqdm
import platform    # For getting the operating system name
import subprocess  # For executing a shell command
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm
from matplotlib.colors import ListedColormap, LinearSegmentedColormap
from collections import namedtuple
%pylab inline

Populating the interactive namespace from numpy and matplotlib

Pinging and tracerouting¶

Tracerouting is also a useful tool which also tells us which servers our packet went through. In the following example one can see how our packet travelled to the MIT, however, it did not reach its destination.

# Python Program to Get IP Address 
import socket    
hostname = socket.gethostname()    
IPAddr = socket.gethostbyname(hostname)    
print("Your Computer Name is:" + hostname)    
print("Your Computer IP Address is:" + IPAddr)

Your Computer Name is:9d67533cb727
Your Computer IP Address is:172.17.0.2

# to break down the traceroute's message.
columns=['no', 'name', 'ip', 'tr1', "ms1", 'tr2', "ms2", 'tr3', "ms3"]

# Loading in the data to a dataframe
with open('mit.edu.txt') as f:
    data = f.readlines()
del data[0]

df2 = pd.DataFrame(columns=columns)

for d in data:
    removed = d.split().copy()
    clear_ = False
    while not clear_ :
        try:
            removed.remove("*")
        except ValueError:
            clear_ = True
    ddict = {c:j for j,c in zip(removed,columns) }
    df2 = df2.append(pd.Series(ddict), ignore_index=True, )

df2 = df2[df2.ip.notnull()] # selecting those which have real ip
df2.ip = [i[1:-1] for i in df2.ip] # clearing from parentheses

df2

The function below helps us pinging websites.

def ping(host):
    """
    Returns True if host (str) responds to a ping request.
    Remember that a host may not respond to a ping (ICMP) request even if the host name is valid.
    """

    # Option for the number of packets as a function of
    param = '-n' if platform.system().lower()=='windows' else '-c'

    # Building the command. Ex: "ping -c 1 google.com"
    command = ['ping', param, '1', host]

    return subprocess.call(command) == 0

Here I load in all the unis and colleges provided.

df_unis = pd.read_csv("University and College Websites - Sheet1.csv")
df_unis["Name"] = df_unis["School Name"]
df_unis.drop("School Name", axis = 1, inplace = True)

df_unis.head()

The commented code cell below pings all the websites 6 times and saves it under the folder $\textit{results}$. You may not want to run this, since it takes about 6 hours to run. It is preferred to just simply load in the results.

"""folder = "results/"

cnt = 0
for i, row in df_unis.iterrows():
    #row.URL = "https://www." + row.URL[7:]
    print(str(row.URL))
    ret = subprocess.run(["ping", "-c", "6", str(row.URL)[7:]],capture_output=True)
    print(ret.returncode)
    f = open(folder + str(i) + ".txt", "w")
    f.write(ret.stdout.decode("utf-8") )
    f.close()""";

# This is private, please do not share
access_token = '5192e0157e0108'
handler = ipinfo.getHandler(access_token)

# Here I define some functions

# reading in results
def read_in_ping_results(folder, length = 1930):
    df = pd.DataFrame(columns=['name', 'times', 'true_ip', 'url', "avg_time", "fileindex" ])
    proper = 0
    for i in tqdm(range(1930)):
        with open(folder+str(i) + ".txt") as f:
            data = f.readlines()
        try:
            len(data[1].split())
        except IndexError:
            continue
        if len(data[1].split()) > 0:
            times = []
            url = data[0].split()[1]
            true_ip = data[1].split()[4][1:-2]
            for c,d in enumerate(data):
                #print(d.split())
                if c in [1,2,3,4,5,6]:
                    try:
                        if float(d.split()[7][5:]) < 1000 :
                            times.append(float(d.split()[7][5:]))
                    except IndexError:
                        #print("IndexError",i,c,d, d.split())
                        continue
                    except ValueError:
                        #print("ValueError", i,c,d, d.split())
                        continue
            if len(times)>0:
                #true_data[i] = {"times" : times.copy(), "true_ip" : true_ip, "url": url, "name":df_unis.Name[i]}
                df.loc[proper] = [df_unis.Name[i], times.copy(), true_ip, url,  np.mean(times), i]
                proper +=1
    return df


# ask for locations
def get_locays(df, handler = handler, errored_ip = [0,0]):
    locc = []
    errcount = 0
    errored_ones = []
    for k in tqdm(range(len(df))):
        try:
            details = handler.getDetails(df.iloc[k]["true_ip"])
            locc.append(details.loc.split(sep = ','))
        except:
            locc.append(errored_ip)
            errcount +=1
            errored_ones.append(df.iloc[k]["name"])
    if errcount != 0:
        print(f"There were {errcount} error(s) for {errored_ones}.\n" +  
              f"Those were placed at {errored_ip}")
    df['locs'] = locc;


# dropping outliers
def dropsome(df,smalls, bigs, based_on = 'avg_time'):
    """Drops _smalls_ number of pinging results from the fastest ones and 
    _bigs_ from the slowest ones as outlying points."""
    smallest = list(df.nsmallest(smalls, based_on).index)
    largest = list(df.nlargest(bigs, based_on).index)
    return df.drop(smallest+largest)

# creating colormap
def plotmap(df, colmap = cm.get_cmap('viridis')):
    minzed = (df.avg_time-df.avg_time.min())/(df.avg_time-df.avg_time.min()).max()
    rgbs = [[int((1-j)*255) for j in colmap(i)[:-1]] for i in minzed]
    rgbs_h = ['#%02x%02x%02x' % tuple(i) for i in rgbs]
    df["clr"] = rgbs_h



    mymap = folium.Map(location=( 39.903525, -101.551042 ),zoom_start=4,)
    for k in range(len(df)):
        cmk = folium.CircleMarker(location=df.iloc[k].locs,color=df.iloc[k].clr, 
                                  popup=df.iloc[k]["name"] +', avg: ' +  str(np.round(df.iloc[k]["avg_time"], 3)),
                                      radius=5,fill = True, fill_color=df.iloc[k].clr   ).add_to(mymap)
    #    marker = folium.Marker(location=[i,j],icon=folium.Icon(color=rgbs[cnt]), popup=str(k))
    return mymap


### STOLEN FUNC FROM:
# https://medium.com/@bobhaffner/folium-lines-with-arrows-25a0fe88e4e
def get_arrows(locations, color='blue', size=6, n_arrows=3):
    
    '''
    Get a list of correctly placed and rotated 
    arrows/markers to be plotted
    
    Parameters
    locations : list of lists of lat lons that represent the 
                start and end of the line. 
                eg [[41.1132, -96.1993],[41.3810, -95.8021]]
    arrow_color : default is 'blue'
    size : default is 6
    n_arrows : number of arrows to create.  default is 3    Return
    list of arrows/markers
    '''
    
    Point = namedtuple('Point', field_names=['lat', 'lon'])
    
    # creating point from our Point named tuple
    p1 = Point(locations[0][0], locations[0][1])
    p2 = Point(locations[1][0], locations[1][1])
    
    # getting the rotation needed for our marker.  
    # Subtracting 90 to account for the marker's orientation
    # of due East(get_bearing returns North)
    rotation = get_bearing(p1, p2) - 90
    
    # get an evenly space list of lats and lons for our arrows
    # note that I'm discarding the first and last for aesthetics
    # as I'm using markers to denote the start and end
    arrow_lats = np.linspace(p1.lat, p2.lat, n_arrows + 2)[1:n_arrows+1]
    arrow_lons = np.linspace(p1.lon, p2.lon, n_arrows + 2)[1:n_arrows+1]
    
    arrows = []
    
    #creating each "arrow" and appending them to our arrows list
    for points in zip(arrow_lats, arrow_lons):
        arrows.append(folium.RegularPolygonMarker(location=points, 
                      fill_color=color, number_of_sides=3, 
                      radius=size, rotation=rotation))
    return arrows

### this is also:
def get_bearing(p1, p2):
    
    '''
    Returns compass bearing from p1 to p2
    
    Parameters
    p1 : namedtuple with lat lon
    p2 : namedtuple with lat lon
    
    Return
    compass bearing of type float
    
    Notes
    Based on https://gist.github.com/jeromer/2005586
    '''
    
    long_diff = np.radians(p2.lon - p1.lon)
    
    lat1 = np.radians(p1.lat)
    lat2 = np.radians(p2.lat)
    
    x = np.sin(long_diff) * np.cos(lat2)
    y = (np.cos(lat1) * np.sin(lat2) 
        - (np.sin(lat1) * np.cos(lat2) 
        * np.cos(long_diff)))    
    bearing = np.degrees(np.arctan2(x, y))
    
    # adjusting for compass bearing
    if bearing < 0:
        return bearing + 360
    return bearing


# plotting difference map:
def plot_diffmap(df, key = 'locs', suffixes = ["_d", "_n"]):
    mymap = folium.Map(location=( 39.903525, -101.551042 ),zoom_start=4,)
    for k in range(len(df)):
        morn, nigh = df.iloc[k][key+suffixes[0]], df.iloc[k][key+suffixes[1]]
        cmk = folium.CircleMarker(location=morn,color='orange', 
                                  popup=df.iloc[k]["name"] + str(df.iloc[k][key+suffixes[0]]),
                                      radius=5,fill = True, fill_color='orange'   ).add_to(mymap)
        #cmk = folium.CircleMarker(location=df_mismatch.iloc[k].locs_n,color='brown', popup=df_mismatch.iloc[k]["name"] + str(df_mismatch.iloc[k]["avg_time_n"]),
        #                              radius=5,fill = True, fill_color='brown'   ).add_to(mymap)
        marker = folium.Marker(location=nigh,icon=folium.Icon(color="blue"),
                               popup=df.iloc[k]["name"] + str(df.iloc[k][key+suffixes[1]]) + str(df.iloc[k][key+suffixes[0]]) ).add_to(mymap)
        lin = folium.PolyLine(locations=[[float(morn[0]), float(morn[1])], [float(nigh[0]), float(nigh[1])]], color='green').add_to(mymap)
        arrows = get_arrows(locations=[[float(morn[0]), float(morn[1])], [float(nigh[0]), float(nigh[1])]], n_arrows=3)
        for arrow in arrows:
            arrow.add_to(mymap)
    return mymap

It is preferred to drop some outlying points. We also ask for all the locations of the ip's.

df = read_in_ping_results(folder = 'results/')
get_locays(df)
df_fewer = dropsome(df, 180, 50)

100%|██████████| 1930/1930 [00:03<00:00, 560.12it/s]
100%|██████████| 989/989 [04:22<00:00,  3.77it/s]

There were 1 error(s) for ['Marietta College'].
Those were placed at [0, 0]

# Plot the sorted ping times
figsize(10,6)
plot(list(sorted(df.avg_time)), label = 'all')
plot(list(sorted(df_fewer.avg_time)), label = 'remaining')
xlabel("Sorted number of average ping.")
ylabel("Ping time [ms]")
legend(loc="best")
title("Sorted ping times")
grid()

One can see from the figure above that real ping time starts from 100 ms, since there are many closer servers for American universites.

Below one can see the results on a folium map.

plotmap(df_fewer)

What is interesting to see here is how the average times scale as the distance increases i.e. going from East to West. The labels on the circlemarkers show the name of the institution and the average ping time. Furthermore, even after dropping out some outliers, there is still one relatively close to us, in Istanbul.

Night Round¶

In this section I ran all the pingings between 6 PM and 10 PM (local time) and after a few days. This might result in some changes not only in the ping time, but as one will see, the urls and therefore the locations have also changed. The results for this can be seen in $\texttt{results_night}$ folder.

df2 = read_in_ping_results(folder = 'results_night/')
get_locays(df2)
df2_fewer = dropsome(df2, 180, 50)

100%|██████████| 1930/1930 [00:03<00:00, 498.85it/s]
100%|██████████| 997/997 [00:25<00:00, 39.59it/s]

There were 1 error(s) for ['Marietta College'].
Those were placed at [0, 0]

# Plot the sorted ping times
figsize(10,6)
plot(list(sorted(df2.avg_time)), label = 'all')
plot(list(sorted(df2_fewer.avg_time)), label = 'remaining')
xlabel("Sorted number of average ping.")
ylabel("Ping time [ms]")
legend(loc="best")
title("Sorted ping times(night round)")
grid()

plotmap(df2_fewer)

The same shading can be seen here, too, with some outliers.

Comparing¶

There are some problems with comparing the two. As I have mentioned, not only the ping times but the universites availability and the urls and locations also have changed.

# Combined plot
figsize(10,6)
plot(list(sorted(df.avg_time)),"orange" ,label = 'morning all')
plot(list(sorted(df_fewer.avg_time)), "red", label = 'morning remaining')
plot(list(sorted(df2.avg_time)), 'lightblue',label = 'night all')
plot(list(sorted(df2_fewer.avg_time)), 'darkblue',label = 'night remaining')
xlabel("Sorted number of average ping.")
ylabel("Ping time [ms]")
legend(loc="best")
title("Sorted ping times (altogether)")
grid()

From the picture above, one can see that the average ping time did decrease at night. The reason behind this might be that the overall network traffic decreased also.

# plotting histograms
figsize(16,9)
plt.grid()

plt.hist(df2_fewer.avg_time, label = 'Nighttime', color = 'blue');
plt.hist(df_fewer.avg_time, histtype="step", label = 'Daytime', color = 'red');
plt.xlabel("Ping time [ms]")
plt.ylabel("Occurences")
plt.title("Histograms of the ping times")
plt.legend(loc = "best");

The histograms indicate our suspicion, that there is a decrease in ping time at night. The shapes are similar but the nights' distribution is shifted to the left.

print(f"# universities responded in the morning: {len(df)}, and in the night: {len(df2)}.")

# universities responded in the morning: 989, and in the night: 997.

# Printing out differencies
print("Unis present in the morning\'s dataset, but not in the night\'s:")
for n in df.name:
    if n not in list(df2.name):
        print(f"\t{n}")

print("\nUnis present in the nights\'s dataset, but not in the mornings\'s:")

for n in df2.name:
    if n not in list(df.name):
        print(f"\t{n}")

Unis present in the morning's dataset, but not in the night's:
	University of Arkansas - Fort Smith
	Napa Valley College
	Houston Baptist University

Unis present in the nights's dataset, but not in the mornings's:
	Antioch University Los Angeles
	Rio Hondo College
	Santa Rosa Junior College
	Space Coast Health Institute
	Child Care Education Institute
	Piedmont College
	McDaniel College
	Saint James School
	Carleton College
	Valley City State University
	Misericordia University

To compare the results, I created a dataset consisting of all the overlapping institutions.

df_section = pd.merge(df, df2, on='name', suffixes=['_d', '_n'])

print(f"Shape of the section: {df_section.shape}")
df_section.head()

Shape of the section: (1008, 13)

df_mismatch = df_section[df_section.locs_d != df_section.locs_n]
plot_diffmap(df_mismatch)

On the map above, one can see the location changes. The morning places are marked with orange circle markers and the night locations are the blue normal ("popup") markers. The green lines are connecting the different places for a given university. There are also blue arrows on the lines, which are showing the "move" of a uni. Since there is no real moving (or at least I hope so, see here) of the institutions, only the IP's are the ones that change. And apart from three unidirectional changes (Chicago, LA and Europe), most connections are bidirectional. This means that those are somewhat connected to each other, for example, in the case of the Southwestern College (see df below), this college has two entries in the provided excel sheet also. One in Miami and one in California. All the bidirectional connections are like this (or even more complicated, see the San Antonio, Virginia Beach and NY triangle).

Since this is only a small portion of all the overlay (14 rows only), the others can be compared.

df_mismatch[df_mismatch.name=="Southwestern College"]

for l in df_mismatch.index:
    try:
        df_section.drop(l, inplace=True)
    except KeyError:
        continue
print(f"Shape of the remaining section: {df_section.shape}")

Shape of the remaining section: (994, 13)

df_section['avg_time'] = df_section.avg_time_n - df_section.avg_time_d
df_section['locs'] = df_section.locs_d
df_section_fewer = dropsome(df_section, 20, 10)

# Plot of the differences
figsize(6, 10)
plt.subplot(2,1,1)
plot(list(sorted(df_section.avg_time)),"orange" ,label = 'all')
xlabel("Sorted number of average ping.")
ylabel("Ping time difference [ms]")
legend(loc="best")
title("Difference in ping times (night-day) -- total")
grid()
plt.subplot(2,1,2)
plot(list(sorted(df_section_fewer.avg_time)), "red", label = 'remaining')
xlabel("Sorted number of average ping.")
ylabel("Ping time difference [ms]")
legend(loc="best")
title("Difference in ping times (night-day) -- cropped")
grid()

The two figures show the difference between the average ping time at night and at daytime. As it is clear on the second figure, the average is below 0, indicating an increase in speed (decrease in ping time). Please note that here I only dropped 20 points from the faster ones and 10 from the slowest ones.

print(f"The mean of the differences: {df_section_fewer.avg_time.mean()}.")
print(f"The standard deviation of the differences: {df_section_fewer.avg_time.std()}.")

The mean of the differences: -4.436469052558784.
The standard deviation of the differences: 6.397550850459039.

plotmap(df_section_fewer)

On the map above one can see the differences between the night time and daytime. If the difference is negative the marker is greener, meaning that there was a decrease from day to night i.e. it is faster to access these points. If the marker is more blue, that indicates an icrease in ping time.

# plotting histogram
figsize(16,9)
plt.grid()

plt.hist(df_section.avg_time,bins = 200, color = 'blue');
plt.xlabel("Ping time [ms]")
plt.ylabel("Occurences")
plt.yscale("log")
plt.title("Histograms of the differences in ping times");

This histogram shows also that most of the ping times decreased. Please note that the y scale is logarithmic.

Conclusions¶

This excercise helped getting to know the basics of pinging and tracerouting (in some sense). Our presupposition was correct, the ping time did decrease, because the network traffic also decreased.

	no	name	ip	tr1	ms1	tr2	ms2	tr3	ms3
0	1	192.168.92.1	192.168.92.1	0.316	ms	0.297	ms	0.287	ms
1	2	kerberos.csoma.elte.hu	157.181.192.1	10.948	ms	10.966	ms	10.961	ms
2	3	leo.leo-kcssk.elte.hu	157.181.119.146	11.435	ms	11.444	ms	11.437	ms
3	4	taurus.taurus-leo.elte.hu	157.181.126.45	11.777	ms	11.785	ms	11.859	ms
4	5	fw1.firewall.elte.hu	157.181.141.145	11.553	ms	11.521	ms	11.493	ms
5	6	taurus.fw1.fw.backbone.elte.hu	192.153.18.146	12.155	ms	11.144	ms	11.134	ms
6	7	rtr.hbone-elte.backbone.elte.hu	157.181.141.9	11.416	ms	10.347	ms	10.343	ms
7	8	tg0-0-0-14.rtr2.vh.hbone.hu	195.111.100.47	11.196	ms	11.256	ms	11.235	ms
8	9	be1.rtr1.vh.hbone.hu	195.111.96.56	23.697	ms	23.715	ms	23.714	ms
9	10	hungarnet-ias-geant-gw.bud.hu.geant.net	83.97.88.81	10.995	ms	11.031	ms	11.028	ms
10	11	62.40.98.47	62.40.98.47	21.069	ms	21.103	ms	20.280	ms
11	12	decix.proxad.net	80.81.192.223	26.622	ms	26.629	ms	26.077	ms
12	13	amsterdam-6k-1-po100.intf.routers.proxad.net	212.27.56.38	25.939	ms	25.678	ms	26.012	ms
13	14	bzn-crs16-2-be1122.intf.routers.proxad.net	194.149.163.49	39.503	ms	39.496	ms	39.491	ms
14	15	194.149.166.22	194.149.166.22	39.496	ms	38.066	ms	38.064	ms
15	16	rennes-9k-1-be2000.intf.routers.proxad.net	194.149.162.98	44.048	ms	41.745	ms	41.745	ms

	URL	City	State	Name
0	http://alaska.edu	Anchorage	AK	University of Alaska
1	http://alaskacc.edu	Soldotna	AK	Alaska Christian College
2	http://npcc.edu	Hot Springs National Park	AK	National Park Community College
3	http://uaf.edu	Fairbanks	AK	University of Alaska Fairbanks
4	http://uafs.edu	Fort Smith	AK	University of Arkansas - Fort Smith

	name	times_d	true_ip_d	url_d	avg_time_d	fileindex_d	locs_d	times_n	true_ip_n	url_n	avg_time_n	fileindex_n	locs_n
0	Alaska Christian College	[120.0, 119.0, 120.0, 120.0, 119.0, 119.0]	184.154.231.31	alaskacc.edu	119.500000	1	[41.8500, -87.6500]	[119.0, 119.0, 119.0, 120.0, 119.0, 120.0]	184.154.231.31	alaskacc.edu	119.333333	1	[41.8500, -87.6500]
1	Alabama A&M University	[138.0, 138.0, 139.0, 138.0, 138.0, 138.0]	198.180.132.26	aamu.edu	138.166667	5	[34.7890, -86.5719]	[137.0, 137.0, 136.0, 137.0, 136.0, 138.0]	198.180.132.26	aamu.edu	136.833333	5	[34.7890, -86.5719]
2	Athens State University	[190.0, 190.0, 190.0, 189.0, 189.0, 189.0]	198.1.73.76	athens.edu	189.500000	6	[40.2338, -111.6585]	[173.0, 173.0, 173.0, 175.0, 173.0, 173.0]	198.1.73.76	athens.edu	173.333333	6	[40.2338, -111.6585]
3	Birmingham-Southern College	[1.63, 2.01, 2.45, 1.82, 1.82, 2.3]	104.26.1.60	bsc.edu	2.005000	9	[40.7143, -74.0060]	[2.4, 1.73, 2.15, 1.45, 2.73, 2.14]	104.26.0.60	bsc.edu	2.100000	9	[40.7143, -74.0060]
4	Calhoun Community College	[130.0, 130.0, 130.0, 130.0, 130.0, 130.0]	69.16.249.10	calhoun.edu	130.000000	10	[42.7325, -84.5555]	[129.0, 130.0, 129.0, 130.0, 130.0, 129.0]	69.16.249.10	calhoun.edu	129.500000	10	[42.7325, -84.5555]

	name	times_d	true_ip_d	url_d	avg_time_d	fileindex_d	locs_d	times_n	true_ip_n	url_n	avg_time_n	fileindex_n	locs_n
169	Southwestern College	[173.0, 159.0, 174.0, 283.0, 169.0, 176.0]	12.116.132.138	swcc.edu	189.000000	303	[26.2379, -80.1248]	[153.0, 152.0, 152.0, 152.0, 153.0, 153.0]	66.33.205.38	swc.edu	152.5	1176	[33.9167, -117.9001]
170	Southwestern College	[155.0, 154.0, 153.0, 153.0, 153.0, 152.0]	66.33.205.38	swc.edu	153.333333	1176	[33.9167, -117.9001]	[178.0, 145.0, 145.0, 175.0, 145.0, 205.0]	12.116.132.138	swcc.edu	165.5	303	[26.2379, -80.1248]