from IPython.display import HTML
HTML('''<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
$('div.output_stderr').hide();
} else {
$('div.input').show();
$('div.output_stderr').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action='javascript:code_toggle()'><input STYLE='color: #4286f4'
type='submit' value='Click here to toggle on/off the raw code.'></form>''')
Disclaimer: if you do not see the code cells, consider toggling them at the top of the document.
The purpose of this report is the show an introduction about how pinging works. Ping time is the latency between two host through internet connection. Ping is most of the times measured in milliseconds, and many tries are done one after another. However, sometimes this can change over time. In this report, I am going to measure the ping time of various universities from the USA based on this website. Then, after checking the location of the ips, visualizing them on a map shows us how geographical distance affects the latency. Comparing the results of a pinging in the morining (in the USA between 4 AM and 8 AM) and in the evening (6 PM and 10 PM) ahows us how the traffic of these networks differs (and some interesting phenomena).
To run this notebook besides the data provided in this folder. Other dependencies can be found in the code cells below.
import sys
sys.path.append('/home/user/anaconda3/lib/python3.7/site-packages')
sys.path.append('/home/user/graph-tool-2.29/src/')
sys.path.append('/home/user/pkgs/')
import subprocess
import folium
import os
import time
import operator
import ipinfo
import pandas as pd
from tqdm import tqdm
import platform # For getting the operating system name
import subprocess # For executing a shell command
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm
from matplotlib.colors import ListedColormap, LinearSegmentedColormap
from collections import namedtuple
%pylab inline
Tracerouting is also a useful tool which also tells us which servers our packet went through. In the following example one can see how our packet travelled to the MIT, however, it did not reach its destination.
# Python Program to Get IP Address
import socket
hostname = socket.gethostname()
IPAddr = socket.gethostbyname(hostname)
print("Your Computer Name is:" + hostname)
print("Your Computer IP Address is:" + IPAddr)
# to break down the traceroute's message.
columns=['no', 'name', 'ip', 'tr1', "ms1", 'tr2', "ms2", 'tr3', "ms3"]
# Loading in the data to a dataframe
with open('mit.edu.txt') as f:
data = f.readlines()
del data[0]
df2 = pd.DataFrame(columns=columns)
for d in data:
removed = d.split().copy()
clear_ = False
while not clear_ :
try:
removed.remove("*")
except ValueError:
clear_ = True
ddict = {c:j for j,c in zip(removed,columns) }
df2 = df2.append(pd.Series(ddict), ignore_index=True, )
df2 = df2[df2.ip.notnull()] # selecting those which have real ip
df2.ip = [i[1:-1] for i in df2.ip] # clearing from parentheses
df2
The function below helps us pinging websites.
def ping(host):
"""
Returns True if host (str) responds to a ping request.
Remember that a host may not respond to a ping (ICMP) request even if the host name is valid.
"""
# Option for the number of packets as a function of
param = '-n' if platform.system().lower()=='windows' else '-c'
# Building the command. Ex: "ping -c 1 google.com"
command = ['ping', param, '1', host]
return subprocess.call(command) == 0
Here I load in all the unis and colleges provided.
df_unis = pd.read_csv("University and College Websites - Sheet1.csv")
df_unis["Name"] = df_unis["School Name"]
df_unis.drop("School Name", axis = 1, inplace = True)
df_unis.head()
The commented code cell below pings all the websites 6 times and saves it under the folder $\textit{results}$. You may not want to run this, since it takes about 6 hours to run. It is preferred to just simply load in the results.
"""folder = "results/"
cnt = 0
for i, row in df_unis.iterrows():
#row.URL = "https://www." + row.URL[7:]
print(str(row.URL))
ret = subprocess.run(["ping", "-c", "6", str(row.URL)[7:]],capture_output=True)
print(ret.returncode)
f = open(folder + str(i) + ".txt", "w")
f.write(ret.stdout.decode("utf-8") )
f.close()""";
# This is private, please do not share
access_token = '5192e0157e0108'
handler = ipinfo.getHandler(access_token)
# Here I define some functions
# reading in results
def read_in_ping_results(folder, length = 1930):
df = pd.DataFrame(columns=['name', 'times', 'true_ip', 'url', "avg_time", "fileindex" ])
proper = 0
for i in tqdm(range(1930)):
with open(folder+str(i) + ".txt") as f:
data = f.readlines()
try:
len(data[1].split())
except IndexError:
continue
if len(data[1].split()) > 0:
times = []
url = data[0].split()[1]
true_ip = data[1].split()[4][1:-2]
for c,d in enumerate(data):
#print(d.split())
if c in [1,2,3,4,5,6]:
try:
if float(d.split()[7][5:]) < 1000 :
times.append(float(d.split()[7][5:]))
except IndexError:
#print("IndexError",i,c,d, d.split())
continue
except ValueError:
#print("ValueError", i,c,d, d.split())
continue
if len(times)>0:
#true_data[i] = {"times" : times.copy(), "true_ip" : true_ip, "url": url, "name":df_unis.Name[i]}
df.loc[proper] = [df_unis.Name[i], times.copy(), true_ip, url, np.mean(times), i]
proper +=1
return df
# ask for locations
def get_locays(df, handler = handler, errored_ip = [0,0]):
locc = []
errcount = 0
errored_ones = []
for k in tqdm(range(len(df))):
try:
details = handler.getDetails(df.iloc[k]["true_ip"])
locc.append(details.loc.split(sep = ','))
except:
locc.append(errored_ip)
errcount +=1
errored_ones.append(df.iloc[k]["name"])
if errcount != 0:
print(f"There were {errcount} error(s) for {errored_ones}.\n" +
f"Those were placed at {errored_ip}")
df['locs'] = locc;
# dropping outliers
def dropsome(df,smalls, bigs, based_on = 'avg_time'):
"""Drops _smalls_ number of pinging results from the fastest ones and
_bigs_ from the slowest ones as outlying points."""
smallest = list(df.nsmallest(smalls, based_on).index)
largest = list(df.nlargest(bigs, based_on).index)
return df.drop(smallest+largest)
# creating colormap
def plotmap(df, colmap = cm.get_cmap('viridis')):
minzed = (df.avg_time-df.avg_time.min())/(df.avg_time-df.avg_time.min()).max()
rgbs = [[int((1-j)*255) for j in colmap(i)[:-1]] for i in minzed]
rgbs_h = ['#%02x%02x%02x' % tuple(i) for i in rgbs]
df["clr"] = rgbs_h
mymap = folium.Map(location=( 39.903525, -101.551042 ),zoom_start=4,)
for k in range(len(df)):
cmk = folium.CircleMarker(location=df.iloc[k].locs,color=df.iloc[k].clr,
popup=df.iloc[k]["name"] +', avg: ' + str(np.round(df.iloc[k]["avg_time"], 3)),
radius=5,fill = True, fill_color=df.iloc[k].clr ).add_to(mymap)
# marker = folium.Marker(location=[i,j],icon=folium.Icon(color=rgbs[cnt]), popup=str(k))
return mymap
### STOLEN FUNC FROM:
# https://medium.com/@bobhaffner/folium-lines-with-arrows-25a0fe88e4e
def get_arrows(locations, color='blue', size=6, n_arrows=3):
'''
Get a list of correctly placed and rotated
arrows/markers to be plotted
Parameters
locations : list of lists of lat lons that represent the
start and end of the line.
eg [[41.1132, -96.1993],[41.3810, -95.8021]]
arrow_color : default is 'blue'
size : default is 6
n_arrows : number of arrows to create. default is 3 Return
list of arrows/markers
'''
Point = namedtuple('Point', field_names=['lat', 'lon'])
# creating point from our Point named tuple
p1 = Point(locations[0][0], locations[0][1])
p2 = Point(locations[1][0], locations[1][1])
# getting the rotation needed for our marker.
# Subtracting 90 to account for the marker's orientation
# of due East(get_bearing returns North)
rotation = get_bearing(p1, p2) - 90
# get an evenly space list of lats and lons for our arrows
# note that I'm discarding the first and last for aesthetics
# as I'm using markers to denote the start and end
arrow_lats = np.linspace(p1.lat, p2.lat, n_arrows + 2)[1:n_arrows+1]
arrow_lons = np.linspace(p1.lon, p2.lon, n_arrows + 2)[1:n_arrows+1]
arrows = []
#creating each "arrow" and appending them to our arrows list
for points in zip(arrow_lats, arrow_lons):
arrows.append(folium.RegularPolygonMarker(location=points,
fill_color=color, number_of_sides=3,
radius=size, rotation=rotation))
return arrows
### this is also:
def get_bearing(p1, p2):
'''
Returns compass bearing from p1 to p2
Parameters
p1 : namedtuple with lat lon
p2 : namedtuple with lat lon
Return
compass bearing of type float
Notes
Based on https://gist.github.com/jeromer/2005586
'''
long_diff = np.radians(p2.lon - p1.lon)
lat1 = np.radians(p1.lat)
lat2 = np.radians(p2.lat)
x = np.sin(long_diff) * np.cos(lat2)
y = (np.cos(lat1) * np.sin(lat2)
- (np.sin(lat1) * np.cos(lat2)
* np.cos(long_diff)))
bearing = np.degrees(np.arctan2(x, y))
# adjusting for compass bearing
if bearing < 0:
return bearing + 360
return bearing
# plotting difference map:
def plot_diffmap(df, key = 'locs', suffixes = ["_d", "_n"]):
mymap = folium.Map(location=( 39.903525, -101.551042 ),zoom_start=4,)
for k in range(len(df)):
morn, nigh = df.iloc[k][key+suffixes[0]], df.iloc[k][key+suffixes[1]]
cmk = folium.CircleMarker(location=morn,color='orange',
popup=df.iloc[k]["name"] + str(df.iloc[k][key+suffixes[0]]),
radius=5,fill = True, fill_color='orange' ).add_to(mymap)
#cmk = folium.CircleMarker(location=df_mismatch.iloc[k].locs_n,color='brown', popup=df_mismatch.iloc[k]["name"] + str(df_mismatch.iloc[k]["avg_time_n"]),
# radius=5,fill = True, fill_color='brown' ).add_to(mymap)
marker = folium.Marker(location=nigh,icon=folium.Icon(color="blue"),
popup=df.iloc[k]["name"] + str(df.iloc[k][key+suffixes[1]]) + str(df.iloc[k][key+suffixes[0]]) ).add_to(mymap)
lin = folium.PolyLine(locations=[[float(morn[0]), float(morn[1])], [float(nigh[0]), float(nigh[1])]], color='green').add_to(mymap)
arrows = get_arrows(locations=[[float(morn[0]), float(morn[1])], [float(nigh[0]), float(nigh[1])]], n_arrows=3)
for arrow in arrows:
arrow.add_to(mymap)
return mymap
It is preferred to drop some outlying points. We also ask for all the locations of the ip's.
df = read_in_ping_results(folder = 'results/')
get_locays(df)
df_fewer = dropsome(df, 180, 50)
# Plot the sorted ping times
figsize(10,6)
plot(list(sorted(df.avg_time)), label = 'all')
plot(list(sorted(df_fewer.avg_time)), label = 'remaining')
xlabel("Sorted number of average ping.")
ylabel("Ping time [ms]")
legend(loc="best")
title("Sorted ping times")
grid()
One can see from the figure above that real ping time starts from 100 ms, since there are many closer servers for American universites.
Below one can see the results on a folium map.
plotmap(df_fewer)
What is interesting to see here is how the average times scale as the distance increases i.e. going from East to West. The labels on the circlemarkers show the name of the institution and the average ping time. Furthermore, even after dropping out some outliers, there is still one relatively close to us, in Istanbul.
In this section I ran all the pingings between 6 PM and 10 PM (local time) and after a few days. This might result in some changes not only in the ping time, but as one will see, the urls and therefore the locations have also changed. The results for this can be seen in $\texttt{results_night}$ folder.
df2 = read_in_ping_results(folder = 'results_night/')
get_locays(df2)
df2_fewer = dropsome(df2, 180, 50)
# Plot the sorted ping times
figsize(10,6)
plot(list(sorted(df2.avg_time)), label = 'all')
plot(list(sorted(df2_fewer.avg_time)), label = 'remaining')
xlabel("Sorted number of average ping.")
ylabel("Ping time [ms]")
legend(loc="best")
title("Sorted ping times(night round)")
grid()
plotmap(df2_fewer)
The same shading can be seen here, too, with some outliers.
There are some problems with comparing the two. As I have mentioned, not only the ping times but the universites availability and the urls and locations also have changed.
# Combined plot
figsize(10,6)
plot(list(sorted(df.avg_time)),"orange" ,label = 'morning all')
plot(list(sorted(df_fewer.avg_time)), "red", label = 'morning remaining')
plot(list(sorted(df2.avg_time)), 'lightblue',label = 'night all')
plot(list(sorted(df2_fewer.avg_time)), 'darkblue',label = 'night remaining')
xlabel("Sorted number of average ping.")
ylabel("Ping time [ms]")
legend(loc="best")
title("Sorted ping times (altogether)")
grid()
From the picture above, one can see that the average ping time did decrease at night. The reason behind this might be that the overall network traffic decreased also.
# plotting histograms
figsize(16,9)
plt.grid()
plt.hist(df2_fewer.avg_time, label = 'Nighttime', color = 'blue');
plt.hist(df_fewer.avg_time, histtype="step", label = 'Daytime', color = 'red');
plt.xlabel("Ping time [ms]")
plt.ylabel("Occurences")
plt.title("Histograms of the ping times")
plt.legend(loc = "best");
The histograms indicate our suspicion, that there is a decrease in ping time at night. The shapes are similar but the nights' distribution is shifted to the left.
print(f"# universities responded in the morning: {len(df)}, and in the night: {len(df2)}.")
# Printing out differencies
print("Unis present in the morning\'s dataset, but not in the night\'s:")
for n in df.name:
if n not in list(df2.name):
print(f"\t{n}")
print("\nUnis present in the nights\'s dataset, but not in the mornings\'s:")
for n in df2.name:
if n not in list(df.name):
print(f"\t{n}")
To compare the results, I created a dataset consisting of all the overlapping institutions.
df_section = pd.merge(df, df2, on='name', suffixes=['_d', '_n'])
print(f"Shape of the section: {df_section.shape}")
df_section.head()
df_mismatch = df_section[df_section.locs_d != df_section.locs_n]
plot_diffmap(df_mismatch)
On the map above, one can see the location changes. The morning places are marked with orange circle markers and the night locations are the blue normal ("popup") markers. The green lines are connecting the different places for a given university. There are also blue arrows on the lines, which are showing the "move" of a uni. Since there is no real moving (or at least I hope so, see here) of the institutions, only the IP's are the ones that change. And apart from three unidirectional changes (Chicago, LA and Europe), most connections are bidirectional. This means that those are somewhat connected to each other, for example, in the case of the Southwestern College (see df below), this college has two entries in the provided excel sheet also. One in Miami and one in California. All the bidirectional connections are like this (or even more complicated, see the San Antonio, Virginia Beach and NY triangle).
Since this is only a small portion of all the overlay (14 rows only), the others can be compared.
df_mismatch[df_mismatch.name=="Southwestern College"]
for l in df_mismatch.index:
try:
df_section.drop(l, inplace=True)
except KeyError:
continue
print(f"Shape of the remaining section: {df_section.shape}")
df_section['avg_time'] = df_section.avg_time_n - df_section.avg_time_d
df_section['locs'] = df_section.locs_d
df_section_fewer = dropsome(df_section, 20, 10)
# Plot of the differences
figsize(6, 10)
plt.subplot(2,1,1)
plot(list(sorted(df_section.avg_time)),"orange" ,label = 'all')
xlabel("Sorted number of average ping.")
ylabel("Ping time difference [ms]")
legend(loc="best")
title("Difference in ping times (night-day) -- total")
grid()
plt.subplot(2,1,2)
plot(list(sorted(df_section_fewer.avg_time)), "red", label = 'remaining')
xlabel("Sorted number of average ping.")
ylabel("Ping time difference [ms]")
legend(loc="best")
title("Difference in ping times (night-day) -- cropped")
grid()
The two figures show the difference between the average ping time at night and at daytime. As it is clear on the second figure, the average is below 0, indicating an increase in speed (decrease in ping time). Please note that here I only dropped 20 points from the faster ones and 10 from the slowest ones.
print(f"The mean of the differences: {df_section_fewer.avg_time.mean()}.")
print(f"The standard deviation of the differences: {df_section_fewer.avg_time.std()}.")
plotmap(df_section_fewer)
On the map above one can see the differences between the night time and daytime. If the difference is negative the marker is greener, meaning that there was a decrease from day to night i.e. it is faster to access these points. If the marker is more blue, that indicates an icrease in ping time.
# plotting histogram
figsize(16,9)
plt.grid()
plt.hist(df_section.avg_time,bins = 200, color = 'blue');
plt.xlabel("Ping time [ms]")
plt.ylabel("Occurences")
plt.yscale("log")
plt.title("Histograms of the differences in ping times");
This histogram shows also that most of the ping times decreased. Please note that the y scale is logarithmic.
This excercise helped getting to know the basics of pinging and tracerouting (in some sense). Our presupposition was correct, the ping time did decrease, because the network traffic also decreased.