In [2]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 $('div.output_stderr').hide();
 } else {
 $('div.input').show();
 $('div.output_stderr').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action='javascript:code_toggle()'><input STYLE='color: #4286f4' 
type='submit' value='Click here to toggle on/off the raw code.'></form>''')

# <center> Small World Effect on Twitter </center>

### <center> Final Report </center>

---

#### <center>by Kristóf Furuglyás </center>
##### <center> 2019. 11. 24. </center>


_Disclaimer: if you do not see the raw code, consider toggling them at the top of the page_

## Introduction

This short report is made to summarize my overall work with twitter data and to give a handful insight for those who want to recreate it. The task and the prerequisites can be found on the [lecture's private Github repo](https://github.com/sdam-elte/dslab2019/tree/master/projects/09-twitter_small_world). If one wants to hide the raw codes, they shall consider toggling them on or off at the top of the page. The format of this report is an $\textit{ipynb}$ file which can be shown using [$\texttt{Jupyter Nbviewer}$](https://nbviewer.jupyter.org/). Most of the packages that have been used here, had already been explored during the [Data exploration and vizalization](https://github.com/sdam-elte/data-exp-vis-2019) course. All the work was done in Python3.



[Twitter](https://twitter.com) is an open social media platform where users are free to share their thoughts, opinions in a so-called $\textit{tweet}$. These tweets contain many information apart from the raw text. It can contain images, locations, links to websites, etc. However, in this particular report, the text is in focus. A network created from these words occurring in tweets, could give a help in the observation of the contemporary online languge through natural language processing (NLP). Our main suggestion is that this network is a small-world network -- see scale-free networks by Barabási et al. [here](https://science.sciencemag.org/content/286/5439/509.full) and small-world effect by Watts & Storgratz [here](https://www.nature.com/articles/30918.). Both topics will be discussed later, also.

During this semester, the plan was the following:

1. __Set up Twitter API:__ to access tweets smoothly, numerous python packages are available, nonetheless, all do require the user to be a developer granted developer rights to create a Twitter application programming interface (API).

2. __Gather tweets:__ once one has every legal right, next step is to obtain plenty of tweets. This means regularizing a stream of tweets by applying different filters (location, hashtags, etc.).

3. __Clean the tweets:__ since a word can appear in many format (e.g.: run, running, runner), it is indispensable to clear the unnecessary things off the words.

4. __Creating word-graph:__ a word-graph is such a network that has words as nodes and weighted edges by the number of co-occurrences of two words.

5. __Exploring small-world properties:__ as it has been mentioned before, many interesting features of nowadays' language can be shown with the help of network science.



## __Setting up an API__

GDPR is the European regulation about handling online personal data. Since I would be working with many of that, I was required to fill in a form, stating that I am never going to give any information to any governmental institution nor publish any personal data. Furthermore, I had to consent that I am not a terrorist and my work is purely for educational purposes. After this procedure, I was granted developer rights, which meant authentication mostly. Below one can see a snapshot of the login screen.

<br>

<a href = "tw_dev_app.png" target="_blank">
    <center>
        <img src="tw_dev_app.png" alt="tw_api" width="600"/>
    </center>
    <center>
         The login screen of the Twitter developer API.
    </center>
</a>

In [1]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [12]:
# Yes, I am going to need all of these.

import sys
import pandas as pd
import json
import re
import nltk
import numpy as np
import networkx as nx
import operator
import plotly.graph_objects as go
import plotly
import folium
import collections
from collections import Counter
from nltk.stem import SnowballStemmer
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
import itertools
import pickle
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream

sys.path.insert(1, '/home/workdir/')
import twitter_credentials # Here are the credentials stored

## Gathering tweets

Gather the tweets was done using the $\texttt{tweepy}$ package. The restriction was about the location of the tweets; I only streamed those tweets, which are located in a square around England. The codes below are from [vprusso's tutorial](https://github.com/vprusso/youtube_tutorials/blob/master/twitter_python/part_1_streaming_tweets/tweepy_streamer.py). The output of the stream was multiple $\textit{.json}$ files. The handling for this filetypes had also been discussed at the $\texttt{dataexp}$ course, therefore I am not explaining here the details. This means that it is basically a nested dict with different types of keys. For the main project I streamed $\sim 8500$ tweets, that took about an hour.

In [13]:
# # # # TWITTER STREAMER # # # #
class TwitterStreamer():
    """
    Class for streaming and processing live tweets.
    """
    def __init__(self):
        pass

    def stream_tweets(self, fetched_tweets_filename, hash_tag_list):
        # This handles Twitter authetification and the connection to Twitter Streaming API
        listener = StdOutListener(fetched_tweets_filename)
        auth = OAuthHandler(twitter_credentials.CONSUMER_KEY, twitter_credentials.CONSUMER_SECRET)
        auth.set_access_token(twitter_credentials.ACCESS_TOKEN, twitter_credentials.ACCESS_TOKEN_SECRET)
        stream = Stream(auth, listener)

        # This line filter Twitter Streams to capture data by the keywords: 
        stream.filter(track=hash_tag_list)

In [14]:
# # # # TWITTER STREAM LISTENER # # # #
class StdOutListener(StreamListener):
    """
    This is a basic listener that just prints received tweets to stdout.
    """
    def __init__(self, fetched_tweets_filename):
        self.fetched_tweets_filename = fetched_tweets_filename
        self.counter = 0
        self.limit = 10

    def on_status(self, status):
        try:
            userid = status.user.id_str
            geo = str(status.coordinates)
            if geo != "None":
                print(userid + ',' + geo)
            else:
                print("No coordinates")
            self.counter += 1
            if self.counter < self.limit:
                return True
            else:
                twitterStream.disconnect()
        except BaseException as e:
            print('failed on_status,',str(e))
            time.sleep(5)
            
    def on_data(self, data):
        try:
            if self.counter <= self.limit:
                print(data)
                with open(self.fetched_tweets_filename, 'a') as tf:
                    tf.write(data)
                self.limit += 1
            return True
        except BaseException as e:
            print("Error on_data %s" % str(e))
        return True
          

    def on_error(self, status):
        print(status)

In [15]:
# hash_tag_list = ["donal trump", "hillary clinton", "barack obama", "bernie sanders"]
# 
# Here I define the output file
fetched_tweets_filename = "tweets.txt"


# Authenticate using config.py and connect to Twitter Streaming API.


listener = StdOutListener(fetched_tweets_filename)
auth = OAuthHandler(twitter_credentials.CONSUMER_KEY, twitter_credentials.CONSUMER_SECRET)
auth.set_access_token(twitter_credentials.ACCESS_TOKEN, twitter_credentials.ACCESS_TOKEN_SECRET)
stream = Stream(auth, listener)

# twitter_streamer.stream_tweets(fetched_tweets_filename, hash_tag_list)
# 
# Here you can see England's bounding box's coordinates
stream.filter(locations=[-6.38,49.87,1.77,55.81])

{"created_at":"Tue Dec 10 12:06:04 +0000 2019","id":1204371700722348037,"id_str":"1204371700722348037","text":"Seen.","source":"\u003ca href=\"http:\/\/twitter.com\/download\/android\" rel=\"nofollow\"\u003eTwitter for Android\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":910222392898834435,"id_str":"910222392898834435","name":"Alison","screen_name":"alisonjbryan","location":"Dublin\/Meath","url":null,"description":"@SYPIreland - publishing - culture vulture - probably remind you of someone else - big oule nerd","translator_type":"none","protected":false,"verified":false,"followers_count":214,"friends_count":367,"listed_count":0,"favourites_count":11879,"statuses_count":3722,"created_at":"Tue Sep 19 19:21:44 +0000 2017","utc_offset":null,"time_zone":null,"geo_enabled":true,"lang":null,"contributors_enabled":false,"is_translator":false,"

{"created_at":"Tue Dec 10 12:06:05 +0000 2019","id":1204371705721962499,"id_str":"1204371705721962499","text":"@BillyKilby @Arsenal I've never seen him play well. I'd rather Upamecano","display_text_range":[21,72],"source":"\u003ca href=\"http:\/\/twitter.com\/download\/android\" rel=\"nofollow\"\u003eTwitter for Android\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":1204087547644788737,"in_reply_to_status_id_str":"1204087547644788737","in_reply_to_user_id":599992465,"in_reply_to_user_id_str":"599992465","in_reply_to_screen_name":"BillyKilby","user":{"id":438632479,"id_str":"438632479","name":"Tom RH","screen_name":"Tomrh1988","location":"Maidstone, Kent","url":null,"description":null,"translator_type":"none","protected":false,"verified":false,"followers_count":357,"friends_count":446,"listed_count":6,"favourites_count":2380,"statuses_count":19666,"created_at":"Fri Dec 16 20:41:34 +0000 2011","utc_offset":null,"time_zone":null,"geo_enabled":true,"lang":null,"contributors_ena

{"created_at":"Tue Dec 10 12:06:08 +0000 2019","id":1204371720351748096,"id_str":"1204371720351748096","text":"@singuIaravery have fun today :)","display_text_range":[15,32],"source":"\u003ca href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for iPhone\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":1204361881743441921,"in_reply_to_status_id_str":"1204361881743441921","in_reply_to_user_id":967449349952364547,"in_reply_to_user_id_str":"967449349952364547","in_reply_to_screen_name":"singuIaravery","user":{"id":1160357073038848000,"id_str":"1160357073038848000","name":"\ud83c\udf52\u22c6\u2727 kennedy \u2727\u22c6\ud83c\udf52","screen_name":"corbynvc","location":"idk","url":"http:\/\/instagram.com\/astrxcqrbyn","description":"v indecisive but thats ok","translator_type":"none","protected":false,"verified":false,"followers_count":197,"friends_count":241,"listed_count":15,"favourites_count":5290,"statuses_count":646,"created_at":"Sun Aug 11 01:07:39 +00

{"created_at":"Tue Dec 10 12:06:09 +0000 2019","id":1204371723631702017,"id_str":"1204371723631702017","text":"@timminchin Yep brilliant ending when it all falls together. And you start balling. Well done Timbo.","display_text_range":[12,100],"source":"\u003ca href=\"http:\/\/twitter.com\/download\/android\" rel=\"nofollow\"\u003eTwitter for Android\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":1204332463838781440,"in_reply_to_status_id_str":"1204332463838781440","in_reply_to_user_id":18980276,"in_reply_to_user_id_str":"18980276","in_reply_to_screen_name":"timminchin","user":{"id":1202537975961280517,"id_str":"1202537975961280517","name":"Joe Dack","screen_name":"JoeDack11","location":null,"url":null,"description":null,"translator_type":"none","protected":false,"verified":false,"followers_count":1,"friends_count":29,"listed_count":0,"favourites_count":52,"statuses_count":14,"created_at":"Thu Dec 05 10:39:53 +0000 2019","utc_offset":null,"time_zone":null,"geo_enabled":true,"

{"created_at":"Tue Dec 10 12:06:10 +0000 2019","id":1204371727431667715,"id_str":"1204371727431667715","text":"@WendyMaisey @warringtonnews Sad","display_text_range":[29,32],"source":"\u003ca href=\"http:\/\/twitter.com\/download\/android\" rel=\"nofollow\"\u003eTwitter for Android\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":1202997513327919104,"in_reply_to_status_id_str":"1202997513327919104","in_reply_to_user_id":302601234,"in_reply_to_user_id_str":"302601234","in_reply_to_screen_name":"WendyMaisey","user":{"id":2742693677,"id_str":"2742693677","name":"Mark Ridgley","screen_name":"zogman64","location":null,"url":null,"description":"Hate Bullys, Tory's. Love Nature.","translator_type":"none","protected":false,"verified":false,"followers_count":736,"friends_count":878,"listed_count":0,"favourites_count":1515,"statuses_count":14008,"created_at":"Thu Aug 14 14:43:34 +0000 2014","utc_offset":null,"time_zone":null,"geo_enabled":true,"lang":null,"contributors_enabled":false,"i

{"created_at":"Tue Dec 10 12:06:11 +0000 2019","id":1204371729650466819,"id_str":"1204371729650466819","text":"@OrdnanceSurvey #maps #cartography","display_text_range":[16,34],"source":"\u003ca href=\"http:\/\/twitter.com\/download\/android\" rel=\"nofollow\"\u003eTwitter for Android\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":1204370250319454208,"in_reply_to_status_id_str":"1204370250319454208","in_reply_to_user_id":22614266,"in_reply_to_user_id_str":"22614266","in_reply_to_screen_name":"OrdnanceSurvey","user":{"id":118802611,"id_str":"118802611","name":"Peregrine Bush","screen_name":"PerezMapManBush","location":"Deepest Rural Suffolk","url":"http:\/\/www.pb-photos.com","description":"Cartographer, photographer & aviation enthusiast. Author of UK Military Airfields Guide https:\/\/t.co\/MrvT5h2zLk Married to Heather and dad to Fin & Zac also see @pemaps1","translator_type":"none","protected":false,"verified":false,"followers_count":185,"friends_count":141,"listed_count":

{"created_at":"Tue Dec 10 12:06:12 +0000 2019","id":1204371735384133633,"id_str":"1204371735384133633","text":"@Hallemillerwil1 @Scampicus @DavidLammy @allisonpearson Cummings and Johnson are trying to outdo Trump, all with Putin\u2019s help. Wake up UK!","display_text_range":[56,138],"source":"\u003ca href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for iPhone\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":1204371002081398784,"in_reply_to_status_id_str":"1204371002081398784","in_reply_to_user_id":1097159644722667520,"in_reply_to_user_id_str":"1097159644722667520","in_reply_to_screen_name":"Hallemillerwil1","user":{"id":39570149,"id_str":"39570149","name":"Roy Bailey","screen_name":"DrRoyBailey","location":"Crowthorne","url":"http:\/\/www.linkedin.com\/in\/royurl","description":"Former senior police officer, lecturer and business consultant. Labour activist and Chair of Bracknell CLP. Doctorate in criminal justice.","translator_type":"none","prot

{"created_at":"Tue Dec 10 12:06:13 +0000 2019","id":1204371737976229888,"id_str":"1204371737976229888","text":"@DeannaMidSussex @MidSussexViews @gembolton @mimsdavies @RobbieEggleston Yes, @DeannaMidSussex was definitely the s\u2026 https:\/\/t.co\/fdfXfT3Rkx","display_text_range":[73,140],"source":"\u003ca href=\"http:\/\/twitter.com\/download\/android\" rel=\"nofollow\"\u003eTwitter for Android\u003c\/a\u003e","truncated":true,"in_reply_to_status_id":1204345814950260736,"in_reply_to_status_id_str":"1204345814950260736","in_reply_to_user_id":1189699466653241344,"in_reply_to_user_id_str":"1189699466653241344","in_reply_to_screen_name":"DeannaMidSussex","user":{"id":130106723,"id_str":"130106723","name":"James Thompson","screen_name":"jamesthompsonuk","location":"East Grinstead","url":null,"description":"Homeless Officer, trade unionist and Europhile. Love cats and wine.","translator_type":"none","protected":false,"verified":false,"followers_count":541,"friends_count":1733,"listed_count

KeyboardInterrupt: 

In [None]:
# Loading in the data

with open(fetched_tweets_filename) as f:
    data = f.readlines()

tweets = []
for k in data:
    tweets.append(json.loads(k))

## Cleaning the tweets


Now that I have my tweets it is important to take a look at them. Below you can see the keys for a tweet.

In [142]:
tweets[0].keys()

dict_keys(['created_at', 'id', 'id_str', 'text', 'source', 'truncated', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place', 'contributors', 'is_quote_status', 'extended_tweet', 'quote_count', 'reply_count', 'retweet_count', 'favorite_count', 'entities', 'favorited', 'retweeted', 'filter_level', 'lang', 'timestamp_ms'])

In [143]:
for c,tweet in enumerate(tweets):
    try:
        tweet["place"]["bounding_box"]
    except TypeError:
        del tweets[c]

In [144]:
df = pd.DataFrame()

df['id'] = np.array([tweet["id"] for tweet in tweets])
df['date'] = np.array([tweet["created_at"] for tweet in tweets])
df['source'] = np.array([tweet["source"] for tweet in tweets])
df['likes'] = np.array([tweet["favorite_count"] for tweet in tweets])
df['retweets'] = np.array([tweet['retweet_count'] for tweet in tweets])
df['name'] = np.array([tweet['user']['name'] for tweet in tweets])
df['locs'] = [[loc[::-1]for loc in tweet['place']['bounding_box']['coordinates'][0] if loc is not None] for tweet in tweets]

Sometimes the text was longer than 140 characters, therefore it had to be stored in the so-called 'extended tweet'. The ratio of tweets where there was no need for a 'full_text' option:

In [291]:
texts = []
cnt = 1
for i, tweet in enumerate(tweets):
    try:
        texts.append(tweet['extended_tweet']['full_text'])
    except KeyError:
        texts.append(tweet['text'])
        cnt += 1
print(f"{cnt/i}")

0.7425100259495164


Since I did not need every single property, I only kept a few. Below you can see the top 5. I kept the id, date, source and location of the tweet, number of likes and retweets, the name of the user, and the raw text (and its length) of the tweet.

In [146]:
df['text'] = texts
df['len'] = np.array([len(i) for i in df.text])

In [147]:
df.head()

Unnamed: 0,id,date,source,likes,retweets,name,locs,text,len
0,1181300835122327552,Mon Oct 07 20:10:41 +0000 2019,"<a href=""http://twitter.com/download/android"" ...",0,0,Peter Dudley,"[[51.543815, 0.010398], [51.626165, 0.010398],...",Side saw good chance thwarted by last ditch ta...,147
1,1181300835239759872,Mon Oct 07 20:10:41 +0000 2019,"<a href=""http://twitter.com/download/iphone"" r...",0,0,Damian Wawrzyniak,"[[52.603205, -0.202653], [52.620535, -0.202653...",@KarenWhiteFood @tweetertucker @AdamHandling I...,322
2,1181300835348815874,Mon Oct 07 20:10:41 +0000 2019,"<a href=""http://twitter.com/download/iphone"" r...",0,0,deli Muru,"[[54.543241, -6.036116], [54.648497, -6.036116...",@maggimccvil Congratulations Caroline! ⭐️,41
3,1181300836519075841,Mon Oct 07 20:10:41 +0000 2019,"<a href=""http://twitter.com/download/android"" ...",0,0,Jordan Carroll,"[[53.438332, -3.058666], [53.501369, -3.058666...",@Eric_Toffee1878 Original mate. Disturbing but...,57
4,1181300838846926848,Mon Oct 07 20:10:42 +0000 2019,"<a href=""http://twitter.com/download/iphone"" r...",0,0,sarah guerra,"[[51.417277, -0.259465], [51.486036, -0.259465...",You were great,14


It is interesting to plot a few tweets, just to check the location to make sure. Below one can see the lower left coordinates plotted for the bounding box of each tweet.

In [148]:
mymap = folium.Map(location=[52.809865,-2.118092],zoom_start=5.4,tiles='cartodbpositron')
for i in range(1000):
    marker = folium.Marker(location=df.locs[i][0],popup=df.text[i])
    marker.add_to(mymap)
folium.Popup(parse_html=True)
mymap

To clean the tweets from the URL-s I used the $\texttt{re}$ package, and to get to the core of each word (to stem the words) I used $\texttt{SnowballStemmer}$ from the $\texttt{nltk}$ package. This iz what tokenizing means. After tokenization I was able to access the raw words in a list. Below you can see the evolution of this procedure.

In [149]:
snow = SnowballStemmer('english',ignore_stopwords=False)

In [150]:
print(f"Text of tweet no.8: \n\n{df.text[8]}")

Text of tweet no.8: 

Bob Prattey: Liverpool Exhibition Centre - Do NOT host trophy hunting safari companies - Sign the Petition! https://t.co/W8VcURxSLN via @UKChange


In [151]:
clean = re.sub(r'http\S+', '', df.text[8])
print(f"\nAfter cleaning:\n\n{clean}")


After cleaning:

Bob Prattey: Liverpool Exhibition Centre - Do NOT host trophy hunting safari companies - Sign the Petition!  via @UKChange


In [152]:
clean_tknzd = [snow.stem(word) for word in re.findall('\w+',clean.lower())]
print(f"After tokenizing:\n\n{clean_tknzd}")

After tokenizing:

['bob', 'prattey', 'liverpool', 'exhibit', 'centr', 'do', 'not', 'host', 'trophi', 'hunt', 'safari', 'compani', 'sign', 'the', 'petit', 'via', 'ukchang']


In [153]:
df["nourl"] = [re.sub(r'http\S+', '', t) for t in df.text]

In [154]:
df['tkzd_clnd'] = [[snow.stem(word) for word in re.findall('\w+',t.lower())] for t in df.nourl]

Even though there might appear some meaningless word, this method provides us a standardization of every type of word, so that we are able to compare them.

## Network of tweets </center>

To create a word-graph, I generated a $\texttt{MultiGraph}$ by the $\texttt{netwotrkx}$ package. In this network nodes were the tokenized and cleaned words, and two nodes were connected if they appeared in the same tweet. No self-loops were allowed. Here we can separate two cases: first, we allow paralell edges but no weight to them, second, when we only have at maximum one edge but all edges are weighted. Respectively, more common tweet means greater weight or more edge. I also removed the nodes "s" and "t" for practical reasons (for example, a text "John's" would be torn apart into "John" and "s" resulting in two different node, whilst "s" has no explicit meaning). Below you can see some interesting fact about the network

In [155]:
g = nx.MultiGraph()

for index, tweet in df.iterrows():
    for i, w in enumerate(tweet['tkzd_clnd']):
        g.add_node(w)
        for j in range(i):
            if tweet['tkzd_clnd'][j]!=w:
                g.add_edge(tweet['tkzd_clnd'][j],w)

g.remove_node("s")
g.remove_node("t")

In [287]:
print(f"Num of nodes and edges: {len(g.nodes), len(g.edges())}")
print(f"Largest connected component consists of {lcc} nodes, which is {np.round(lcc/len(g.nodes()), 3)} of the all")

Num of nodes and edges: (22181, 1569028)
Largest connected component consists of 21279 nodes, which is 0.959 of the all


In [288]:
Gc = max(nx.connected_component_subgraphs(g), key=len)

The words with the highest degree:

In [161]:
file = open("words_zipf.txt", "r") 

In [162]:
words = []
freqs = []

for line in file:
    t = line.split()
    words.append(t[0])
    freqs.append(t[1])

In [None]:
degs = sorted(g.degree, key=lambda x: x[1], reverse=True)
node, occ = [x[0] for x in degs], [x[1] for x in degs]

n = 20
len_g_e = len(g.edges()) 

data =[go.Bar(x = node[:n], y = [x/len_g_e for x in occ[:n]], name = "From twitter"),
      go.Bar(x = [x.lower() for x in words[:n]], y = [float(x)/10e5 for x in freqs[:n]], name = "Real" )]

fig2 = go.Figure(data, layout=layout)
#fig1.update_layout(xaxis_type="log")#, yaxis_type="log")
fig2.update_layout({'hovermode': 'x',})
fig2.update_xaxes(title_text = "Nodes", tickangle=315)
fig2.update_yaxes(title_text = "Num of edges (relative)")
fig2.update_layout(title="Most common words" ,     font=dict(
                family='Courier New, monospace',
                size=14,
                color='black'
            ))

In [164]:
plotly.offline.init_notebook_mode(connected=True)
plotly.offline.iplot(fig2, filename='mostcommnodes')

The figure above shows the most common words (words with the highest degree) with blue bars. The measure for these are their relative occurrence which is the number of edges connecting to a word divided by the total number of edges. Red lines are those, which words originate from [this website](http://www.cs.cmu.edu/~cburch/words/top.html) and are in accordance with [Zipf's law](https://en.wikipedia.org/wiki/Zipf%27s_law). The measure here is the occurrence in a million words. What is interesting to see here is -- despite the fact that literature is very different for the sets -- 15 out of the 20 words match, and often thier positions, too. Maybe if we were to measure the 50 most frequent word, the relative mismatche would decrease. It is a great checkpoint to make sure, our measurement is based on real world data. 

In [165]:
edgs = list(g.edges())

edgs_d = dict(Counter(edgs))

sorted_edgs = sorted(edgs_d.items(), key=operator.itemgetter(1), reverse=True)

In [166]:
n = 20

toplot = [[e[0], e[1]] for c, e  in enumerate(sorted_edgs) if c<30]
nums = [100*z[1]/len(edgs) for z in toplot]
labels = [str(z[0][0]+' - '+ z[0][1]) for z in toplot]

Below you can see the 20 most common occurrences ('most numerous paralell edge').

In [None]:

fig = go.Figure(go.Bar(x = labels,y =  nums, name = 'Occurrence'), layout=layout )

fig.update_xaxes(title_text = "Connection pairs", tickangle=315)
fig.update_yaxes(title_text = "% of all the connections")
fig.update_layout(title_text="Relative occurrence of the most common edges",     font=dict(
                family='Courier New, monospace',
                size=14,
                color='black'
            ))

In [168]:
plotly.offline.init_notebook_mode(connected=True)
plotly.offline.iplot(fig, filename='simple-3d-scatter')

This figure shows the most weighted edges aka. the edge word-pairs that occured the most. The measure is the number of occurrence divided by the number of total edges. As it could have been predicted, the word "the" is present in many pairs and the two most frequent nodes' pair is significantly higher.

# <center> Exploring small-world properties </center>

As it was mentioned in the introduction, real world networks are scale-free networks and do exhibit the small-world effect. The first means that the degree distribution follows a power law function:


$$
p(k) \sim k^{-\gamma},
$$

where $p(k)$ means the probability of finding a node with degree $k$ and $\gamma$ is the scaling exponent. This particular exponent is mostly above 2 for real world networks. Small world effect means that the longest shortest path ($d_{max}$) between two nodes when $N$ nodes in the network:

$$
d_{max} \sim \log(N). 
$$

This means that every node is accessible in quite a few steps. Furthermore I measured the betweenness for all the nodes and compared to their relative frequency. The betweenness centrality of a node v ${\displaystyle v}$ v is given by the expression:
$$
    {\displaystyle g(v)=\sum _{s\neq v\neq t}{\frac {\sigma _{st}(v)}{\sigma _{st}}}},
$$

where $\sigma_{st}$ is the total number of shortest paths from node $s$ to node $t$ and $\sigma_{st}(v)$ is the number of those paths that pass through $v$. In graph theory, betweenness centrality is a measure of centrality in a graph based on shortest paths. For every pair of vertices in a connected graph, there exists at least one shortest path between the vertices such that either the number of edges that the path passes through (for unweighted graphs) or the sum of the weights of the edges (for weighted graphs) is minimized. The betweenness centrality for each vertex is the number of these shortest paths that pass through the vertex ([source](https://en.wikipedia.org/wiki/Betweenness_centrality)).  

Below I created the weigthed graph and compared the number of nodes and edges to the non-weighted multi-edged one.

In [169]:
G2 = nx.Graph()
for u,v,data in g.edges(data=True):
    w = data['weight'] if 'weight' in data else 1.0
    if G2.has_edge(u,v):
        G2[u][v]['weight'] += w
    else:
        G2.add_edge(u, v, weight=w)


In [171]:
gnodes , gedges = len(g.nodes()),len(g.edges())
G2nodes, G2edges = len(G2.nodes()),len(G2.edges())

In [292]:
print(f"Num of nodes and edges in the multie, non-w: {gnodes}, {gedges}")
print(f"Num of nodes and edges in the non-m, weighted: {G2nodes}, {G2edges}")
print(f"Ratio between the nodes and edges: {G2nodes/gnodes}, {G2edges/gedges}")

Num of nodes and edges in the multie, non-w: 22181, 1569028
Num of nodes and edges in the non-m, weighted: 21982, 683128
Ratio between the nodes and edges: 0.9910283576033542, 0.4353829249701089


As one can see, the number of nodes decreased. The reason behind this must be that there were some tweets which contained only one word. Since I converted the former network into a new based on the edges, these nodes must have been left out. 

In [None]:
nx.write_weighted_edgelist(G2, "edgelist2.txt", ) # safety first, and vizualization second

In [174]:
# measure p(k) for both of te networks
degree_sequence = sorted([d for n, d in g.degree()], reverse=True)  # degree sequence
# print "Degree sequence", degree_sequence
degreeCount = collections.Counter(degree_sequence)
deg, cnt = zip(*degreeCount.items())

degree_sequence2 = sorted([d for n, d in G2.degree()], reverse=True)  # degree sequence
# print "Degree sequence", degree_sequence
degreeCount2 = collections.Counter(degree_sequence2)
deg2, cnt2 = zip(*degreeCount2.items())

In [175]:
fro, to = 530, 25 # arbitrary limits
def func_powerlaw(x, m, c):
    return x**m * c

popt2, pcov2 = curve_fit(func_powerlaw, deg2[fro:-to], cnt2[fro:-to],p0 =[2,10**2 ], maxfev = 2000)#p0 = np.asarray([0,10,0]))
popt, pcov = curve_fit(func_powerlaw, deg[fro:-to], cnt[fro:-to],p0 =[2,10**2 ], maxfev = 2000)#p0 = np.asarray([0,10,0]))

In [None]:
avg_clus2 = nx.average_clustering(G2)
avg_deg2 = nx.average_degree_connectivity(G2)

avg_deg = nx.average_degree_connectivity(g)

In [249]:
avg_deg_2_a = np.mean([c for n,c in avg_deg2.items()])
avg_deg_a = np.mean([c for n,c in avg_deg.items()])


In [None]:

fig = go.Figure([go.Scatter(x = deg,y =  cnt, name = 'non-w, multie', mode='markers', marker=dict(color="blue")),
                go.Scatter(x = deg2,y =  cnt2, name = 'weighted, non-multie', mode='markers', marker=dict(color="red")),
                go.Scatter(x = deg[fro:-to],y =  func_powerlaw(deg[fro:-to], *popt), name = 'gamma' + ' = ' +str(np.round(abs(popt[0]), 3)), mode='lines', marker=dict(color="#17becf")),
                  go.Scatter(x = deg2[fro:-to],y =  func_powerlaw(deg2[fro:-to], *popt2), name = 'gamma' + ' = ' +str(np.round(abs(popt2[0]), 3)), mode='lines',  marker=dict(color="#e377c2")),
                ], layout=layout )


fig.update_xaxes(title_text = "Degree", tickangle=315)
fig.update_yaxes(title_text = "Num of words")
fig.update_layout(xaxis_type="log", yaxis_type="log")
fig.update_layout(title_text="Degree distribution" + ". The avg. deg. are " + str(np.round(avg_deg_a, 3)) + ", " + str(np.round(avg_deg_2_a, 3)) + ",\nAvg clust:" + str(np.round(avg_clus2, 3))  ,     font=dict(
                family='Courier New, monospace',
                size=14,
                color='black'
            ))

fig.update_layout({'hovermode': 'x',})


In [263]:
plotly.offline.init_notebook_mode(connected=True)
plotly.offline.iplot(fig, filename='degdistnew')

Here one can see the degree distribution for both of the networks, and it is clear that they do show a scaling property. For the modified network, a $\gamma = 2.226$ also shows that this is a real-world network. The title also shows the average degree, which also decreased by renorming the network, which is not surprising, and tells us that the network is densely connected. 

Unfortunately, computing the $d_{max}$ even only for the giant component would take more energy resources than I have, which I skipped therefore. 

In [182]:
r = nx.centrality.betweenness_centrality(G2, weight='weight')

sorted_x = sorted(r.items(), key=operator.itemgetter(1), reverse=True)

In [None]:
num_to_show = 30

x = [i[0] if i[0] == j else i[0] + " (" + j + ")" for i,j in zip(sorted_x[:num_to_show], node[:num_to_show])]
y = [i[1] for i in sorted_x[:num_to_show]]

fig10 = go.Figure([go.Bar(x = x,y =  y, name = 'betweenness', )
                ], layout=layout )


fig10.update_xaxes(title_text = "Node name", tickangle=315)
fig10.update_yaxes(title_text = "Betweenness")
#fig10.update_layout(xaxis_type="log", yaxis_type="log")
fig10.update_layout(title_text="Edges w highest betweenness",     font=dict(
                family='Courier New, monospace',
                size=14,
                color='black'
            ))

fig10.update_layout({'hovermode': 'x',})


In [200]:
plotly.offline.init_notebook_mode(connected=True)
plotly.offline.iplot(fig10, filename='betw')

This figure shows the 30 nodes with the highest betweenness. For those nodes, where there is another word in the parentheses the latter shows the word with the same rank in the frequency table. For example, on this figure on the 15th place "have" is present (it is the 15th node with the highest betweenness) but the "this" word was the 15th most frequent. Mostly there only small changes meaning that high betweenness correlates with high frequency also.

# <center> Communities </center>

Furthermore, I also tried to find communities with different algorithms but unfortunately the nested stochastic block model ([nsbm](https://graph-tool.skewed.de/static/doc/demos/inference/inference.html)) not only killed the kernel every single time but nearly killed my laptop. Due to this small disruption, I decided to use a greedy modularity algorithm. 

Community detection with Greedy Modularity algorithm:

$$
{\displaystyle Q={\frac {1}{2m}}\sum \limits _{ij}{\bigg [}A_{ij}-{\frac {k_{i}k_{j}}{2m}}{\bigg ]}\delta (c_{i},c_{j}),}
$$

where 

- $A_{ij}$ edge weight between node $i,j$,
- $k_{i}$ sum of weights of the edges of $i$,
- $m$ normalization due to weights (sum of all weights),
- $c_{i}$ and $c_{j}$ are the communities of node $i$ and $j$.

Goal is to maximize the modularity.

In [None]:
from networkx.algorithms.community import greedy_modularity_communities
c = list(greedy_modularity_communities(g))

In [209]:
comms = sorted([len(i) for i in c], reverse=True)  # degree sequence
# print "Degree sequence", degree_sequence
comCount = collections.Counter(comms)
degc, cntc = zip(*comCount.items())

In [None]:
fig10 = go.Figure([go.Bar(x = degc ,y = cntc, name = 'betweenness', )
                ], layout=layout )


fig10.update_xaxes(title_text = "Cardinality", tickangle=315)
fig10.update_yaxes(title_text = "Num of comms")
fig10.update_layout(xaxis_type="log", yaxis_type="log")
fig10.update_layout(title_text="Distributon of the cardinality of communities",     font=dict(
                family='Courier New, monospace',
                size=14,
                color='black'
            ))

fig10.update_layout({'hovermode': 'x',})


In [294]:
plotly.offline.init_notebook_mode(connected=True)
plotly.offline.iplot(fig10, filename='betw')

In this plot one can see the distribution of the cardinality of the communities found by the algorithm. What is hardly seen here is a community consisting of $\sim 4400$ nodes at the rightmost side. We also can see that there are many node which could not be clustered.

In [224]:
ones_twos = []
for i in c:
    rtz = len(list(i))
    if rtz <= 2 :
        ones_twos.append(list(i)[0])

In [226]:
print(ones_twos)



Above one can see those nodes which were not clustered at all or were clustered in pairs. Despite some interesting nodes ("epsteinsuicidecoverup" and "こっちのおっさん達本当笑い声大きくて草" -- meaning "This uncle is really laughing loud and grass") there were a few normal words also ("occasion", "overthink" or "priceless") most of the words are meaningless or usernames/hashtags.

In [230]:
for i in c:
    rtz = len(list(i))
    if 5 <= rtz <=21 :
        print(rtz, list(i)[:5])

21 ['द', 'आजक', 'स', 'क', 'ल']
20 ['brenmac01', 'themuddlersclub', 'fuckri', 'so', 'westindi']
18 ['b_isto__ar', 'indianregista', 'nycfc', 'hollybeang', 'towel']
18 ['28s', 'suemurray14', 'snatch', 'thekophq', 'mall']
16 ['ryan_kenny1', 'clivegreenwd', 'jasbar', 'cihfutur', 'eviecopland']
16 ['pcso3580ruth', 'alicel', 'emilylinka', 'have', 'shirleysetia']
14 ['эплу', 'хороший', 'очень', 'выпуск', 'iamvasabi']
14 ['ଆଙ', 'ସ', 'ନ', 'ପ', 'ଳନ']
14 ['samozřejmě', 'bambilionnásobně', 'platit', 'croplus', 'podl']
13 ['debbiezimmer54', 'jilliemari', 'newworlddd555', 'djdebster', 'suzannelepage1']
13 ['maureenb2b', 'burgessbev', 'joel_b2beditor', 'nailedit', 'elise_amil']
13 ['eileenbwyatt', '7th_layer', 'lovemypir', 'c_licar', 'yulvazquez']
13 ['boy1010tori', 'toxicambassador', 'sarzboogi', 'we', 'onejasonkayley']
13 ['brockleymax', 'jimiadefirany', 'catcornucopia', 'brockleybreweri', 'thebrockleybuzz']
12 ['misssdoherti', 'div', 'bravotv', 'scott_riley', 'button']
12 ['มต', 'ดเข', 'ท', 'ยย', '

Communties above are listed their cardinality was between 5 and 21 and 5 of their members. Here there are interesting clusters: for example, not only the cluster with 21 nodes is probably the cluster of the hindi words but all the communities with a cardinality with 14 are distinct languanges (russian, hindi maybe(?) and czech, respectively). One can also find japanese characters (which are whole sentences by the way) and many arabic communities. An arabic group of six is about something bad, since the meanings of the words listed are the following: ['Mistake', 'You', 'Love', 'Mistake', 'Sorry']. Such a cheap drama. 

In [233]:
for i in c[:10]:
    rtz = len(list(i))
    print(rtz, list(i)[:10])

4419 ['beginn', 'improv', 'temperatur', 'loncon19', 'xcase_', 'russtnuttz', 'davidhall75', 'wildearth', 'bbcnickrobinson', 'mike_lions_71']
2962 ['constitu', 'regain', 'counter', 'shock', 'goodnewshackney', 'popsicle_____', 'penyrheolgerrig', 'lack', 'unlikeli', 'uklabour']
2042 ['habit', 'gerad', 'slept', 'pusscat', 'ellarinajohn', 'wish', 'waveoflight2019', 'safeti', 'yoga', 'fat']
1996 ['rasa', 'pásalo', 'mahez', 'marialal', 'illustriousg40', 'botham_sam', 'gedl', 'tywydd', 'tien', 'farklı']
1331 ['quid', 'dvd', 'fu', 'profil', 'gibsidehotel', 'wildathearthq', 'th3gasman', 'merseaisland', 'lx', 'waywardhu']
930 ['حرفيا', 'monetari', 'بعد', 'ربيع', 'يوم', 'سادگی', 'richardlatto', 'womeninart', 'المباركة', 'قسومنا']
914 ['feather', 'angelahilleri', 'darzi', 'mindless', 'finest', 'twofussyblok', 'drinkwat', 'vicky_mcclur', 'tegen', '070906']
559 ['bhamcitycouncil', 'sarashamma', 'referendumnin', 'royalfamili', 'gadaviman', 'gabe', 'chelseafran', 'nickybenedetti', 'audibleuk', 'vch_shro

Here you can see the 10 largest communitites with 10 members of that. Some interesting things:

- the second largest group may be based on political tweets (according to these words),
- the third probably is about fitness ('habit', 'wish', 'yoga', 'fat'),
- the group 930 is probably the largest non-english group, which is listed totally below.

In [235]:
print(list(c[5]))

['حرفيا', 'monetari', 'بعد', 'ربيع', 'يوم', 'سادگی', 'richardlatto', 'womeninart', 'المباركة', 'قسومنا', 'ومراته', 'الأميرات', 'spay', 'الخاطفة', 'انا', 'الله', 'آکھے', 'بمشي', 'ثبات', 'حرام', 'georgia', 'favourit', 'بیغیرت', 'cleansman', 'الأساسية', 'قالتلي', 'الساعه', 'ایتھے', 'ph', 'غير', 'فخاري', 'فالدنيا', 'الرز', 'announc', 'joalsubai', 'طبية', 'الكوارث', 'العصر', 'بس', 'الشيخ', 'بنت', 'يفيدها', 'خبراء', 'الرياض', 'بالضرورة', 'تمنى', 'كأس', 'حساب', 'وسيوارى', 'العزاء', 'الانستقرام', 'كتاب', 'كلمة', 'مع', 'مررره', 'غيمة', 'زاد', 'karimahamed9', 'اريد', 'wigan', 'alfaskara7', 'travel', 'بندہ', 'انت', 'خربوك', 'احب', 'ونعوضها', 'لحضور', 'سوق', 'صار', 'لاحول', 'نجيب', 'ف', 'ترهبني', 'منزل', 'شيخة', 'complimentari', 'نحمد', 'م', 'مازرت', 'حاجة', 'اسهل', 'جعلنى', 'الغدا', 'classicbritcom', 'أجامل', 'الفردوس', 'اضبطها', 'الصغيرة', 'دار', 'الخرينج', '٧', 'المتعبة', 'roaa_alsabban', 'ghumman', 'ليه', 'عمرى', 'والديكتاتورية', 'والشيخة', 'آگ', 'أيامك', 'ق', 'المقربه', 'ظلمه', 'اطلع', 'ويتنا

Furthermore, I looked for Hungarian words. Based on [Wiktionary](https://en.wiktionary.org/wiki/Wiktionary:Frequency_lists/Hungarian_frequency_list_1-10000), in a code cell below, one can see the 50 most frequent words.

In [261]:
szavak = ["meg", "vagy", "van", "vagyok", "vannak", "vagytok", "volt", "már", "kell", 
          "még", "és", "mint", "azt", "az", "akkor", "sem", "lehet", "mert", "minden", "olyan",
          "szerint", "pedig", "ezt", "ez", "így", "után", "úgy", "nagy", "fel", "majd", "két", 
          "nem", "nagyon", "aki", "akik", "akit", "kit", "kik", "most", "több", "lesz", "itt",
          "magyar", "ami", "amik", "amit", "mit", "első", "között", "amely", "hanem", "nincs",
          "más", "illetve", "alatt", "egyik", "volna", "arra", "kft", "ilyen", "azonban"] #deleted "a"

In [82]:
not_in = []
for szó in szavak:
    stemd = snow.stem(szó)
    if g.has_node(stemd):
        print(szó, stemd)
        print(list(G2.edges(stemd))[:7])
        print()
    else:
        not_in.append(stemd)

van van
[('van', 'and'), ('van', 'i'), ('van', 'of'), ('van', 'is'), ('van', 'the'), ('van', 'me'), ('van', 'my')]

mint mint
[('mint', 'by'), ('mint', 'last'), ('mint', 'follow'), ('mint', 'it'), ('mint', 'this'), ('mint', 'the'), ('mint', 'absolut')]

az az
[('az', 'is'), ('az', 'via'), ('az', 'bekliyoruz'), ('az', 'sunayakin'), ('az', 'ağabey'), ('az', 'kaldı'), ('az', '19')]

sem sem
[('sem', 'do'), ('sem', 'a'), ('sem', 'como'), ('sem', 'de'), ('sem', 'se'), ('sem', 'o'), ('sem', 'que')]

ez ez
[('ez', 'is'), ('ez', 'via'), ('ez', 'az'), ('ez', 'te'), ('ez', 'nagyon'), ('ez', 'megérintett'), ('ez', 'ügi')]

fel fel
[('fel', 'i'), ('fel', 'un'), ('fel', 'y'), ('fel', 'n'), ('fel', 'w'), ('fel', 'r'), ('fel', 'yn')]

nagyon nagyon
[('nagyon', 'is'), ('nagyon', 'via'), ('nagyon', 'az'), ('nagyon', 'te'), ('nagyon', 'megérintett'), ('nagyon', 'ez'), ('nagyon', 'ügi')]

aki aki
[('aki', 'by'), ('aki', 'and'), ('aki', 'to'), ('aki', 'save'), ('aki', 'up'), ('aki', 'work'), ('aki', 'of')

Above are those "Hungarian" words which were present in the network. The listing is the following; in the first line one can see the pure word and the stemmed one (which appeared to be the same for all) and below the 7 of its connections. Unfortunately many Hungarian words have an English meaning, too ("van", "mint", "kit", "most"). There are two words which have either Spanish and/or Portuguese links ("sem", "más") and one with Turkish connections ("az"). However, it did find some Hungarian connections for "ez" and "nagyon". Maybe, do these words make up a whole cluster? Let us check for the cardinality of their communities.

In [80]:
for i in sorted(c):
    for cou, szó in enumerate(szavak):
        stemd = snow.stem(szó)
        if stemd in i:
            print(cou, stemd, len(i))

11 mint 4441
13 az 4441
15 sem 4441
23 ez 4441
28 fel 4441
32 nagyon 4441
36 kit 4441
52 más 4441
2 van 2874
33 aki 2874
38 most 657
43 ami 139


The listing here means the following: (rank in frequency, word, cardinality of its cluster). Unfortunately, the words which had Hungarian link appear to be in the largest "quite meaningless" group. Therefore we can say that the algorithm did not find a Hungarian cluster. 

## Conclusions


During this project I was not only able to utilize all the learnt abilities from the dataexp course but also got results which are in accordance with the literature. Gathering the tweets and preparing the data might have been the most difficult part of all the work, yet, measuring the properties of the network was undoubtedly the most time-consuming task. Thus, there are many improvements, including for example thresholding for number of occurrences or filtering for those tweets which were reposted or the user was disturbingly active posting. Furthermore, other stemmers probably would have stemmed the (foreign) words in other which would then create a slightly different network. Finally, if I had had the resources to conduct a measurement involving more complex community findig algorithms, comparing the outcoming clusters would have said a lot about the robustness about each method.


Nevertheless, proving scale-free property and finding different communities not only based on the language but topic also are great achievements to notice. In my opinion I did everything I could to do my based on my time schedule. This project unambiguously helped my professional career as it seems to me, many companies do greet having knowledge about natural language processing. 