Small world effect on Twitter ¶

Third presentation ¶

by Kristóf Furuglyás ¶

2019 Fall, Consultant: Eszter Bokányi, Eötvös Loránd University ¶

Plan ¶

Setting up Twitter API $\checkmark$
Gathering tweets $\checkmark$
Cleaning the tweets $in$ $progress$
Creating word-graph $\checkmark$
Exploring small-world properties $in$ $progress$

About last week ¶

My progress so far:

Stream tweets via the $\texttt{tweepy}$ package
Choose them by locations
Prepare the data from .json format
Tokenize the words in the tweets
Cleaned the words from unnecesary things.
Created word-graph

Since last week (the presentation), I was able to clean (most of) the unnecessary punctuation and other things, and after tokenizing the words from the tweets, I could create a graph also.

I already have a couple of tweets streamed from last week ($\sim$ 8500).

In [7]:

df.shape

Out[7]:

(8479, 8)

Location of the tweets ¶

In [34]:

mymap = folium.Map(location=[52.809865,-2.118092],zoom_start=5.4,tiles='cartodbpositron')
for i in range(1000):
    marker = folium.Marker(location=df.locs[i][0],popup=df.text[i])
    marker.add_to(mymap)
folium.Popup(parse_html=True)
mymap

Out[34]:

Cleaning from unnecessary things ¶

In [12]:

print(f"Text of tweet no.8: \n\n{df.text[8]}")

Text of tweet no.8: 

Bob Prattey: Liverpool Exhibition Centre - Do NOT host trophy hunting safari companies - Sign the Petition! https://t.co/W8VcURxSLN via @UKChange

In [13]:

clean = re.sub(r'http\S+', '', df.text[8])
print(f"\nAfter cleaning:\n\n{clean}")

After cleaning:

Bob Prattey: Liverpool Exhibition Centre - Do NOT host trophy hunting safari companies - Sign the Petition!  via @UKChange

In [14]:

clean_tknzd = [snow.stem(word) for word in re.findall('\w+',clean.lower())]
print(f"After tokenizing:\n\n{clean_tknzd}")

After tokenizing:

['bob', 'prattey', 'liverpool', 'exhibit', 'centr', 'do', 'not', 'host', 'trophi', 'hunt', 'safari', 'compani', 'sign', 'the', 'petit', 'via', 'ukchang']

Other tokenizers are plausible also (Porter).

Network of tweets ¶

Tool: $\texttt{MultiGraph}$ by $\texttt{netwotrkx}$,
Connection = same tweet $\rightarrow$ the more tweet = greater weight,
Paralell edges possible, but no self-loops.

In [20]:

print(f"Num of nodes and edges: {len(g.nodes), len(g.edges())}")

Num of nodes and edges: (22181, 1569028)

In [22]:

plotly.offline.init_notebook_mode(connected=True)
plotly.offline.iplot(fig1, filename='degdist')

The words with the highest degree:

In [26]:

plotly.offline.init_notebook_mode(connected=True)
plotly.offline.iplot(fig2, filename='mostcommnodes')

Below you can see the 20 most common occurrences ('most numerous paralell edge').

In [30]:

plotly.offline.init_notebook_mode(connected=True)
plotly.offline.iplot(fig, filename='simple-3d-scatter')

Upcoming tasks ¶

Gather more (now $\sim 8500$) tweets (more than a million)
Search for communities
Look for other measures (centralities)

Thank you for your attention! ¶