Small world effect on Twitter

Third presentation


by Kristóf Furuglyás

tw_start
2019 Fall, Consultant: Eszter Bokányi, Eötvös Loránd University

Plan

  1. Setting up Twitter API $\checkmark$
  2. Gathering tweets $\checkmark$
  3. Cleaning the tweets $in$ $progress$
  4. Creating word-graph $\checkmark$
  5. Exploring small-world properties $in$ $progress$

About last week

My progress so far:

  • Stream tweets via the $\texttt{tweepy}$ package
  • Choose them by locations
  • Prepare the data from .json format
  • Tokenize the words in the tweets
  • Cleaned the words from unnecesary things.
  • Created word-graph

Since last week (the presentation), I was able to clean (most of) the unnecessary punctuation and other things, and after tokenizing the words from the tweets, I could create a graph also.

I already have a couple of tweets streamed from last week ($\sim$ 8500).

In [7]:
df.shape
Out[7]:
(8479, 8)

Location of the tweets

In [34]:
mymap = folium.Map(location=[52.809865,-2.118092],zoom_start=5.4,tiles='cartodbpositron')
for i in range(1000):
    marker = folium.Marker(location=df.locs[i][0],popup=df.text[i])
    marker.add_to(mymap)
folium.Popup(parse_html=True)
mymap
Out[34]:

Cleaning from unnecessary things

In [12]:
print(f"Text of tweet no.8: \n\n{df.text[8]}")
Text of tweet no.8: 

Bob Prattey: Liverpool Exhibition Centre - Do NOT host trophy hunting safari companies - Sign the Petition! https://t.co/W8VcURxSLN via @UKChange
In [13]:
clean = re.sub(r'http\S+', '', df.text[8])
print(f"\nAfter cleaning:\n\n{clean}")
After cleaning:

Bob Prattey: Liverpool Exhibition Centre - Do NOT host trophy hunting safari companies - Sign the Petition!  via @UKChange
In [14]:
clean_tknzd = [snow.stem(word) for word in re.findall('\w+',clean.lower())]
print(f"After tokenizing:\n\n{clean_tknzd}")
After tokenizing:

['bob', 'prattey', 'liverpool', 'exhibit', 'centr', 'do', 'not', 'host', 'trophi', 'hunt', 'safari', 'compani', 'sign', 'the', 'petit', 'via', 'ukchang']

Other tokenizers are plausible also (Porter).

Network of tweets

  • Tool: $\texttt{MultiGraph}$ by $\texttt{netwotrkx}$,

  • Connection = same tweet $\rightarrow$ the more tweet = greater weight,

  • Paralell edges possible, but no self-loops.

In [20]:
print(f"Num of nodes and edges: {len(g.nodes), len(g.edges())}")
Num of nodes and edges: (22181, 1569028)
In [22]:
plotly.offline.init_notebook_mode(connected=True)
plotly.offline.iplot(fig1, filename='degdist')

The words with the highest degree:

In [26]:
plotly.offline.init_notebook_mode(connected=True)
plotly.offline.iplot(fig2, filename='mostcommnodes')

Below you can see the 20 most common occurrences ('most numerous paralell edge').

In [30]:
plotly.offline.init_notebook_mode(connected=True)
plotly.offline.iplot(fig, filename='simple-3d-scatter')

Upcoming tasks

  • Gather more (now $\sim 8500$) tweets (more than a million)
  • Search for communities
  • Look for other measures (centralities)

Thank you for your attention!


elte

2019 Fall, Eötvös Loránd University