My progress so far:
Since last week (the presentation), I was able to clean (most of) the unnecessary punctuation and other things, and after tokenizing the words from the tweets, I could create a graph also.
I already have a couple of tweets streamed from last week ($\sim$ 8500).
df.shape
(8479, 8)
mymap = folium.Map(location=[52.809865,-2.118092],zoom_start=5.4,tiles='cartodbpositron')
for i in range(1000):
marker = folium.Marker(location=df.locs[i][0],popup=df.text[i])
marker.add_to(mymap)
folium.Popup(parse_html=True)
mymap
print(f"Text of tweet no.8: \n\n{df.text[8]}")
Text of tweet no.8: Bob Prattey: Liverpool Exhibition Centre - Do NOT host trophy hunting safari companies - Sign the Petition! https://t.co/W8VcURxSLN via @UKChange
clean = re.sub(r'http\S+', '', df.text[8])
print(f"\nAfter cleaning:\n\n{clean}")
After cleaning: Bob Prattey: Liverpool Exhibition Centre - Do NOT host trophy hunting safari companies - Sign the Petition! via @UKChange
clean_tknzd = [snow.stem(word) for word in re.findall('\w+',clean.lower())]
print(f"After tokenizing:\n\n{clean_tknzd}")
After tokenizing: ['bob', 'prattey', 'liverpool', 'exhibit', 'centr', 'do', 'not', 'host', 'trophi', 'hunt', 'safari', 'compani', 'sign', 'the', 'petit', 'via', 'ukchang']
Other tokenizers are plausible also (Porter).
Tool: $\texttt{MultiGraph}$ by $\texttt{netwotrkx}$,
Connection = same tweet $\rightarrow$ the more tweet = greater weight,
Paralell edges possible, but no self-loops.
print(f"Num of nodes and edges: {len(g.nodes), len(g.edges())}")
Num of nodes and edges: (22181, 1569028)
plotly.offline.init_notebook_mode(connected=True)
plotly.offline.iplot(fig1, filename='degdist')
The words with the highest degree:
plotly.offline.init_notebook_mode(connected=True)
plotly.offline.iplot(fig2, filename='mostcommnodes')
Below you can see the 20 most common occurrences ('most numerous paralell edge').
plotly.offline.init_notebook_mode(connected=True)
plotly.offline.iplot(fig, filename='simple-3d-scatter')