Twitter Topic Modeling ¶

Reproduced project presentation ¶

by Kristóf Furuglyás ¶

2019 Fall, Original Project: István Márkusz, Eötvös Loránd University ¶

Goal of project

Gather tweets about one given topic (original: #maga, mine: #metoo)
Create word-word graph and word-tweet bipartite graph
Find communities with different algorithms
Associate communites w topics
Compare methods

Data availability and Preparation ¶

Ready codes and much data were available
Environment is complicated to install (docker)
Create virtual environment

Reproduced Methods¶

Collected tweets with $\textbf{#metoo}$
Cleaning and tokenizing done in the same way
Not 44775 but 1760 tweets
Word-word graph & word-tweet bipartite graph
hSBM & LDA

Word-word graph properties¶

In [152]:

# Creating word-word graph
G = nx.Graph()

for token_list in tqdm(data.tokens):
    
    for edge in itertools.combinations(token_list, 2):
        w = G.get_edge_data(*edge, default={'weight':0})['weight'] + 1
        G.add_edge(*edge, weight=w)

G = nx.convert_node_labels_to_integers(G, label_attribute='label')
print(nx.info(G))
deg_med = np.median([deg for node, deg in G.degree(weight='weight')])
print(f'Median degree: {deg_med}')

Name: 
Type: Graph
Number of nodes: 1448
Number of edges: 66988
Average degree:  92.5249
Median degree: 74.0

In [154]:

# Degree dist
hist = nx.degree_histogram(G)

plt.figure(figsize=(16,6))
plt.bar(range(len(hist)), hist)
plt.xlabel("Degree")
plt.ylabel("Occurrence")
plt.grid()

plt.show()

Word-tweet graph¶

In [159]:

# Tweet-token ratio
filters = (data.tokens.str.len() >= 6)
print(f'Number of tweets: {len(data[filters])}')
print(f'Number of tokens: {data[filters].tokens.str.len().sum()}')

Number of tweets: 1174
Number of tokens: 14770

In [201]:

level=1
hsbm_model.topics(l=level, n=10)

Out[201]:

{0: [('women', 0.015546639919759278),
  ('movement', 0.014844533600802408),
  ('say', 0.011634904714142427),
  ('like', 0.011334002006018053),
  ('get', 0.011133400200601806),
  ('peopl', 0.008625877632898696),
  ('one', 0.008224674022066199),
  ('stori', 0.007321965897693079),
  ('think', 0.00712136409227683),
  ('make', 0.00712136409227683)],
 1: [('assault', 0.05037593984962406),
  ('era', 0.04360902255639098),
  ('court', 0.04360902255639098),
  ('survivor', 0.039097744360902256),
  ('first', 0.035338345864661655),
  ('watch', 0.03007518796992481),
  ('high', 0.02706766917293233),
  ('case', 0.02556390977443609),
  ('tri', 0.02556390977443609),
  ('convict', 0.021052631578947368)],
 2: [('sexual', 0.062146892655367235),
  ('men', 0.03615819209039548),
  ('need', 0.02711864406779661),
  ('harass', 0.026741996233521657),
  ('victim', 0.02448210922787194),
  ('abus', 0.021845574387947268),
  ('come', 0.021468926553672316),
  ('man', 0.01770244821092279),
  ('violenc', 0.01770244821092279),
  ('power', 0.01657250470809793)]}

---------------------------------------- TOPIC: 0 ----------------------------------------

@othermatt2 hahah my term paper is actually about using media to cultivate the practice of faithful presence in the midst of the #metoo & #churchtoo movements.

@laurahday @JayneBYoung @magenta_17 @cash4questions2 @topwak @SkyNews @BBCNews @DailyMirror @UKLabour @jeremycorbyn 2) I have my eyes wide open. All I'm seeing is your lively smile right now (that was a bit of humour). Don't go all #MeToo over it lol.

@susanthesquark Lol... Its a dumb observation that yout naive or too dumb to realize.... Your getting ratioed for good reason... Ever heard of #MeToo

@BiggFan77 #Dumboleena was coined for a reason. #ShehnaazGill is in task. To take max footage from HMs, to be seen. #Shehnaaz says "She can't be fake & talk to HMs who don't like her, Sid was the only one who she could take footage from"

Accused #Metoo cursed & dumbo was flirting with him.

---------------------------------------- TOPIC: 1 ----------------------------------------

The Superior Court ruling was being closely watched because Cosby was the first celebrity tried and convicted in the #MeToo era. https://t.co/avPRTAbrDa

15.08 ongoing attempt to murder GE #GE #FCPA #CORPGOV whistle-blower & #MeToo survivor Seema Sapra in Delhi High Court @realDonaldTrump @FBI @POTUS @gurgaonpolice @DelhiPolice @CPDelhi @StateDept @WhiteHouse @TheJusticeDept @HMOIndia @AmitShah @PMOIndia @narendramodi https://t.co/kWwbc3sk8t

0.12 Attempt to murder GE #GE #FCPA #CORPGOV whistle-blower & #MeToo survivor Seema Sapra at Gate 8 Delhi High Court @realDonaldTrump @FBI @POTUS @gurgaonpolice @DelhiPolice @CPDelhi @StateDept @WhiteHouse @TheJusticeDept @HMOIndia @AmitShah @PMOIndia @narendramodi https://t.co/xf3da9t2HY

UP16BC7271 0.02 Attempt to murder GE #GE #FCPA #CORPGOV whistle-blower & #MeToo survivor Seema Sapra at Gate 8 Delhi High Court @realDonaldTrump @FBI @POTUS @gurgaonpolice @DelhiPolice @CPDelhi @StateDept @WhiteHouse @TheJusticeDept @HMOIndia @AmitShah @PMOIndia @narendramodi

---------------------------------------- TOPIC: 2 ----------------------------------------

PWN men and PWN the WPRDL, with proprietary insights, from the N-biCOMACOPO "male dual loyalty" #gaming algorithm. Reverse engineered by the AIA. #ReadMyTweets #AI #MAGA #MeToo #IoT #DemDebate Eli Manning #EaglesvsGiants #fintech #infosec #Joker #hacking

https://t.co/RS3QsVW99Q

Damn You @GloriaAllred, Damn You @LisaBloom, Damn You #JudgeONeill, Damn You #KevinSteele! 😤👎👎

Damn You #TimesUp and #MeToo! #FirstThem! #MuteTimesUp and #MuteMeToo! I AM VERY PISSED! #BillCosby didn't deserve it! #FreeCosby #BillCosbyIsInnocent! https://t.co/6jAeVizmlW

"Community problems deserve a community response. My response was #metoo." #IHIForum

IT'S THE MEN, STUPID. This is algorithmically identical to #domesticabuse model—the ENDLESS "play by play" over PREPOSTEROUS MEN—a deep deep spiraling "male dual loyalty" #gaming dynamic. #ReadMyTweets #AI #MAGA #MeToo #IoT #DemDebate Eli Manning #EaglesvsGiants #fintech #infosec https://t.co/JUiZAGLg2L

Latent-Dirichlet Algorithm (LDA) ¶

In [217]:

level=1
lda_models[level].show_topics(num_topics=-1, num_words=7, formatted=False)

Out[217]:

[(0,
  [('say', 0.01029575),
   ('women', 0.0095762685),
   ('men', 0.007937355),
   ('movement', 0.007825186),
   ('stori', 0.0073716524),
   ('see', 0.0059436252),
   ('right', 0.005807654)]),
 (1,
  [('sexual', 0.019692097),
   ('women', 0.017512914),
   ('movement', 0.012585554),
   ('like', 0.010749188),
   ('say', 0.010251208),
   ('get', 0.009469597),
   ('assault', 0.008004609)]),
 (2,
  [('would', 0.009614601),
   ('get', 0.009072569),
   ('know', 0.0076741227),
   ('movement', 0.007480507),
   ('like', 0.0064201523),
   ('new', 0.0060390034),
   ('one', 0.0057884566)])]

Comparison ¶

Smiley face

Conclusions ¶

Could setup environment
Reproduced the results, but with less data
Many new great features

Thank you for your attention! ¶