Algorithm | graph-tool (16 threads) | graph-tool (1 thread) | igraph | NetworkX |
---|---|---|---|---|
Single-source shortest path | 0.0023 s | 0.0022 s | 0.0092 s | 0.25 s |
Global clustering | 0.011 s | 0.025 s | 0.027 s | 7.94 s |
PageRank | 0.0052 s | 0.022 s | 0.072 s | 1.54 s |
K-core | 0.0033 s | 0.0036 s | 0.0098 s | 0.72 s |
Minimum spanning tree | 0.0073 s | 0.0072 s | 0.026 s | 0.64 s |
Betweenness | 102 s (~1.7 mins) | 331 s (~5.5 mins) | 198 s (vertex) + 439 s (edge) (~ 10.6 mins) | 10297 s (vertex) 13913 s (edge) (~6.7 hours) |
pip install networkx
jupyter notebook
Compared to
docker pull tiagopeixoto/graph-tool
docker run -p 8888:8888 -p 6006:6006 -it -u user -w /home/user -v /home/kfuruglyas:/home/user tiagopeixoto/graph-tool bash
[user@c20b92b8c4bf ~]$ jupyter notebook --ip 0.0.0.0
from graph_tool.all import *
import numpy as np
import matplotlib.pyplot as plt
num_of_nodes, num_of_edges = 9, 18
def create_edges(num_of_nodes, num_of_edges, self_loops = False, multiedges = True, weighted = True):
edges_labels = {}
while len(edges_labels.keys()) < num_of_edges:
e_pair = tuple(np.random.choice(num_of_nodes, 2,self_loops))
if multiedges or (e_pair not in list(edges_labels.keys()) and e_pair[::-1] not in list(edges_labels.keys())):
e_weight = np.random.rand() if weighted else 1
edges_labels[e_pair] = e_weight
return edges_labels
np.random.seed(0)
edges_labels = create_edges(num_of_nodes, num_of_edges)
g = Graph()
g.add_vertex(num_of_nodes);
for i,j in edges_labels.keys():
g.add_edge(i,j)
eprop_wei = g.new_edge_property("double")
eprop_str = g.new_edge_property("string")
for c,key in enumerate(edges_labels.keys()):
e = list(g.edges())[c]
eprop_wei[e] = edges_labels[key]*10
eprop_str[e] = f"{edges_labels[key]:.2f}"
vprop_id = g.new_vertex_property("string")
for i in g.vertices():
vprop_id[i] = str(i)
pos = random_layout(g)
graph_draw(g, pos, vertex_text = vprop_id,
vertex_size=30, edge_pen_width=eprop_wei,
edge_text = eprop_str);
g.set_directed(False)
pos = arf_layout(g)
graph_draw(g, pos, vertex_text = vprop_id,
vertex_size=30, edge_pen_width=eprop_wei, edge_text = eprop_str, );
pos = radial_tree_layout(g, 2)
graph_draw(g, pos, vertex_text = vprop_id,
vertex_size=30, edge_pen_width=eprop_wei, edge_text = eprop_str,
vertex_color = 'green', vertex_shape = "heptagon",vertex_fill_color = 'k',
edge_color = "cyan", edge_font_family= "cursive", edge_mid_marker = "bar");
where description length (entropy) is
$$ \Sigma = -\ln P\left( \mathbf{A}\left| \right.\mathbf{\theta}, \mathbf{b} \right) - \ln P\left( \mathbf{\theta}, \mathbf{b}\right). $$Measures the amount of information required to describe the data
Example of a nested SBM with three levels.
Goal is to minimize description length [7]
a.k.a entropy
$$ P\left( \mathbf{b}\left| \right. \mathbf{A} \right) = \frac{ \exp\left({-\Sigma}\right)}{P\left( \mathbf{A} \right)}, \text{ where } \Sigma = -\ln P\left( \mathbf{A}\left| \right.\mathbf{\theta}, \mathbf{b} \right) - \ln P\left( \mathbf{\theta}, \mathbf{b}\right). $$where $\boldsymbol{\theta} = \boldsymbol{e}$ for the traditional SBM and $\boldsymbol{\theta} = \{\boldsymbol{e}, \boldsymbol{k} \}$ for the degree-corrected model. For the former one $$ \begin{split}P(\boldsymbol{A}|\boldsymbol{e},\boldsymbol{b}) &= \frac{\prod_{r<s}e_{rs}!\prod_re_{rr}!!}{\prod_rn_r^{e_r}}\times \frac{1}{\prod_{i<j}A_{ij}!\prod_iA_{ii}!!},\\ P(\boldsymbol{A}|\boldsymbol{e},\boldsymbol{b}) &= \frac{\prod_{rs}e_{rs}!}{\prod_rn_r^{e_r}}\times \frac{1}{\prod_{ij}A_{ij}!},\end{split} $$
For the latter one: $$ \begin{split}P(\boldsymbol{A}|\boldsymbol{e},\boldsymbol{b},\boldsymbol{k}) &= \frac{\prod_{r<s}e_{rs}!\prod_re_{rr}!!}{\prod_re_r!}\times \frac{\prod_ik_i!}{\prod_{i<j}A_{ij}!\prod_iA_{ii}!!},\\ P(\boldsymbol{A}|\boldsymbol{e},\boldsymbol{b},\boldsymbol{k}) &= \frac{\prod_{rs}e_{rs}!}{\prod_re_r^+!\prod_re_r^-!}\times \frac{\prod_ik_i^+!\prod_ik_i^-!}{\prod_{ij}A_{ij}!},\end{split} $$
Using Markov chain Monte Carlo (MCMC)
minimize_blockmodel_dl()
minimize_nested_blockmodel_dl()
g = collection.data["football"]
print(g)
for k in g.graph_properties.keys():
print(k+":", g.gp[k])
<Graph object, undirected, with 115 vertices and 613 edges, 4 internal vertex properties, 2 internal graph properties, at 0x7f7137009d90> readme: The file football.gml contains the network of American football games between Division IA colleges during regular season Fall 2000, as compiled by M. Girvan and M. Newman. The nodes have values that indicate to which conferences they belong. The values are as follows: 0 = Atlantic Coast 1 = Big East 2 = Big Ten 3 = Big Twelve 4 = Conference USA 5 = Independents 6 = Mid-American 7 = Mountain West 8 = Pacific Ten 9 = Southeastern 10 = Sun Belt 11 = Western Athletic If you make use of these data, please cite M. Girvan and M. E. J. Newman, Community structure in social and biological networks, Proc. Natl. Acad. Sci. USA 99, 7821-7826 (2002). Correction: Two edges were erroneously duplicated in this data set, and have been removed (21 SEP 2014) description: American College football: network of American football games between Division IA colleges during regular season Fall 2000. Please cite M. Girvan and M. E. J. Newman, Proc. Natl. Acad. Sci. USA 99, 7821-7826 (2002). Retrieved from `Mark Newman's website <http://www-personal.umich.edu/~mejn/netdata/>`_. This file also contains corrections made by T. S. Evans, available `here <http://figshare.com/articles/American_College_Football_Network_Files/93179>`_. Thus, please also cite T.S. Evans, "Clique Graphs and Overlapping Communities", J.Stat.Mech. (2010) P12037 [arXiv:1009.0638].
states = [minimize_blockmodel_dl(g) for _ in range(10)]
entropies = [state.entropy() for state in states]
# minimize_blockmodel_dl(g) returns BlockState object
plt.plot(range(10), entropies, '-o')
plt.xlabel("Trial"), plt.ylabel("Entropy"), plt.grid()
state = states[np.argmin(entropies)]
b = state.get_blocks()
r = b[10] # group membership of vertex 10
print(r)
3
e = state.get_matrix()
im = plt.imshow(e.todense())
plt.colorbar(im);
print(f"Modularity of found partition: {modularity(g, b):.4f}")
print(f"GN modularity of found partition (from prev lesson): {0.5996290274077901:.4f}")
print(f"GN modularity of original partition: {modularity(g, g.vp.value_tsevans):.4f}" )
state.draw(pos=g.vp.pos,vertex_text= g.vp['value']);
Modularity of found partition: 0.6044 GN modularity of found partition (from prev lesson): 0.5996 GN modularity of original partition: 0.5744
g = collection.data["celegansneural"]
print(g)
for k in g.graph_properties.keys():
print(k+":", g.gp[k])
<Graph object, directed, with 297 vertices and 2359 edges, 2 internal vertex properties, 1 internal edge property, 2 internal graph properties, at 0x7f7135db30d0> description: Neural network: A directed, weighted network representing the neural network of C. Elegans. Data compiled by D. Watts and S. Strogatz and made available on the web `here <http://cdg.columbia.edu/cdg/datasets>`_. Please cite D. J. Watts and S. H. Strogatz, Nature 393, 440-442 (1998). Original experimental data taken from J. G. White, E. Southgate, J. N. Thompson, and S. Brenner, Phil. Trans. R. Soc. London 314, 1-340 (1986). Retrieved from `Mark Newman's website <http://www-personal.umich.edu/~mejn/netdata/>`_. readme: Neural network of the nematode C. Elegans Compiled by Duncan Watts and Steven Strogatz from original experimental data by White et al. The file celegansneural.gml describes a weighted, directed network representing the neural network of C. Elegans. The data were taken from the web site of Prof. Duncan Watts at Columbia University, http://cdg.columbia.edu/cdg/datasets. The nodes in the original data were not consecutively numbered, so they have been renumbered to be consecutive. The original node numbers from Watts' data file are retained as the labels of the nodes. Edge weights are the weights given by Watts. These data can be cited as: J. G. White, E. Southgate, J. N. Thompson, and S. Brenner, "The structure of the nervous system of the nematode C. Elegans", Phil. Trans. R. Soc. London 314, 1-340 (1986). D. J. Watts and S. H. Strogatz, "Collective dynamics of `small-world' networks", Nature 393, 440-442 (1998).
states = [minimize_nested_blockmodel_dl(g) for _ in range(10)]
entropies = [state.entropy() for state in states]
# minimize_blockmodel_dl(g) returns BlockState object
plt.plot(range(10), entropies, '-o')
plt.xlabel("Trial"), plt.ylabel("Entropy"), plt.grid()
state = states[np.argmin(entropies)]
cel_ent = state.entropy().copy()
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-44-13e4f2eadfee> in <module> 5 plt.xlabel("Trial"), plt.ylabel("Entropy"), plt.grid() 6 state = states[np.argmin(entropies)] ----> 7 cel_ent = state.entropy().copy() AttributeError: 'float' object has no attribute 'copy'
state.print_summary()
state.draw();
l: 0, N: 297, B: 15 l: 1, N: 15, B: 6 l: 2, N: 6, B: 2 l: 3, N: 2, B: 1
g = collection.data["celegansneural"]
state = NestedBlockState(g)
state.print_summary()
state.entropy()
l: 0, N: 297, B: 1 l: 1, N: 1, B: 1 l: 2, N: 1, B: 1 l: 3, N: 1, B: 1 l: 4, N: 1, B: 1 l: 5, N: 1, B: 1 l: 6, N: 1, B: 1 l: 7, N: 1, B: 1 l: 8, N: 1, B: 1 l: 9, N: 1, B: 1
9468.527791619807
entropies = []
for _ in range(200): # this should be sufficiently large
state.multiflip_mcmc_sweep(beta=np.inf, niter=10)
entropies.append(state.entropy())
state.print_summary()
plt.plot(range(200), entropies, label = "sweeps")
plt.axhline(cel_ent,color = "red", label = 'min_dl', )
plt.xlabel("Flip count"), plt.ylabel("Entropy"), plt.grid(), plt.legend();
l: 0, N: 297, B: 11 l: 1, N: 11, B: 4 l: 2, N: 4, B: 1 l: 3, N: 1, B: 1 l: 4, N: 1, B: 1 l: 5, N: 1, B: 1 l: 6, N: 1, B: 1 l: 7, N: 1, B: 1 l: 8, N: 1, B: 1 l: 9, N: 1, B: 1
state = NestedBlockState(g)
mcmc_anneal(state, beta_range=(1, 10), niter=500, mcmc_equilibrate_args=dict(force_niter=10))
state.print_summary()
print(f"Entropy: {state.entropy()}, with MDL: {cel_ent}")
l: 0, N: 297, B: 21 l: 1, N: 21, B: 7 l: 2, N: 7, B: 2 l: 3, N: 2, B: 1 l: 4, N: 1, B: 1 l: 5, N: 1, B: 1 l: 6, N: 1, B: 1 l: 7, N: 1, B: 1 l: 8, N: 1, B: 1 l: 9, N: 1, B: 1 Entropy: 8150.537898604023, with MDL: 8211.603988675375
Combinations like minimize_nested_blockmodel_dl
$\rightarrow$ mcmc_anneal
could give better results.
g = collection.data["celegansneural"]
state_ndc = minimize_nested_blockmodel_dl(g, deg_corr=False)
state_dc = minimize_nested_blockmodel_dl(g, deg_corr=True)
print("Non-degree-corrected DL:\t", state_ndc.entropy())
print("Degree-corrected DL:\t\t", state_dc.entropy())
Non-degree-corrected DL: 8529.502159049163 Degree-corrected DL: 8216.06947699617
g = collection.data["football"]
state_ndc = minimize_nested_blockmodel_dl(g, deg_corr=False)
state_dc = minimize_nested_blockmodel_dl(g, deg_corr=True)
print("Non-degree-corrected DL:\t", state_ndc.entropy())
print("Degree-corrected DL:\t\t", state_dc.entropy())
Non-degree-corrected DL: 1733.5256851451725 Degree-corrected DL: 1780.5767169430787
g = collection.data["lesmis"]
state = BlockState(g, B=1) # This automatically initializes the state with a partition
# into one group. The user could also pass a higher number
# to start with a random partition of a given size, or pass a
# specific initial partition using the 'b' parameter.
# Now we run 1,000 sweeps of the MCMC. Note that the number of groups
# is allowed to change, so it will eventually move from the initial
# value of B=1 to whatever is most appropriate for the data.
print(f"Initial dl: {state.entropy()}")
dS, nattempts, nmoves = state.multiflip_mcmc_sweep(niter=10000)
print("Change in description length:", dS)
print("Number of vertex moves attempts:", nattempts)
print("Number of accepted vertex moves:", nmoves)
print(f"Final dl: {state.entropy()}")
Initial dl: 787.4999819793873 Change in description length: -71.4314895705073 Number of vertex moves attempts: 3571289 Number of accepted vertex moves: 449625 Final dl: 716.0684924088499
# We will accept equilibration if 10 sweeps are completed without a
# record breaking event, 2 consecutive times.
state = BlockState(g, B=1)
mcmc_equilibrate(state, wait=1000, nbreaks=2, mcmc_args=dict(niter=10))
(698.9506193551209, 9157858, 1117729)
# We will first equilibrate the Markov chain
mcmc_equilibrate(state, wait=1000, mcmc_args=dict(niter=10))
bs = [] # collect some partitions
def collect_partitions(s):
global bs
bs.append(s.b.a.copy())
# Now we collect partitions for exactly 100,000 sweeps, at intervals
# of 10 sweeps:
mcmc_equilibrate(state, force_niter=10000, mcmc_args=dict(niter=10),
callback=collect_partitions)
# Disambiguate partitions and obtain marginals
pmode = PartitionModeState(bs, converge=True)
pv = pmode.get_marginal(g)
# Now the node marginals are stored in property map pv. We can
# visualize them as pie charts on the nodes:
state.draw(pos=g.vp.pos, vertex_shape="pie", vertex_pie_fractions=pv,)
<VertexPropertyMap object with value type 'vector<double>', for Graph 0x7f7136fe3340, at 0x7f7135d54be0>
g = collection.data["football"]
state = PPBlockState(g)
# Now we run 1,000 sweeps of the MCMC with zero temperature.
state.multiflip_mcmc_sweep(beta=np.inf, niter=1000)
print(modularity(g, state.get_blocks()))
state.draw(pos=g.vp.pos,vertex_text= g.vp['value'])
0.6005165407471079
<VertexPropertyMap object with value type 'vector<double>', for Graph 0x7f7137009d90, at 0x7f7137009f10>
MeasuredBlockState
$\rightarrow$ MixedMeasuredBlockState
$\rightarrow$ UncertainBlockState
g = collection.data["lesmis"].copy()
# pretend we have measured and observed each edge twice
n = g.new_ep("int", 2) # number of measurements
x = g.new_ep("int", 2) # number of observations
e = g.edge(11, 36)
x[e] = 1 # pretend we have observed edge (11, 36) only once
e = g.add_edge(15, 73)
n[e] = 2 # pretend we have measured non-edge (15, 73) twice,
x[e] = 1 # but observed it as an edge once.
# We inititialize MeasuredBlockState, assuming that each non-edge has
# been measured only once (as opposed to twice for the observed
# edges), as specified by the 'n_default' and 'x_default' parameters.
state = MeasuredBlockState(g, n=n, n_default=1, x=x, x_default=0)
# We will first equilibrate the Markov chain
mcmc_equilibrate(state, wait=1000, mcmc_args=dict(niter=10))
(1035.6463478081441, 10881739, 2640325)
# Now we collect the marginals for exactly 100,000 sweeps, at
# intervals of 10 sweeps:
u = None # marginal posterior edge probabilities
bs = [] # partitions
cs = [] # average local clustering coefficient
def collect_marginals(s):
global u, bs, cs
u = s.collect_marginal(u)
bstate = s.get_block_state()
bs.append(bstate.levels[0].b.a.copy())
cs.append(local_clustering(s.get_graph()).fa.mean())
mcmc_equilibrate(state, force_niter=10000, mcmc_args=dict(niter=10),
callback=collect_marginals)
eprob = u.ep.eprob
print("Posterior probability of edge (11, 36):", eprob[u.edge(11, 36)])
print("Posterior probability of non-edge (15, 73):", eprob[u.edge(15, 73)])
print("Estimated average local clustering: %g ± %g" % (np.mean(cs), np.std(cs)))
Posterior probability of edge (11, 36): 0.7044704470447045 Posterior probability of non-edge (15, 73): 0.0349034903490349 Estimated average local clustering: 0.57169 ± 0.00336791
# The maximum marginal posterior estimator can be obtained by
# filtering the edges with probability larger than .5
u = GraphView(u, efilt=u.ep.eprob.fa > .5)
# Mark the recovered true edges as red, and the removed spurious edges as green
ecolor = u.new_ep("vector<double>", val=[0, 0, 0, .6])
for e in u.edges():
if g.edge(e.source(), e.target()) is None or (e.source(), e.target()) == (11, 36):
ecolor[e] = [1, 0, 0, .6]
for e in g.edges():
if u.edge(e.source(), e.target()) is None:
ne = u.add_edge(e.source(), e.target())
ecolor[ne] = [0, 1, 0, .6]
# Duplicate the internal block state with the reconstructed network
# u, for visualization purposes.
bstate = state.get_block_state()
bstate = bstate.levels[0].copy(g=u)
# Disambiguate partitions and obtain marginals
pmode = PartitionModeState(bs, converge=True)
pv = pmode.get_marginal(u)
edash = u.new_ep("vector<double>")
edash[u.edge(15, 73)] = [.1, .1, 0]
bstate.draw(pos=u.own_property(g.vp.pos), vertex_shape="pie", vertex_pie_fractions=pv,
edge_color=ecolor, edge_dash_style=edash, edge_gradient=None,)
<VertexPropertyMap object with value type 'vector<double>', for Graph 0x7f712ff315e0, at 0x7f7135b29910>
get_edges_prob
Two non-existing edges in the football network (in red):
g = collection.data["football"]
missing_edges = [(101, 102), (17, 56)]
L = 10
state = minimize_nested_blockmodel_dl(g, deg_corr=True)
bs = state.get_bs() # Get hierarchical partition.
bs += [np.zeros(1)] * (L - len(bs)) # Augment it to L = 10 with
# single-group levels.
state = state.copy(bs=bs, sampling=True)
probs = ([], [])
def collect_edge_probs(s):
p1 = s.get_edges_prob([missing_edges[0]], entropy_args=dict(partition_dl=False))
p2 = s.get_edges_prob([missing_edges[1]], entropy_args=dict(partition_dl=False))
probs[0].append(p1)
probs[1].append(p2)
# Now we collect the probabilities for exactly 100,000 sweeps
mcmc_equilibrate(state, force_niter=10000, mcmc_args=dict(niter=10),
callback=collect_edge_probs)
def get_avg(p):
p = np.array(p)
pmax = p.max()
p -= pmax
return pmax + np.log(np.exp(p).mean())
p1 = get_avg(probs[0])
p2 = get_avg(probs[1])
p_sum = get_avg([p1, p2]) + np.log(2)
l1 = p1 - p_sum
l2 = p2 - p_sum
print("likelihood-ratio for %s: %g" % (missing_edges[0], np.exp(l1)))
print("likelihood-ratio for %s: %g" % (missing_edges[1], np.exp(l2)))
likelihood-ratio for (101, 102): 0.365437 likelihood-ratio for (17, 56): 0.634563
Ad:
Hannu Reittu, Ilkka Norros, Tomi Räty, Marianna Bolla and Fülöp Bazsó : Regular decomposition of large graphs: foundation of a sampling approach to stochastic block model fitting. Data Science and Engineering 4 (1): pp. 44-60, 2019. https://doi.org/10.1007/s41019-019-0084-x
Hannu Reittu, Fülöp Bazsó and Robert Weiss Regular Decomposition of Multivariate Time Series and Other Matrices, Structural, Syntactic, and Statistical Pattern Recognition, Lecture Notes in Computer Science , Volume 8621, 2014, pp 424-433
Sources: