What should a data generating process for graphs be able to produce?

This is the first post in a series of six on random and large graphs. Question: What kinds of networks appear in the real world? For this first post, recall that \delta (A) resp. \Gamma (A) denote the edge resp. node set incident resp. adjacent to A.

A lot has been written on properties of real-world networks (for an introduction in addition to the present blog post, see e.g. Strogatz’ article in Nature). In most cases, argument is by example: Take graphs such as some parts of the internet and check: What is the empirical degree distribution for this graph like, how many  edges are present etc.

This makes evaluation of foundational models of random graphs difficult: We can reject generative models that are not able to produce graphs with given properties, but we cannot confirm that these models are actually good. But as this is the same situation as in all (natural and social) model-based sciences, we follow the same way in this and the next post: We establish five properties (sparsity, shortest path-length, modularity, degree-distribution, clustering) many real-world graphs have and will use these in the following to reject generative models of graphs as foundations for statistics on graphs and unveil how versatile they are.

Let us start with sparsity, the property that a graph has not many more edges than nodes.


Sparsity says that a graph G=(V,E) has \lvert E \rvert \ll \lvert E \rvert^2. As \ll is not a formal relation one usually defines sparsity of a graph sequence (G_n) by \lvert E(G_n) \rvert \in o(\lvert V (G_n)\rvert^2). The definitions of properties via graph sequences will reoccur at several places. Let us consider data on the internet as an example.

Artistic interpretation of the internet of 2006 (The opte project, Creative Commons Attribution-NonCommercial 4.0 International License).

This looks almost like a tree, which would imply \lvert V(G) \rvert \approx \lvert E(G) \rvert. It is difficult to get data on the whole internet owing to its massive size. Smaller datasets exist however: Take a “small” internet dataset that was used in Google’s programming Contest 2002 (!) (Available at Stanford’s large dataset page). It has 875713 Nodes, but only about 5.8 times as many edges, 5105039.

Many networks are sparse due to bounds on (in-/out-)degree. Consider the wikipedia graph, nodes individual pages, edges hyperlinks. As pages in wikipedia have a bounded size (some however less bounded than others), also the number of links is bounded by C, say, hence

\sum_{v \in V(G) } \lvert \delta^+ (\{v\} ) \rvert \le \sum_{v \in V(G) } C = C \lvert V(G) \rvert.

Shortest-Path Length

Let \ell_{uv} be the length of a shortest uv-path in a graph G or infinity if the nodes are disconnected. Then we call a graph of small longest shortest path length if

\max_{u, v \in V(G)} \ell_{uv} \in O(\log n)

This demands of a graph that it somehow structurally resembles a balanced tree. In the ecology literature (on which the second-to-next post will focus), food webs describe who eats whom in an ecology and longest shortest-path length is called  (in parts of the literature) food-chain length. It has been claimed that long food chain length is a sign of ecological instability.


Many graphs have a modular structure: Modules that are more highly connected within themselves than they are connected in between each other. The definition of modularity is the first instance where randomization is used. The (Newman-Girvaens) modularity of a graph for given partition f into modules is defined as the proportion of edges within the modules minus the expected number of edges within modules for uniformly distributed multigraphs with the same node degrees.  A partition maximizing this difference is chosen. Modularity is approximated more specifically for a graph G with adjacency matrix a by

\max_{\substack{\lvert A \rvert < \infty \\ f \colon V \rightarrow A}} \frac{1}{2\lvert E(G) \rvert} \sum_{k \in A} \sum_{v,w \in f^{-1} (\{k\}) } a_{u,v} - \frac{\lvert \delta (v) \rvert\lvert  \delta(w)\rvert}{2\lvert E(G) \rvert}.

The further summand is clearly the number of edges within modules, one sees that the latter summand is an approximation of the above-mentioned expectation by considering the probability of the event that a random edge is chosen to end in given vertices v or w and sum this over all edges (For all the ones in need of a hint, this section in Wikipedia gives a good explanation). Modularity lies in between of -0.5 and 1 (exercise).


The degree distribution characterizes the relative importance of some nodes in a network.

A random variable X that takes values in the natural number is said to be of follow a power law with parameter \gamma > 0 iff \mathbb{P}[X = k] = k^{-\gamma}, k \in \mathbb{N}. This distribution has fat tails, it is \ell^1 iff \gamma \ge 2. In the case \gamma < 2, one speaks of a scale-free network, i.e. a network that does not have a characteristic degree scale.

Degrees of the bipartite GitHub (multidi)graph in log-log scale – nodes users and repos, edges connect users to repos they push to. See the good fit to power law distribution.  (Image taken from Hortonworks)

Many real-world networks have fat-tailed empirical degree-distributions and it is subject of debate whether they follow a power law. The degree distribution of the


A last property is that real-world networks are clustered — that is, adjacent nodes tend to share neighbors. To formalise this, let v \in V(G). Define the clustering coefficient

{clust}(v) = \frac{ \sum_{u,  w \in \Gamma (\{v\})} a_{uw} }{\binom{\lvert\Gamma(v)\rvert}{2}},

where a is the incidence matrix of G.

Define the average clustering as {clust} (G) = \frac{1}{\lvert V(G) \rvert}\sum_{v in V(G)} {clust}(v). Nonzero clustering should be possible: We take as an argument clust(G_n) \rightarrow c > 0 for a graph sequence (G_n). This is an instance of pattern counting: In how many different ways can a pattern be embedded into a graph? This is a topic, however, out of the scope of this article.


We formulated necessary conditions for random graph models: sparsity, shortest-path lengths, modularity, degree-distribution and clustering. In the next post, we study the drama of random graph models (in five acts).

More on modularity: this article in PNAS


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s