# What should a data generating process for graphs be able to produce?

This is the first post in a series of six on random and large graphs. Question: What kinds of networks appear in the real world? For this first post, recall that $\delta (A)$ resp. $\Gamma (A)$ denote the edge resp. node set incident resp. adjacent to $A$.

A lot has been written on properties of real-world networks (for an introduction in addition to the present blog post, see e.g. Strogatz’ article in Nature). In most cases, argument is by example: Take graphs such as some parts of the internet and check: What is the empirical degree distribution for this graph like, how many  edges are present etc.

This makes evaluation of foundational models of random graphs difficult: We can reject generative models that are not able to produce graphs with given properties, but we cannot confirm that these models are actually good. But as this is the same situation as in all (natural and social) model-based sciences, we follow the same way in this and the next post: We establish five properties (sparsity, shortest path-length, modularity, degree-distribution, clustering) many real-world graphs have and will use these in the following to reject generative models of graphs as foundations for statistics on graphs and unveil how versatile they are.

Let us start with sparsity, the property that a graph has not many more edges than nodes.

# Sparsity

Sparsity says that a graph $G=(V,E)$ has $\lvert E \rvert \ll \lvert E \rvert^2$. As $\ll$ is not a formal relation one usually defines sparsity of a graph sequence $(G_n)$ by $\lvert E(G_n) \rvert \in o(\lvert V (G_n)\rvert^2)$. The definitions of properties via graph sequences will reoccur at several places. Let us consider data on the internet as an example.

This looks almost like a tree, which would imply $\lvert V(G) \rvert \approx \lvert E(G) \rvert$. It is difficult to get data on the whole internet owing to its massive size. Smaller datasets exist however: Take a “small” internet dataset that was used in Google’s programming Contest 2002 (!) (Available at Stanford’s large dataset page). It has 875713 Nodes, but only about 5.8 times as many edges, 5105039.

Many networks are sparse due to bounds on (in-/out-)degree. Consider the wikipedia graph, nodes individual pages, edges hyperlinks. As pages in wikipedia have a bounded size (some however less bounded than others), also the number of links is bounded by $C$, say, hence

$\sum_{v \in V(G) } \lvert \delta^+ (\{v\} ) \rvert \le \sum_{v \in V(G) } C = C \lvert V(G) \rvert$.

# Shortest-Path Length

Let $\ell_{uv}$ be the length of a shortest $u$$v$-path in a graph $G$ or infinity if the nodes are disconnected. Then we call a graph of small longest shortest path length if

$\max_{u, v \in V(G)} \ell_{uv} \in O(\log n)$

This demands of a graph that it somehow structurally resembles a balanced tree. In the ecology literature (on which the second-to-next post will focus), food webs describe who eats whom in an ecology and longest shortest-path length is called  (in parts of the literature) food-chain length. It has been claimed that long food chain length is a sign of ecological instability.

# Modularity

Many graphs have a modular structure: Modules that are more highly connected within themselves than they are connected in between each other. The definition of modularity is the first instance where randomization is used. The (Newman-Girvaens) modularity of a graph for given partition $f$ into modules is defined as the proportion of edges within the modules minus the expected number of edges within modules for uniformly distributed multigraphs with the same node degrees.  A partition maximizing this difference is chosen. Modularity is approximated more specifically for a graph $G$ with adjacency matrix $a$ by

$\max_{\substack{\lvert A \rvert < \infty \\ f \colon V \rightarrow A}} \frac{1}{2\lvert E(G) \rvert} \sum_{k \in A} \sum_{v,w \in f^{-1} (\{k\}) } a_{u,v} - \frac{\lvert \delta (v) \rvert\lvert \delta(w)\rvert}{2\lvert E(G) \rvert}$.

The further summand is clearly the number of edges within modules, one sees that the latter summand is an approximation of the above-mentioned expectation by considering the probability of the event that a random edge is chosen to end in given vertices $v$ or $w$ and sum this over all edges (For all the ones in need of a hint, this section in Wikipedia gives a good explanation). Modularity lies in between of $-0.5$ and $1$ (exercise).

# Degree-Distribution

The degree distribution characterizes the relative importance of some nodes in a network.

A random variable $X$ that takes values in the natural number is said to be of follow a power law with parameter $\gamma > 0$ iff $\mathbb{P}[X = k] = k^{-\gamma}, k \in \mathbb{N}$. This distribution has fat tails, it is $\ell^1$ iff $\gamma \ge 2$. In the case $\gamma < 2$, one speaks of a scale-free network, i.e. a network that does not have a characteristic degree scale.

Many real-world networks have fat-tailed empirical degree-distributions and it is subject of debate whether they follow a power law. The degree distribution of the

# Clustering

A last property is that real-world networks are clustered — that is, adjacent nodes tend to share neighbors. To formalise this, let $v \in V(G)$. Define the clustering coefficient

${clust}(v) = \frac{ \sum_{u, w \in \Gamma (\{v\})} a_{uw} }{\binom{\lvert\Gamma(v)\rvert}{2}}$,

where $a$ is the incidence matrix of $G$.

Define the average clustering as ${clust} (G) = \frac{1}{\lvert V(G) \rvert}\sum_{v in V(G)} {clust}(v)$. Nonzero clustering should be possible: We take as an argument $clust(G_n) \rightarrow c > 0$ for a graph sequence $(G_n)$. This is an instance of pattern counting: In how many different ways can a pattern be embedded into a graph? This is a topic, however, out of the scope of this article.

# Conclusion

We formulated necessary conditions for random graph models: sparsity, shortest-path lengths, modularity, degree-distribution and clustering. In the next post, we study the drama of random graph models (in five acts).