Algebraic Datatypes, Combinatorial Structures, and Generating Functions
This is structured as a preliminary article on the ideas that connect algebraic datatypes, combinatorial structures, and generating functions. I have attempted to keep this post minimal in its prerequisites as there are a handful of ideas in computer science and mathematics that you have to weave through to understand the ideas here. I have attempted to navigate through these ideas without getting (too much) into the intricacies of advanced ideas like combinatorial species, analytic functors, and differentiation as zippers. I have also attempted to make it maximally visual and intuitive for a person who might be coming across this nexus of ideas for the first time. This article was penned in response to this tweet.
If you have any feedback / further ideas or if you spot any errors/omissions in the narrative or the equations, feel free to reach out to me on X or via email.
The Combinatorics Algebra Transit
The key idea of this article is that algebra of datatypes can be interpreted as transformations on the combinatorial objects they track. We will start with some algebraic manipulations to obtain a few “forms” of the same algebraic data type and then relate them to their combinatorial interpretations. Then we will discuss a bit on what differentiation/integration operators on these structures mean. An annotated bibliography is present at the bottom which briefly recounts the evolution of these ideas and locates various papers, blogposts, and dissertations in which many brilliant authors have written about these and other adjacent ideas. The summary of the ideas I describe in this article can be captured in this image, details of which will be unpacked in the prose below.
Representing Lists in Algebra
Inner / Outer View
We can represent the type of a list in a typed programming language as:
List(X) := Nil | Cons a List(X)
If this is your first time coming across the concept of algebraic types and want to understand what they are, I recommend this lucidly written series of articles by Chris Taylor. There is also a nice talk by him on the same topic.
This can be algebraized as
Here, Cons a List(X)
The Cons operator in the type algebra. It is mathematically more accurate to think of these as linear orders. A linear order can be thought of as a list without any duplicate elements from the sampling set.
Now, let us proceed to do some equational manipulation with it. By shifting the
which can be transformed further to:
Let us call this the Outer View or Closed Form Equation.
Now, there's another equational manipulation we can do by repeatedly substituting the left hand side of
This will give us
Iterating this indefinitely gives us the generating series function:
Let us call this the Inner view or the Series Form.
Together, these manipulations gives us the following two ways of representing
The closed form
Let us summarize this as:
Turns out, it is not just lists, but many such algebraic data type representations have such equivalent forms and there are some fun combinatorial interpretations of these!
Family of Structures
To give the generating functions a combinatorial interpretation, we need to familiarize ourselves with some ideas from combinatorics. Given a finite set of labels, we can think of the family of structures that can be constructed from them.
We do this in a way that are considered to be the same under permutation of the labels we give them. These are called F-structures.
An important thing to notice in connection to F-structures is that these labels in themselves are not data but rather location in which data can be stored. As such, they represent the "form" in which data can be stored not its contents. That is when we say [a, b, c] denotes a list / linear order, we are saying that they are place holders named a, b, c, in which any data can be stored.
An F-structure can be viewed as taking in a set of labels and constructing another set of structures from it. We also need the detail of making it independent of the underlying labelling set, so it is defined in such a way that all such structures which can be generated by permuting the labels are considered to be the same. If you are interested in how this “lifting” is performed, I refer you to Brent Yorgey’s accessible paper on the idea.
Here are the different ways in which we can reassign labels to a 3 element set.
Corresponding to each of these permutation, we can locate a linear order in the family of structures generated above. We can think of the F-structures as a family that remains invariant under permutation of labels. This perspective will grant us a neat way to interpret the algebraic series as tracking these very combinatorial structures!
Labelled vs. Unlabelled structures
It is important to mention here that there are two kinds of structures here, unlabelled ones and labelled ones.
Notice, that we can assign labels in
Ordinary and Exponential Generating Functions
Now, we will proceed to give an interpretation of these F-structures in terms of generating functions. Keep in mind that every F-structure may correspond to several generating functions, not just a single one. Here, we will make use of two different families of generating functions to interpret our list expansions.
First is the ordinary generating function defined as:
These are used when we want to count the number of unlabelled objects. Using this interpretation, if we look at the coefficients of
Since
Second one is the exponential generating function. Using this, we can interpret the coefficient as tracking combinatorial structures.
These are used when we want to count the labelled objects, that is, when the order in which they appear are significant. Here,
As noted above in the family of structures section, look at how the OGF for unlabelled case and EGF for the labelled case vary by a factor of
We can construct a table of linear orders / lists here that summarizes our findings.
This study is extended to a full blown theory under the name of Combinatorial Species and the dissertation of Brent Yorgey linked in the bibliography acts as a good introduction. As an added bonus, it has a lot of pretty pictures to aid the intuition!
Integration and Differentiation
Now, let us think about what integration and differentiation means on these series. If we think about the anti-derivative of our list, we can see that:
or conversely
This gives us a new closed form equation,
Using the exponential generating function interpretation that tracks labelled structures, we can see that this corresponds to (n-1)!, which is the number of cycles for n labelled elements.
Here are the counts for unlabelled and labelled cycles.
So in a sense, what we have derived here is, differentiation of cycles is linear orders / lists:
or conversely, integrating cycles gives us linear orders.
Is there a combinatorial interpretation that we can assign to this transformation? Turns out, we can indeed! The idea of derivative is analogous to creating an hole in the data structure. Here in the example pictured above, 1-holed data structure of a 3 cycle is isomorphic to a linear order with 3 elements!
The differentiated F structure gives you the original F-structure unioned with a distinguished element. We can think of it as a “hole” which helps align this with the literature about differentiation on data types you can find in the annotated bibliography below.
So with that idea we can construct a more elaborate table recapitulating the set of ideas described here.
This is only a small sample of the fascinating ideas in this field, there are some really intriguing connections and deep results which are well worth perusing if the above ideas intrigued you. I have added an annotated bibliography with some brilliant material below which I hope will be helpful for the curious.
Open Questions
What does the outer view mean in terms of fraction and negative types? Hints in Sabry's work The Two Dualities of Computation: Negative and Fractional Types and Yorgey's blogpost on Virtual Species: Species subtraction made simple
What about the ordinary generating functions interpretation of the cycles? What does
represent? What is the intuition behind representing combinatorial objects using the Taylor-MacLaurin series? Can we get a better intuition that algebraic manipulation into this?
What is the intuition behind employing these two series in OGF and EGF to track unlabelled and labelled objects? How does each term act as a primitive for the coefficients?
If differentiation means data structures with a hole in it, what does integration mean? Hint by Tali here: https://news.ycombinator.com/item?id=41058125
How can we distinguish lists with duplicates from the lists described above? It is perhaps better to call them linear orders rather than lists because there is no repetition of elements involved, a list on [a, b, c] can have
elements rather than because we allow for repetition of elements in a list giving us [a, a, a], [a, a, b], but in a linear order, we only allow for distinct permutations as mentioned in the article above. It might be better to give an account of linear orders with multiplicities to clarify this distinction.
Annotated Bibliography
The ideas here have a long history going back to the Symbolic Method that developed through the last few centuries. In programming, algebraic data types were brought in and this connection to the underlying calculus of these ideas started finding prominence in the papers by Abbott, Altenkirch, Conor McBride, Ghani et. al. which got linked with the combinatorial species work of André Joyal. In another line of development, Flajolet and Sedgewick invented the Analytic Combinatorics which has a rich theoretic background. There are some other related ideas, which if you are curious can find in a tweet threads of mine:
Algebraic Datatypes
A nice introductory article into many of the ideas here by Chris Taylor:
The Algebra of Algebraic Data Types, Part 1
The Algebra of Algebraic Data Types, Part 2
The Algebra of Algebraic Data Types, Part 3
Other intros to algebra / calculus on types:
The Algebra (and calculus!) of algebraic data types by Joel Burget at RecurseCenter
The Algebra of Data, and the Calculus of Mutation by Kalani Thielen at Lab49
Generating Functions
Concrete Mathematics - Ronald Graham, Donald Knuth, Oren Patashnik (1994)
Generating Functionology by Helbert Wilf (1992)
Analytic Combinatorics
Analytic Combinatorics - Flajolet and Sedgewick
Introductory article on Generating Functions by Sedgewick on Generating Functions
Introductory slides on Analytic Combinatorics (Part One) by Robert Sedgewick
Introductory slides on Analytic Combinatorics (Part Two) by Robert Sedgewick
Combinatorial Species
Beautiful introductory article by Trevor Hyde:
Combinatorial Species and Generating Functions (Slides)
André Joyal's work on Combinatorial Species:
English translation by Brent Yorgey
Sigfpe's articles:
Formal Power Series and Haskell
Divided Differences and the Tomography of Types
Brent Yorgey has written a flurry of blogposts, a lucid Pearl on this, and an entire dissertation which is canonical on the topic.
First series by Brent Yorgey (A bit dependent on the implementation of his library):
Introducing Math.Combinatorics.Species!
Primitive species and species operations
Primitive species and species operations, part II
Species operations: differentiation
Second series by Brent Yorgey:
And now, back to your regularly scheduled combinatorial species
Combinatorial species definition
Species definition clarification and exercises
The algebra of species: primitives
Functional Pearl by Brent Yorgey:
Species and Functors and Types, Oh My! Functional Pearl (2010)
Dissertation by Brent Yorgey:
Combinatorial Species and Labelled Structures (2014)
Categories of Containers (2003) by Michael Abbott, Thorsten Altenkirch, and Neil Ghani
Species: making analytic functors practical for functional programming by Jacques Carette, Gordon Uszkay
Stackoverflow question on why algebra/calculus works on datatypes with some neat answers.
Other Works
Differentiating Data Structures and Zippers
∂ for Data: Differentiating Data Structures by Michael Abbott, Neil Ghani, Thorsten Altenkirch, Conor McBride (2005)
Clowns to the Left, Jokers to the Right: Dissecting Data Structures by Conor McBride (2008)
A series of articles introducing zippers in four parts, getting into why differentation on lists track these: Part 1: Zippers, Part 2: Derivatives, Part 3: Kiselyov Zippers, Part 4: Multi Zippers
Shapely Types
There is a parallel development on tracking data and their shape in the work of Barry Jay and Robin Cockett.
Shapely Types and Shape Polymorphism - Barry Jay and Robin Cockett (2004)
A semantics for shape - Barry Jay (1995)
Shape in Computing - Barry Jay (1996)
Separating Shape from Data - Barry Jay (2005)
The Functional Imperative: Shape! - Barry Jay (2006)
Colin Banger and David Skillicorn work
Flat arrays as a Categorical Datatype - Colin Banger and David Skillicorn (1992)
A Foundation for Theory of Arrays - Colin Banger and David Skillicorn (1993)
ADTs
Tangential but interesting work from Benjamin Pierce et al. on expressing ADTs: http://www.cis.upenn.edu/~bcpierce/papers/compobj.ps