Iordan Ganev - Machine Learning

This page lists publications, preprints, and projects related to my work in machine learning. Comments and questions welcome.

Publications and Preprints

Symmetries, flat minima, and the conserved quantities of gradient flow.
With B. Zhao, R. Walters, R. Yu, and N. Dehmamy. International Conference on Learning Representations (ICLR), 2023.

Abstract: We present a general framework for finding continuous symmetries in the parameter space of deep neural networks, which carve out low-loss valleys in the loss landscape. Our framework uses equivariances of the activation functions and can be applied to different layer architectures. We also show that conserved quantities associated with linear symmetries can be used to define coordinates along low-loss valleys. The conserved quantities help reveal that using common initialization methods, gradient flow only explores a small part of the global minimum. By relating conserved quantities to convergence rate and sharpness of the minimum, we provide insights on how initialization impacts convergence and generalizability.

[arXiv] [github]

Quiver neural networks.
With R. Walters. Preprint, arXiv:2207.12773.

Abstract: We develop a uniform theoretical approach towards the analysis of various neural network connectivity architectures by introducing the notion of a quiver neural network. Inspired by quiver representation theory in mathematics, this approach gives a compact way to capture elaborate data flows in complex network architectures. As an application, we use parameter space symmetries to prove a lossless model compression algorithm for quiver neural networks with certain non-pointwise activations known as rescaling activations. In the case of radial rescaling activations, we prove that training the compressed model with gradient descent is equivalent to training the original model with projected gradient descent.

[arXiv] [github]

Universal approximation and model compression for radial neural networks.
With T. van Laarhoven and R. Walters. Preprint, arXiv:2107.02550.

Abstract: We introduce a class of fully-connected neural networks whose activation functions, rather than being pointwise, rescale feature vectors by a function depending only on their norm. We call such networks radial neural networks, extending previous work on rotation equivariant networks that considers rescaling activations in less generality. We prove universal approximation theorems for radial neural networks, including in the more difficult cases of bounded widths and unbounded domains. Our proof techniques are novel, distinct from those in the pointwise case. Additionally, radial neural networks exhibit a rich group of orthogonal change-of-basis symmetries on the vector space of trainable parameters. Factoring out these symmetries leads to a practical lossless model compression algorithm. Optimization of the compressed model by gradient descent is equivalent to projected gradient descent for the full model.

[arXiv] [github]

Notes and Projects

Gradient flow and conserved quantities.

These are notes stemming from and supplementing the paper Symmetries, flat minima, and the conserved quantities of gradient flow above (arXiv:2210.17216). We explain several different characterizations of conserved quantities, provide examples, and explore conserved quantities in the context of a loss function that is invariant under a group action. Of particular interest are orthogonal group actions, which appear in the study of radial neural networks. We provide a description of the stablizer of a generic point in the parameter space of a radial neural network, which is related to the compression of network. We also consider a particular diagonalizable action.

[pdf]

Notes on "Graph Neural Networks are Dynamic Programmers".

These notes delve into the technical aspects of the paper "Graph Neural Networks are Dynamic Programmers" (NeurIPS 2022, arXiv:2203.15544). We formulate integral transforms using bags and lists. The update step of a dynamic programming algorithm can be interpreted as an integral transform; a prominent example is the Bellman--Ford algorithm. Furthermore, the message-passing step of a graph neural network can also be stated in terms of a certain integral transform.

[pdf]

Notes on lax monoidal functors, monads, and Haskell.

These notes are an exploration of lax monoidal functors and monads in category theory as they relate to applicatives and monads in Haskell. Of particular interest are the categorical semantics behind the sequence map in Haskell. While our focus is on mathematical foundations, we also have practical goals of understanding how best to use the Monad and Applicative type classes. We assume familiarity with the basics of category theory, and exposure to Haskell.

[pdf]

Backpropagation.

These notes are an exploration of the backpropagation algorithm for computing the gradient of the loss function for parameters of a neural network. We provide theoretical justification for the algorithm, give several explicit pseudo-code implementations, discuss backpropagation in batches, and include a list of exercises.

[pdf]

EM algorithm.

These notes provide an exposition of the expectation-maximization (EM) algorithm for clustering. One can regard this algorithm as an unsupervised learning analogue of Linear Discriminant Analysis. The main reference for these notes is Section 9.2 of the book "Applied Machine Learning" by David Forsyth (link). We focus on the theoretical justification for the algorithm rather than implementation details.

[pdf]

Principal Component Analysis.

Principal component analysis is a technique for finding a new ordered basis (or partial basis) of the predictor space in such a way that most of the variability in the data can be captured in fewer dimensions. In these expository notes, we first provide a formulation of principal component analysis from the point of view of finding directions that maximize variability. We then explain the relationship between principal component analysis and the singular value decomposition.

[pdf]

Geometric Relational Algebra.

These notes illustrate an abstract take on relational algebra inspired by constructions from category theory and algebraic geometry. We define a category for every header whose objects are relations with that header. Relational algebra operations such as union, join, and product are functors between the appropriate categories. We also use pushforwards and pullbacks to give interpretations of aggregate functions, group by, and window functions.

[pdf]

Kaggle.

Minor learning projects for gaining familiarity with PyTorch. One of the projects is a computer vision analysis for distinguishing bees from wasps using a convolutional neural network (link to the Kaggle dataset). Another is a NLP sentiment analysis for IMDB movie reviews (link to the Kaggle dataset).

[github]

Bias-Variance.

An analysis of the bias-variance tradeoff in supervised statistical learning. We examine both the regression and classification settings.

[pdf]

Principal Component Analysis (alternative take).

A slightly different perspective on principal component analysis in terms of covariant matrices.

[pdf]