Approximating Stochastic Functions with Multivariate Outputs | by Nicolas Arroyo Duran | Sep, 2024

[ad_1]
You can reproduce the experiments in this article by cloning https://github.com/narroyo1/pmt.
The previous article in this series named Approximating stochastic functions introduced a novel method to train generative machine learning models capable of approximating any stochastic function with a single output variable. From this point on I will refer to this method as Pin Movement Training or PMT for short. This because of the analogy of placing pins on fabric and moving them that is used to illustrate it.
The method was described for functions with any number of inputs X but with only a single output Y. The present article will generalize PMT for functions with any number of outputs. A summary of the method will be provided and should be enough to understand how it works, but if you would like a more in depth description you can read the previous article.
The generalized method, for reasons you will learn below, utilizes an architecture similar to that of autoencoders. Because of this and because the uniform sampling distribution may be more convenient for many applications, I believe this method is a valid alternative to Variational Autoencoders.
Letās say that we want to use the a neural network to approximate a stochastic function defined as š(š„) ā š where x is an input of any number of dimensions in X and Y is a one dimensional random variable.
The first thing we will do is to introduce a secondary input Z that we define as a uniformly distributed random variable in a range [Zāįµ¢ā, Zāāā]. This is necessary in order to introduce randomness to an otherwise deterministic system. This gives us a neural network defined by šš(š„,š§ā¼š) ā š where š represents the network trained weights.
Now letās visualize any given point š„ā², š .š”. xā² ā X. For this xā we want to map the whole range [Zāįµ¢ā, Zāāā] to Yāā². That is f(xā², Zāįµ¢ā) should be as similar as possible to min(Yāā²) and f(xā², Zāāā) should be as similar as possible to max(Yāā²). Additionally the mid-points f(xā², mid(Z)) and mid(Yāā²) should be as similar as possible and of course the same goes for every other point in the range (see Fig. 1).
In order to achieve this letās think of model šš as a stretchable and transparent fabric on which X is represented horizontally and Z is represented vertically. Also letās imagine a board with all the data points in the dataset plotted in it, in this board X is represented horizontally and Y is represented vertically. We then proceed to place the fabric on top of the board.
For every data point we place a āpinā on the fabric at the vertical midpoint of Z or mid(Z). We then compare the positions of the pin and the data point. If the data point is higher than the pin then, without unpinning the pin on the fabric, we move the pin upwards a predefined distance so that it lands in a higher position on the board. The pin will stretch or shrink the fabric with this motion. If it is lower then we move the pin downwards a predefined distance. We add the distances moved upwards and downwards and call the sum total movement.
After processing every data point, if the pin was not initially in the midpoint, the total movement will be greater in the direction of the actual midpoint. After repeating the process enough times the pin will reach a position close to the midpoint where the total movement upwards and downwards is equal, that is, the number of data points above it is the same as the number of data points below it. See Fig. 2 for an animation of how this process stabilizes.
Now if instead of putting the pin on the midpoint of Z we put it in a point 1/3 of the distance in range [Zāįµ¢ā, Zāāā] from the lowest point Zāįµ¢ā. And instead of moving it the same predetermined distance upwards and downwards, we move it 1.5 times the predetermined distance when going downwards and 0.75 times the predetermined distance when going upwards. Then this pin will reach a stability point (where the total movement upwards and total movement downwards are equal) at a place roughly above 1/3 of the data points.
This is because distance upwards * higher data points = distance downwards * lower data points or (0.75ā2/3=1.5ā1/3=0.5). See Fig. 3 for an animation of how this process stabilizes for pins at Zāįµ¢ā + 1/3 and Zāįµ¢ā + 2/3.
How do we achieve this movement using a neural network? In order to move the āpins on the fabricā with a neural network, we select a value in Z (which we call a z-pin) and do backpropagation with the target value being the z-pin plus/minus the predetermined distance, like this:
Leveraging this principle we can select points uniformly in Z and over a number of epochs we obtain the mapping we require. i.e. šš(š„,š§ā¼š) ā š.
- In the original article the fabric stretching/shrinking analogy referred to pins that were used to reshape the model, however the model definitions and training method used the term z-samples to refer to the same concrete term. In the present article and in the future these will be referred to exclusively as z-pins.
- When selecting z-pins the original article always placed them evenly distributed in Z and also used the same positions on every data point for every epoch. This is not necessary though, the only requirement is that the z-pins are uniformly distributed in Z.
- The original article would use multiple z-pins per data point. This is also not necessary and it is sufficient to select a single z-pin per data point. In the present article all the experiments will select a single z-pin per data point per epoch.
Having revisited the original method for one output, lets move on to the changes necessary to work for multiple outputs.
Letās define Z, the sampling ground from where we select our z-pins. In the original article Z was defined as a single dimensional range described simply by lower and upper bounds [Zāįµ¢ā, Zāāā]. However in the generalized method and in order to be able to handle multidimensional outputs Y, Z must be defined in multiple dimensions as well (note however that the number of dimensions in Z and Y need not be the same).
In theory it could be any bounded n-dimensional space, but because it makes calculating the scalars easier as youāll see later, I chose to use a hyper-sphere that can be defined by an origin O, a radius R and a dimensionality N (see Fig. 4).
Now letās define a few concepts related to Z that will be needed to move ahead.
- z-pins: These are uniformly sampled points in Z. They can be defined as an N-dimensional vector like this: zāįµ¢ā = (zā, zā, ā¦, zā) where zā, zā, ⦠are coordinates in Z.
- z-dirs: A z-dir is a direction on Z that can be defined as a unit vector based at origin O like this: z-dir = O + (žā, žā, ā¦, žā)
- z-lines: A z-line is a line in Z such that it runs between any two points in Z. We will define it as a line with a z-pin origin and a z-dir including all points in it that are inside of Z like this: zāįµ¢āā = zāįµ¢ā + z-dir s.t.āz ā zāįµ¢āā : z ā Z
Moving into a multidimensional Z introduces an important challenge. In the case of one-dimensional Z and š spaces, it was very simple to tell whether the selected z-pin projection i.e. šš(x, zāįµ¢ā) was greater or smaller than the observed data point in order to decide which direction to move it to. In one dimension āgreater thanā in š could simply be translated to āgreater thanā in Z and the z-pin could simply be moved up. This because we were mapping a line to another line.
But with multidimensional š and Z it is not possible to assume that the spaces will have the same shape or even the same number of dimensions which means that in order to decide the direction where to move a z-pin based on its relation to a data point, it is necessary to map that data point from š to Z. This means that in addition to train function šš to generate values in š, weāll also need to train an inverse function ššā»Ā¹ to map data points to Z. This fact changes our model architecture to something like this:
The left side of the model allows us to map points in š to Z. The right side of the model allows us to generate random samples in š by sampling points in Z.
You may have noticed that this architecture is similar to that of plain autoencoders and indeed it is. This has the added benefit of making the method useful for learning latent representations that are bounded and evenly distributed.
Having defined all the concepts we need we can proceed to discuss how pin movement works in multiple dimensions.
The first step is to use the inverse function ššā»Ā¹ (or encoder going with autoencoder terminology) and map all data points in the batch from š space to Z. We will call the original data points y-data and the mapped data points z-data.
Next we must select some z-pins. In order to do so, we start by selecting uniformly sampled z-dirs, one for every data point. The easiest way to do so is by choosing random points in a hypersphere surface with the same dimensionality as Z. Then we use the selected z-dirs and translate them to have the data points mapped in the previous step z-data as origins. This gives us some z-lines as you can see in Fig. 7.
Once we have our z-lines we proceed to randomly select points in these lines, these will be our z-pins. Fig. 8 shows how this can look like.
It is necessary for the method to work that for any given z-line in Z, every mapped data point z-data in it has an equal probability of occurring, otherwise the equations on Calculating the movement scalars would not hold up.
Given a 2-dimensional Z and for any given z-line in it, lets picture it as as a line with a minimal width š in a way that it seems like a long rectangle, similar to the z-lines in Fig. 8. The probability of any given š§ existing in it is of the area of this āthinā z-line over the area of Z.
Since this āthinā z-line is rectangular, any given segment š of minimal length šæ across itās length has an equal area and therefore any given š§ has an equal probability of being in the segment.
Also the probability of any given š§ā inside this āthinā z-line of selecting the z-dir of this āthinā z-line is constant given that the z-dirs are selected using a uniform distribution.
Taking equations (2) and (3) we get that the probability of any š§ being on any segment of a given z-line and selecting the same z-dir, and that is the same for every segment which satisfies the requirement above.
The probability is independent of the position of š§ in z-line so the distribution in any z-line is uniform.
After selecting the z-pins we can proceed to calculate the target values (or z-targets) to use in our backpropagation. All we have to do for this is to add to every z-pin the movement constant š in the direction where the mapped data point š§-ššš”š is.
Fig. 9 Shows how the z-targets are calculated.
The way that movement scalars are calculated is similar to the way it was done in the original one-dimensional method.
Letās start by picturing a z-line along with a z-pin and some mapped data points š§ššš”š like we see on Fig .10.
Letās call a the distance from the z-pin to one end of the z-line and b the distance to the other end. And letās call the number of data points on the former side aā and the number of data points on latter side bā. Our purpose is to make quantity aā proportional to distance a and bā proportional to b i.e. š:š::šā²:šā².
Next we will call š¼ the scalar that we will use on the movement applied to the z-pin for all data points on the side of length a. And we will call š½ the scalar that we will use on the movement applied to the z-pin for all data points on the side of length b.
We will also call T the total movement, which is the sum of moving the z-pin a constant movement M towards the side of every data point multiplied by that sideās scalar.
We want T to be 0 (i.e. stabilized) when šā²/(šā²+šā²)āš/(š+š)ā§šā²/(šā²+šā²)āš/(š+š), that is when the z-pin divides the intended proportion of data points to both sides. Substituting T with 0 on (5) gives us the equation:
Now letās remember that not all z-lines will have the same length, since they are bounded by the hypersphere defined by Z the ones towards the center will be longer than the ones at the edges. Longer lines will represent larger spaces in Z (see equation (1)) so their influence in the movement should be proportional to their length. We want T to be linearly proportional to the length of the z-line which gives us the equation:
If we put together (6) and (7) we get that the scalars should have these values:
Which are a similar equations to the one on the original article.
You may have noticed that this equations break towards the edges i.e. when either a or b tends to 0.
In order to solve this problem a maximum scalar constant S is introduced to clamp the scalars. Of course when clamping the scalars we have to be careful to adjust the value for both sides, for example if a is very small (and therefore š¼ is large) but the data point is on side b the scalar š½ must be adjusted as well, otherwise equation (5) will not hold.
We start by selecting the largest of the 2 scalars ššš„(š¼,š½). Then we calculate an adjustment value by dividing S by ššš„(š¼,š½) and clamping it to 1.0 so that it is always a number in the range [0, 1]. We will use the adjustment value to prevent scalars from going over S. Finally, if a is 0.0, then the values of š¼ and š½ are S and 0.0 respectively, and the other way around if b is 0.0. This gives us the revised equation (8b).
Below you can see how the plots for the scalars proportional to a or b look like. Notice how they are clamped beyond the selected S.
Having calculated both scalars we can choose the one to use by determining the side on which the data point resides.
Now that all the concepts involved are clear we can move on to describe the training algorithm.
The algorithm described makes the assumption that the models ššā»Ā¹ and šš inversely match each other. This can lead to a slow start if we train these 2 models to match each other at the same time as we do pin movement. So it has been found beneficial to do a āpretrainā stage on which we only train ššā»Ā¹ and šš to match each other. This stage is essentially an ordinary autoencoder training. After the reconstruction error has reached a reasonably low value the algorithm can proceed to the main training.
This pretrain stage has the added advantage that it makes it easier to define Z when it completes. In the section Redefining the Z-space it was mentioned that Z is defined by an origin O and a radius R. Having pretrained the model for some time, all we do is run a batch of data points through the inverse model to calculate a set Z-data.
Then we take the mean of this set and use it as the origin O.
We can also use the mean distance in Z-data to O as R, however it has been seen that experimenting with and tuning this value may give better results.
This works because after the āpretrainā stage, the model has found a region capable of representing the data, so defining Z in its vicinity will likely have a low reconstruction error.
To start pin movement we select a batch of data y-data = {y-dataā, y-dataā, ā¦, y-dataā} from the training dataset and map it to z-data = {z-dataā, z-dataā, ā¦, z-dataā} like it is explained in Mapping data points to Z.
The next step is to randomly select the z-pins set {z-pinā, z-pinā, ā¦, z-pinā} (one for every data point in the batch) in the way described on section Selecting the Z-pins.
Note that it is possible to select multiple z-pins per data point. But it is not necessary and for simplicity we will use only one on the experiments.
Then we calculate the target values z-targets = {z-targetā, z-targetā, ā¦, z-targetā} and scalars s = {sā, sā, ā¦, sā} as explained on sections Calculating the target values and Calculating the movement scalars.
Having the z-targets, we calculate the current model predictions by running them through šš, this gives us:
Now we have everything for the first component of our loss function:
Notice that we are using a Weighted Mean Absolute Error (WMAE) function instead of a Weighted Mean Squared Error (WMSE). This is because the latter is designed to punish larger differences while we are moving all of our pins the same distance.
The next component to our loss function is the difference between our model šš and our inverse model ššā»Ā¹. This is very similar to the Reconstruction Loss in both variational and ordinary autoencoders. It is necessary to pass the batch data points to ššā»Ā¹, take the results and pass them on to šš and then run backpropagation using the results and the original data points.
Before we define the last component of the loss function, letās explain why it is necessary. Ideally at the end of the training both šš and ššā»Ā¹ will be bijective, meaning that there will be an exactly one-to-one correspondence between the domain and codomain Z and Y. However during training this is not guaranteed to be the case and it is possible that areas in Z will not be mapped to Y.
As you can see in Fig. 12 as a result of training with component loss-y, šš and ššā»Ā¹ agree with each other as far as Y goes. That is āy ā Y, ššā»Ā¹(šš(š¦)) ā y. However not all of Z is used and some points in Y map outside of it. This is a problem because the assumption of moving z-pins to a position that will map to a point in Y that both šš and ššā»Ā¹ will agree on is broken.
Fig. 12 shows 2 of the problems that may happen. A āfoldā in the fabric happens when 2 or more points in Z map to the same point in Y. An āout-of-boundsā happens when a point in Z maps to a point outside of Y.
In order to solve this problem we add third component to the loss function:
What this does is to synchronize šš and ššā»Ā¹ā with regards to Z and it does so by selecting random points in Z instead of using the training set data points.
Notice that for both the reconstruction loss and the inverse reconstruction loss we simply use Mean Squared Error (MSE).
Now that we have all the components to the loss function all thatās left is to define weights for each of them which we will name š¾-š, š¾-y and š¾-š§. We can put together (10), (11) and (12) to define the loss function like this:
All thatās left after this is to run backpropagation on the loss.
In the original paper we used goal 1 and goal 2 testing which measured the density of data points between z-pins and compared it to the densities of the test dataset. However on a multi-dimensional space doing that approach is not practical since the spaces between z-pins scale rapidly in number.
The original paper also used Earth Moverās Distance (EMD) as an indicator of the modelās performance. For multiple dimension PMT we will use EMD to measure the modelās accuracy. We will define the EMD error by comparing data points from the training dataset against data points generated by the PMT model.
And in order to have an idea of what the lowest EMD error would be we will also calculate a base EMD by comparing data points from the training dataset against data points in the test dataset.
This gives us a baseline that we can compare against E-emd to measure the accuracy of the models.
The most similar generative model to PMT is Variational Autoencoders (VAE). It has an almost identical neural network architecture and acts both a generative model and a latent representation mapper. The biggest difference between the two is that the source distribution in VAE is unbounded (Gaussian) and the one in PMT is bounded (uniform distribution).
The experiments show however, that for both bounded and unbounded target distributions PMT outperforms VAE. Furthermore, the reconstruction error in PMT is significantly lower than on VAE. The reason for this may be that the components of the loss function cooperate with each other on PMT as opposed to competing with each other in VAE. Also because of the fact that the target distribution is uniform, the spacing between data points in Z can be larger.
Another difference is that PMT takes a larger number of hyperparameters, š (maximum scalar), š¾-š (pin movement weight), š¾-š¦ (reconstruction loss weight), š¾-š§ (inverse reconstruction loss weight) and š (movement constant) compared to VAE hyperparameters which are just the kld weight. This may make PMT training more difficult.
Finally PMT takes longer to train per epoch than VAE, this is because it is necessary to do a pass to calculate the z-targets, and also because the loss function has an additional component (see equation (12)).
Now I will try out the model in several datasets. In order to make them easier to plot, the experiments presented will have no X inputs.
Due to the similarities with VAE, every experiment will be done using both PMT and VAE models for comparison. In every experiment both models will have identical neural network architectures.
The source code and everything needed to reproduce the experiments below can be found in https://github.com/narroyo1/pmt.
The first dataset Iāll try is generated using make_blobs() from the sklearn library. As it name suggests it generates a number of Gaussian blobs and it is a good dataset to test how PMT performs with unbounded datasets.
[ad_2]