A very simple case study to show the limits of discriminator networks in searching for new processes (make parallels with high energy particle physics if you wish!) and an equally simple setup for anomaly detection.

Imports

We will only need a few usual packages: numpy for array manipulation and random number generation, just enough of keras to build a dense neural network, pyplot to make simple plots, pandas and sklearn for data manipulation.

import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from matplotlib import pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
Using TensorFlow backend.

Data generation

Now we define a custom function to generate our data. You can easily play with the various parameters and run the notebook again. Note the following points:

  • 3 pandas dataframes are created, corresponding to 3 sets of data: “background (bkg), “signal” (sig) and “hidden” (hid)
  • each contains 4 variables and a label (0 for bkg and 1 for sig and hid)
  • the variables for the background sample correspond to a set of decaying exponentials (smoothly falling curve), whereas the signal and hidden samples are represented by Gaussian distributions
  • the number of sig and hid counts are changeable

The scenario is the following: we have a well-modelled background process and we are looking for a new signal process that will look like an obvious bump in the tail of our distribution. However, unbeknownst to us, we’re ignoring the possibility of a more localised, tinier bump within the bulk of the background distribution.

def generate_input_data(n=1000):
    n_background = n
    n_signal     = int(n*0.5)
    n_hidden     = int(n*0.1)
    
    random_numbers = np.random.rand(n_background)
    
    variable1_bkg = -30*np.log(random_numbers)
    variable1_sig = np.random.normal(105,15,n_signal)
    variable1_hid = np.random.normal(30,7,n_hidden)
    
    variable2_bkg = -35*np.log(random_numbers)
    variable2_sig = np.random.normal(115,10,n_signal)
    variable2_hid = np.random.normal(45,7,n_hidden)
    
    variable3_bkg = -25*np.log(random_numbers)
    variable3_sig = np.random.normal(95,25,n_signal)
    variable3_hid = np.random.normal(20,15,n_hidden)
    
    variable4_bkg = -30*np.log(random_numbers)
    variable4_sig = np.random.normal(95,20,n_signal)
    variable4_hid = np.random.normal(20,10,n_hidden)
    
    df_bkg = pd.DataFrame({"var1":variable1_bkg, "var2":variable2_bkg, "var3":variable3_bkg, "var4":variable4_bkg,
                          "label":0})
    df_sig = pd.DataFrame({"var1":variable1_sig, "var2":variable2_sig, "var3":variable3_sig, "var4":variable4_sig,
                          "label":1})
    df_hid = pd.DataFrame({"var1":variable1_hid, "var2":variable2_hid, "var3":variable3_hid, "var4":variable4_hid,
                          "label":1})
    
    return df_bkg, df_sig, df_hid

Let’s generate our three samples, and concatenate them (for later).

df_bkg, df_sig, df_hid = generate_input_data(10000)
df_mixed_bkg_sig = pd.concat([df_bkg,df_sig])
df_mixed_bkg_sig_hid = pd.concat([df_bkg,df_sig,df_hid])

Discriminator approach

Ignorant of the extra bump hiding in our data, we take the usual approach of building a deep neural network to discriminate our simulated signal sample from the background one.

Note that the particular network below is incredibly basic and unoptimal - but it makes for nicer-looking plots, as the separation task by default is actually not that hard! (feel free to make your own changes)

neuralnet = Sequential()
neuralnet.add(Dense(3, activation='relu', kernel_initializer='random_normal', input_shape=(4,)))
neuralnet.add(Dense(2, activation='relu', kernel_initializer='random_normal'))
neuralnet.add(Dense(1, activation='sigmoid', kernel_initializer='random_normal'))
neuralnet.summary()
neuralnet.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_1 (Dense)              (None, 3)                 15        
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 8         
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 3         
=================================================================
Total params: 26
Trainable params: 26
Non-trainable params: 0
_________________________________________________________________

We now split our mixture of background and signal samples into a training (80%) and a testing (20%) set, and reweight our categories to get around the very likely class imbalance issue: because we have possibly much more background counts than signal counts, the network could simply learn to identify all events as background and perform quite well!

X_train, X_test, y_train, y_test = train_test_split(df_mixed_bkg_sig[["var1","var2","var3","var4"]].values,
                                                    df_mixed_bkg_sig["label"].values,
                                                    test_size=0.2, random_state=42)
reweight = (y_test==1).sum()/(y_test==0).sum()
neuralnet.fit(x=X_train, y=y_train, batch_size=25, epochs=3, validation_split=0.1, class_weight={0:reweight, 1:1.})
Train on 10800 samples, validate on 1200 samples
Epoch 1/3
10800/10800 [==============================] - 1s 131us/step - loss: 0.3617 - acc: 0.5692 - val_loss: 0.3200 - val_acc: 0.7192
Epoch 2/3
10800/10800 [==============================] - 1s 79us/step - loss: 0.2917 - acc: 0.7976 - val_loss: 0.2652 - val_acc: 0.8367
Epoch 3/3
10800/10800 [==============================] - 1s 75us/step - loss: 0.2443 - acc: 0.8683 - val_loss: 0.2259 - val_acc: 0.8783
keras_score = neuralnet.evaluate(x=X_test, y=y_test, batch_size=25)
print("Validation loss:     {:.3f}".format(keras_score[0]))
print("Validation accuracy: {:.3f}".format(keras_score[1]))
3000/3000 [==============================] - 0s 33us/step
Validation loss:     0.413
Validation accuracy: 0.876

So it seems our (unoptimised, poorly trained) network performs nicely, reaching an accuracy of 87% on average. Not bad!

Let’s now see what the output of the network looks like, for all 2+1 classes…

keras_prediction_bkg = neuralnet.predict(df_bkg[["var1","var2","var3","var4"]])
keras_prediction_sig = neuralnet.predict(df_sig[["var1","var2","var3","var4"]])
keras_prediction_hid = neuralnet.predict(df_hid[["var1","var2","var3","var4"]])
bins = np.linspace(0,1,51)
plt.hist(keras_prediction_bkg, bins, density=True, label="bkg", alpha=0.4)
plt.hist(keras_prediction_sig, bins, density=True, label="sig", alpha=0.4)
plt.hist(keras_prediction_hid, bins, density=True, label="hid", alpha=0.4)
plt.legend()

png

If you forget about the (unknown!) green distribution (the “hidden signal”), our simple neural network is performing great! You could put a decision boundary at around 0.6 (defining the region to be signal-like and to be background-like) and get a true positive rate .

But doing so, you would completely miss the additional “hidden” signal component!

Is it really that bad?

Short answer, no.

This is an incredibly simplistic approach to anomaly detection, such that the discussion of real-world cases might be quite different. First of all, notice how the network output above looks very similar to the distributions we initially generated for our three samples: this is not surprising, because all 4 variables we created looked pretty much the same!

Our network was therefore successful in preserving, at least partially, the distances between the samples: the signal component was a clear bump in a smooth background tail, whereas the “hidden” signal component had a much smaller contribution to the bulk and hence is much harder to separate effectively. See in the blind plots of the 4 variables below whether you can tell them apart:

fig, axs = plt.subplots(2, 2)
_=axs[0, 0].hist(df_mixed_bkg_sig_hid["var1"].values, 100)
_=axs[0, 1].hist(df_mixed_bkg_sig_hid["var2"].values, 100)
_=axs[1, 0].hist(df_mixed_bkg_sig_hid["var3"].values, 100)
_=axs[1, 1].hist(df_mixed_bkg_sig_hid["var4"].values, 100)

png

If your data analysis is sound, you will look both left and right of your decision boundary, in the plot of the output of the neural network above. That is, looking at the signal-like region you might be able to claim discovery if your experimental data lines up well with the simulated sample; and in the background-like region, you might observe some discrepancies too since you forgot to include the “hidden” sample!

You might then re-evaluate your background modelling and come to one of three possible conclusions:

  1. the “hidden” component is in fact small enough that its contribution is within the statistical and systematic uncertainties of the background - you won’t know it was there at all;
  2. in a dedicated analysis of the background component, you’ll realise that there are statistically significant and correlated deviations in multiple observables - great, you’ve discovered the “hidden” signal component independently!
  3. you don’t perform this extensive review of your background modelling and simply apply some shape and/or normalisation correction factors - this affects your background estimate in the signal-like region, but might still lead to a discovery of the main signal component (and like in point 1, you won’t know about the “hidden” signal!)

And of course, as you move the decision boundary around and change the simulated properties of the “hidden” component, the situation might get even more complicated, with a spillage of “hidden” signal counts in both regions…

A natural work-around: anomaly detection

In the previous section, we’ve tried to highlight a limitation of model dependence in searches for new processes (I’m still implicitly drawing strong parallels to high energy particle physics here, but I guess it could be any topic). Without a proper description (here, simulation model) of all signal components, we will run into trouble.

As a work-around, we could drop model dependence altogether. After all, we’re only sure of the existence of the background process: let’s focus on it. In what follows, we will train a very basic, unoptimised auto-encoder. This is a dense neural network that will take the inputs (our 4 variables), pass them through a 4-3-2-3-4 neural structure and try to output the same values again.

Two crucial points here:

  1. the bottleneck - with only 2 nodes available in the center of the autoencoder, the network is forced to find a lower-dimensional representation of our input data. The 4-3 and 3-4 parts of the network can then be seen as encoders and decoders respectively.
  2. we are defining the loss to be some reconstruction error of the form (like the mse loss) and training only on the background sample - that way, non-background-like processes should have higher reconstruction errors, since our network has never seen them before.

There are many caveats to point 2 above, but they are beyond the scope of this short demonstration…

autoencoder = Sequential()
autoencoder.add(Dense(4, activation='relu', kernel_initializer='random_normal', input_shape=(4,)))
autoencoder.add(Dense(3, activation='linear', kernel_initializer='random_normal'))
autoencoder.add(Dense(2, activation='sigmoid', kernel_initializer='random_normal'))
autoencoder.add(Dense(3, activation='relu', kernel_initializer='random_normal'))
autoencoder.add(Dense(4, activation='linear', kernel_initializer='random_normal'))
autoencoder.summary()
autoencoder.compile(loss="mse", optimizer="adam", metrics=["accuracy"])
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_4 (Dense)             (None, 4)                 20        
_________________________________________________________________
dense_5 (Dense)             (None, 3)                 15        
_________________________________________________________________
dense_6 (Dense)             (None, 2)                 8         
_________________________________________________________________
dense_7 (Dense)             (None, 3)                 9         
_________________________________________________________________
dense_8 (Dense)             (None, 4)                 16        
=================================================================
Total params: 68
Trainable params: 68
Non-trainable params: 0
_________________________________________________________________
X_train, X_test, y_train, y_test = train_test_split(df_bkg[["var1","var2","var3","var4"]].values,
                                                    df_bkg["label"].values,
                                                    test_size=0.1, random_state=42)
history = autoencoder.fit(x=X_train, y=X_train,
                          batch_size=10, epochs=25, validation_split=0.1)
Train on 8100 samples, validate on 900 samples
Epoch 1/25
8100/8100 [==============================] - 2s 268us/step - loss: 1750.5724 - acc: 1.0000 - val_loss: 1513.6703 - val_acc: 1.0000
Epoch 2/25
8100/8100 [==============================] - 2s 189us/step - loss: 1346.3003 - acc: 1.0000 - val_loss: 1055.3551 - val_acc: 1.0000
Epoch 3/25
8100/8100 [==============================] - 2s 209us/step - loss: 921.6960 - acc: 1.0000 - val_loss: 698.2426 - val_acc: 1.0000
Epoch 4/25
8100/8100 [==============================] - 2s 196us/step - loss: 630.6976 - acc: 1.0000 - val_loss: 480.9412 - val_acc: 1.0000
Epoch 5/25
8100/8100 [==============================] - 2s 192us/step - loss: 461.5699 - acc: 1.0000 - val_loss: 362.1350 - val_acc: 1.0000
Epoch 6/25
8100/8100 [==============================] - 2s 234us/step - loss: 367.5061 - acc: 1.0000 - val_loss: 294.5653 - val_acc: 1.0000
Epoch 7/25
8100/8100 [==============================] - 2s 195us/step - loss: 299.9588 - acc: 1.0000 - val_loss: 229.2217 - val_acc: 1.0000
Epoch 8/25
8100/8100 [==============================] - 2s 197us/step - loss: 231.2490 - acc: 1.0000 - val_loss: 173.2358 - val_acc: 1.0000
Epoch 9/25
8100/8100 [==============================] - 2s 219us/step - loss: 179.0970 - acc: 1.0000 - val_loss: 133.9912 - val_acc: 1.0000
Epoch 10/25
8100/8100 [==============================] - 2s 200us/step - loss: 140.2602 - acc: 1.0000 - val_loss: 104.3013 - val_acc: 1.0000
Epoch 11/25
8100/8100 [==============================] - 2s 252us/step - loss: 111.2060 - acc: 1.0000 - val_loss: 81.9585 - val_acc: 1.0000
Epoch 12/25
8100/8100 [==============================] - 2s 205us/step - loss: 88.9883 - acc: 1.0000 - val_loss: 65.9761 - val_acc: 1.0000
Epoch 13/25
8100/8100 [==============================] - 2s 220us/step - loss: 71.9570 - acc: 1.0000 - val_loss: 53.1596 - val_acc: 1.0000
Epoch 14/25
8100/8100 [==============================] - 2s 205us/step - loss: 58.8327 - acc: 1.0000 - val_loss: 43.0322 - val_acc: 1.0000
Epoch 15/25
8100/8100 [==============================] - 2s 226us/step - loss: 48.8037 - acc: 1.0000 - val_loss: 34.8757 - val_acc: 1.0000
Epoch 16/25
8100/8100 [==============================] - 2s 196us/step - loss: 40.9370 - acc: 1.0000 - val_loss: 29.0557 - val_acc: 1.0000
Epoch 17/25
8100/8100 [==============================] - 2s 214us/step - loss: 34.8747 - acc: 1.0000 - val_loss: 24.9867 - val_acc: 1.0000
Epoch 18/25
8100/8100 [==============================] - 2s 218us/step - loss: 30.1346 - acc: 1.0000 - val_loss: 20.8473 - val_acc: 1.0000
Epoch 19/25
8100/8100 [==============================] - 2s 211us/step - loss: 26.2875 - acc: 1.0000 - val_loss: 18.7480 - val_acc: 1.0000
Epoch 20/25
8100/8100 [==============================] - 2s 210us/step - loss: 23.2625 - acc: 1.0000 - val_loss: 15.7640 - val_acc: 1.0000
Epoch 21/25
8100/8100 [==============================] - 2s 198us/step - loss: 20.7090 - acc: 1.0000 - val_loss: 15.0290 - val_acc: 1.0000
Epoch 22/25
8100/8100 [==============================] - 2s 211us/step - loss: 18.6336 - acc: 1.0000 - val_loss: 11.5554 - val_acc: 1.0000
Epoch 23/25
8100/8100 [==============================] - 2s 223us/step - loss: 16.5186 - acc: 1.0000 - val_loss: 10.5755 - val_acc: 1.0000
Epoch 24/25
8100/8100 [==============================] - 2s 197us/step - loss: 15.0918 - acc: 1.0000 - val_loss: 9.0644 - val_acc: 1.0000
Epoch 25/25
8100/8100 [==============================] - 2s 199us/step - loss: 13.8251 - acc: 1.0000 - val_loss: 7.7271 - val_acc: 1.0000

Let’s take a quick look at the history of our loss as a function of the number of epochs. This is a natural quantity to survey, as (by construction) we’ve given up on the notion of “accuracy”. Small reconstruction loss means an efficient background-auto-encoder:

loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(25)
plt.plot(epochs, loss, 'g', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.gca().set_yscale("log")
plt.title('Training and validation loss')
plt.legend()

png

(In some cases there’s a bit of overtraining noticeable at the end, but I’m really not that bothered… Feel free to play around and optimise the auto-encoder!)

Applying the auto-encoder

Let’s now use the auto-encoder we’ve just trained (on the background sample) to also evaluate the two signal components

ae_prediction_bkg = autoencoder.predict(df_bkg[["var1","var2","var3","var4"]])
ae_prediction_sig = autoencoder.predict(df_sig[["var1","var2","var3","var4"]])
ae_prediction_hid = autoencoder.predict(df_hid[["var1","var2","var3","var4"]])
ae_prediction_all = autoencoder.predict(df_mixed_bkg_sig_hid[["var1","var2","var3","var4"]])

In passing, let’s see how our auto-encoder actually performs on the background sample:

bins = np.linspace(0,200,51)
plt.hist(ae_prediction_bkg[:,0], bins, label="predicted", alpha=0.4)
plt.hist(df_bkg["var1"].values, bins, label="original",  alpha=0.4)
plt.legend()

png

… yeah, not bad but not great either! It’s figured out the decaying exponential distribution, but it could use some more tweaks and epochs of learning…

Moving on to the computation of the reconstruction errors: here we’ll simply do for all 4 variables and take the average. (As it turns out, it’s a good enough definition to illustrate my point)

error_ae_bkg = ((ae_prediction_bkg-df_bkg[["var1","var2","var3","var4"]].values)**2).mean(axis=1)
error_ae_sig = ((ae_prediction_sig-df_sig[["var1","var2","var3","var4"]].values)**2).mean(axis=1)
error_ae_hid = ((ae_prediction_hid-df_hid[["var1","var2","var3","var4"]].values)**2).mean(axis=1)
error_ae_all = ((ae_prediction_all-df_mixed_bkg_sig_hid[["var1","var2","var3","var4"]])**2).mean(axis=1)
bins = np.linspace(0,2000,51)
plt.hist(error_ae_bkg, bins, label="bkg", alpha=0.4, density=True)
plt.hist(error_ae_sig, bins, label="sig", alpha=0.4, density=True)
plt.hist(error_ae_hid, bins, label="hid", alpha=0.4, density=True)
plt.legend()

png

Let’s zoom in a bit on that plot of the reconstruction errors for the three samples:

bins = np.linspace(0,500,51)
plt.hist(error_ae_bkg, bins, label="bkg", alpha=0.4, density=True)
plt.hist(error_ae_sig, bins, label="sig", alpha=0.4, density=True)
plt.hist(error_ae_hid, bins, label="hid", alpha=0.4, density=True)
plt.gca().set_ylim(0,0.02)
plt.legend()

png

And there we have it! All the expected effects of the anomaly detection/auto-encoder approach are there:

  • the background component is grouped in a sharp peak at low values of the reconstruction error
  • both signal components, since they weren’t seen during training, show up with larger reconstruction errors
  • the relative input distances between the three samples are preserved: the green “hidden” signal component is less spread out than the orange one and is closer to the background distribution
  • applying a sensible decision boundary on that plot, we would get very limited signal contamination in our background-like region (remember that the plot above is normalised!) and a pure signal-like region containing both signal components

This was a very simple case study for anomaly detection - in future posts, I hope to come back with more formal comments and realistic studies! Stay tuned…