by: Chris Cave
In this tutorial we will go through the basic workflow of how to solve a regression problem using the QLattice and Feyn.
Feyn is the software development kit that we use to interact with the QLattice. It is named after Richard Feynman because his work on path integrals was an inspiration for our QLattice. This means that Feyn is pronouced like "fine".
import numpy as np import pandas as pd import feyn
We want to demonstrate Feyn on a good regression dataset so we've chosen this dataset from Kaggle. It's an Airbnb dataset of rental listings in New York City in 2019.
It's a good mixture of continuous and categorical features and we can give you a nice visualisation of our results at the end. Let's take a look at it!
rentals = pd.read_csv('AB_NYC_2019.csv') rentals.head()
|0||2539||Clean & quiet apt home by the park||2787||John||Brooklyn||Kensington||40.64749||-73.97237||Private room||149||1||9||2018-10-19||0.21||6||365|
|1||2595||Skylit Midtown Castle||2845||Jennifer||Manhattan||Midtown||40.75362||-73.98377||Entire home/apt||225||1||45||2019-05-21||0.38||2||355|
|2||3647||THE VILLAGE OF HARLEM....NEW YORK !||4632||Elisabeth||Manhattan||Harlem||40.80902||-73.94190||Private room||150||3||0||NaN||NaN||1||365|
|3||3831||Cozy Entire Floor of Brownstone||4869||LisaRoxanne||Brooklyn||Clinton Hill||40.68514||-73.95976||Entire home/apt||89||1||270||2019-07-05||4.64||1||194|
|4||5022||Entire Apt: Spacious Studio/Loft by central park||7192||Laura||Manhattan||East Harlem||40.79851||-73.94399||Entire home/apt||80||10||9||2018-11-19||0.10||1||0|
What are we going to do with this dataset? Well since this is a regression tutorial we better have a regression problem to solve. Let's predict the price of a listing based on the other features.
The features are all pretty self-explanatory here. Perhaps the only one that needs explaning is
availability_365. It just shows the number of days in the year the house is available to rent.
If we look a bit closer there are a few things that aren't really that helpful here. The 'id' and 'host_id' are likely to be unique identifiers so that's a definite no-no for us. We don't want to risk overfitting our data. We also want to get rid of the host_name and name features. They could provide some interesting insights with more exploration but we aren't going to that here. We're here to talk about QLattices!
rentals = rentals.drop(["id", "host_id","host_name", "name", "last_review"], axis=1)
I've also got some more news for you in case you haven't heard it yet. We're going to input all this data through something called registers. Now registers can take either numerical or categorical values. We're working on expanding that and we are hopeful for the future that we will have a date/time register. You'll know when we've solved this problem because we will shout it from the rooftops.
For now though we will drop the
We're also going to get rid of some of those outliers.
from scipy import stats #We drop all rows with NaN values rentals = rentals.dropna() #Then we drop all prices that are equal to 0 rentals = rentals[rentals.price != 0] #Finally we drop values that are further than three standard deviations from the mean. z = np.abs(stats.zscore(rentals['price'])) rentals = rentals[z<3]
We are now ready to split our data. We are only going to split this into a train and test set but it is good practise to split into three sets: train, cross-validation and test. We will in the future provide some guidance on how to use cross-validation with Feyn, but for now we will stick with just a train and test set.
Notice how we aren't splitting the columns into input columns and target variable. More on that later.
from sklearn.model_selection import train_test_split train, test = train_test_split(rentals, random_state=42)
Initialize the QLattice
Now we are ready to play. We first call a QLattice. If you are using this on your local computer then you would call the QLattice with a unique identifier. You can find out more about authenticating your QLattice here.
If you're running this is on the playground then you do not need to call a unique identifier.
ql = feyn.QLattice() ql.reset()
To continue we need to tell the QLattice what should be the inputs and outputs of the models it produces.
This is when we start playing with registers. They are our way of feeding data in and getting predictions out of models.
We will assign each column in the dataset to a register. Now the type of data we have is a mixture between numerical and categorical values.
num_columns = ['latitude', 'longitude', 'minimum_nights', 'number_of_reviews', 'reviews_per_month', 'calculated_host_listings_count', 'availability_365','price'] cat_columns = ['neighbourhood_group','neighbourhood','room_type']
We need to tell each register what type of data it should expect: numerical or categorical. By default each register will be a numerical register and expect numerical values.
The categorical register is a special one. It takes only categorical values and automatically one-hot-encodes them!
Did you hear that? You do not need to one-hot encode categorical features! It's taken care of in the categorical register.
Likewise the numerical register automatically scales the data to be between [-1,1]. So there's no need for you to do all this scaling because the numerical register automatically uses the MinMaxScalar standardisation.
We've already said this but we will say it again. You do not need to one-hot-encode the data for categorical registers. The categorical register does this automatically!
Let's assign some registers!
for col in num_columns: #This assigns numerical (default) registers to the continuous features ql.get_register(name=col) for col in cat_columns: #This assigns categorical registers to the categorical features ql.get_register(name=col,register_type='cat')
The behaviour of
get_register is to first check the QLattice to see if there already exists a register with that name. If it does, then it extracts it. If it doesn't it creates a new one of that type and name and then extracts it.
This has a lot of interesting features. But first of all we should warn you to be a little careful if you have a dataset with two columns with the same name because the QLattice will try to fit one input into the wrong register! This can only lead to confusion.
The more interesting feature is that you can use registers to share datasets across a single QLattice. Here we have the dataset of rentals in New York City but what if we also had the data of rentals from another country, say Germany? Then we could use the same QLattice with the same registers that was used in the New York dataset for the dataset based on German rentals. Exciting isn't it? In the future we will provide a bit more guidance on how we can use the same QLattice for multiple datasets in similar domains.
Let's put our QLattice to work. The main thing the QLattice does is produce QGraphs. Ideally a QGraph would be the collection of all possible models from input and output. Of course we can't do that because computers. Instead a QGraph is a finite list (usually in the thousands) of possible models for our data.
Here we're going to use our registers. We need to tell the QLattice what should be the inputs and output of the model. We include all the registers into the method below and then declare which one of those will be our output.
Let's show you some models.
target = 'price' qgraph = ql.get_qgraph(registers=train.columns, output=target, max_depth=3) qgraph.head()
You should keep running the above line because you will get different QGraphs each time so you will see models you have not seen before. Have fun with it! Experiment with different input and output registers! Just remember that for the rest of the code to make sense you need to use the original input and output registers.
Each model has two types of boxes. The green ones are our registers and the grey ones are what we call interactions. An interaction takes in a value, evaluates a function at that value, then moves it to the next interaction. The function that is used is the name of the interaction with some weights and biases. If you are using Feyn on your local computer then there is a hover over feature on the interaction where you can see these weights and biases.
These models are sort of similar to a neural network but with fewer nodes and not the typical type of activation function on the node. For now we use multiply, sine and Gaussian functions along with the typical tanh function.
Each model has a natural flow from left to right, so we feed each row of our dataset into the input registers, evaluate at each grey box and then produce a prediction at the end at the output register.
Like a neural network or any other machine learning technique, each of these models needs to train. The weights on each interaction are initially random so it is likely that all of the models you see above are just terrible.
The QGraph fit method needs to take the following arguments:
- The data the models in the QGraph should be trained on.
- The number of epochs to train.
- The loss function we want to optimise for.
At the moment there are three options for loss functions: mean squared error and mean absolute error for regression problems, and categorical cross entropy for binary classification. As we are in a typical regression problem we will use the bog standard mean squared error.
qgraph.fit(train,epochs = 5, loss_function=feyn.losses.mean_squared_error)
So what is happening here? We go through each model in the QGraph and train for the amount of epochs we've declared. The graph that is being shown is the current model with the lowest mean squared error loss. When we find a model with a lower loss after training, we show that model instead. Note: the text "Examined n of N." indicates the number of graphs (n) examined out of their total number (N) in the QGraph.
Did you notice something funny here? Something a little unusual? We didn't specify in the fit function what should be the features to train on and what should be the target variable. It's all taken care of with the registers so none of this extra splitting your data into
y_train is ever needed.
Let's have a look at the best ones so far. We do this by using the select function. The select function returns a list of the best models based on some criteria which we are free to choose. Let's stick to the easy case, we want to select the models with the lowest loss on all of the training set.
best_graphs = qgraph.select(train) best = best_graphs best
Other things we can do: select with regards to a different loss function.
We can limit the depth of the model so it makes the model a bit more readable and explainable.
We can also select on different datasets, for example a cross-validation set.
Shhhh, the code below shouldn't really be allowed because it's our test set. Our test set should be locked away somewhere, never to been seen again until we do a final final validation. But what the hey, I won't tell anyone if you won't.
This is the really clever bit next. I don't know if you can tell but we're actually quite proud how the next section works.
You remember that a QGraph is a finite list of random models for our data right? Well when we update our QLattice, it's a way of saying to the QLattice: hey this is a great model, more like this please! The QLattice uses this model to narrow its search space and hone in on better models.
This is like an evolutionary process but for models and the analogy works surprisingly well. We can think of a QGraph as a generatation of models. The fit method is the adaptation of the models to their environment (i.e. the loss function). The select method is the survial of the fittest (i.e. the models that are strongest with respect to a particular criteria). Finally the update method passes on the strongest model's characteristics to the next generation of QGraphs.
We only need to update once. If we update too much then we could be narrowing the QLattice's search space too much and too soon. We do not want to rule out other potentially stronger models.
Now we are ready for the next generation of QGraphs. We are going to fit it all into one cell and not go through it in the same level of detail as above. We're also going to do two updates.
updates = 2 for loop in range(updates): qgraph = ql.get_qgraph(train.columns,target,max_depth=3) qgraph.fit(train,epochs = 5, loss_function=feyn.losses.mean_squared_error) best = qgraph.select(train,loss_function=feyn.losses.mean_squared_error,n=1) ql.update(best)
This evolutionary process is all randomly initiated. If we reset the QLattic with
ql.reset() and it forgets all it's learnings then we start the whole evolutionary process again. The models at the end of each evolutionary process are likely to have some similarities and differences between them. They'd be similar because the same registers are likely to interact in similar ways. They'd be different because the initial QGraph starts off with different inital conditions.
The next questions to ask are:
- how do I know when to stop?
- what is the right balance between the number of epochs and the number of updates?
Well as with most neural networks, you never really know when to stop training. You get a good feel that the loss is not really decreasing any more and you have pushed the QGraphs as much as you can.
As with the balance this is much more fickle and I'm going to say a bit of annoying comment on it: it depends on the dataset. We will write further guidance on the balance of this at a later date. One useful thing to acknowledge is that because there are fewer interactions in these models compared to neural networks it usually doesn't require as many epochs.
Predicting and evaluating
Let's see what we've got for our efforts. We can predict on our datasets by runnning the predict method on our graph.
pred_train = best.predict(train) pred_test = best.predict(test)
Here we will plot our results as a map of New York city where the colour of the point represents the price of the rental.
import matplotlib.pyplot as plt f, (ax1, ax2,ax3) = plt.subplots(1, 3, sharey=True, sharex=True, figsize=(24,8)) ax1.scatter(rentals["longitude"],rentals["latitude"], c=rentals[target], vmax=200, s=8) ax1.set_title('Actual price') ax2.scatter(train["longitude"], train["latitude"], c=pred_train, vmax=200, s=8) ax2.set_title('Predicted Train price') ax3.scatter(test["longitude"], test["latitude"], c=pred_test, vmax=200, s=8) ax3.set_title('Predicted Test price') pass
Not bad as you can see. The model mostly captures the trend of prices in different neighbourhoods. To be honest given the data set this is the most we can expect. There are a lot of helpful features that we do not have in the dataset such as sq meters or number of rooms that will likely have a huge effect on the price. Instead this model mostly captures how prices vary over neighbourhoods.
Here we just put a summary of how to interact with the QLattice and use QGraphs. This is how a basic workflow would look when you want to use the QLattice and QGraphs.
#Here we initiate and reset the QLattice again. This means it's forgotten all it's learnings ql = feyn.QLattice() ql.reset() #This is the pro-way of assigning registers to columns for col in rentals.columns: if rentals[col].dtype == 'object': #Here we assign all categorical registers. ql.get_register(name=col, register_type="cat") else: #and here we assign all numerical registers. ql.get_register(name=col) #We define the target variable target = 'price' #The number of update loops we are going to perform updates = 2 for loop in range(updates): #First extract the QGraph using these registers and the target as the output register qgraph = ql.get_qgraph(train.columns,target,max_depth=3) #Fit each model in the QGraph qgraph.fit(train,epochs = 5, loss_function=feyn.losses.mean_squared_error) #Select the best model best = qgraph.select(train,loss_function=feyn.losses.mean_squared_error,n=1) #Update the QLattice with the best model ql.update(best)