Validate data
by: Kevin Broløs
(Feyn version 2.0.7 or newer)
validate_data is a function that helps discover the few common data errors that might give unwanted effects with feyn. We advise running this once after loading in your data, to ensure that your data is in good enough condition.
In order to best validate your data, you need to specify the kind of problem you intend to solve, the output column as well as the stypes that you'll use for sample_models, if any of them are categorical.
Example
from feyn.datasets import make_classification
from feyn import validate_data
train, test = make_classification()
validate_data(data=train, kind='classification', output_name='y', stypes={})
Here's an example that doesn't validate, because we're using a continuous numerical output to do a classification:
from feyn.datasets import make_regression
from feyn import validate_data
train, test = make_regression()
try:
validate_data(data=train, kind='classification', output_name='y', stypes={})
except ValueError as e:
print(e)
y must be an iterable of booleans or 0s and 1s
In the examples we run it for the training data, but we recommend running it for the full dataset.
validate_data will raise a ValueError in the following cases:
- If the
outputcolumn does not consist of only numerical values for aregressioncase. - If the
outputcolumn does not consist boolean-like values for aclassificationcase. - If any of the columns are object types, but have not been declared as
categoricalinstypes. - If columns contain NaN values, and are not declared as
categoricalinstypes.- Note:
categoricalssupport NaN values by assigning them their own weights, so we allow this. You should still consider if that's the behaviour what you want, and handle it yourself if you don't.
- Note: