Automatic Feature Engineering: An Event
Feature Engineering: the Heart of Data Science
“Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data.”
— Dr. Jason Brownlee
The groundwork for this field was laid long before the hype of
Key Performance Indicators (KPIs) are crucial for companies of all sizes. They offer concrete metrics on business performance. Take the classic RFM (recency, frequency, monetary value) paradigm that is used by retailers to measure customer value. More recently, the
We believe that a good feature engineering tool should not only generate sophisticated indicators, but should also keep them interpretable so that data scientists can use them either for the machine learning models or for the KPI dashboards.
The Need for Automated Feature Engineering
Imagine you are working for an e-commerce company. You have collected transactional data and are now almost ready to make some magic with machine learning.
The task at hand is churn prediction: you want to predict who might stop visiting your website next week so that the marketing team has enough time to react.
Before doing any feature engineering, you need to choose a reference date in the past, in this case it will be 2018–09–01. Only data before that date will be taken into account by the model, which predicts the churners of the week after 2018–09–01. This is a way to make sure that there is no data leaking: we are not looking at the future to predict the past.
As an experienced data scientist, you know that one important feature for this type of problem will be the recency of the client: if the timespan between two visits of a client is increasing, that’s an alert to potential!
You put on your SQL ninja hat and write the following PostgreSQL query:
This is fine, but now you want to go further: you want to add a time filter to capture long and short term signals, then you want to compute this feature for each type of activity the user does, and then you want to add some more statistics on top of these results, and then …
You get the idea, the list of ideas keeps growing exponentially, and this is just for one feature!
EventsAggregator to the Rescue
Now let’s see how things change with EventsAggregator.
First, we need to instantiate the feature_aggregator object as follows:
We then apply the feature_aggregator to the input dataset:
Under the hood, feature_aggregatorgenerates a set of SQL queries corresponding to our criteria.
For example, you can see below one of the generated queries for the Postgres database, where it considers only the 6 most recent months of history: