Learn: A silver bullet for basic machine learning
Let’s start a machine learning project workflow here. The intention of this workflow is not to improve the accuracy or f1 score of the classification problem but to touch on all the necessary modules to complete the classification problem efficiently using scikit-learn. Most of the classification examples start with iris dataset, so let’s pick another dataset within scikit-learn for this workflow. We will primarily work with Wisconsin breast cancer dataset. The objective is to classify diagnosis (cancer diagnosis: true or false) based on the patient’s clinical observation parameters. The dataset contains 569 observations and 30 continuous numeric features. class distribution of 212 — Malignant, 357 — Benign.
- Datasets and generators: Unlike unsupervised learning tasks, the supervised tasks (i.e., classification) require labeled datasets, and the package comes with multiple datasets and dataset generators to get started with machine learning
Broadly split into two types
a. Static/toy datasets: datasets are dictionaries with feature data (numpy ndarray), dataset description, feature names, target (numpy array and ndarray for multilabel) and target name (i.e., fetch_20newsgroups contains text input, and grouped into 20 different newsgroups like sport, politics, finance, etc., ). These datasets only have a finite number of observations and target classes or prediction ranges. i.e., The famous iris dataset has only 150 observation and 3 target classes. I have written a function to convert the inbuild dataset which is in dictionary format to a pandas dataframe for visualization and exploration propose