Compressing Feature Space for Classification Using PCA

Compressing Feature Space for Classification Using PCA

In this project, we use Principal Component Analysis (PCA) to compress 100 unlabelled, sparse features into a more manageable number for classifying buyers of Ed Sheeran’s latest album.

Some Context

Our client is looking to promote Ed Sheeran’s new album - and want to be both targeted with their customer communications, and as efficient as possible with their marketing budget. As a proof-of-concept, they would like us to build a classification model for customers who purchased Ed’s last album based upon a small sample of listening data they have acquired for some of their customers at that time.

If we can do this successfully, they will look to purchase up-to-date listening data, apply the model, and use the predicted probabilities to promote to customers who are most likely to purchase.


Approach

The sample data is short but wide. It contains only 356 customers, but for each, columns that represent the percentage of historical listening time allocated to each of 100 artists. On top of these, the 100 columns do not contain the artist in question, instead being labelled artist1, artist2 etc.

So, we will need to compress this data of 100 inputs to a manageable number of components, so we can apply a classification model and do some predictions.

More details on the project can be found in the Jupyter Notebook here: Principal Component Analysis


Results

Based upon an analysis of variance vs. components - we made a call to keep 75% of the variance of the initial feature set, which meant we dropped the number of features from 100 down to 24. Now that’s a real insight!

alt text

Using these 24 components, we trained a Random Forest Classifier which is able to predict customers that would purchase Ed Sheeran’s last album with a Classification Accuracy of 93%.


© 2022. Ibiene Obuoforibo.

Powered by Hydejack v9.1.6