Clustering

Overview

Clustering is a machine-learning technique used to group similar data points together based on their features or characteristics. Clustering can be used to identify distinct groups of customers or market segments based on their preferences and attitudes towards Nike and other leading shoe brands. K-means clustering and hierarchical clustering are two popular unsupervised machine-learning techniques used to group similar data points together.

K-means clustering is an iterative algorithm that partitions the data into a fixed number of clusters by minimizing the sum of squared distances between each data point and its assigned centroid. It randomly initializes the centroids and assigns each data point to its closest centroid until the centroids no longer move or the maximum number of iterations is reached.

Hierarchical clustering is another method that creates a hierarchical representation of the data by repeatedly merging pairs of clusters based on their similarity until all the data points are in a single cluster. It can be either agglomerative, starting with each data point in its own cluster and merging the closest pairs, or divisive, starting with all the data points in a single cluster and recursively splitting them.

Cosine similarity is a commonly used distance measure in hierarchical clustering, particularly when working with text data. It measures the cosine of the angle between two vectors and is calculated by dividing the dot product of the vectors by their magnitudes. In this method, the distance between two data points is calculated as 1 minus the cosine similarity of the vectors representing those data points.

The aim of this analysis would be to identify specific groups of customers with similar preferences for certain brands, price points, or features of footwear products. Also, potential gaps can be identified in the market that Nike or other brands can tap into by targeting these customer segments with specific marketing strategies and product features.

It is expected to find different groups of customers with varying preferences for shoe brands and features. These groups may be based on factors such as age, gender, location, and socioeconomic status. The analysis may reveal that certain segments prefer high-end brands with premium prices and luxury features, while others may prioritize comfort and affordability over brand name recognition. Additionally, it could also be found that Nike is the top-ranked brand in certain customer segments, while other brands may dominate in other segments. This information can be used to guide strategic decision-making by companies looking to compete in the global footwear market.

Data Preparation

Clustering is a machine learning technique used to group similar data points together based on their similarities. It is typically used with unlabeled numeric data, which means that the data points have no predefined labels or categories. However, in real-world applications, the data may come in different formats, such as customer reviews or text data, which are not numeric.

To apply clustering to such data, one common approach is to use a count vectorizer to convert the text data into numeric data. A count vectorizer is a technique that represents text data as a matrix of word counts, where each row corresponds to a document or review and each column corresponds to a unique word in the corpus. The entries in the matrix are the frequency of each word in each document.

By using a count vectorizer, the text data can be transformed into a numeric format that is suitable for clustering algorithms. This allows for the application of clustering techniques such as k-means or hierarchical clustering to the data, which can then be used to group similar customer reviews together based on their content. Overall, using a count vectorizer is a useful approach for clustering customer reviews or other text data, making it a valuable tool for gaining insights from unstructured data.

Code

Hierarchical Clustering : https://github.com/poonamthakur08/Nike-Footwear-Market-Analysis/blob/main/hierarchicalClustering.rmd

K-means Clustering : https://github.com/poonamthakur08/Nike-Footwear-Market-Analysis/blob/main/KmeansClustering.ipynb

Results

Visualization

Top 10 words of clusters (k=2)

Top 10 words of clusters (k=3)

Top 10 words of clusters (k=4)

The dendrogram illustrates how each data point is assigned to a cluster and how the clusters are merged as the algorithm progresses. The height of the branches in the dendrogram represents the distance between the clusters. The longer the branch, the greater the distance between the clusters. The dendrogram can be used to determine the optimal number of clusters by identifying the level at which the branches merge.

The figure illustrates the optimal value of K, which is 2, that should be used to perform Hierarchical clustering using the Silhouette method.

Silhouette Visualization (k=2,3,4,5).png

For each value of k = 2,3,4,5, the silhouette coefficient is calculated for each data point. The silhouette coefficient measures how similar a data point is to its own cluster compared to other clusters. A high silhouette coefficient indicates that the data point is well-clustered, while a low coefficient indicates that it is misclassified or located near the border between two clusters.

The silhouette plot for each value of k displays a vertical bar chart for each data point, representing its silhouette coefficient. The height of each bar represents the silhouette coefficient, and the color of the bar represents the assigned cluster. The plot also includes a dotted line representing the average silhouette coefficient for all data points in the cluster.

By examining the silhouette plot for each value of k, one can determine the optimal number of clusters for the given dataset. The optimal number of clusters is the value of k that maximizes the average silhouette coefficient across all data points. A higher average silhouette coefficient indicates a better clustering result.

Conclusion

In conclusion, the k-means and hierarchical clustering analyses provide insights into the similarities and differences between Nike and other leading shoe brands in the global footwear market based on customer reviews. The analyses suggest that Nike is often clustered together with other popular brands, indicating that it is not necessarily the only top-ranked brand in the market. Other factors such as pricing, availability, and brand reputation may also play a significant role in determining a brand's success. Overall, these clustering techniques offer a valuable tool for understanding the complex relationships between different shoe brands and their customers.