Clustering
Overview
Clustering is a machine-learning technique used to group similar data points together based on their features or characteristics. Clustering can be used to identify distinct groups of customers or market segments based on their preferences and attitudes towards Nike and other leading shoe brands. K-means clustering and hierarchical clustering are two popular unsupervised machine-learning techniques used to group similar data points together.
K-means clustering is an iterative algorithm that partitions the data into a fixed number of clusters by minimizing the sum of squared distances between each data point and its assigned centroid. It randomly initializes the centroids and assigns each data point to its closest centroid until the centroids no longer move or the maximum number of iterations is reached.
Hierarchical clustering is another method that creates a hierarchical representation of the data by repeatedly merging pairs of clusters based on their similarity until all the data points are in a single cluster. It can be either agglomerative, starting with each data point in its own cluster and merging the closest pairs, or divisive, starting with all the data points in a single cluster and recursively splitting them.
Cosine similarity is a commonly used distance measure in hierarchical clustering, particularly when working with text data. It measures the cosine of the angle between two vectors and is calculated by dividing the dot product of the vectors by their magnitudes. In this method, the distance between two data points is calculated as 1 minus the cosine similarity of the vectors representing those data points.
The aim of this analysis would be to identify specific groups of customers with similar preferences for certain brands, price points, or features of footwear products. Also, potential gaps can be identified in the market that Nike or other brands can tap into by targeting these customer segments with specific marketing strategies and product features.
It is expected to find different groups of customers with varying preferences for shoe brands and features. These groups may be based on factors such as age, gender, location, and socioeconomic status. The analysis may reveal that certain segments prefer high-end brands with premium prices and luxury features, while others may prioritize comfort and affordability over brand name recognition. Additionally, it could also be found that Nike is the top-ranked brand in certain customer segments, while other brands may dominate in other segments. This information can be used to guide strategic decision-making by companies looking to compete in the global footwear market.
Data Preparation
Clustering is a machine learning technique used to group similar data points together based on their similarities. It is typically used with unlabeled numeric data, which means that the data points have no predefined labels or categories. However, in real-world applications, the data may come in different formats, such as customer reviews or text data, which are not numeric.
To apply clustering to such data, one common approach is to use a count vectorizer to convert the text data into numeric data. A count vectorizer is a technique that represents text data as a matrix of word counts, where each row corresponds to a document or review and each column corresponds to a unique word in the corpus. The entries in the matrix are the frequency of each word in each document.
By using a count vectorizer, the text data can be transformed into a numeric format that is suitable for clustering algorithms. This allows for the application of clustering techniques such as k-means or hierarchical clustering to the data, which can then be used to group similar customer reviews together based on their content. Overall, using a count vectorizer is a useful approach for clustering customer reviews or other text data, making it a valuable tool for gaining insights from unstructured data.

Code
Hierarchical Clustering : https://github.com/poonamthakur08/Nike-Footwear-Market-Analysis/blob/main/hierarchicalClustering.rmd
K-means Clustering : https://github.com/poonamthakur08/Nike-Footwear-Market-Analysis/blob/main/KmeansClustering.ipynb
Results
Visualization
Top 10 words of clusters (k=2)

Top 10 words of clusters (k=3)

Top 10 words of clusters (k=4)

Conclusion
In conclusion, the k-means and hierarchical clustering analyses provide insights into the similarities and differences between Nike and other leading shoe brands in the global footwear market based on customer reviews. The analyses suggest that Nike is often clustered together with other popular brands, indicating that it is not necessarily the only top-ranked brand in the market. Other factors such as pricing, availability, and brand reputation may also play a significant role in determining a brand's success. Overall, these clustering techniques offer a valuable tool for understanding the complex relationships between different shoe brands and their customers.