GitHub - andrea-dagostino/simple_keyword_clusterer: A simple machine learning package to cluster keywords in higher-level groups.
A simple machine learning package to cluster keywords in higher-level groups. Example: "Senior Frontend Engineer" --> "Frontend Engineer" "Junior Backend developer" --> "Backend developer" pip install simple_keyword_clusterer The algorithm will find the optimal number of clusters automatically based on the best Silhouette Score.
A weekend project that ends up to be quite useful for real-world tasks. It leverages TF-IDF vectorization to convert keywords of any context to vectors, which are fed to a KMeans clustering algorithm. The vectorization is then passed to a decomposition process through Principal Component Analysis for 2D representation.
You simply pass a list of keywords (for instance, a list of job roles) and the algorithm will return the clusters labelled by the most representative element of that group.
The algorithm automatically finds the optimal number of clusters, but you can also tune it yourself by passing an additional argument in the constructor.
Goal of the project 🎯
Allow cluster extraction from unordered and unstructured textual data in a straightforward way.
- Anyone who wants to contribute and improve the codebase 🙂
The software offers the following features
- preprocessing of keywords (stopword removal, string sanitization)
- vectorization through TF-IDF to find the most relevant words across the whole corpus of keywords
- KMeans clustering and PCA decomposition
- Outputs a Pandas DataFrame
- Annotations in plot to identify the clusters centers
- More preprocessing options
- GridSearch on the pipeline for ad-hoc tuning