Simple Keyword Clusterer

Simple Keyword Clusterer

Description 💬

A weekend project that ends up to be quite useful for real-world tasks. It leverages TF-IDF vectorization to convert keywords of any context to vectors, which are fed to a KMeans clustering algorithm. The vectorization is then passed to a decomposition process through Principal Component Analysis for 2D representation.

You simply pass a list of keywords (for instance, a list of job roles) and the algorithm will return the clusters labelled by the most representative element of that group.


The algorithm automatically finds the optimal number of clusters, but you can also tune it yourself by passing an additional argument in the constructor.

Goal of the project 🎯

Allow cluster extraction from unordered and unstructured textual data in a straightforward way.

Team 🤼

  • Me
  • Anyone who wants to contribute and improve the codebase 🙂

Features 🦓

The software offers the following features

  • preprocessing of keywords (stopword removal, string sanitization)
  • vectorization through TF-IDF to find the most relevant words across the whole corpus of keywords
  • KMeans clustering and PCA decomposition
  • Plotting
  • Outputs a Pandas DataFrame

Upcoming features

  • Annotations in plot to identify the clusters centers
  • More preprocessing options
  • GridSearch on the pipeline for ad-hoc tuning

Want to know more?

Drop me a line on my Twitter, LinkedIn or contact me through the form in the homepage.