Filter and Cluster

M.Sc course, University of Debrecen, Department of Data Science and Visualization, 2025

Colab

The labor aims to filter, cluster and visualize the CORD-19 dataset metadata. It uses various libraries and techniques for this purpose.

Summary

  • Data Download: The notebook starts by downloading the CORD-19 dataset metadata using the kagglehub library.
  • Data Filtering: The metadata is then filtered to keep only the entries with a non-null description using pandas DataFrame operations.
  • Keyword Extraction: The KeyBERT library is planned to be used to extract relevant keywords from the descriptions of the filtered metadata.
  • Clustering: The BERTTopic library is intended to be applied to cluster the metadata based on the extracted keywords and semantic similarity.
  • Dimension Reduction: The Umap library is planned for reducing the dimensionality of the clustered data for visualization purposes.
  • Visualization: Finally, the results of the clustering and dimension reduction are visualized.

Theoretical Background

  • CORD-19 Dataset: The CORD-19 dataset is a collection of research articles related to COVID-19. The metadata contains information about each article, such as title, authors, abstract, and publication date.
  • Keyword Extraction: KeyBERT is a keyword extraction technique that uses BERT embeddings to identify the most relevant words or phrases in a document. It can be used to extract the main topics or themes of a text.
  • Topic modelling: BERTTopic is a topic modeling technique that leverages BERT embeddings and clustering algorithms to group similar documents together. It can be used to discover hidden topics in a collection of texts.
  • Dimension Reduction: Umap (Uniform Manifold Approximation and Projection) is a dimensionality reduction technique that can be used to visualize high-dimensional data in a lower-dimensional space. It preserves the local structure of the data while reducing its complexity.
  • HDBSCAN clustering: HDBSCAN - Hierarchical Density-Based Spatial Clustering of Applications with Noise. Performs DBSCAN over varying epsilon values and integrates the result to find a clustering that gives the best stability over epsilon. This allows HDBSCAN to find clusters of varying densities (unlike DBSCAN), and be more robust to parameter selection.
  • Visualization: The visualization step aims to represent the clustered data in a human-readable format, such as a scatter plot or a network diagram. This allows users to explore the relationships between different documents and topics.

Libraries Used

  • kagglehub: For downloading the CORD-19 dataset.
  • pandas: For data manipulation and analysis.
  • KeyBERT: For keyword extraction.
  • BERTTopic: For clustering.
  • Umap: For dimension reduction.