Filter and Cluster

M.Sc course, University of Debrecen, Department of Data Science and Visualization, 2025

Colab

The labor aims to filter, cluster and visualize the CORD-19 dataset metadata. It uses various libraries and techniques for this purpose.

Summary

Data Download: The notebook starts by downloading the CORD-19 dataset metadata using the kagglehub library.
Data Filtering: The metadata is then filtered to keep only the entries with a non-null description using pandas DataFrame operations.
Keyword Extraction: The KeyBERT library is planned to be used to extract relevant keywords from the descriptions of the filtered metadata.
Clustering: The BERTTopic library is intended to be applied to cluster the metadata based on the extracted keywords and semantic similarity.
Dimension Reduction: The Umap library is planned for reducing the dimensionality of the clustered data for visualization purposes.
Visualization: Finally, the results of the clustering and dimension reduction are visualized.

Theoretical Background

CORD-19 Dataset: The CORD-19 dataset is a collection of research articles related to COVID-19. The metadata contains information about each article, such as title, authors, abstract, and publication date.
Keyword Extraction: KeyBERT is a keyword extraction technique that uses BERT embeddings to identify the most relevant words or phrases in a document. It can be used to extract the main topics or themes of a text.
Topic modelling: BERTTopic is a topic modeling technique that leverages BERT embeddings and clustering algorithms to group similar documents together. It can be used to discover hidden topics in a collection of texts.
Dimension Reduction: Umap (Uniform Manifold Approximation and Projection) is a dimensionality reduction technique that can be used to visualize high-dimensional data in a lower-dimensional space. It preserves the local structure of the data while reducing its complexity.
HDBSCAN clustering: HDBSCAN - Hierarchical Density-Based Spatial Clustering of Applications with Noise. Performs DBSCAN over varying epsilon values and integrates the result to find a clustering that gives the best stability over epsilon. This allows HDBSCAN to find clusters of varying densities (unlike DBSCAN), and be more robust to parameter selection.
Visualization: The visualization step aims to represent the clustered data in a human-readable format, such as a scatter plot or a network diagram. This allows users to explore the relationships between different documents and topics.

Libraries Used

kagglehub: For downloading the CORD-19 dataset.
pandas: For data manipulation and analysis.
KeyBERT: For keyword extraction.
BERTTopic: For clustering.
Umap: For dimension reduction.

Róbert Lakatos

Colab

Summary

Theoretical Background

Libraries Used