Skip to the content.

What are the best practices for building datasets that change over time on Kaggle?

This is a question I have asked myself a lot over the last 8 months or so that I’ve been putting in the time to build good datasets on this platform.

In the past, maintaining evolving datasets presented somewhat of a challenge, often impeding my ability to refresh them as regularly as I would have hoped.

I used to manually re-scrape data via a notebook and then upload the resulting CSV files - a method that was not only time consuming but also inefficient.

The need to input all attributes manually via the user interface increased the likelihood of errors. This process was a significant hindrance, making me hesitant to jump into the creation of such datasets.

Thankfully, there’s a better way.

Pipeline Best Practices

The pillars for managing Kaggle datasets efficiently while maintaining comprehensive control through Python include:

  • Kaggle API
  • GitHub workflows via Actions/Secrets
  • Effective Monitoring/Error Handling
  • The Kaggle API is well documented and gives you all the functionality you’d expect to get out of it.
  • GitHub workflows allow you to schedule updates that push datasets to Kaggle. This frees me up to not have to manually update every dataset I maintain and build new ones.
  • Monitoring allows you to better identify errors in the pipeline and where exactly things went wrong. This happens to me all the time with web scraping because the websites themselves change along with the data. For testing/monitoring I’m using the Evidently framework that is meant for all aspects of data monitoring.

I’ve designed a template repository, embodying the basic structure that I’ve employed to maintain my Metallica Songs dataset on Kaggle. This framework is something I plan to continue using for future datasets. I hope you’ll explore it! 🤘

Important Links

Template Repo

Metallica Example Pipeline Repo

Metallica Song Data

Metallica Song Analysis Notebook

Evidently docs

My Evidently Discussion Post

Kaggle Post

Back to Home