How to access datasets directly from Kaggle
4 min readMar 2, 2022
- Kaggle is one of the largest data science community platforms that provides access to various datasets, competitions, resources, and powerful tools to practice data science and machine learning.
- Kaggle allows us to use its datasets by downloading them or by using its API.
- In this article, we will be looking at the latter part where we can simply use the API key provided to us by Kaggle.com and can be stored anywhere on your Google drive.
Prerequisites
To follow through this article, you need to have a Kaggle account (to generate the API key) and a Google account (to use Google Colab)
Generating the API Key
To generate the Kaggle API Key, follow the given steps:
- Login to your kaggle.com account
- On the top right corner, you can see your profile. On clicking it, you will see an option to view Your Profile, Account Settings, or Logout. Click on the Account Settings (indicated by Gear icon).
- On your account page, you can scroll down till you see an API section. In this section, you can see a Create New API Token button. Click on it.
- You will be given a JSON file named kaggle.json that contains the API Key that is private only to your account and must not be shared.
- You need to store this API key in a folder named .kaggle as the API’s library by default searches for this on your local system.
Setting things up
- In this article, I will be showcasing how to access the token through google drive.
- Before running the required scripts, you first need to upload your kaggle.json file on Google Drive.
- Meanwhile, you can create a new colab notebook to keep up with this article.
- After you have uploaded the file, you need to mount your drive storage on your new colab notebook using the following command:
drive.mount('/content/drive')
- You will be prompted to give access to your drive storage by selecting your account and authenticating using a key.
- Now that you have mounted your drive, we can download and import all the necessary libraries on this colab instance.
- Starting with the required libraries, we will first install
kaggle
andkaggle-cli
libraries using the following commands:
!pip install -q kaggle
!pip install -q kaggle-cli
- Now, you need to run the below script that creates a folder named as .kaggle on your drive, copies the kaggle.json file in it, and modifies the access such that only you can access and read the kaggle.json file:
!mkdir -p ~/.kaggle
!cp "/content/drive/MyDrive/kaggle.json" ~/.kaggle/
!cat ~/.kaggle/kaggle.json
!chmod 600 ~/.kaggle/kaggle.json
- The output should be your kaggle username and your API Key and we are set to download the datasets.
Accessing a publicly available dataset
- To download the dataset here, you need to copy the URL after kaggle.com i.e. username of the uploader and the dataset name they have uploaded.
- And the required command will be in the form:
!kaggle datasets download -d username/dataset_name
- The dataset that is being accessed can be found here.
- Our dataset URL is:
https://www.kaggle.com/nicholasjhana/energy-consumption-generation-prices-and-weather
- So you need to copy:
nicholasjhana/energy-consumption-generation-prices-and-weather
- The command should look like this:
!kaggle datasets download -d nicholasjhana/energy-consumption-generation-prices-and-weather
- You can see the download progress and later check that the files are visible on the left side of your colab interface.
- But the data is in a zip file. You can extract the contents using the following command:
!unzip /content/energy-consumption-generation-prices-and-weather.zip
- You can now use the pandas library to check the data.
Accessing a Competition dataset
- The procedure is the same except that you first need to terms and conditions of the said competition.
- To download the dataset here, you need to copy the URL after kaggle.com i.e. the competition name.
- And the required command will be in the form:
!kaggle competitions download -c competition_name
- The dataset and competition that are being accessed can be found here.
- Our dataset URL is: https://www.kaggle.com/c/tabular-playground-series-feb-2022
- So you need to copy: tabular-playground-series-feb-2022.
- The command should look like this:
!kaggle competitions download -c tabular-playground-series-feb-2022
- Again the file is in zipped format but you can unzip it using the !unzip command.
Conclusion
- And that’s actually it…
- You can access the notebook that I have created for your reference here.
- All you need to do is generate and upload your API key on your google drive before running the above notebook.
Follow us for more upcoming future articles related to Data Science, Machine Learning, and Artificial Intelligence.
Also, Do give us a Clap👏 if you find this article useful as your encouragement catalyzes inspiration for and helps to create more cool stuff like this.