Ora

How do I upload data in Google Colab?

Published in Google Colab Data Management 7 mins read

You can upload data in Google Colab using several convenient methods, with mounting Google Drive being a highly recommended approach for permanent storage and seamless integration. Other popular methods include direct file upload, fetching from URLs, and connecting to services like GitHub or Kaggle.

Key Methods to Upload Data in Google Colab

Google Colab offers flexibility for data input, catering to different needs from temporary uploads to persistent cloud storage. Understanding each method helps you choose the most efficient way for your specific workflow.

1. Mounting Google Drive (Recommended for Persistent Storage)

Mounting your Google Drive directly within Colab is the most robust and widely used method, especially for datasets you intend to use repeatedly or store long-term. This leverages Google Drive's official integration with Colab, providing a permanent cloud storage solution for your files.

  • Why use it?

    • Permanent Storage: Files remain on your Google Drive even after your Colab runtime disconnects.
    • Official Integration: Designed for seamless interaction, making it reliable.
    • Easy Access: Once mounted, your Drive files appear as a local directory.
    • Collaboration: Easily share datasets via Google Drive.
  • How to Mount Google Drive:

    1. Execute the following Python code in a Colab cell:
      from google.colab import drive
      drive.mount('/content/drive')
    2. A new window or link will appear, prompting you to authenticate your Google account. Follow the steps to grant Colab access to your Google Drive.
    3. Once successfully mounted, you can access your files under the path /content/drive/MyDrive/. For example, a file named my_data.csv in your Drive's root would be at /content/drive/MyDrive/my_data.csv.

    Example:

    import pandas as pd
    
    # Mount Google Drive
    from google.colab import drive
    drive.mount('/content/drive')
    
    # Load a CSV file from your Google Drive
    df = pd.read_csv('/content/drive/MyDrive/path/to/your_data.csv')
    print("Data loaded successfully from Google Drive!")
    print(df.head())

    For more details, refer to the official Google Colab documentation.

2. Direct File Upload from Your Local Machine

For smaller datasets or files you only need for the current session, you can upload them directly from your computer.

  • Why use it?

    • Quick: Ideal for one-off tasks or small files.
    • Simple: No external services needed.
  • How to Upload Files Directly:

    1. Use the files module from google.colab:
      from google.colab import files
      uploaded = files.upload()
    2. After running the cell, a "Choose Files" button will appear. Click it and select the files from your local machine.
    3. The uploaded files will be available in the Colab runtime's /content/ directory.

    Example:

    from google.colab import files
    import pandas as pd
    import io
    
    uploaded = files.upload() # This will open a file picker
    
    for fn in uploaded.keys():
      print(f'User uploaded file "{fn}" with length {len(uploaded[fn])} bytes')
      # To read a CSV file directly from the uploaded bytes:
      if fn.endswith('.csv'):
          df = pd.read_csv(io.BytesIO(uploaded[fn]))
          print("Data loaded successfully from local upload!")
          print(df.head())

    Note: Files uploaded this way are temporary and will be deleted when your Colab runtime restarts or disconnects.

3. Fetching Data from a URL

If your data is publicly accessible via a URL (e.g., CSV, JSON, images, .txt files), you can download it directly into your Colab environment.

  • Why use it?

    • Convenient: No manual upload needed if data is online.
    • Reproducible: Easily share notebooks that fetch data directly.
  • How to Fetch Data from a URL:

    • Using pandas for tabular data:
      import pandas as pd
      url = 'https://raw.githubusercontent.com/datasets/titanic/master/data/titanic_train.csv'
      df = pd.read_csv(url)
      print("Data loaded successfully from URL!")
      print(df.head())
    • Using !wget for any file type:
      !wget -O california_housing.tgz https://ndownloader.figshare.com/files/5976036
      !tar -xzf california_housing.tgz
      import pandas as pd
      df = pd.read_csv('housing.csv')
      print("Data loaded successfully using wget!")
      print(df.head())
    • Using urllib.request for more control:
      import urllib.request
      image_url = 'https://upload.wikimedia.org/wikipedia/commons/thumb/b/b6/Image_created_with_a_neural_network_dream_of_a_flower_with_a_cat_in_the_middle.jpg/1280px-Image_created_with_a_neural_network_dream_of_a_flower_with_a_cat_in_the_middle.jpg'
      urllib.request.urlretrieve(image_url, 'cat_flower.jpg')
      print("Image downloaded successfully!")

4. Cloning from GitHub Repositories

For projects managed on GitHub, you can clone entire repositories directly into your Colab session.

  • Why use it?

    • Version Control: Integrate with your existing GitHub workflows.
    • Project Structure: Download all project files, scripts, and data at once.
  • How to Clone a GitHub Repository:

    1. Use the !git clone command in a Colab cell, replacing [repository_url] with your GitHub repository's HTTPS clone URL.
      !git clone [repository_url]
    2. The repository will be cloned into a new folder in your /content/ directory.

    Example:

    !git clone https://github.com/googlecolab/colabtools.git
    # Now you can navigate into the cloned directory and access its files
    !ls colabtools

5. Using the Kaggle API

If you work extensively with datasets from Kaggle, their API provides a direct way to download data into Colab.

  • Why use it?

    • Large Datasets: Efficiently download large competition or public datasets.
    • Automation: Automate data retrieval for competitions.
  • How to Use the Kaggle API:

    1. Install the Kaggle API client:

      !pip install kaggle
    2. Upload your kaggle.json API token:

      • Go to your Kaggle account settings, click "Create New API Token" to download kaggle.json.

      • Upload this file to Colab using files.upload() (as described in method 2).

      • Move the kaggle.json file to the correct directory (~/.kaggle) and set permissions:

        from google.colab import files
        files.upload() # Upload kaggle.json
        
        !mkdir -p ~/.kaggle
        !mv kaggle.json ~/.kaggle/
        !chmod 600 ~/.kaggle/kaggle.json
    3. Download a dataset:

      !kaggle datasets download -d [dataset-owner]/[dataset-name]
      # Example: !kaggle datasets download -d ankitbansal06/retail-orders-dataset
      !unzip [dataset-name].zip # Most Kaggle datasets are zipped

      Example:

      
      !pip install kaggle
      from google.colab import files
      files.upload() # Upload kaggle.json from your local machine

    !mkdir -p ~/.kaggle
    !mv kaggle.json ~/.kaggle/
    !chmod 600 ~/.kaggle/kaggle.json

    !kaggle datasets download -d ankitbansal06/retail-orders-dataset
    !unzip retail-orders-dataset.zip

    import pandas as pd
    df = pd.read_csv('Retail_Orders_Dataset.csv')
    print("Kaggle data loaded successfully!")
    print(df.head())

Summary of Data Upload Methods

Method Description Best For Persistence Ease of Use
Mount Google Drive Connects your personal Google Drive to Colab. Permanent storage, large datasets, shared files, recurring projects. Persistent (stored on Drive) High
Direct Local Upload Uploads files from your computer via a file picker. Small, temporary files, quick tests, one-time use. Temporary (deleted on runtime reset) High
Fetch from URL Downloads data directly from a public web address. Publicly available datasets, web-hosted files (CSV, JSON, images), reproducible data fetching. Temporary (downloads to session), but easily re-downloadable High
Clone GitHub Repo Downloads an entire GitHub repository. Projects with code and data on GitHub, version-controlled datasets, reproducible research. Temporary (cloned to session), but easily re-cloned Medium
Kaggle API Uses the Kaggle API to download datasets directly from Kaggle. Large Kaggle datasets, competitive programming, automation of data acquisition from Kaggle. Temporary (downloaded to session), but easily re-downloadable via API Medium

Choosing the right method depends on factors like data size, permanence requirements, and the origin of your data. For most common scenarios, especially if you need to retain your data across sessions, mounting Google Drive is the most recommended approach.