2 minute read

I began my Ph.D. journey on 2021. At that time I started learning ML. So, initially, I practiced different ML models on the internet and through different tutorials. I always got confused about the ways different tutorials demonstrated how to load the dataset. Because I was not sure which way is good for my case. Then, I documented the dataset loading mechanism that different tutorials followed. I found that most of the tutorials follow only 2/3 types of mechanisms.

Here I will write a few techniques for loading CSV datasets into your machine learning and deep learning model.

Method1: Load data using basic python module

The first method I will discuss is a little lengthy and has two steps.

  • Download or load the raw data into RAM
  • Convert the data into a standard format such as list, Numpy array.

Open File into RAM

#Method 1: Open File from Remote URL
from urllib.request import urlopen
path = "path_of_the_data"
rawdata= urlopen(path) #loads the raw data from url into ram

#Method 2: Open File from Local 
rawdata= open(filename, 'rt') #r=read mode | t=txt mode

Convert Data Into Numpy array

#Method 1 : Convert numpy array using loadtxt() method
data_np = np.loadtxt(rawdata, delimiter=',') # returns numpy array

#Method 2: Convert numpy array using csv and numpy
import csv
csvObj = csv.reader(rawdata, delimiter=',', quoting=csv.QUOTE_NONE)
listObj = list(csvObj)             # convert csv object into list
data_np= np.array(listObj)         #convert list object into numpy array
data_np = data_np.astype('float')  #convert string array into float

Method 2: Load data using Pandas

The second method is using the Pandas module which is very straightforward. It has a csv_read() method which takes both URL and local path and returns the output as pandas dataframe.

import pandas as pd
df1 = pd.read_csv(remote_url) #read data from remote url
df2 = pd.read_csv(local_path) #read data from local path
df3 = pd.read_csv('local_path',header=None)  #headers=none for there is not header in my dataset
# return the numpy representation of the dataframe
np_data= df1.values

Method 3: Load data from google drive

While working in Google Colab, it is convenient to load data from google drive. Suppose here is the structure of your google drive folder and you uploaded the dataset into the dataset_folder

My Drive 

To access the dataset from dataset_folder we have to write the following code.

from google.colab import drive
import os
# Mounting my Google drive
#Setting google drive folder path. 
os.chdir(r"/content/drive/My Drive/dataset_folder")

Another important point you need to remember all the time while working with ML. Sometimes you download the dataset and feed that from the local directory. When you pass our model to someone, it kind of breaks the code. If the dataset is available online, don’t download it into your drive and load it. Because it will break your code in the future. So it is good practice to load the dataset directly from the URL and process it after that.

Shuvangkar Das,Potsdam, New York

Leave a comment