April 29, 2017, 8:30 p.m.

Data Science Data Loading techniques

Eg on how to download from scikit

downloading datasets from mldata.org

>>> from sklearn.datasets import fetch_mldata

>>> mnist = fetch_mldata('MNIST original', data_home=custom_data_home)



import pandas as pd

r_cols = ['user_id', 'movie_id', 'rating']
ratings = pd.read_csv('c:/xxxxxxxxx/xx/xxx/xx/sample.data', sep='\t', names=r_cols, usecols=range(3))



>>> from numpy import *
>>> data = loadtxt("myfile.txt")                       # myfile.txt contains 4 columns of numbers
>>> t,z = data[:,0], data[:,3]                         # data is 2D numpy array
>>> t,x,y,z = loadtxt("myfile.txt", unpack=True)                  # to unpack all columns
>>> t,z = loadtxt("myfile.txt", usecols = (0,3), unpack=True)     # to select just a few columns
>>> data = loadtxt("myfile.txt", skiprows = 7)                    # to skip 7 rows from top of file
>>> data = loadtxt("myfile.txt", comments = '!')                  # use '!' as comment char instead of '#'
>>> data = loadtxt("myfile.txt", delimiter=';')                   # use ';' as column separator instead of whitespace
>>> data = loadtxt("myfile.txt", dtype = int)                     # file contains integers instead of floats


How to load data sets : 

Here are some recommended ways to load standard columnar data into a format usable by scikit-learn:

  • pandas.io provides tools to read data from common formats including CSV, Excel, JSON and SQL. DataFrames may also be constructed from lists of tuples or dicts. Pandas handles heterogeneous data smoothly and provides tools for manipulation and conversion into a numeric array suitable for scikit-learn.
  • scipy.io specializes in binary formats often used in scientific computing context such as .mat and .arff
  • numpy/routines.io for standard loading of columnar data into numpy arrays
  • scikit-learn’s datasets.load_svmlight_file for the svmlight or libSVM sparse format
  • scikit-learn’s datasets.load_files for directories of text files where the name of each directory is the name of each category and each file inside of each directory corresponds to one sample from that category

For some miscellaneous data such as images, videos, and audio, you may wish to refer to: