Reading Data In Your Data Science Project
Engaging in a data science project is a fun and interesting thing to do, every data scientist with their styles and tools they use in their project execution. One thing is common to every data science project and it is data. Data is the bedrock of every data science project; without data there is no data science or it related field. This data can be in any format(Excel, Comma Separated Values CSV, txt, JSON etc.)
In this article we will be considering three ways to read or load data in every project.
Python provides an inbuilt open()
function to read text files. To use this function you need to call the open()
function, specify the filepath and the mode; whether you want to read, write, or append. The filepath tells the location of the file, the mode specify what you want to do with the data. An example of this method will be; open(filepath,mode)
To read a data that is inside the same folder with your program, you don't need to specify the path of the data, you just need to go ahead and call the name of the data and the mode which is read. An example will be; open('example_data.txt','r')
To read data that is not in the same folder or directory in Windows you need to specify the filepath using forward slash (/)
an example will be open(r"C:/mycomputer/example_data.txt")
You can store your data in a variable like this data = open(r"C:/mycomputer/example_data.txt")
The second way we will be considering is reading data using the read method in pandas library. Pandas is a library that is built on python programming language, it provides many tools that you can use to analyze data. To read data using Pandas first you have to import Pandas library as an Alias.
import pandas as pd
then we call the pd.read_dataformat()
function, the data format can be in CSV, Excel, JSON, etc. An example of this will be pd.read_dataformat("filename.dataformat")
You can also use pd.read_dataformat ("filepath.dataformat")
if you need to specify the path of the data file, that is if the data is not in the same folder or directory with your program. You can go ahead and store your data in a variable
data = pd.read_dataformat("filename.dataformat")
The third way is using the numpy library; numpy provides numpy.loadtxt()
function to read data. Numpy stands for numerical python, it is a python library that is used for data analysis. It is used for working with arrays, arrays are like lists. With numpy matrix comes to mind. Back to reading data using numpy.loadtxt()
function. The first step is to import the numpy library as an Alias, that is import numpy as np
, then we call the function as np.loadtxt()
since it has been aliased. An example of this is np.loadtxt("filename.dataformat")
You can specify the data type as an argument; whether 'float' which is decimal or 'int' which stands for integer. This goes like this np.loadtxt("filename.dataformat", dtype = 'int')
if your data is not in the same folder with your program, you will have to specify the path of the data file, that is np.loadtxt(filepath.dataformat)
The data can be stored in a variable like this data = np.loadtxt("filename.dataformat")
, the same apply for the path. Note that using the np.loadtxt()
function will always return a numpy dimensional array.
In all of these ways the widely and commonly used in data science project is the pandas.read()
function. Most projects out there uses this method, I use this all the time. Personally I have not seen a project where the data scientist use the numpy.loadtxt()
function to read data, but it can be used as long as we are working with arrays. Sometimes data scientists uses python open()
function to read data, it is a matter of preference.