Menu

Pandas read_csv() – How to read a csv file in Python

Most of the data is available in a tabular format of CSV files. It is very popular. You can convert them to a pandas DataFrame using the read_csv function. The pandas.read_csv is used to load a CSV file as a pandas dataframe.

In this article, you will learn the different features of the read_csv function of pandas apart from loading the CSV file and the parameters which can be customized to get better output from the read_csv function.

pandas.read_csv

  • Syntax: pandas.read_csv( filepath_or_buffer, sep, header, index_col, usecols, prefix, dtype, converters, skiprows, skiprows, nrows, na_values, parse_dates)Purpose: Read a comma-separated values (csv) file into DataFrame. Also supports optionally iterating or breaking the file into chunks.
  • Parameters:
    • filepath_or_buffer : str, path object or file-like object Any valid string path is acceptable. The string could be a URL too. Path object refers to os.PathLike. File-like objects with a read() method, such as a filehandle (e.g. via built-in open function) or StringIO.
    • sep : str, (Default ‘,’) Separating boundary which distinguishes between any two subsequent data items.
    • header : int, list of int, (Default ‘infer’) Row number(s) to use as the column names, and the start of the data. The default behavior is to infer the column names: if no names are passed the behavior is identical to header=0 and column names are inferred from the first line of the file.
    • names: array-like List of column names to use. If the file contains a header row, then you should explicitly pass header=0 to override the column names. Duplicates in this list are not allowed.
    • index_col: int, str, sequence of int/str, or False, (Default None) Column(s) to use as the row labels of the DataFrame, either given as string name or column index. If a sequence of int/str is given, a MultiIndex is used.
    • usecols: list-like or callable Return a subset of the columns. If callable, the callable function will be evaluated against the column names, returning names where the callable function evaluates to True.
    • prefix: str Prefix to add to column numbers when no header, e.g. ‘X’ for X0, X1
    • dtype: Type name or dict of column -> type Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’} Use str or object together with suitable na_values settings to preserve and not interpret dtype.
    • converters: dict Dict of functions for converting values in certain columns. Keys can either be integers or column labels.
    • skiprows: list-like, int or callable Line numbers to skip (0-indexed) or the number of lines to skip (int) at the start of the file. If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise.
    • skipfooter: int Number of lines at bottom of the file to skip
    • nrows: int Number of rows of file to read. Useful for reading pieces of large files.
    • na_values: scalar, str, list-like, or dict Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’.
    • parse_dates: bool or list of int or names or list of lists or dict, (default False) If set to True, will try to parse the index, else parse the columns passed
  • Returns: DataFrame or TextParser, A comma-separated values (CSV) file is returned as a two-dimensional data structure with labeled axes. _For full list of parameters, refer to the offical documentation

Reading CSV file

The pandas read_csv function can be used in different ways as per necessity like using custom separators, reading only selective columns/rows and so on. All cases are covered below one after another.

Default Separator

To read a CSV file, call the pandas function read_csv() and pass the file path as input.

Step 1: Import Pandas

import pandas as pd

Step 2: Read the CSV

# Read the csv file
df = pd.read_csv("data1.csv")

# First 5 rows
df.head()
read_csv file from pandas

Different, Custom Separators

By default, a CSV is seperated by comma. But you can use other seperators as well. The pandas.read_csvfunction is not limited to reading the CSV file with default separator (i.e. comma). It can be used for other separators such as ;, | or :. To load CSV files with such separators, the sep parameter is used to pass the separator used in the CSV file.

Let’s load a file with | separator

# Read the csv file sep='|'
df = pd.read_csv("data2.csv", sep='|')
df
Custom Separators for read
_csv pandas file

Set any row as column header

Let’s see the data frame created using the read_csv pandas function without any header parameter:

# Read the csv file
df = pd.read_csv("data1.csv")
df.head()
Column header for read
_csv pandas file

The row 0 seems to be a better fit for the header. It can explain better about the figures in the table. You can make this 0 row as a header while reading the CSV by using the header parameter. Header parameter takes the value as a row number.

Note: Row numbering starts from 0 including column header

# Read the csv file with header parameter
df = pd.read_csv("data1.csv", header=1)
df.head()
Column header for read
_csv pandas file

Renaming column headers

While reading the CSV file, you can rename the column headers by using the names parameter. The names parameter takes the list of names of the column header.

# Read the csv file with names parameter 
df = pd.read_csv("data.csv", names=['Ranking', 'ST Name', 'Pop', 'NS', 'D'])
df.head()
Renaming Column header for read
_csv pandas file

To avoid the old header being inferred as a row for the data frame, you can provide the header parameter which will override the old header names with new names.

# Read the csv file with header and names parameter 
df = pd.read_csv("data.csv", header=0, names=['Ranking', 'ST Name', 'Pop', 'NS', 'D'])
df.head()
Renaming Column header for read
_csv pandas file

 

 

Loading CSV without column headers in pandas

There is a chance that the CSV file you load doesn’t have any column header. The pandas will make the first row as a column header in the default case.

# Read the csv file
df = pd.read_csv("data3.csv")
df.head()
Default case without column header

To avoid any row being inferred as column header, you can specify header as None. It will force pandas to create numbered columns starting from 0.

# Read the csv file with header=None
df = pd.read_csv("data3.csv", header=None)
df.head()
Default case without column header

Adding Prefixes to numbered columns

You can also give prefixes to the numbered column headers using the prefix parameter of pandas read_csv function.

# Read the csv file with header=None and prefix=column_
df = pd.read_csv("data3.csv", header=None, prefix='column_')
df.head()

Set any column(s) as Index

By default, Pandas adds an initial index to the data frame loaded from the CSV file. You can control this behavior and make any column of your CSV as an index by using the index_col parameter.

It takes the name of the desired column which has to be made as an index.

Case 1: Making one column as index

# Read the csv file with 'Rank' as index
df = pd.read_csv("data.csv", index_col='Rank')
df.head()

Case 2: Making multiple columns as index

For two or more columns to be made as an index, pass them as a list.

# Read the csv file with 'Rank' and 'Date' as index
df = pd.read_csv("data.csv", index_col=['Rank', 'Date'])
df.head()

Selecting columns while reading CSV

In practice, all the columns of the CSV file are not important. You can select only the necessary columns after loading the file but if you’re aware of those beforehand, you can save the space and time.

usecols parameter takes the list of columns you want to load in your data frame.

Selecting columns using list

# Read the csv file with 'Rank', 'Date' and  'Population' columns (list)
df = pd.read_csv("data.csv", usecols=['Rank', 'Date', 'Population'])
df.head()
Selecting column for read_csv pandas file

Selecting columns using callable functions

usecols parameter can also take callable functions. The callable functions evaluate on column names to select that specific column where the function evaluates to True.

# Read the csv file with columns where length of column name > 10
df = pd.read_csv("data.csv", usecols=lambda x: len(x)>10)
df.head()
Selecting column for read_csv pandas file

Selecting/skipping rows while reading CSV

You can skip or select a specific number of rows from the dataset using the pandas.read_csv function. There are 3 parameters that can do this task: nrows, skiprows and skipfooter.

All of them have different functions. Let’s discuss each of them separately.

A. nrows: This parameter allows you to control how many rows you want to load from the CSV file. It takes an integer specifying row count.

# Read the csv file with 5 rows
df = pd.read_csv("data.csv", nrows=5)
df
Selecting rows for read_csv pandas file

B. skiprows: This parameter allows you to skip rows from the beginning of the file.

Skiprows by specifying row indices

# Read the csv file with first row skipped
df = pd.read_csv("data.csv", skiprows=1)
df.head()
Selecting rows for read_csv pandas file

Skiprows by using callback function

skiprows parameter can also take a callable function as input which evaluates on row indices. This means the callable function will check for every row indices to decide if that row should be skipped or not.

# Read the csv file with odd rows skipped
df = pd.read_csv("data.csv", skiprows=lambda x: x%2!=0)
df.head()
Selecting rows for read_csv pandas file

C. skipfooter: This parameter allows you to skip rows from the end of the file.

# Read the csv file with 1 row skipped from the end
df = pd.read_csv("data.csv", skipfooter=1)
df.tail()
Selecting rows for read_csv pandas file

Changing the data type of columns

You can specify the data types of columns while reading the CSV file. dtype parameter takes in the dictionary of columns with their data types defined. To assign the data types, you can import them from the numpy package and mention them against suitable columns.

Data Type of Rank before change

# Read the csv file 
df = pd.read_csv("data.csv")

# Display datatype of Rank
df.Rank.dtypes
dtype('int64')

Data Type of Rank after change

# import numpy
import numpy as np

# Read the csv file with data type specified for Rank.
df = pd.read_csv("data.csv", dtype={'Rank':np.int8})

# Display datatype of rank
df.Rank.dtypes
dtype('int8')

Parse Dates while reading CSV

Date time values are very crucial for data analysis. You can convert a column to a datetime type column while reading the CSV in two ways:

Method 1. Make the desired column as an index and pass parse_dates=True

# Read the csv file with 'Date' as index and parse_dates=True
df = pd.read_csv("data.csv", index_col='Date', parse_dates=True, nrows=5)

# Display index
df.index
DatetimeIndex(['2021-02-25', '2021-04-14', '2021-02-19', '2021-02-24',
               '2021-02-13'],
              dtype='datetime64[ns]', name='Date', freq=None)

Method 2. Pass desired column in parse_dates as list

# Read the csv file with parse_dates=['Date']
df = pd.read_csv("data.csv", parse_dates=['Date'], nrows=5)

# Display datatypes of columns
df.dtypes
Rank                           int64
State                         object
Population                    object
National Share (%)            object
Date                  datetime64[ns]
dtype: object

Adding more NaN values

Pandas library can handle a lot of missing values. But there are many cases where the data contains missing values in forms that are not present in the pandas NA values list. It doesn’t understand ‘missing’, ‘not found’, or ‘not available’ as missing values.

So, you need to assign them as missing. To do this, use the na_values parameter that takes a list of such values.

Loading CSV without specifying na_values

# Read the csv file
df = pd.read_csv("data.csv", nrows=5)
df
Adding NaN values

Loading CSV with specifying na_values

# Read the csv file with 'missing' as na_values
df = pd.read_csv("data.csv", na_values=['missing'], nrows=5)
df
Adding NaN values

Convert values of the column while reading CSV

You can transform, modify, or convert the values of the columns of the CSV file while loading the CSV itself. This can be done by using the converters parameter. converters takes in a dictionary with keys as the column names and values are the functions to be applied to them.

Let’s convert the comma seperated values (i.e 19,98,12,341) of the Population column in the dataset to integer value (199812341) while reading the CSV.

# Function which converts comma seperated value to integer
toInt = lambda x: int(x.replace(',', '')) if x!='missing' else -1

# Read the csv file 
df = pd.read_csv("data.csv", converters={'Population': toInt})
df.head()

Practical Tips

  • Before loading the CSV file into a pandas data frame, always take a skimmed look at the file. It will help you estimate which columns you should import and determine what data types your columns should have.
  • You should also lookout for the total row count of the dataset. A system with 4 GB RAM may not be able to load 7-8M rows.

Test your knowledge

Q1: You cannot load files with the $ separator using the pandas read_csv function. True or False?

Answer:

Answer: False. Because, you can use sep parameter in read_csv function.

Q2: What is the use of the converters parameter in the read_csv function?

Answer:

Answer: converters parameter is used to modify the values of the columns while loading the CSV.

Q3: How will you make pandas recognize that a particular column is datetime type?

Answer:

Answer: By using parse_dates parameter.

Q4: A dataset contains missing values no, not available, and '-100'. How will you specify them as missing values for Pandas to correctly interpret them? (Assume CSV file name: example1.csv)

Answer:

Answer: By using na_values parameter.

import pandas as pd

df = pd.read_csv("example1.csv", na_values=['no', 'not available', '-100'])

Q5: How would you read a CSV file where,

  1. The heading of the columns is in the 3rd row (numbered from 1).
  2. The last 5 lines of the file have garbage text and should be avoided.
  3. Only the column names whose first letter starts with vowels should be included. Assume they are one word only.

(CSV file name: example2.csv)

Answer:

Answer:

import pandas as pd

colnameWithVowels = lambda x: x.lower()[0] in ['a', 'e', 'i', 'o', 'u']

df = pd.read_csv("example2.csv", usecols=colnameWithVowels, header=2, skipfooter=5)

 

The article was contributed by Kaustubh G and Shrivarsheni

Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science