Let’s understand how to select columns in pandas using various useful method.
While working on data science projects, you generally get a big dump of data. But you may need to work with only a subset of columns from the data. This selection of necessary data is done to make the tasks simpler. Additionally, the computation speeds also gets improved while working with a smaller data set. There are various ways to select columns from data frame including loc
, iloc
, filter
,etc. This article covers all the methods with examples.
Create a sample data set
Start by creating a simple pandas dataframe as shown below. It has 6 columns.
# Import packages
import pandas as pd
# Create a dataframe
df = pd.DataFrame(
data= [['HR', 'Orange', 'Wheat', 30, 165, 4.6],
['DL', 'Purple', 'Flour', 2, 70, 8.3],
['MH', 'Red', 'Mango', 12, 120, 9.0],
['AS', 'Black', 'Apple', 4, 80, 3.3],
['GJ', 'Blue', 'Milk', 32, 180, 1.8],
['KL', 'Green', 'Melon', 33, 172, 9.5],
['PB', 'Magenta', 'Beans', 69, 150, 2.2]],
columns=['State', 'Color', 'Food', 'Average Age', 'Average Height', 'Score']
)
df

1. Basic Column Selection
One of the most basic ways in pandas to select columns from dataframe is by passing the list of columns to the dataframe object indexing operator.
# Selecting columns by passing a list of desired columns
df[['Color', 'Score']]

2. Column selection using column list
The dataframe_name.columns
returns the list of all the columns in the dataframe. You can use this as one of the ways of accessing multiple columns in pandas. You need to pass the modified list of columns in the dataframe indexing operator. This method can be performed in two ways:
A. Passing sliced column list
You can slice out the desired column names from the list of all the columns returned by dataframe_name.columns
by indexing.
# slicing df.columns to access the last two columns of the dataframe
df[df.columns[-2:]]

B. Searching for desired columns using isin()
The isin()
fucntion is used to check if an element is present in the list. Here , the isin()
function can be used to check if a column name is present in df.columns
list as well as desired columns list. This helps in selecting multiple columns from the list of all columns. The function returns a boolean array, where True
value denotes that the column name was present in both lists.
# Creating the boolean mask
booleanMask = df.columns.isin(['State', 'Food'])
# saving the selected columns
selectedCols = df.columns[booleanMask]
# selecting the desired columns
df[selectedCols]

3. Using .loc and .iloc to select columns by name or position
The Pandas .loc
and .iloc
indexers can be used with pandas dataframe to narrow down a large dataframe into a small dataframe. These indexers can be used for both row and column selection.
A. Using .loc
indexer
The loc
takes column names or lists of columns and returns a row or dataframe. The indexer takes both rows and column slicing. Therefore, to select only columns from the dataframe, you can leave row slicing as :
, which will pick all the rows of the dataframe.
For column slicing, you can either pass the name of the column to be selected.
# .loc single column selection
df.loc[:, 'Food' ]
0 Wheat
1 Flour
2 Mango
3 Apple
4 Milk
5 Melon
6 Beans
Name: Food, dtype: object
If you want to select multiple columns, pass a list of columns.
# .loc multiple column selection
df.loc[:, ['State', 'Food'] ]

B. Using .iloc
indexer
The iloc
indexer is similar to .loc
indexer. The only difference in iloc
is that it takes indices of columns or indices lists of columns and returns a row or dataframe.
Here also, you can leave row slicing as :
to select all rows and for column slicing, you can either pass the indices of the column or a list of indices of the columns to be selected.
# .iloc single column selection
df.iloc[:, 0]
0 HR
1 DL
2 MH
3 AS
4 GJ
5 KL
6 PB
Name: State, dtype: object
# .iloc single column selection
df.iloc[:, [0, 2, 3] ]

4. Using filter methods to select columns that contains certain words.
You can use the filter function of the pandas dataframe to select columns containing a specified string in column names.
The parameter like
of the .filter
function defines this specific string. If a column name contains the string specified, that column will be selected and dataframe will be returned.
# selecting columns where column name contains 'Average' string
df.filter(like='Average')

5. Pandas Select columns based on their data type
Pandas dataframe has the function select_dtypes
, which has an include
parameter. Specify the datatype of the columns which you want select using this parameter. This can be useful to you if you want to select only specific data type columns from the dataframe.
# selecting integer valued columns
df.select_dtypes(include=['int64'])

Practical Tips
- You can select columns based on regular expressions rules too. Use
regex
parameter in filter function to get columns where column names evaluate to True on given expression. - Selecting columns based on data types can prove useful while performing Exploratory Data Analysis. You will have different data types segregated which will eliminate the process of manually creating the list of columns for different data types.
If you would like to learn more about functions of pandas, you can go through the exercises.
Test your knowledge
Q1: Which of the following is the correct way to select multiple columns in pandas dataframe?
a) df['Col1', 'Col2']
b) df[['Col1', 'Col2']]
c) df[['Col1'], 'Col2']
Answer: Only B option
Q2: How to select all the categorical columns (dtype=category
)?
Answer: Using pandas dataframe function select_dtypes()
and include
parameter.
dataframe_name.select_dtypes(include=['category'])
Q3: How to select all the columns where the column name contains the word final
?
Answer: Using pandas dataframe function filter
and like
parameter.
dataframe_name.filter(like='final')
The article was contributed by Kaustubh and Shrivarsheni.