Let’s understand how to select columns in pandas using various useful method.
While working on data science projects, you generally get a big dump of data. But you may need to work with only a subset of columns from the data. This selection of necessary data is done to make the tasks simpler. Additionally, the computation speeds also gets improved while working with a smaller data set. There are various ways to select columns from data frame including
filter ,etc. This article covers all the methods with examples.
Create a sample data set
Start by creating a simple pandas dataframe as shown below. It has 6 columns.
# Import packages import pandas as pd # Create a dataframe df = pd.DataFrame( data= [['HR', 'Orange', 'Wheat', 30, 165, 4.6], ['DL', 'Purple', 'Flour', 2, 70, 8.3], ['MH', 'Red', 'Mango', 12, 120, 9.0], ['AS', 'Black', 'Apple', 4, 80, 3.3], ['GJ', 'Blue', 'Milk', 32, 180, 1.8], ['KL', 'Green', 'Melon', 33, 172, 9.5], ['PB', 'Magenta', 'Beans', 69, 150, 2.2]], columns=['State', 'Color', 'Food', 'Average Age', 'Average Height', 'Score'] ) df
1. Basic Column Selection
One of the most basic ways in pandas to select columns from dataframe is by passing the list of columns to the dataframe object indexing operator.
# Selecting columns by passing a list of desired columns df[['Color', 'Score']]
2. Column selection using column list
dataframe_name.columns returns the list of all the columns in the dataframe. You can use this as one of the ways of accessing multiple columns in pandas. You need to pass the modified list of columns in the dataframe indexing operator. This method can be performed in two ways:
A. Passing sliced column list
You can slice out the desired column names from the list of all the columns returned by
dataframe_name.columns by indexing.
# slicing df.columns to access the last two columns of the dataframe df[df.columns[-2:]]
B. Searching for desired columns using
isin() fucntion is used to check if an element is present in the list. Here
, the isin() function can be used to check if a column name is present in
df.columns list as well as desired columns list. This helps in selecting multiple columns from the list of all columns. The function returns a boolean array, where
True value denotes that the column name was present in both lists.
# Creating the boolean mask booleanMask = df.columns.isin(['State', 'Food']) # saving the selected columns selectedCols = df.columns[booleanMask] # selecting the desired columns df[selectedCols]
3. Using .loc and .iloc to select columns by name or position
.iloc indexers can be used with pandas dataframe to narrow down a large dataframe into a small dataframe. These indexers can be used for both row and column selection.
The loc takes column names or lists of columns and returns a row or dataframe. The indexer takes both rows and column slicing. Therefore, to select only columns from the dataframe, you can leave row slicing as
:, which will pick all the rows of the dataframe.
For column slicing, you can either pass the name of the column to be selected.
# .loc single column selection df.loc[:, 'Food' ]
0 Wheat 1 Flour 2 Mango 3 Apple 4 Milk 5 Melon 6 Beans Name: Food, dtype: object
If you want to select multiple columns, pass a list of columns.
# .loc multiple column selection df.loc[:, ['State', 'Food'] ]
The iloc indexer is similar to
.loc indexer. The only difference in
iloc is that it takes indices of columns or indices lists of columns and returns a row or dataframe.
Here also, you can leave row slicing as
: to select all rows and for column slicing, you can either pass the indices of the column or a list of indices of the columns to be selected.
# .iloc single column selection df.iloc[:, 0]
0 HR 1 DL 2 MH 3 AS 4 GJ 5 KL 6 PB Name: State, dtype: object
# .iloc single column selection df.iloc[:, [0, 2, 3] ]
4. Using filter methods to select columns that contains certain words.
You can use the filter function of the pandas dataframe to select columns containing a specified string in column names.
like of the
.filter function defines this specific string. If a column name contains the string specified, that column will be selected and dataframe will be returned.
# selecting columns where column name contains 'Average' string df.filter(like='Average')
5. Pandas Select columns based on their data type
Pandas dataframe has the function
select_dtypes, which has an
include parameter. Specify the datatype of the columns which you want select using this parameter. This can be useful to you if you want to select only specific data type columns from the dataframe.
# selecting integer valued columns df.select_dtypes(include=['int64'])
- You can select columns based on regular expressions rules too. Use
regexparameter in filter function to get columns where column names evaluate to True on given expression.
- Selecting columns based on data types can prove useful while performing Exploratory Data Analysis. You will have different data types segregated which will eliminate the process of manually creating the list of columns for different data types.
If you would like to learn more about functions of pandas, you can go through the exercises.
Test your knowledge
Q1: Which of the following is the correct way to select multiple columns in pandas dataframe?
Answer: Only B option
Q2: How to select all the categorical columns (dtype=
Answer: Using pandas dataframe function
Q3: How to select all the columns where the column name contains the word
Answer: Using pandas dataframe function
The article was contributed by Kaustubh and Shrivarsheni.