Pandas Select Columns

Let’s understand how to select columns in pandas using various useful method.

 

While working on data science projects, you generally get a big dump of data. But you may need to work with only a subset of columns from the data. This selection of necessary data is done to make the tasks simpler. Additionally, the computation speeds also gets improved while working with a smaller data set. There are various ways to select columns from data frame including loc, iloc, filter ,etc. This article covers all the methods with examples.

Create a sample data set

Start by creating a simple pandas dataframe as shown below. It has 6 columns.

# Import packages
import pandas as pd

# Create a dataframe
df = pd.DataFrame(
                   data= [['HR', 'Orange', 'Wheat', 30, 165, 4.6],
                        ['DL', 'Purple', 'Flour', 2, 70, 8.3],
                        ['MH', 'Red', 'Mango', 12, 120, 9.0],
                        ['AS', 'Black', 'Apple', 4, 80, 3.3],
                        ['GJ', 'Blue', 'Milk', 32, 180, 1.8],
                        ['KL', 'Green', 'Melon', 33, 172, 9.5],
                        ['PB', 'Magenta', 'Beans', 69, 150, 2.2]],

                    columns=['State', 'Color', 'Food', 'Average Age', 'Average Height', 'Score']
                )
df
Creating Dataframe

1. Basic Column Selection

One of the most basic ways in pandas to select columns from dataframe is by passing the list of columns to the dataframe object indexing operator.

# Selecting columns by passing a list of desired columns
df[['Color', 'Score']]
Pandas Select Column

 

2. Column selection using column list

The dataframe_name.columns returns the list of all the columns in the dataframe. You can use this as one of the ways of accessing multiple columns in pandas. You need to pass the modified list of columns in the dataframe indexing operator. This method can be performed in two ways:

A. Passing sliced column list

You can slice out the desired column names from the list of all the columns returned by dataframe_name.columns by indexing.

# slicing df.columns to access the last two columns of the dataframe

df[df.columns[-2:]]
Pandas Select Column using list

B. Searching for desired columns using isin()

The isin() fucntion is used to check if an element is present in the list. Here , the isin() function can be used to check if a column name is present in df.columns list as well as desired columns list. This helps in selecting multiple columns from the list of all columns. The function returns a boolean array, where True value denotes that the column name was present in both lists.

# Creating the boolean mask
booleanMask = df.columns.isin(['State', 'Food'])

# saving the selected columns 
selectedCols = df.columns[booleanMask]

# selecting the desired columns
df[selectedCols]
Pandas Select Column using isin function

3. Using .loc and .iloc to select columns by name or position

The Pandas .loc and .iloc indexers can be used with pandas dataframe to narrow down a large dataframe into a small dataframe. These indexers can be used for both row and column selection.

A. Using .loc indexer

The loc takes column names or lists of columns and returns a row or dataframe. The indexer takes both rows and column slicing. Therefore, to select only columns from the dataframe, you can leave row slicing as :, which will pick all the rows of the dataframe.

For column slicing, you can either pass the name of the column to be selected.

Get Free Complete Python Course

Facing the same situation like everyone else?

Build your data science career with a globally recognised, industry-approved qualification. Get the mindset, the confidence and the skills that make Data Scientist so valuable.

Logo

Get Free Complete Python Course

Build your data science career with a globally recognised, industry-approved qualification. Get the mindset, the confidence and the skills that make Data Scientist so valuable.

# .loc single column selection 
df.loc[:, 'Food' ]
0    Wheat
1    Flour
2    Mango
3    Apple
4     Milk
5    Melon
6    Beans
Name: Food, dtype: object

If you want to select multiple columns, pass a list of columns.

# .loc multiple column selection
df.loc[:, ['State', 'Food'] ]
Pandas Select Column using loc

B. Using .iloc indexer

The iloc indexer is similar to .loc indexer. The only difference in iloc is that it takes indices of columns or indices lists of columns and returns a row or dataframe.

Here also, you can leave row slicing as : to select all rows and for column slicing, you can either pass the indices of the column or a list of indices of the columns to be selected.

# .iloc single column selection 
df.iloc[:, 0]
0    HR
1    DL
2    MH
3    AS
4    GJ
5    KL
6    PB
Name: State, dtype: object
# .iloc single column selection 
df.iloc[:, [0, 2, 3] ]
Pandas Select Column using iloc

4. Using filter methods to select columns that contains certain words.

You can use the filter function of the pandas dataframe to select columns containing a specified string in column names.

The parameter like of the .filter function defines this specific string. If a column name contains the string specified, that column will be selected and dataframe will be returned.

# selecting columns where column name contains 'Average' string
df.filter(like='Average')

5. Pandas Select columns based on their data type

Pandas dataframe has the function select_dtypes, which has an include parameter. Specify the datatype of the columns which you want select using this parameter. This can be useful to you if you want to select only specific data type columns from the dataframe.

# selecting integer valued columns
df.select_dtypes(include=['int64'])
Selection based on data type

Practical Tips

  • You can select columns based on regular expressions rules too. Use regex parameter in filter function to get columns where column names evaluate to True on given expression.
  • Selecting columns based on data types can prove useful while performing Exploratory Data Analysis. You will have different data types segregated which will eliminate the process of manually creating the list of columns for different data types.

If you would like to learn more about functions of pandas, you can go through the exercises.

Test your knowledge

Q1: Which of the following is the correct way to select multiple columns in pandas dataframe?

a) df['Col1', 'Col2']

b) df[['Col1', 'Col2']]

c) df[['Col1'], 'Col2']

Answer:

Answer: Only B option

Q2: How to select all the categorical columns (dtype=category)?

Answer:

Answer: Using pandas dataframe function select_dtypes() and include parameter.

dataframe_name.select_dtypes(include=['category'])

Q3: How to select all the columns where the column name contains the word final?

Answer:

Answer: Using pandas dataframe function filter and like parameter.

dataframe_name.filter(like='final')

 

The article was contributed by Kaustubh and Shrivarsheni.

Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science