Menu

How to create a Pandas Dataframe in Python

In Pandas, DataFrame is the primary data structures to hold tabular data. You can create it using the DataFrame constructor pandas.DataFrame()or by importing data directly from various data sources.

Tabular datasets which are located in large external databases or are present in files of different formats such as .csv files or excel files can be read into Python using the pandas library in the form of a DataFrame.

In this article, you will see different ways of making a DataFrame or loading existing tabular datasets in the form of a DataFrame.

pandas.DataFrame

Syntax

    • pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)

Purpose

    • To create a two dimensional spreadsheet-like data structure for storing data in a tabular format

Parameters

    • data
      • Dictionary or list (default: None). It will be used to populate the rows and columns of the DataFrame.
    • index
      • Index or Array(default: None) It is used to specify the feature of the dataset whose values will be used to mark and identify each row of the dataset. Although its default value is ‘None’, if the index is not specified, integer values ranging from zero to one less than the total number of rows present in the dataset will be used as index
    • columns
      • Index or Array (default:None). It is used to specify the column or the feature names of the dataset. Although its default value is ‘None’, if the   columns parameter is not specified then integer values ranging from zero to one less than the total number of features present in the dataset will be used as column names
    • dtype
      • dtype(default: None). It is used to force the DataFrame to be created and have only those values or convert the values to the specified dtype. If this parameter is not specified then the DataFrame will infer the data types of each feature on the basis of the values present in them.
    • copy
      • Boolean(default: None). It is used to copy data from the inputs

Returns

    • A two-dimensional data structure containing data in a tabular format i.e., rows and columns

Creating a basic single column Pandas DataFrame

A basic DataFrame can be made by using a list.

# Create a single column dataframe
import pandas as pd
data = ['India', 'China', 'United States', 'Pakistan', 'Indonesia']

df = pd.DataFrame(data)

df
Creating basic dataframe

That creates a default column name (0) and index names (0,1,2,3..).

Making a DataFrame from a dictionary of lists

A pandas DataFrame can be created using a dictionary in which the keys are column names and and array or list of feature values are passed as the values to the dict.

This dictionary is then passed as a value to the data parameter of the DataFrame constructor.

# Create a dictionary where the keys are the feature names and the values are a list of the feature values
data_dict = {'Country': ['India', 'China', 'United States', 'Pakistan', 'Indonesia'],
             'Population': [1393409038, 1444216107, 332129157, 225199937, 276361783],
             'Currency': ['Indian Rupee', 'Renminbi', 'US Dollar', 'Pakistani Rupee', 'Indonesian Rupiah']}

df = pd.DataFrame(data=data_dict)

df
Making dataframe from list

 

Making a DataFrame from a list of lists

A list of lists means a list in which each element itself is a list. Each element in such a list forms a row of the DataFrame.
Therefore, the number of rows of the Pandas DataFrame is equal to the number of elements of the outer list.

# Create a list of lists where each inner list is a row of the DataFrame
data_list = [['India', 1393409038, 'Indian Rupee'],
             ['China', 1444216107, 'Renminbi'],
             ['United States', 332129157, 'US Dollar'],
             ['Pakistan', 225199937, 'Pakistani Rupee'],
             ['Indonesia', 276361783, 'Indonesian Rupiah']]


df = pd.DataFrame(data=data_list, columns=[
                  'Country', 'Population', 'Currency'])

df
Making dataframe from lists of lists

The elements of the inner lists, that is, the lists within data_list are the values of the different features across each row,

Also, see that the column names have been passed as a list to the columns parameter.


This means that the inner lists passed to data_list are strictly considered as rows of the DataFrame.
Therefore, when making a DataFrame using a list of lists, if the values of the columns parameter is not specified, then integer values ranging from zero to ‘one less than the total number of columns’, will be assumed as the column names.

For example:

# Create dataframe from a list of lists
data_list = [['India', 1393409038, 'Indian Rupee'],
             ['China', 1444216107, 'Renminbi'],
             ['United States', 332129157, 'US Dollar'],
             ['Pakistan', 225199937, 'Pakistani Rupee'],
             ['Indonesia', 276361783, 'Indonesian Rupiah']]


df = pd.DataFrame(data=data_list)

df
Making dataframe from lists of lists

Making a DataFrame from a list of dictionaries

A list of dictionaries means a list in which each element is a dictionary.
In the dictionary, the keys are the column names and the values are the corresponding column values.

# Create a list of dictionaries where the keys are the column names and the values are a particular feature value.
list_of_dicts = [{'Country': 'India', 'Population': 139409038, 'Currency': 'Indian Rupee'},
                 {'Country': 'China', 'Population': 1444216107,
                     'Currency': 'Renminbi'},
                 {'Country': 'United States', 'Population': 332129157,
                     'Currency': 'US Dollar'},
                 {'Country': 'Pakistan', 'Population': 225199937,
                     'Currency': 'Pakistani Rupee'},
                 {'Country': 'Indonesia', 'Population': 276361763, 'Currency': 'Indonesian Rupiah'}, ]


df = pd.DataFrame(list_of_dicts)

df
Making dataframe from dictionaries

Making a DataFrame from a Numpy array

A multi-dimensional numpy array can also be used for creating a DataFrame. It looks similar to the list of lists where there is an outer array and the inner arrays form the rows of the DataFrame.

# Create a numpy array where each inner array is a row of the DataFrame

import numpy as np

data_nparray = np.array([['India', 1393409038, 'Indian Rupee'],
                         ['China', 1444216107, 'Renminbi'],
                         ['United States', 332129157, 'US Dollar'],
                         ['Pakistan', 225199937, 'Pakistani Rupee'],
                         ['Indonesia', 276361783, 'Indonesian Rupiah']])


df = pd.DataFrame(data=data_nparray)
df
Making dataframe from numpy array

For column names, you need to pass a list of column names to the columns parameter just like it was shown in the previous section.

# Create dataframe with user specified column names
data_nparray = np.array([['India', 1393409038, 'Indian Rupee'],
                         ['China', 1444216107, 'Renminbi'],
                         ['United States', 332129157, 'US Dollar'],
                         ['Pakistan', 225199937, 'Pakistani Rupee'],
                         ['Indonesia', 276361783, 'Indonesian Rupiah']])


df = pd.DataFrame(data=data_nparray, columns=[
                  'Country', 'Population', 'Currency'])
df
Making dataframe from numpy array

Alternatively, you can also make a dictionary of numpy arrays where the keys would be the column names and the corresponding values to each key would be the inner arrays which are the feature values.

# Create a numpy array where each inner array is a list of values of a particular feature
data_array = np.array(
    [['India', 'China', 'United States', 'Pakistan', 'Indonesia'],
     [1393409038, 1444216107, 332129157, 225199937, 276361783],
     ['Indian Rupee', 'Renminbi', 'US Dollar', 'Pakistani Rupee', 'Indonesian Rupiah']])

# Create a dictionary where the keys are the column names and each element of data_array is the feature value.
dict_array = {
    'Country': data_array[0],
    'Population': data_array[1],
    'Currency': data_array[2]}

# Create the DataFrame
df = pd.DataFrame(dict_array)
df
Making dataframe from numpy array

Making a DataFrame using the zip function

The zip function can be used to combine multiple objects into a single object which can then be passed into the pandas.DataFrame function for making the DataFrame.

# Create the countries list(1st object)
countries = ['India', 'China', 'United States', 'Pakistan', 'Indonesia']

# Create the population list(2nd object)
population = [1393409038, 1444216107, 332129157, 225199937, 276361783]

# Create the currency list (3rd object)
currency = ['Indian Rupee', 'Renminbi', 'US Dollar',
            'Pakistani Rupee', 'Indonesian Rupiah']

# Zip the three objects
data_zipped = zip(countries, population, currency)

# Pass the zipped object as the data parameter and mention the column names explicitly
df = pd.DataFrame(data_zipped, columns=['Country', 'Population', 'Currency'])

df
Making dataframe using zip

Making Indexed Pandas DataFrames

Pandas DataFrames having a pre-defined index can also be made by passing a list of indices to the index parameter.

# Create the DataFrame
data_dict = {'Country': ['India', 'China', 'United States', 'Pakistan', 'Indonesia'],
             'Population': [1393409038, 1444216107, 332129157, 225199937, 276361783],
             'Currency': ['Indian Rupee', 'Renminbi', 'US Dollar', 'Pakistani Rupee', 'Indonesian Rupiah']}

# Make the list of indices
indices = ['Ind', 'Chi', 'US', 'Pak', 'Indo']

# Pass the indices to the index parameter
df = pd.DataFrame(data=data_dict, index=indices)

df
Making indexed dataframe

Making a new DataFrame from existing DataFrames

pandas.concat

You can also make new DataFrames from existing DataFrames using the pandas.concat function. The DataFrames can be joined or concatenated both vertically or horizontally as required.

Joining two DataFrames horizontally

You can join two DataFrames horizontally by setting the value of the axis parameter to 0.

# -- Joining Horizontally
# Create 1st DataFrame
countries = ['India', 'China', 'United States', 'Pakistan', 'Indonesia']

df1 = pd.DataFrame(countries, columns=['Country'])

# Create 2nd DataFrame

df2_data = {'Population': [1393409038, 1444216107, 332129157, 225199937, 276361783],
            'Currency': ['Indian Rupee', 'Renminbi', 'US Dollar', 'Pakistani Rupee', 'Indonesian Rupiah']}

df2 = pd.DataFrame(df2_data)

# Join the two DataFrames horizontally by setting the axis value equal to 1
df_joined = pd.concat([df1, df2], axis=1)

df_joined
Horizontally joining two dataframes

Joining two DataFrames vertically

You can also join two DataFrames vertically if they have the same column names by setting the value of the axis parameter to 1.

# -- Joining Vertically
# Create the 1st DataFrame
df_top_data = {'Country': ['India', 'China', 'United States'],
               'Population': [1393409038, 1444216107, 332129157],
               'Currency': ['Indian Rupee', 'Renminbi', 'US Dollar']}

df_top = pd.DataFrame(df_top_data)

# Create the 2nd DataFrame
df_bottom_data = {'Country': ['Pakistan', 'Indonesia'], 'Population': [
    225199937, 276361783], 'Currency': ['Pakistani Rupee', 'Indonesian Rupiah']}

df_bottom = pd.DataFrame(df_bottom_data)

# Join the two DataFrames vertically by setting the axis value equal to 0
df_joined = pd.concat([df_top, df_bottom], axis=0)

df_joined
Vertically joining two dataframes

Making pandas DataFrames from text files

The pandas.read_csv function is one of the most popular functions used to read external text files.

Even though the name of the function says ‘csv’, it can read other types of text files which are often imported from different databases because of which they can be in different formats (.csv, .txt, etc.) or encodings (utf-8, ascii, etc.). pandas.read_csv function offers a number of parameters which can be configured for reading and parsing the files as required.

You can find the entire list of parameters as well as their functions here.

Now, you will see how to load a dataset using the read_csv function:

# Enter the path where the file is located
df = pd.read_csv(filepath_or_buffer='D:/PERSONAL/DATASETS/household_power_consumption.txt')
df
Making dataframes from text files

This does not seem right. The DataFrame has not been loaded properly as all the values of the rows are present in a single column.

This is because, the default character by which Python separates the values of different columns in a row is a comma (,).
It is most likely the case that Python did not find any commas while reading the file. Hence, all the values of the dataset have been allocated a single column.

If you closely look at the data in the rows, you will see that the different values across rows are separated by a semi-colon(;). Therefore, in this case, you need to specify the value of sep as ‘ ; ‘ .

# Define the sep parameter
df = pd.read_csv(filepath_or_buffer='D:/PERSONAL/DATASETS/household_power_consumption.txt', sep=';')
df
Making dataframes from text files

Now, you can see that the different feature values in the rows have been placed under the columns in an appropriate manner.

Making pandas DataFrames from different types of files

The pandas framework also has a variety of functions for reading different files in addition to text files and loading them as dataframes such as:

  1. read_excel: For reading .xlsx, .xls and .odf files.
  2. read_parquet: For reading Apache Parquet files.
  3. read_orc: For reading .orc files
  4. read_spss: For reading SPSS (.sav) files.
  5. Stata: For reading Stata (.dta) files.
  6. SQL: For executing SQL queries on tables in remote databases.
  7. Google BigQuery: For executing SQL queries on tables stored in Google BigQuery.

Practical Tips

  1. Please keep in mind that while making a DataFrame using a list of lists, the elements of the inner lists are the values of different features.
    While making a DataFrame using a dictionary, the array of feature values are passed as the value parameter to the corresponding keys.
  2. The chunksize parameter of the read_csv function is very useful for datasets that are too big or do not fit into the memory. By defining the chunksize, Python will load only the chunk_size number of rows at a time and process them before loading the next chunk.
  3. Try to manually check the dataset first before loading it into Python. This can give you an idea of the delimiter used or if there are any rows at the beginning or the end of the dataset which should be ignored while loading the dataset.
  4. In case you wish to pass a list of column names which are different from the ones present in the dataset. You can do so using the names parameter. This will push the column names to the first row but this row can be ignored by setting skiprows=1.

Test Your Knowledge

Q1: Which parameter is used to pass a custom list of column names in the read_csv function?

Answer

Answer: names and set `header=0`

Q2: Yoy have a DataFrame which cannot be fit into the memory. How will you load such a DataFrame in Python?

Answer

Answer: Use the chunksize parameter to load only a certain amount of rows at a time.

Q3:Write the code to make the DataFrame shown using a numpy array without explicitly passing a list of column names:

Pandas Dataframe Python - Question 3
Answer

Answer:

data_array = np.array(
    [['India', 'China',],
     ['Indian Rupee', 'Renminbi']])

dict_array = {
    'Country': data_array[0],
    'Currency': data_array[1]}

df = pd.DataFrame(dict_array)

df

Q4: Complete the following line of code to ignore 100 rows from the bottom of the dataset:

df = pd.read_csv(filepath_or_buffer='household_power_consumption.txt',sep=';')

Answer

Answer:

df = pd.read_csv(filepath_or_buffer='household_power_consumption.txt',sep=';',skipfooter=100)

Q5: Suppose you have the following dataset named ‘countries.csv’:

Pandas Dataframe Python - Question 5

Write the code for loading the file in Python. Make sure that the ‘NA’ values are recognized as missing values by Python

Answer

Answer:

pandas.read_csv('countries.csv',na_values='NA')

 

This article was contributed by Shreyansh.

Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science