Pandas Sample – Randomly Sample Rows From Dataframe

Use the pandas.DataFrame.sample() method from pandas library to randomly select rows from a DataFrame

Randomly selecting rows can be useful for inspecting the values of a DataFrame.

In this article, you will learn about the different configurations of this method for randomly selecting rows from a DataFrame followed by a few practical tips for using this method for different purposes.

Create a DataFrame

# Make a DataFrame
import pandas as pd

# Create the data of the DataFrame as a dictionary
data_df = {'Name': ['OpenCV', 'Tensorflow', 'Matlab', 'CUDA', 'Theano', 'Keras', 'GPUImage', 'YOLO', 'BoofCV'],

           'Created By': ['Gary Bradsky', 'Google Brain', 'Cleve Moler', 'Ian Buck', 'MILA',
                          'Francois Chollet', 'Brad Larson', 'Joseph Redmon', 'Peter Abeles'],

           'Written in': ['C++', 'Python', 'C++', 'C++', 'Python', 'Python', 'C', 'C', 'Java']}

# Create the dictionary
df = pd.DataFrame(data_df)
df

To learn more about creating and loading pandas DataFrames, click here.

pandas.DataFrame.sample()

Syntax: DataFrame.sample(n = None, frac=None, replace=False, weights=None, random_state=None, axis=None)
Purpose: To return a random sample of rows or columns of a DataFrame.
Parameters:
- n:Int (default: None). It is used to specify the number of randomly selected rows or columns to be returned from the DataFrame. – frac:Float (default: None). It is used to specify the number of rows to be returned as a fraction of the total number of rows in the DataFrame. Although it can be considered as an alternative to the n parameter, it cannot be used with n at the same time.
- replace:Boolean (default: False). It is used to specify if the same row or column can be returned more than once or not.
- weights:String or array (default: None). It is used to add bias to the selected rows or columns so that they will have a higher chance of getting returned by the method.
- random_state:Int or array or BitGenerator or numpy.randomRandomState (default: None). It is used to specify the seed value which will used in the random number generator.
- axis:0 or 1 (default: None). It is used to specify the orientation along which the objects are to be returned. Pass the value 0 to this parameter to return randomly selected rows or pass the value 1 to return randomly selected columns.
Returns: A pandas series or DataFrame depending on the type of the calling object.

Using the DataFrame.sample() method

You can directly use the DataFrame.sample() method without passing any parameters. On doing so, the default value gets passed to the parameters and a single randomly selected row of the DataFrame gets returned.

# Use the DataFrame.sample() method to return a single randomly selected row
df.sample()

Using the n parameter

The default configuration of the DataFrame.sample() method returns only a single row. To return multiple rows, you can use the n parameter to specify the number of rows to be returned.

# Return three randomly selected rows from the DataFrame
df.sample(n=3)

Using the frac parameter

Using the frac parameter, you can specify the number of rows to be returned as a fraction of the total number of rows present in the DataFrame

# Return 30% of the total number of rows from the DataFrame
df.sample(frac=0.3)

Using the replace parameter

With the help of this parameter, you can return the same row more than once. The default value of this parameter is False which means it cannot select the same row more than once. Set its value to True to return duplicate rows.

# Return the same three rows more than once
df.sample(n=3, replace=True, random_state=2)

Using the weights parameter

The DataFrame.sample() method returns different rows each time it is called. However, if you want certain rows to have a higher chance of getting returned, you can use the weights parameter to specify the probability of those rows getting returned.

# Add bias to those rows which should be returned more frequently than the others
bias = [15, 10, 0.5, 0.55, 0.4, 0.2, 0.1, 0.6, 8]
df.sample(n=2, weights=bias,)

As you can see, the first, second, and the last row have been assigned higher weights than the other rows. This means that those rows will have a higher chance of being returned each time this method is called.

Using the random_state parameter

You can use the random_state parameter to ensure that the the same rows are returned each time the method is called.

# Ensure that the same three rows are repeated each time the method is called
df.sample(n=3, random_state=0)

Practical Tips

When you pass the value True to the replace parameter, you can return rows more than the total rows present in the DataFrame though some of them will be duplicates of the other.

print('Rows and columns present in the DataFrame:', df.shape)

df.sample(n=15, replace=True)

Rows and columns present in the DataFrame: (9, 3)

While using the weights parameter, you can assign weights greater than 1 to the rows though the sum of the weights gets standardized to 1.
The random_state parameter can be useful if want to share your code with someone else but ensure that the outputs are reproducible.

Test Your Knowledge

Q1: The frac parameter is used to return a fraction of the total rows of a DataFrame after randomly selecting them. True or False?

Answer:

Answer: True

Q2: What is the difference between the function of the weights parameter and the random_state parameter?

Answer:

Answer: The random_state parameter ensures that the output will be the same each time the DataFrame.sample() method is called. The weights parameter increases the chances of the rows having higher weights get selected but it does not guarantee that the rows with the higher weights will be returned every time the method is called.

Q3: Write the code to return any three randomly selected rows from the DataFrame df. Ensure that each time the same row will returned each time the method is called.

Answer:

Answer: df.sample(n = 3, random_state=0)

Q4: You have a DataFrame df that has 5 rows and 4 columns. Write the code to randomly return 10 rows from the DataFrame. The returned rows do not have to be necessarily unique.

Answer:

Answer: df.sample(n = 10, replace = True)

Q5: Write the code to return 47% of all the rows in the DataFrame df.

Answer:

Answer: df.sample(frac=0.47)

The article was contributed by Shreyansh B and Shri Varsheni

Pandas