The data.table package in R is super fast when it comes to handling data. It has a syntax that reduces keystrokes while making R code easier to read. These set of exercises are designed to help you to oil your data brain through solving data manipulation exercises. Related post: 101 Python datatable Exercises (pydatatable)
101 R data.table Exercises. Photo by Ela Abbau.
1. How to install data.table and check the version?
# Install the data.table package
install.packages("data.table")
# Load the library
library(data.table)
# Check version
print(packageVersion("data.table"))
2. How to create a data.table from lists (vectors) ?
Difficulty Level: L1 Question : Create a pandas series from each of the items below: a list, numpy and a dictionary Input
list_1 <- c("a","b","c","d")
list_2 <- c("1","2","3","4")
list_3 <- c("aa","bb","cc","dd")
Show Solution# Inputs
list_1 <- c("a","b","c","d")
list_2 <- c("1","2","3","4")
list_3 <- c("aa","bb","cc","dd")
# Solution
DT <- data.table(V1= list_1, V2 =list_2, V3 = list_3 )
print(DT)
V1 V2 V3
1: a 1 aa
2: b 2 bb
3: c 3 cc
4: d 4 dd
3. How to import csv data files as data.table ?
Difficulty Level: L1 Question : Import BostonHousing dataset as a data.table. Show Solution
# Solution
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv')
print(head(DT))
crim zn indus chas nox rm age dis rad tax ptratio b lstat
1: 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
2: 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
3: 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
4: 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
5: 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
6: 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
medv
1: 24.0
2: 21.6
3: 34.7
4: 33.4
5: 36.2
6: 28.7
4. How to import only ‘n’ rows from a csv file to create a data.table?
Difficiulty Level: L1 Question : Import first 50 row of BostonHousing dataset as a data.table. Show Solution
# Solution
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv', nrows = 50)
print(head(DT))
crim zn indus chas nox rm age dis rad tax ptratio b lstat
1: 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
2: 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
3: 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
4: 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
5: 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
6: 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
medv
1: 24.0
2: 21.6
3: 34.7
4: 33.4
5: 36.2
6: 28.7
5. How to import only specified columns from a csv file?
Difficulty Level: L1 Question : Import only ‘crim’ and ‘medv’ columns of the BostonHousing dataset as a data.table. Show Solution
# Solution
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv', select = c('crim', 'medv'))
print(head(DT))
crim medv
1: 0.00632 24.0
2: 0.02731 21.6
3: 0.02729 34.7
4: 0.03237 33.4
5: 0.06905 36.2
6: 0.02985 28.7
6. How to get the nrows, ncolumns, datatype, summary stats of each column in a data.table
Difficulty Level: L2 Question : Get the number of rows, columns, datatype and summary statistics of each column of the Cars93 dataset. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv')
Show SolutionDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv')
# number of rows and columns
print(NROW(DT))
print(NCOL(DT))
# datatypes
print(sapply(DT,class))
# summary statistics
summary(DT)
[1] 506
[1] 14
crim zn indus chas nox rm age dis
"numeric" "numeric" "numeric" "integer" "numeric" "numeric" "numeric" "numeric"
rad tax ptratio b lstat medv
"integer" "integer" "numeric" "numeric" "numeric" "numeric"
crim zn indus chas
Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
nox rm age dis
Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
rad tax ptratio b
Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
Median : 5.000 Median :330.0 Median :19.05 Median :391.44
Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
lstat medv
Min. : 1.73 Min. : 5.00
1st Qu.: 6.95 1st Qu.:17.02
Median :11.36 Median :21.20
Mean :12.65 Mean :22.53
3rd Qu.:16.95 3rd Qu.:25.00
Max. :37.97 Max. :50.00
7. How to extract the row and column number of a particular cell with given criterion?
Difficulty Level: L2 Question : Which manufacturer, model and type has the highest Price
? What is the row and column number of the cell with the highest Price
value? Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Show SolutionFree Time Series Project Template
Do you want learn how to approach projects across different domains with Time Series?
Get started with your first Time Series Industry Project and Learn how to use and implement algorithms like ARIMA, SARIMA, SARIMAX, Simple Exponential Smoothing and Holt-Winters.

Do you want learn how to approach projects across different domains with Time Series?
Get started with your first Time Series Industry Project and Learn how to use and implement algorithms like ARIMA, SARIMA, SARIMAX, Simple Exponential Smoothing and Holt-Winters.
# Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
# Solution
# Get Manufacturer with highest price
print(DT[Price == max(Price, na.rm = TRUE)][,list(Manufacturer, Model, Type)])
# Get row number
DT[, .(rownum=.I, Price)][Price == max(Price, na.rm = TRUE), rownum]
Manufacturer Model Type
1: Mercedes-Benz 300E Midsize
59
8. How to rename a specific columns in a data.table?
Difficulty Level: Easy Question : Rename the column Model
as New_Model
in DT
. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Show Solution# Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
print(head(setnames(DT, old = "Model", new = "New_Model")[]))
Manufacturer New_Model Type Min.Price Price Max.Price MPG.city
1: Acura Integra Small 12.9 15.9 18.8 25
2: <NA> Legend Midsize 29.2 33.9 38.7 18
3: Audi 90 Compact 25.9 29.1 32.3 20
4: Audi 100 Midsize NA 37.7 44.6 19
5: BMW 535i Midsize NA 30.0 NA 22
6: Buick Century Midsize 14.2 15.7 17.3 22
MPG.highway AirBags DriveTrain Cylinders EngineSize Horsepower
1: 31 None Front 4 1.8 140
2: 25 Driver & Passenger Front 6 3.2 200
3: 26 Driver only Front 6 2.8 172
4: 26 Driver & Passenger <NA> 6 NA 172
5: 30 <NA> Rear 4 3.5 208
6: 31 Driver only <NA> 4 2.2 110
RPM Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length
1: 6300 2890 Yes 13.2 5 177
2: 5500 2335 Yes 18.0 5 195
3: 5500 2280 Yes 16.9 5 180
4: 5500 2535 <NA> 21.1 6 193
5: 5700 2545 Yes 21.1 4 186
6: 5200 2565 No 16.4 6 189
Wheelbase Width Turn.circle Rear.seat.room Luggage.room Weight Origin
1: 102 68 37 26.5 NA 2705 non-USA
2: 115 71 38 30.0 15 3560 non-USA
3: 102 67 37 28.0 14 3375 non-USA
4: 106 NA 37 31.0 17 3405 non-USA
5: 109 69 39 27.0 13 3640 non-USA
6: 105 69 41 28.0 16 NA USA
Make
1: Acura Integra
2: Acura Legend
3: Audi 90
4: Audi 100
5: BMW 535i
6: Buick Century
9. How to check if a data.table has any missing values?
Difficulty Level: L1 Question : Check if DT
has any missing values. Return TRUE
is present else FALSE
. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Show Solution# Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
# Solution
anyNA(DT)
TRUE
10. How to check for missing values in a data.table?
Difficulty Level: L1 Question : Get the location of missing values in DT
. Replace the value with TRUE
is missing else FALSE
. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Desired Output
Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway
[1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[2,] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
(...truncated...)
Show Solution# Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
# Solution
print(head(is.na(DT)))
Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway
[1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[2,] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
AirBags DriveTrain Cylinders EngineSize Horsepower RPM Rev.per.mile
[1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[4,] FALSE TRUE FALSE TRUE FALSE FALSE FALSE
[5,] TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[6,] FALSE TRUE FALSE FALSE FALSE FALSE FALSE
Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase Width
[1,] FALSE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE FALSE FALSE
[4,] TRUE FALSE FALSE FALSE FALSE TRUE
[5,] FALSE FALSE FALSE FALSE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE FALSE FALSE
Turn.circle Rear.seat.room Luggage.room Weight Origin Make
[1,] FALSE FALSE TRUE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE FALSE FALSE FALSE
[6,] FALSE FALSE FALSE TRUE FALSE FALSE
11. How to get the row and column positions of missing values in a data.table?
Difficulty Level: L2 Question: Get the row and column positions of missing values in DT. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Desired Output
[1] "(1, 24)" "(2, 1)" "(4, 4)" "(4, 10)" "(4, 12)" "(4, 16)" "(4, 21)"
[8] "(5, 4)" "(5, 6)" "(5, 9)" "(6, 10)" "(6, 25)" "(7, 6)" "(7, 15)"
[15] "(7, 17)" "(9, 21)" "(10, 15)" "(11, 24)" "(12, 9)" "(12, 10)" "(12, 13)"
[22] "(13, 11)" "(13, 24)" "(14, 11)" "(14, 17)" "(15, 7)" "(15, 13)" "(15, 19)"
[29] "(15, 22)" "(16, 24)" "(17, 13)" "(17, 24)" "(19, 23)" "(19, 24)" "(19, 26)"
[36] "(20, 1)" "(20, 22)" "(23, 17)" "(23, 23)" "(24, 15)" "(24, 24)" .. (truncated)
Show Solution# Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
# Solution
output <- c() # init output vector
for(i in 1:NROW(DT)){
for(j in 1:NCOL(DT)){
curr_value <- DT[i, j,with=F][[1]]
if(is.na(curr_value)){
position <- paste0('(', paste(i, j, sep=', '), ')')
output <- c(output, position)
}
}
}
11. How to count the number of missing values in each column of data.table?
Difficulty Level: L2 Question : Count the number of missing values in each column of DT
Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Show Solution# Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
# Solution
for (i in colnames(DT)){
print(paste("Number of missing values in column", i , "is", sum(is.na(DT[,..i]))))
}
[1] "Number of missing values in column Manufacturer is 4"
[1] "Number of missing values in column Model is 1"
[1] "Number of missing values in column Type is 3"
[1] "Number of missing values in column Min.Price is 7"
[1] "Number of missing values in column Price is 2"
[1] "Number of missing values in column Max.Price is 5"
[1] "Number of missing values in column MPG.city is 9"
[1] "Number of missing values in column MPG.highway is 2"
[1] "Number of missing values in column AirBags is 6"
[1] "Number of missing values in column DriveTrain is 7"
[1] "Number of missing values in column Cylinders is 5"
[1] "Number of missing values in column EngineSize is 2"
[1] "Number of missing values in column Horsepower is 7"
[1] "Number of missing values in column RPM is 3"
[1] "Number of missing values in column Rev.per.mile is 6"
[1] "Number of missing values in column Man.trans.avail is 5"
[1] "Number of missing values in column Fuel.tank.capacity is 8"
[1] "Number of missing values in column Passengers is 2"
[1] "Number of missing values in column Length is 4"
[1] "Number of missing values in column Wheelbase is 1"
[1] "Number of missing values in column Width is 6"
[1] "Number of missing values in column Turn.circle is 5"
[1] "Number of missing values in column Rear.seat.room is 4"
[1] "Number of missing values in column Luggage.room is 19"
[1] "Number of missing values in column Weight is 7"
[1] "Number of missing values in column Origin is 5"
[1] "Number of missing values in column Make is 3"
12. How to replace missing values of data.table with 0?
Difficulty Level: L2 Question : Replace all missing values in DT
with 0. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Show Solution# Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
# Solution
DT[is.na(DT)] <- 0
print(head(DT[, 1:8]))
Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway
1: Acura Integra Small 12.9 15.9 18.8 25 31
2: 0 Legend Midsize 29.2 33.9 38.7 18 25
3: Audi 90 Compact 25.9 29.1 32.3 20 26
4: Audi 100 Midsize 0.0 37.7 44.6 19 26
5: BMW 535i Midsize 0.0 30.0 0.0 22 30
6: Buick Century Midsize 14.2 15.7 17.3 22 31
13. How to replace missing values of numeric columns with the mean?
Difficulty Level: L2 Question : Replace missing values of numeric columns with their respective mean. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Show Solution# Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
# Solution
replace_with_mean <- function(x){ifelse(is.na(x),mean(x, na.rm = TRUE),x)}
output <- DT[,lapply(.SD, replace_with_mean)]
print(head(output))
Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway
1: Acura Integra Small 12.9000 15.9 18.80000 25 31
2: <NA> Legend Midsize 29.2000 33.9 38.70000 18 25
3: Audi 90 Compact 25.9000 29.1 32.30000 20 26
4: Audi 100 Midsize 17.1186 37.7 44.60000 19 26
5: BMW 535i Midsize 17.1186 30.0 21.45909 22 30
6: Buick Century Midsize 14.2000 15.7 17.30000 22 31
AirBags DriveTrain Cylinders EngineSize Horsepower RPM
1: None Front 4 1.800000 140 6300
2: Driver & Passenger Front 6 3.200000 200 5500
3: Driver only Front 6 2.800000 172 5500
4: Driver & Passenger <NA> 6 2.658242 172 5500
5: <NA> Rear 4 3.500000 208 5700
6: Driver only <NA> 4 2.200000 110 5200
Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase
1: 2890 Yes 13.2 5 177 102
2: 2335 Yes 18.0 5 195 115
3: 2280 Yes 16.9 5 180 102
4: 2535 <NA> 21.1 6 193 106
5: 2545 Yes 21.1 4 186 109
6: 2565 No 16.4 6 189 105
Width Turn.circle Rear.seat.room Luggage.room Weight Origin
1: 68.00000 37 26.5 13.98649 2705.000 non-USA
2: 71.00000 38 30.0 15.00000 3560.000 non-USA
3: 67.00000 37 28.0 14.00000 3375.000 non-USA
4: 69.44828 37 31.0 17.00000 3405.000 non-USA
5: 69.00000 39 27.0 13.00000 3640.000 non-USA
6: 69.00000 41 28.0 16.00000 3104.593 USA
Make
1: Acura Integra
2: Acura Legend
3: Audi 90
4: Audi 100
5: BMW 535i
6: Buick Century
14. How to select a specific column from a data.table as a list (vector)?
Difficulty Level: L1 Question : Get the column Model
in DT
as a list. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Desired Solution
[1] "Integra" "Legend" "90" "100"
[5] "535i" "Century" "LeSabre" "Roadmaster"
[9] "Riviera" "DeVille" "Seville" "Cavalier"
[13] "Corsica" "Camaro" "Lumina" "Lumina_APV"
[17] "Astro" "Caprice" "Corvette" "Concorde"
[21] "LeBaron" "Imperial" (...truncated...)
Show Solution# Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
#Solution
DT <- DT[, Model]
print(DT)
[1] "Integra" "Legend" "90" "100"
[5] "535i" "Century" "LeSabre" "Roadmaster"
[9] "Riviera" "DeVille" "Seville" "Cavalier"
[13] "Corsica" "Camaro" "Lumina" "Lumina_APV"
[17] "Astro" "Caprice" "Corvette" "Concorde"
[21] "LeBaron" "Imperial" "Colt" "Shadow"
[25] "Spirit" "Caravan" "Dynasty" "Stealth"
[29] "Summit" "Vision" "Festiva" "Escort"
[33] "Tempo" "Mustang" "Probe" "Aerostar"
[37] "Taurus" "Crown_Victoria" "Metro" "Storm"
[41] "Prelude" "Civic" "Accord" "Excel"
[45] "Elantra" "Scoupe" "Sonata" "Q45"
[49] "ES300" "SC300" "Continental" "Town_Car"
[53] "323" "Protege" "626" "MPV"
[57] "RX-7" "190E" "300E" NA
[61] "Cougar" "Mirage" "Diamante" "Sentra"
[65] "Altima" "Quest" "Maxima" "Achieva"
[69] "Cutlass_Ciera" "Silhouette" "Eighty-Eight" "Laser"
[73] "LeMans" "Sunbird" "Firebird" "Grand_Prix"
[77] "Bonneville" "900" "SL" "Justy"
[81] "Loyale" "Legacy" "Swift" "Tercel"
[85] "Celica" "Camry" "Previa" "Fox"
[89] "Eurovan" "Passat" "Corrado" "240"
[93] "850"
15. How to select a specific column from a data.table as a data.table?
Difficulty Level: L2 Question : Get the column Model
in DT
as a data.table (rather than as a list). Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Desired Solution
Model
1: Integra
2: Legend
3: 90
4: 100
5: 535i
6: Century
Show Solution# Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
#Solution
DT <- DT[, list(Model)]
print(head(DT))
Model
1: Integra
2: Legend
3: 90
4: 100
5: 535i
6: Century
16. How to sort a data.table based on a given column?
Difficulty Level: L1 Question: Reorder the rows as per Price
column of DT
. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Show Solution# Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
#Solution
DT <- DT[order(Price)]
print(head(DT))
Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway
1: Ford Festiva Small 6.9 7.4 7.9 31 33
2: Hyundai Excel Small 6.8 8.0 9.2 29 33
3: Mazda 323 Small 7.4 8.3 9.1 29 37
4: Geo Metro Small 6.7 8.4 10.0 46 50
5: Subaru Justy Small 7.3 8.4 9.5 33 37
6: Suzuki Swift <NA> 7.3 8.6 NA 39 43
AirBags DriveTrain Cylinders EngineSize Horsepower RPM Rev.per.mile
1: <NA> Front 4 1.3 63 5000 3150
2: None Front 4 1.5 81 5500 2710
3: None Front 4 1.6 82 5000 2370
4: None Front 3 1.0 55 5700 3755
5: None 4WD 3 1.2 73 5600 2875
6: None Front 3 1.3 70 6000 3360
Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase Width
1: Yes 10.0 4 141 90 63
2: Yes 11.9 5 168 94 63
3: Yes 13.2 4 164 97 66
4: <NA> 10.6 4 151 93 63
5: Yes 9.2 4 146 90 60
6: Yes 10.6 4 161 93 NA
Turn.circle Rear.seat.room Luggage.room Weight Origin Make
1: 33 26.0 12 1845 USA Ford Festiva
2: 35 26.0 11 2345 non-USA Hyundai Excel
3: 34 27.0 16 2325 non-USA Mazda 323
4: 34 27.5 10 1695 non-USA Geo Metro
5: 32 23.5 10 2045 non-USA Subaru Justy
6: 34 27.5 10 1965 non-USA Suzuki Swift
17. How to sort a data.table based on two or more columns?
Difficulty Level: L1 Question : Reorder the rows as per ascending order of Price
column and descending order of Max.Price
column of DT
. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Show Solution# Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
#Solution
DT <- DT[order(Price, -Max.Price)]
print(head(DT))
Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway
1: Ford Festiva Small 6.9 7.4 7.9 31 33
2: Hyundai Excel Small 6.8 8.0 9.2 29 33
3: Mazda 323 Small 7.4 8.3 9.1 29 37
4: Geo Metro Small 6.7 8.4 10.0 46 50
5: Subaru Justy Small 7.3 8.4 9.5 33 37
6: Suzuki Swift <NA> 7.3 8.6 NA 39 43
AirBags DriveTrain Cylinders EngineSize Horsepower RPM Rev.per.mile
1: <NA> Front 4 1.3 63 5000 3150
2: None Front 4 1.5 81 5500 2710
3: None Front 4 1.6 82 5000 2370
4: None Front 3 1.0 55 5700 3755
5: None 4WD 3 1.2 73 5600 2875
6: None Front 3 1.3 70 6000 3360
Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase Width
1: Yes 10.0 4 141 90 63
2: Yes 11.9 5 168 94 63
3: Yes 13.2 4 164 97 66
4: <NA> 10.6 4 151 93 63
5: Yes 9.2 4 146 90 60
6: Yes 10.6 4 161 93 NA
Turn.circle Rear.seat.room Luggage.room Weight Origin Make
1: 33 26.0 12 1845 USA Ford Festiva
2: 35 26.0 11 2345 non-USA Hyundai Excel
3: 34 27.0 16 2325 non-USA Mazda 323
4: 34 27.5 10 1695 non-USA Geo Metro
5: 32 23.5 10 2045 non-USA Subaru Justy
6: 34 27.5 10 1965 non-USA Suzuki Swift
18. How to filter out rows from a data.table that satisfy multiple conditions?
Difficulty Level: L2 Question : Filter out the rows having Manufacturer as Ford and Price less than 30 Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Desired Output :
Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway
1: Ford Festiva Small 6.9 7.4 7.9 31 33
2: Ford Escort Small 8.4 10.1 11.9 23 30
3: Ford Tempo Compact 10.4 11.3 12.2 22 27
4: Ford Mustang Sporty 10.8 15.9 21.0 22 29
5: Ford Probe Sporty 12.8 14.0 15.2 NA 30
6: Ford Aerostar Van 14.5 19.9 25.3 15 20
AirBags DriveTrain Cylinders EngineSize Horsepower RPM Rev.per.mile
1: <NA> Front 4 1.3 63 5000 3150
2: None Front 4 1.8 127 6500 2410
3: None Front 4 2.3 96 4200 2805
4: Driver only Rear 4 2.3 105 4600 2285
5: Driver only Front 4 2.0 115 5500 2340
6: Driver only 4WD 6 3.0 145 4800 2080
(...truncated...)
Show Solution#Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
#Solution
sol <- DT[Manufacturer == "Ford" & Price < 30]
print(head(sol))
Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway
1: Ford Festiva Small 6.9 7.4 7.9 31 33
2: Ford Escort Small 8.4 10.1 11.9 23 30
3: Ford Tempo Compact 10.4 11.3 12.2 22 27
4: Ford Mustang Sporty 10.8 15.9 21.0 22 29
5: Ford Probe Sporty 12.8 14.0 15.2 NA 30
6: Ford Aerostar Van 14.5 19.9 25.3 15 20
AirBags DriveTrain Cylinders EngineSize Horsepower RPM Rev.per.mile
1: <NA> Front 4 1.3 63 5000 3150
2: None Front 4 1.8 127 6500 2410
3: None Front 4 2.3 96 4200 2805
4: Driver only Rear 4 2.3 105 4600 2285
5: Driver only Front 4 2.0 115 5500 2340
6: Driver only 4WD 6 3.0 145 4800 2080
Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase Width
1: Yes 10.0 4 141 90 63
2: <NA> 13.2 5 171 98 67
3: Yes 15.9 5 177 100 68
4: Yes 15.4 4 180 101 68
5: Yes 15.5 4 179 103 70
6: Yes 21.0 7 176 119 72
Turn.circle Rear.seat.room Luggage.room Weight Origin Make
1: 33 26.0 12 1845 USA Ford Festiva
2: 36 28.0 12 2530 USA Ford Escort
3: NA 27.5 13 NA USA <NA>
4: 40 24.0 12 NA USA Ford Mustang
5: 38 23.0 18 2710 USA Ford Probe
6: 45 30.0 NA 3735 USA Ford Aerostar
19. How to select multiple columns from data.table and rename them?
Difficulty Level: L2 Question : Get columns ‘Manufacturer’ & ‘Type’ from data.table and rename them to ‘MANUFACTURER’ & ‘TYPE’ Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Desired Output
MANUFACTURER TYPE
1: Acura Small
2: <NA> Midsize
3: Audi Compact
4: Audi Midsize
5: BMW Midsize
6: Buick Midsize
Show Solution#Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
#Solution
sol <- DT[, .(MANUFACTURER = Manufacturer, TYPE = Type)]
print(head(sol))
MANUFACTURER TYPE
1: Acura Small
2: <NA> Midsize
3: Audi Compact
4: Audi Midsize
5: BMW Midsize
6: Buick Midsize
20. How to get mean of rows that satisfy a condition from a given column?
Difficulty Level: L2 Question : Get mean of all the vehicles having Price
greater than 30
. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Desired Output :
38.6833
Show Solution#Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
#Solution
output <- DT[Price > 30, mean(Price, na.rm = TRUE)]
output
38.6833333333333
21. How to remove only specified columns from a data.table?
Difficulty Level: L2 Question : Create new data.table by removing ‘Manufacturer’ & ‘Type’ columns. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Show Solution#Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
#Solution
output <- DT[, !c("Manufacturer", "Type")]
print(head(output))
Model Min.Price Price Max.Price MPG.city MPG.highway AirBags
1: Integra 12.9 15.9 18.8 25 31 None
2: Legend 29.2 33.9 38.7 18 25 Driver & Passenger
3: 90 25.9 29.1 32.3 20 26 Driver only
4: 100 NA 37.7 44.6 19 26 Driver & Passenger
5: 535i NA 30.0 NA 22 30 <NA>
6: Century 14.2 15.7 17.3 22 31 Driver only
DriveTrain Cylinders EngineSize Horsepower RPM Rev.per.mile Man.trans.avail
1: Front 4 1.8 140 6300 2890 Yes
2: Front 6 3.2 200 5500 2335 Yes
3: Front 6 2.8 172 5500 2280 Yes
4: <NA> 6 NA 172 5500 2535 <NA>
5: Rear 4 3.5 208 5700 2545 Yes
6: <NA> 4 2.2 110 5200 2565 No
Fuel.tank.capacity Passengers Length Wheelbase Width Turn.circle
1: 13.2 5 177 102 68 37
2: 18.0 5 195 115 71 38
3: 16.9 5 180 102 67 37
4: 21.1 6 193 106 NA 37
5: 21.1 4 186 109 69 39
6: 16.4 6 189 105 69 41
Rear.seat.room Luggage.room Weight Origin Make
1: 26.5 NA 2705 non-USA Acura Integra
2: 30.0 15 3560 non-USA Acura Legend
3: 28.0 14 3375 non-USA Audi 90
4: 31.0 17 3405 non-USA Audi 100
5: 27.0 13 3640 non-USA BMW 535i
6: 28.0 16 NA USA Buick Century
22. How to calculate count of each unique value in a given column?
Difficulty Level: L2 Question : Calculate number of records for every Manufacturer
in DT
. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Desired Output
Manufacturer N
1: Acura 1
2: <NA> 4
3: Audi 2
4: BMW 1
5: Buick 4
6: Cadillac 2
Show Solution#Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
#Solution
output <- DT[, .(.N), by = .(Manufacturer)]
print(head(output))
Manufacturer N
1: Acura 1
2: <NA> 4
3: Audi 2
4: BMW 1
5: Buick 4
6: Cadillac 2
23. How to find mean of a column grouped by multiple columns?
Difficulty Level: L2 Question : Calculate mean Price
for every Manufacturer
, Type
combination in DT
. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Desired Output
Manufacturer Type V1
1: Acura Small 15.90000
2: <NA> Midsize 31.93333
3: Audi Compact 29.10000
4: Audi Midsize 37.70000
5: BMW Midsize 30.00000
6: Buick Midsize 21.00000
Show Solution#Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
#Solution
output <- DT[, mean(Price, na.rm = TRUE), by = .(Manufacturer, Type)]
print(head(output))
Manufacturer Type V1
1: Acura Small 15.90000
2: <NA> Midsize 31.93333
3: Audi Compact 29.10000
4: Audi Midsize 37.70000
5: BMW Midsize 30.00000
6: Buick Midsize 21.00000
24. How to find mean of column grouped by multiple columns and sort by grouped columns?
Difficulty Level: L2 Question : Calculate mean Price
for every Manufacturer
, Type
combination and arrange it by the keys in DT
. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Desired Output
Manufacturer Type V1
1: Acura Small 15.90
2: Audi Compact 29.10
3: Audi Midsize 37.70
4: BMW Midsize 30.00
5: Buick Large 22.25
6: Buick Midsize 21.00
Show Solution#Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
#Solution 1
output <- DT[, mean(Price, na.rm = TRUE), keyby = .(Manufacturer, Type)]
#Solution 1
output <- DT[, mean(Price, na.rm = TRUE), by = .(Manufacturer, Type)][order(Manufacturer, Type)]
print(head(output))
Manufacturer Type V1
1: Acura Small 15.90
2: Audi Compact 29.10
3: Audi Midsize 37.70
4: BMW Midsize 30.00
5: Buick Large 22.25
6: Buick Midsize 21.00
25. How to select all numeric columns from data.table?
Difficulty Level: L2 Question : Select numeric columns from DT
. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Desired Output
Min.Price Price Max.Price MPG.city MPG.highway EngineSize Horsepower RPM
1: 12.9 15.9 18.8 25 31 1.8 140 6300
2: 29.2 33.9 38.7 18 25 3.2 200 5500
3: 25.9 29.1 32.3 20 26 2.8 172 5500
4: NA 37.7 44.6 19 26 NA 172 5500
5: NA 30.0 NA 22 30 3.5 208 5700
6: 14.2 15.7 17.3 22 31 2.2 110 5200
Rev.per.mile Fuel.tank.capacity Passengers Length Wheelbase Width
1: 2890 13.2 5 177 102 68
2: 2335 18.0 5 195 115 71
3: 2280 16.9 5 180 102 67
4: 2535 21.1 6 193 106 NA
5: 2545 21.1 4 186 109 69
6: 2565 16.4 6 189 105 69
(...truncated...)
Show Solution# Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
# Solution 1
output <- DT[, .SD, .SDcols = which(sapply(DT, is.numeric))]
print(head(output))
Min.Price Price Max.Price MPG.city MPG.highway EngineSize Horsepower RPM
1: 12.9 15.9 18.8 25 31 1.8 140 6300
2: 29.2 33.9 38.7 18 25 3.2 200 5500
3: 25.9 29.1 32.3 20 26 2.8 172 5500
4: NA 37.7 44.6 19 26 NA 172 5500
5: NA 30.0 NA 22 30 3.5 208 5700
6: 14.2 15.7 17.3 22 31 2.2 110 5200
Rev.per.mile Fuel.tank.capacity Passengers Length Wheelbase Width
1: 2890 13.2 5 177 102 68
2: 2335 18.0 5 195 115 71
3: 2280 16.9 5 180 102 67
4: 2535 21.1 6 193 106 NA
5: 2545 21.1 4 186 109 69
6: 2565 16.4 6 189 105 69
Turn.circle Rear.seat.room Luggage.room Weight
1: 37 26.5 NA 2705
2: 38 30.0 15 3560
3: 37 28.0 14 3375
4: 37 31.0 17 3405
5: 39 27.0 13 3640
6: 41 28.0 16 NA
26. How to calculate mean of all numeric columns for every Manufacturer?
Difficulty Level: L3 Question : Calculate mean of all numeric columns for every Manufacturer Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Desired Output
Manufacturer Min.Price Price Max.Price MPG.city MPG.highway EngineSize
1: Acura 12.900 15.900 18.80 25.0 31.00 1.800
2: <NA> 26.775 28.550 30.30 19.0 26.00 2.975
3: Audi NA 33.400 38.45 19.5 26.00 NA
4: BMW NA 30.000 NA 22.0 30.00 3.500
5: Buick 20.750 21.625 NA 19.0 27.75 3.875
6: Cadillac 35.250 37.400 39.50 16.0 25.00 4.750
(...truncated...)
Show Solution#Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
#Solution
output <- DT[, .SD, .SDcols = which(sapply(DT, is.numeric))]
output <- cbind(output, DT[, list(Manufacturer)])
output <- output[, lapply(.SD, mean), by = Manufacturer]
print(head(output))
Manufacturer Min.Price Price Max.Price MPG.city MPG.highway EngineSize
1: Acura 12.900 15.900 18.80 25.0 31.00 1.800
2: <NA> 26.775 28.550 30.30 19.0 26.00 2.975
3: Audi NA 33.400 38.45 19.5 26.00 NA
4: BMW NA 30.000 NA 22.0 30.00 3.500
5: Buick 20.750 21.625 NA 19.0 27.75 3.875
6: Cadillac 35.250 37.400 39.50 16.0 25.00 4.750
Horsepower RPM Rev.per.mile Fuel.tank.capacity Passengers Length Wheelbase
1: 140.0 6300 2890.0 13.2 5.00 177.00 102.00
2: 186.5 5750 NA NA 5.00 193.25 109.75
3: 172.0 5500 2407.5 19.0 5.50 186.50 104.00
4: 208.0 5700 2545.0 21.1 4.00 186.00 109.00
5: 157.5 4700 NA NA 5.75 200.75 110.00
6: 247.5 5050 NA 19.0 5.50 205.00 112.50
Width Turn.circle Rear.seat.room Luggage.room Weight
1: 68.00 37.00 26.500 NA 2705.00
2: 71.25 NA 29.000 13.5 3458.75
3: NA 37.00 29.500 15.5 3390.00
4: 69.00 39.00 27.000 13.0 3640.00
5: NA 42.25 28.875 17.0 NA
6: 73.50 43.50 33.000 NA 3777.50
27. How to get first 3 rows for every Manufacturer?
Difficulty Level: L3 Question : Get first 3
rows for every Manufacturer
in DT
. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Desired Output
Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway
1: Acura Integra Small 12.9 15.9 18.8 25 31
2: Audi 90 Compact 25.9 29.1 32.3 20 26
3: Audi 100 Midsize NA 37.7 44.6 19 26
4: BMW 535i Midsize NA 30.0 NA 22 30
5: Buick Century Midsize 14.2 15.7 17.3 22 31
6: Buick LeSabre Large 19.9 20.8 NA 19 28
(...truncated...)
Show Solution#Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
#Solution
output <- DT[!is.na(Manufacturer), head(.SD, 3), by = Manufacturer]
print(head(output))
Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway
1: Acura Integra Small 12.9 15.9 18.8 25 31
2: Audi 90 Compact 25.9 29.1 32.3 20 26
3: Audi 100 Midsize NA 37.7 44.6 19 26
4: BMW 535i Midsize NA 30.0 NA 22 30
5: Buick Century Midsize 14.2 15.7 17.3 22 31
6: Buick LeSabre Large 19.9 20.8 NA 19 28
AirBags DriveTrain Cylinders EngineSize Horsepower RPM
1: None Front 4 1.8 140 6300
2: Driver only Front 6 2.8 172 5500
3: Driver & Passenger <NA> 6 NA 172 5500
4: <NA> Rear 4 3.5 208 5700
5: Driver only <NA> 4 2.2 110 5200
6: Driver only Front 6 3.8 170 4800
Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase
1: 2890 Yes 13.2 5 177 102
2: 2280 Yes 16.9 5 180 102
3: 2535 <NA> 21.1 6 193 106
4: 2545 Yes 21.1 4 186 109
5: 2565 No 16.4 6 189 105
6: NA No NA 6 200 111
Width Turn.circle Rear.seat.room Luggage.room Weight Origin Make
1: 68 37 26.5 NA 2705 non-USA Acura Integra
2: 67 37 28.0 14 3375 non-USA Audi 90
3: NA 37 31.0 17 3405 non-USA Audi 100
4: 69 39 27.0 13 3640 non-USA BMW 535i
5: 69 41 28.0 16 NA USA Buick Century
6: 74 42 30.5 17 3470 USA Buick LeSabre
28. How to remove specific column from data.table?
Difficulty Level: L2 Question : Remove column Model
from DT
. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Show Solution#Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
#Solution
DT[, Model := NULL]
print(head(DT))
Manufacturer Type Min.Price Price Max.Price MPG.city MPG.highway
1: Acura Small 12.9 15.9 18.8 25 31
2: <NA> Midsize 29.2 33.9 38.7 18 25
3: Audi Compact 25.9 29.1 32.3 20 26
4: Audi Midsize NA 37.7 44.6 19 26
5: BMW Midsize NA 30.0 NA 22 30
6: Buick Midsize 14.2 15.7 17.3 22 31
AirBags DriveTrain Cylinders EngineSize Horsepower RPM
1: None Front 4 1.8 140 6300
2: Driver & Passenger Front 6 3.2 200 5500
3: Driver only Front 6 2.8 172 5500
4: Driver & Passenger <NA> 6 NA 172 5500
5: <NA> Rear 4 3.5 208 5700
6: Driver only <NA> 4 2.2 110 5200
Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase
1: 2890 Yes 13.2 5 177 102
2: 2335 Yes 18.0 5 195 115
3: 2280 Yes 16.9 5 180 102
4: 2535 <NA> 21.1 6 193 106
5: 2545 Yes 21.1 4 186 109
6: 2565 No 16.4 6 189 105
Width Turn.circle Rear.seat.room Luggage.room Weight Origin Make
1: 68 37 26.5 NA 2705 non-USA Acura Integra
2: 71 38 30.0 15 3560 non-USA Acura Legend
3: 67 37 28.0 14 3375 non-USA Audi 90
4: NA 37 31.0 17 3405 non-USA Audi 100
5: 69 39 27.0 13 3640 non-USA BMW 535i
6: 69 41 28.0 16 NA USA Buick Century
29. How to set key to data.table?
Difficulty Level: L1 Question : Set column AirBags
as key to the data.table Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Show Solution#Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
#Solution
setkey(DT,AirBags)
30. How to get all rows where AirBags
are at 'Driver & Passenger'
in DT
?
Difficulty Level: L2 Question : Get all rows where AirBags
are Driver & Passenger
in DT
. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Show Solution#Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
#Solution
setkey(DT,AirBags)
print(head(DT["Driver & Passenger"]))
Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway
1: <NA> Legend Midsize 29.2 33.9 38.7 18 25
2: Audi 100 Midsize NA 37.7 44.6 19 26
3: Cadillac Seville Midsize 37.5 40.1 42.7 16 25
4: Chevrolet Camaro Sporty 13.4 15.1 16.8 19 28
5: <NA> Concorde Large 18.4 18.4 18.4 20 28
6: Chrysler LeBaron Compact 14.5 15.8 17.1 23 28
AirBags DriveTrain Cylinders EngineSize Horsepower RPM
1: Driver & Passenger Front 6 3.2 200 5500
2: Driver & Passenger <NA> 6 NA 172 5500
3: Driver & Passenger Front 8 4.6 295 6000
4: Driver & Passenger Rear <NA> 3.4 160 4600
5: Driver & Passenger Front 6 3.3 153 5300
6: Driver & Passenger Front 4 3.0 141 5000
Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase
1: 2335 Yes 18.0 5 195 115
2: 2535 <NA> 21.1 6 193 106
3: 1985 No 20.0 5 204 111
4: 1805 Yes NA 4 193 101
5: 1990 No 18.0 6 203 113
6: 2090 No 16.0 6 183 104
Width Turn.circle Rear.seat.room Luggage.room Weight Origin
1: 71 38 30.0 15 3560 non-USA
2: NA 37 31.0 17 3405 non-USA
3: 74 44 31.0 NA 3935 USA
4: 74 43 25.0 13 3240 USA
5: 74 NA 31.0 15 3515 USA
6: 68 41 30.5 14 3085 USA
Make
1: Acura Legend
2: Audi 100
3: Cadillac Seville
4: Chevrolet Camaro
5: Chrylser Concorde
6: Chrysler LeBaron
31. How to get all rows where AirBags
are either 'Driver & Passenger'
or 'No data'
?
Difficulty Level: L2 Question : Get all rows where AirBags
are Driver & Passenger
& No data
. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Desired Output
Show Solution#Input
library(data.table)
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
#Solution
setkey(DT,AirBags)
print(DT[c("Driver & Passenger", "No data")])
Manufacturer Model Type Min.Price Price Max.Price MPG.city
1: <NA> Legend Midsize 29.2 33.9 38.7 18
2: Audi 100 Midsize NA 37.7 44.6 19
3: Cadillac Seville Midsize 37.5 40.1 42.7 16
4: Chevrolet Camaro Sporty 13.4 15.1 16.8 19
5: <NA> Concorde Large 18.4 18.4 18.4 20
6: Chrysler LeBaron Compact 14.5 15.8 17.1 23
7: Eagle Vision Large NA 19.3 21.2 20
8: Honda Prelude Sporty 17.0 19.8 22.7 24
9: Honda Accord Compact 13.8 17.5 21.2 24
10: <NA> SC300 Midsize 34.7 35.2 35.6 18
11: Lincoln Continental Midsize 33.3 34.3 35.3 17
12: Lincoln Town_Car Large 34.4 36.1 37.8 18
13: Mercedes-Benz 300E Midsize 43.8 61.9 80.0 19
14: Pontiac Firebird <NA> 14.0 17.7 21.4 19
15: Pontiac Bonneville Large 19.4 24.4 29.4 19
16: <NA> 850 Midsize 24.8 26.7 28.5 20
17: <NA> <NA> <NA> NA NA NA NA
MPG.highway AirBags DriveTrain Cylinders EngineSize Horsepower
1: 25 Driver & Passenger Front 6 3.2 200
2: 26 Driver & Passenger <NA> 6 NA 172
3: 25 Driver & Passenger Front 8 4.6 295
4: 28 Driver & Passenger Rear <NA> 3.4 160
5: 28 Driver & Passenger Front 6 3.3 153
6: 28 Driver & Passenger Front 4 3.0 141
7: 28 Driver & Passenger Front 6 3.5 NA
8: 31 Driver & Passenger Front 4 2.3 160
9: 31 Driver & Passenger Front 4 2.2 140
10: 23 Driver & Passenger Rear 6 3.0 225
11: 26 Driver & Passenger <NA> 6 3.8 160
12: 26 Driver & Passenger Rear 8 4.6 210
13: 25 Driver & Passenger Rear 6 3.2 217
14: 28 Driver & Passenger Rear 6 NA 160
15: 28 Driver & Passenger Front 6 3.8 170
16: 28 Driver & Passenger Front 5 2.4 168
17: NA No data <NA> <NA> NA NA
RPM Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length
1: 5500 2335 Yes 18.0 5 195
2: 5500 2535 <NA> 21.1 6 193
3: 6000 1985 No 20.0 5 204
4: 4600 1805 Yes NA 4 193
5: 5300 1990 No 18.0 6 203
6: 5000 2090 No 16.0 6 183
7: 5800 1980 No 18.0 6 202
8: 5800 2855 Yes NA 4 175
9: 5600 NA Yes 17.0 4 185
10: 6000 2510 Yes NA 4 191
11: 4400 1835 No 18.4 6 205
12: 4600 1840 No 20.0 NA 219
13: 5500 2220 No 18.5 5 NA
14: 4600 1805 Yes 15.5 4 196
15: 4800 1565 No 18.0 6 177
16: 6200 NA Yes 19.3 5 184
17: NA NA <NA> NA NA NA
Wheelbase Width Turn.circle Rear.seat.room Luggage.room Weight Origin
1: 115 71 38 30.0 15 3560 non-USA
2: 106 NA 37 31.0 17 3405 non-USA
3: 111 74 44 31.0 NA 3935 USA
4: 101 74 43 25.0 13 3240 USA
5: 113 74 NA 31.0 15 3515 USA
6: 104 68 41 30.5 14 3085 USA
7: 113 74 40 30.0 15 3490 USA
8: 100 70 39 23.5 8 2865 non-USA
9: 107 67 41 28.0 14 3040 non-USA
10: 106 71 39 25.0 9 3515 non-USA
11: 109 73 42 30.0 19 3695 USA
12: 117 77 45 31.5 22 4055 USA
13: 110 69 37 NA 15 3525 non-USA
14: 101 75 43 25.0 13 3240 USA
15: 111 74 43 30.5 18 3495 USA
16: 105 69 38 30.0 15 3245 non-USA
17: NA NA NA NA NA NA <NA>
Make
1: Acura Legend
2: Audi 100
3: Cadillac Seville
4: Chevrolet Camaro
5: Chrylser Concorde
6: Chrysler LeBaron
7: Eagle Vision
8: Honda Prelude
9: Honda Accord
10: Lexus SC300
11: Lincoln Continental
12: Lincoln Town_Car
13: Mercedes-Benz 300E
14: Pontiac Firebird
15: Pontiac Bonneville
16: Volvo 850
17: <NA>
33. How to get last and 2nd last (penultimate) row of data.table?
Difficulty Level: L2 Question 1 : Get penultimate row of DT
. Question 2 : Get last row of DT
. Input:
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Desired Output:
Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway
1: <NA> 850 Midsize 24.8 26.7 28.5 20 28
2: Volvo 240 Compact 21.8 22.7 23.5 21 28
AirBags DriveTrain Cylinders EngineSize Horsepower RPM
1: Driver & Passenger Front 5 2.4 168 6200
2: Driver only Rear <NA> 2.3 114 5400
Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase
1: NA Yes 19.3 5 184 105
2: 2215 Yes 15.8 5 190 104
Width Turn.circle Rear.seat.room Luggage.room Weight Origin Make
1: 69 38 30.0 15 3245 non-USA Volvo 850
2: 67 37 29.5 14 2985 non-USA Volvo 240
Show Solution#Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
# Solution
output <- DT[c(.N, .N - 1)]
print(output)
Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway
1: <NA> 850 Midsize 24.8 26.7 28.5 20 28
2: Volvo 240 Compact 21.8 22.7 23.5 21 28
AirBags DriveTrain Cylinders EngineSize Horsepower RPM
1: Driver & Passenger Front 5 2.4 168 6200
2: Driver only Rear <NA> 2.3 114 5400
Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase
1: NA Yes 19.3 5 184 105
2: 2215 Yes 15.8 5 190 104
Width Turn.circle Rear.seat.room Luggage.room Weight Origin Make
1: 69 38 30.0 15 3245 non-USA Volvo 850
2: 67 37 29.5 14 2985 non-USA Volvo 240
34. How to reorder columns of data.table?
Difficulty Level: L2 Question : Reorder columns of DT
as V2 , V1, V3
Input
DT <- data.table(V1= c("a","b","c","d"), V2 =c("1","2","3","4"), V3 = c("aa","bb","cc","dd") )
Desired Output
V2 V1 V3
1: 1 a aa
2: 2 b bb
3: 3 c cc
4: 4 d dd
Show Solution# Input
DT <- data.table(V1= c("a","b","c","d"), V2 =c("1","2","3","4"), V3 = c("aa","bb","cc","dd") )
#Solution
output <- setcolorder(DT,c("V2","V1","V3"))
print(output)
V2 V1 V3
1: 1 a aa
2: 2 b bb
3: 3 c cc
4: 4 d dd
35. How to select one column and compute standard deviation of another column and return a single value that gets recycled?
Difficulty Level: L2 Question : Select column Manufacturer
and compute standard deviation of column Price
and return a single value that gets recycled. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Desired Output:
Manufacturer Sd.Price
1: Acura 9.72428
2: <NA> 9.72428
3: Audi 9.72428
4: Audi 9.72428
5: BMW 9.72428
6: Buick 9.72428
Show Solution# Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
#Solution
output <- DT[,.(Manufacturer, Sd.Price = sd(Price, na.rm = TRUE))]
print(head(output))
Manufacturer Sd.Price
1: Acura 9.72428
2: <NA> 9.72428
3: Audi 9.72428
4: Audi 9.72428
5: BMW 9.72428
6: Buick 9.72428
36. How to use row indexing, column indexing and group by expressions all together?
Difficulty Level: L2 Question : Calculate sum of Weight
of cars under every Manufacturer
but exclude all 'Midsize'
Type
cars. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Desired Output
Manufacturer V1
1: Acura 2705
2: Audi 3375
3: Buick 7575
4: Cadillac 3620
5: Chevrolet 23545
6: <NA> 3515
Show Solution# Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
#Solution
output = DT[Type !="Midsize", sum(Weight), by=Manufacturer]
print(head(output))
Manufacturer V1
1: Acura 2705
2: Audi 3375
3: Buick 7575
4: Cadillac 3620
5: Chevrolet 23545
6: <NA> 3515
37. How to get the row number of the nth largest value in a column?
Difficulty Level: L2 Question : Find the row position of the 5th largest value of column Price
in DT
. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Show Solution# Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
#Solution
DT <- DT[order(Price)]
fifth_largest_price = DT[5, Price]
output = DT[Price == fifth_largest_price, which = TRUE]
output
- 4
- 5
38. How to find and cap outliers from a data.table column?
Difficulty Level: L2 Question: Replace all values of Length
column in the lower 5%ile and greater than 95%ile with respective 5th and 95th %ile value. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Show Solution# Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
# Solution
DT[Length > quantile(Length, .95, na.rm=T), Length := round(quantile(DT$Length, .95, na.rm=T))]
DT[Length < quantile(Length, .05, na.rm=T), Length := round(quantile(DT$Length, .05, na.rm=T))]
39. How to swap two column values based on a condition in another column of a data.table?
Difficulty Level: L2 Question : Swap column V1 & V3 for all the rows where V2 = 2. Input
DT <- data.table(V1= c("a","b","c","d"),
V2 =c("1","2","2","3"),
V3 = c("aa","bb","cc","dd") )
Desired Output:
V1 V2 V3
1: a 1 aa
2: bb 2 b
3: cc 2 c
4: d 3 dd
Show Solution# Input
DT <- data.table(V1= c("a","b","c","d"), V2 =c("1","2","2","3"), V3 = c("aa","bb","cc","dd") )
#Solution
DT[V2=="2", c("V3", "V1") := .(V1, V3)]
print(DT)
V1 V2 V3
1: a 1 aa
2: bb 2 b
3: cc 2 c
4: d 3 dd
40. How to swap two rows of a dataframe?
Difficulty Level: L2 Questions : Swap rows 1 and 2 in DT
. Input
DT <- data.table(V1= c("a","b","c","d"),
V2 =c("1","2","2","3"),
V3 = c("aa","bb","cc","dd") )
Desired Output
V1 V2 V3
1: b 2 bb
2: a 1 aa
3: c 2 cc
4: d 3 dd
Show Solution# Input
DT <- data.table(V1= c("a","b","c","d"), V2 =c("1","2","2","3"), V3 = c("aa","bb","cc","dd") )
# Solution
swap_rows <- function(dt, i1, i2){
i1_row <- DT[i1,]
i2_row <- DT[i2,]
DT[i1,] <- i2_row
DT[i2,] <- i1_row
return(DT)
}
output <- swap_rows(DT, 1,2)
print(output)
V1 V2 V3
1: b 2 bb
2: a 1 aa
3: c 2 cc
4: d 3 dd
41. How to reverse the rows of a data.table?
Difficulty Level: L2 Question : Reverse all the rows of DT
so that topmost row goes to bottom. Input
DT <- data.table(V1= c("a","b","c","d"),
V2 =c("1","2","2","3"),
V3 = c("aa","bb","cc","dd") )
Desired Output
V1 V2 V3
[1,] "d" "3" "dd"
[2,] "c" "2" "cc"
[3,] "b" "2" "bb"
[4,] "a" "1" "aa"
Show Solution# Input
DT <- data.table(V1= c("a","b","c","d"), V2 =c("1","2","2","3"), V3 = c("aa","bb","cc","dd") )
# Solution
output <- do.call(cbind, lapply(DT, rev))
print(output)
V1 V2 V3
[1,] "d" "3" "dd"
[2,] "c" "2" "cc"
[3,] "b" "2" "bb"
[4,] "a" "1" "aa"
42. How to create one-hot encodings of a categorical variable (dummy variables)?
Difficulty Level: L2 Question : Get one-hot encodings for columns Gender
& Degree
in the data.table DT
and append it as columns. Input
DT <- data.table(Gender = c("M","M","F","F"), E_Id =c(1,2,3,4), Degree = c("UG","PG","PhD","UG") )
Gender E_Id Degree
1: M 1 UG
2: M 2 PG
3: F 3 PhD
4: F 4 UG
Desired Output
E_Id Gender_F Gender_M Degree_PG Degree_PhD Degree_UG
1: 1 0 1 0 0 1
2: 2 0 1 1 0 0
3: 3 1 0 0 1 0
4: 4 1 0 0 0 1
Show Solution# Input
DT <- data.table(Gender = c("M","M","F","F"), E_Id =c(1,2,3,4), Degree = c("UG","PG","PhD","UG") )
# Solution
output <- dcast(melt(DT, id.vars='E_Id'), E_Id ~ variable + value, fun = length)
print(output)
E_Id Gender_F Gender_M Degree_PG Degree_PhD Degree_UG
1: 1 0 1 0 0 1
2: 2 0 1 1 0 0
3: 3 1 0 0 1 0
4: 4 1 0 0 0 1
‘
43. How to find median value in each numeric column in R data.table?
Difficulty Level: L2 Question : Find row-wise sum values of DT
. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Desired Output
Min.Price Price Max.Price MPG.city MPG.highway EngineSize Horsepower RPM
1: 14.6 17.7 19.15 21 28 2.3 140 5200
Rev.per.mile Fuel.tank.capacity Passengers Length Wheelbase Width
1: 2360 16.5 5 181 103 69
Turn.circle Rear.seat.room Luggage.room Weight
1: 39 27.5 14 3085
Show Solution#Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
#Solution
output <- DT[, .SD, .SDcols = which(sapply(DT, is.numeric))]
output <- output[, lapply(.SD, median, na.rm = TRUE)]
print(head(output))
Min.Price Price Max.Price MPG.city MPG.highway EngineSize Horsepower RPM
1: 14.6 17.7 19.15 21 28 2.3 140 5200
Rev.per.mile Fuel.tank.capacity Passengers Length Wheelbase Width
1: 2360 16.5 5 181 103 69
Turn.circle Rear.seat.room Luggage.room Weight
1: 39 27.5 14 3085
44. How to compute correlation score of each column against other columns?
Difficulty Level: L2 Question : Compute correlation score of each column against other columns. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Show Solution#Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
#Solution
output <- DT[, .SD, .SDcols = which(sapply(DT, is.numeric))]
print(cor(output, use = "na.or.complete"))
45. How to create a column containing the minimum by maximum of each row?
Difficulty Level: L2 Question : Compute the minimum-by-maximum for every row of df
.
DT <- data.table(V1 = runif(10, 1, 100),
V2 = runif(10, 1, 80),
V3 = runif(10, 1, 100))
Show Solution# Input
DT <- data.table(V1= runif(10, 1, 100), V2 =runif(10, 1, 80), V3 = runif(10, 1, 100))
#Solution
DT[,max_min:=do.call(pmin,.SD)/do.call(pmax,.SD)]
print(head(DT))
# Solution 2
DT <- data.table(V1= runif(10, 1, 100), V2 =runif(10, 1, 80), V3 = runif(10, 1, 100))
DT[, max_min2:=apply(.SD, 1, min)/apply(.SD, 1, max)]
print(head(DT))
V1 V2 V3 max_min
1: 16.33154 66.08863 41.38638 0.2471157
2: 87.79467 38.57875 38.50150 0.4385404
3: 40.95019 19.47167 15.03562 0.3671685
4: 12.40790 17.68716 34.95050 0.3550136
5: 34.15262 24.52459 71.04999 0.3451738
6: 60.22314 60.22538 10.79905 0.1793106
46. How to scale
all columns in a dataframe?
Difficulty Level: L2 Question : Normalize all columns in DT
. Without using external packages. Input
DT <- data.table(V1= runif(10, 1, 100),
V2 =runif(10, 1, 80),
V3 = runif(10, 1, 100))
Show Solution# Input
DT <- data.table(V1= runif(10, 1, 100), V2 =runif(10, 1, 80), V3 = runif(10, 1, 100))
# Solution
output <- DT[, lapply(.SD, scale)]
print(output)
V1.V1 V2.V1 V3.V1
1: 0.9884659 -1.2595111 0.09593436
2: 0.1217149 -0.8867426 -0.31629923
3: 0.5872020 0.1443417 0.70839114
4: -0.5178531 0.5737876 1.12962574
5: 1.0221688 -0.9475154 -0.63975308
6: -1.5091412 1.2276058 1.11944272
7: -0.9203772 1.0074295 -0.33182208
8: -0.8005338 1.3745140 -1.41247525
9: 1.5281054 -0.9386551 -1.45771901
10: -0.4997516 -0.2952545 1.10467468
47. How to compute the correlation of each column with the succeeding column?
Difficulty Level: L2 Compute the correlation of each column of df
with its succeeding column. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Desired Output
Price
Min.Price 0.9705018
Max.Price
Price 0.9802519
MPG.city
Max.Price -0.5377239
MPG.highway
MPG.city 0.9479157
EngineSize
MPG.highway -0.6259651
Horsepower
EngineSize 0.7313269
RPM
Horsepower 0.04736874
Rev.per.mile
RPM 0.4623314
(...truncated...)
Show Solution#Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
#Solution
output <- DT[, .SD, .SDcols = which(sapply(DT, is.numeric))] # numeric columns
for (i in 1:(ncol(output) - 1)){
a <- i + 1
print(cor(output[,..i], output[,..a], use = "na.or.complete"))
}
Price
Min.Price 0.9705018
Max.Price
Price 0.9802519
MPG.city
Max.Price -0.5377239
MPG.highway
MPG.city 0.9479157
EngineSize
MPG.highway -0.6259651
Horsepower
EngineSize 0.7313269
RPM
Horsepower 0.04736874
Rev.per.mile
RPM 0.4623314
Fuel.tank.capacity
Rev.per.mile -0.6318497
Passengers
Fuel.tank.capacity 0.4965661
Length
Passengers 0.4737477
Wheelbase
Length 0.8275146
Width
Wheelbase 0.8127431
Turn.circle
Width 0.8170549
Rear.seat.room
Turn.circle 0.4738118
Luggage.room
Rear.seat.room 0.668847
Weight
Luggage.room 0.6425643
48. How to replace both the diagonals of data.table with 0?
Difficulty Level: L2 Replace both values in both diagonals of DT
with 0. Input
DT <- data.table(V1= runif(4, 1, 100), V2 =runif(4, 1, 80), V3 = runif(4, 1, 100), V4 = runif(4,1, 50))
V1 V2 V3 V4
1: 97.451642 65.393451 6.496714 27.80014
2: 56.475427 5.234317 80.551154 10.86617
3: 65.965884 18.954913 70.968490 25.92509
4: 3.723218 26.590925 75.723006 42.22132
Desired output
V1 V2 V3 V4
1: 0.00000 65.39345 6.496714 0.00000
2: 56.47543 0.00000 0.000000 10.86617
3: 65.96588 0.00000 0.000000 25.92509
4: 0.00000 26.59093 75.723006 0.00000
Show Solution# Input
DT <- data.table(V1= runif(4, 1, 100), V2 =runif(4, 1, 80), V3 = runif(4, 1, 100), V4 = runif(4,1, 50))
# Solution
for (i in 1:nrow(DT)){
DT[i,i] <- 0
DT[nrow(DT) - i + 1, i] <- 0
}
print(DT)
V1 V2 V3 V4
1: 0.00000 72.08959 12.63875 0.000000
2: 17.61880 0.00000 0.00000 1.696741
3: 82.20107 0.00000 0.00000 40.321027
4: 0.00000 40.47473 71.60260 0.000000
49. How to join two dataframes by 2 columns so they have only the common rows?
Difficulty Level: L2 Join dataframes df1
and df2
by ‘fruit-pazham’ and ‘weight-kilo’. Input
DT1 = data.table(fruit = c('apple', 'banana', 'orange'),
weight = c('high', 'medium', 'low'),
price= c(1,2,3))
DT2 = data.table(pazham = c('apple', 'orange', 'pine'),
kilo= c('high', 'low', 'high'),
price = c(1,2,3))
Desired Output
fruit weight price_left price_right
1: apple high 1 1
2: orange low 3 2
Show Solution# Input
DT1 = data.table(fruit = c('apple', 'banana', 'orange'),
weight = c('high', 'medium', 'low'),
price= c(1,2,3))
DT2 = data.table(pazham = c('apple', 'orange', 'pine'),
kilo= c('high', 'low', 'high'),
price = c(1,2,3))
# Solution
output <- merge(DT1, DT2,
by.x = c('fruit', 'weight'),
by.y = c('pazham', 'kilo'),
all = FALSE,
suffixes = c("_left", "_right"))
print(output)
fruit weight price_left price_right
1: apple high 1 1
2: orange low 3 2
‘
50. How to get the positions where values of two columns of data.table
match?
Difficulty Level: L2 Question : Get the positions where values of two columns of DT
match. Input
DT <- data.table(fruit1 = c('apple', 'orange', 'banana', 'mango),
fruit2 = c('apple', 'grapes', 'banana', 'jackfruit'))
Desired Output
1 3
Show Solution# Input
DT <- data.table(fruit1 = c('apple', 'orange', 'banana', 'mango'),
fruit2 = c('apple', 'grapes', 'banana', 'jackfruit'))
# Solution
which(DT$fruit1 == DT$fruit2)
- 1
- 3
‘
51. How to create lags and leads of a column in a data.table
?
Difficulty Level: L2 Create two new columns in df
, one of which is a lag 1 (shift column a
down by 1 row) of column ‘a’ and the other is a lead 1 (shift column b
up by 1 row). Input
DT <- data.table(V1 = runif(4, 1, 100),
V2 = runif(4, 1, 80),
V3 = runif(4, 1, 100),
V4 = runif(4,1, 50))
Desired Output
V1 V2 V3 V4 lagged_column lead_column
1: 90.41034 79.62526 57.525005 16.88133 NA 13.19427
2: 13.19427 29.20243 3.962821 35.72731 90.41034 13.79110
3: 13.79110 15.68368 78.933011 18.43108 13.19427 86.76293
4: 86.76293 33.66214 6.706639 43.40251 13.79110 NA
Show Solution# Input
DT <- data.table(V1= runif(4, 1, 100), V2 =runif(4, 1, 80), V3 = runif(4, 1, 100), V4 = runif(4,1, 50))
# Solution
DT[, lagged_column := shift(V1, type = 'lag')]
DT[, lead_column := shift(V1, type = 'lead')]
print(DT)
V1 V2 V3 V4 lagged_column lead_column
1: 51.50006 74.050716 27.618570 16.68658 NA 82.29783
2: 82.29783 63.969537 9.054432 1.16761 51.50006 78.36425
3: 78.36425 63.937995 85.047398 21.39628 82.29783 63.64181
4: 63.64181 2.482478 10.060054 27.33597 78.36425 NA
52. How to get the frequency of unique values in the entire dataframe?
Difficulty Level: L2 Question : Get the frequency of unique values in the entire dataframe DT
. Input
DT <- data.table(fruit = c('apple', 'orange', 'banana', 'apple'))
Desired Output
fruit number_of_distinct_orders
1: apple 2
2: orange 1
3: banana 1
Show Solution# Input
DT <- data.table(fruit = c('apple', 'orange', 'banana', 'apple'))
# Solution
output <- DT
output[,flag := 1]
print(output[, .(number_of_distinct_orders = length(flag)), by = fruit])
fruit number_of_distinct_orders
1: apple 2
2: orange 1
3: banana 1
53. How to convert a table from long to wide format using R data.table?
Question : Create a pivot table out of DT
keeping ID & Month
as key. Difficulty Level: L2 Input
n <- 5
DT <- data.table(
ID = sample(1:20, n, replace=TRUE),
Month = sample(1:12, n, replace=TRUE),
Category = c("Drinks", "Food", "Drinks", "Food", "Food"),
Expenses = runif(n),
key = c('ID', 'Month')
)
ID Month Category Expenses
1: 4 3 Drinks 0.9921504
2: 4 4 Food 0.8073523
3: 8 10 Food 0.5533336
4: 14 8 Food 0.2588098
5: 20 4 Drinks 0.2012480
Desired Output
ID Month Drinks Food
1: 4 3 0.9921504 0.0000000
2: 4 4 0.0000000 0.8073523
3: 8 10 0.0000000 0.5533336
4: 14 8 0.0000000 0.2588098
5: 20 4 0.2012480 0.0000000
Show Solution# Input
n <- 5
DT <- data.table(
ID=sample(1:20, n, replace=TRUE),
Month=sample(1:12, n, replace=TRUE),
Category= c("Drinks", "Food", "Drinks", "Food", "Food"),
Expenses=runif(n),
key=c('ID', 'Month')
)
# Solution
output <- dcast.data.table(DT, ID + Month ~ Category, fun=sum, value.var='Expenses')
print(output)
ID Month Drinks Food
1: 7 6 0.0000000 0.75928350
2: 7 10 0.5103192 0.00000000
3: 9 3 0.1320670 0.00000000
4: 10 2 0.0000000 0.07450386
5: 18 7 0.0000000 0.23840221
54. How to create a bootstrap sample (sample same number of rows with replacement) from a data.table?
Difficulty Level: L2 Question : Create a bootstrap sample of length 5 from DT
. Input
DT <- data.table(V1 = runif(20, 1, 100),
V2 = runif(20, 1, 80),
V3 = runif(20, 1, 100),
V4 = runif(20,1, 50))
Show Solution# Input
DT <- data.table(V1 = runif(20, 1, 100),
V2 = runif(20, 1, 80),
V3 = runif(20, 1, 100),
V4 = runif(20,1, 50))
# Solution
n <- 5
row_number <- sample(1:20, n, replace=TRUE)
print(DT[row_number,])
V1 V2 V3 V4
1: 3.213551 25.49768 30.01245 5.81102
2: 28.313228 63.25563 89.86874 30.28312
3: 37.163717 41.17688 98.72226 43.52485
4: 91.656742 60.68913 9.93234 21.66074
5: 51.661261 26.40149 28.44038 27.73880
55. How to bin a numeric column with user custom intervals to form a categorical (text) column?
Difficulty Level: L2 In DT, create a new column (cat_col
, that has 'fail'
if the average of the first two columns is less than 40. Else, it contains 'pass'
. Input
set.seed(100)
DT <- data.table(V1 = runif(20, 1, 100),
V2 = runif(20, 1, 80))
Desired Ouput
V1 V2 cat_col
1: 31.468845 43.32908 fail
2: 26.509578 57.15350 pass
3: 55.679921 43.52955 pass
4: 6.581932 60.16881 fail
5: 47.386379 34.18801 pass
6: 48.893303 14.54220 fail
(...truncated...)
Show Solution# Input
set.seed(100)
DT <- data.table(V1= runif(20, 1, 100),
V2 =runif(20, 1, 80))
# Solution
DT[, cat_col := ifelse((V1 + V2)/2 < 40, 'fail' , 'pass')]
print(DT)
V1 V2 cat_col
1: 31.468845 43.32908 fail
2: 26.509578 57.15350 pass
3: 55.679921 43.52955 pass
4: 6.581932 60.16881 fail
5: 47.386379 34.18801 pass
6: 48.893303 14.54220 fail
7: 81.427859 61.85383 pass
8: 37.661733 70.67433 pass
9: 55.109301 44.37864 pass
10: 17.855943 22.94018 fail
11: 62.874651 39.57617 pass
12: 88.334386 74.35190 pass
13: 28.755030 28.54667 fail
14: 40.450302 76.37846 pass
15: 76.492557 55.92666 pass
16: 67.233150 71.26683 pass
17: 21.256604 15.25217 fail
18: 36.394960 50.72188 pass
19: 36.588036 79.17557 pass
20: 69.338762 11.29282 pass
56. How to convert a data.table
from wide to long format?
Difficulty Level: L2 Question : Convert DT
from wide to long format by keeping ID & Month
as key. Input
n <- 5
DT <- data.table(
ID = sample(1:20, n, replace=TRUE),
Month = sample(1:12, n, replace=TRUE),
Drinks = sample(10:20,n, replace = TRUE),
Food = sample(10:20,n, replace = TRUE),
Starters = sample(5:15,n, replace = TRUE)
)
ID Month Drinks Food Starters
1: 1 12 17 14 13
2: 14 2 12 12 11
3: 12 7 12 15 11
4: 19 1 16 15 15
5: 15 11 17 10 10
Desired Output
ID Month variable value
1: 14 4 Drinks 12
2: 4 11 Drinks 18
3: 19 5 Drinks 17
4: 19 7 Drinks 15
5: 14 4 Drinks 11
6: 14 4 Food 14
(...truncated...)
Show Solution# Input
n <- 5
DT <- data.table(
ID=sample(1:20, n, replace=TRUE),
Month=sample(1:12, n, replace=TRUE),
Drinks= sample(10:20,n, replace = TRUE),
Food = sample(10:20,n, replace = TRUE),
Starters = sample(5:15,n, replace = TRUE)
)
# Solution
output = melt(DT, id.vars = c("ID", "Month"),
measure.vars = c("Drinks", "Food", "Starters"))
print(head(output))
ID Month variable value
1: 14 4 Drinks 12
2: 4 11 Drinks 18
3: 19 5 Drinks 17
4: 19 7 Drinks 15
5: 14 4 Drinks 11
6: 14 4 Food 14
57. How to create a dataframe that contains the first 3 lags of a time series data?
Difficulty Level: L2 Question : Append a new column in DT
with first three lag of column V1
. Input
DT <- data.table(Date = seq(as.Date('2011-01-01'),as.Date('2011-01-10'),by = 1),
V1 = round(runif(10, 1, 80),2))
Desired Output
Date V1 lag1 lag2 lag3
1: 2011-01-01 48.92 NA NA NA
2: 2011-01-02 65.45 48.92 NA NA
3: 2011-01-03 67.63 65.45 48.92 NA
4: 2011-01-04 63.26 67.63 65.45 48.92
5: 2011-01-05 2.50 63.26 67.63 65.45
6: 2011-01-06 56.15 2.50 63.26 67.63
7: 2011-01-07 65.39 56.15 2.50 63.26
8: 2011-01-08 45.90 65.39 56.15 2.50
Show Solution# Input
DT <- data.table(Date = seq(as.Date('2011-01-01'),as.Date('2011-01-10'),by = 1),
V1 = round(runif(10, 1, 80),2))
# Solution
DT[, `:=`(lag1 = shift(V1, 1, type = 'lag'),
lag2 = shift(V1, 2, type = 'lag'),
lag3 = shift(V1, 3, type = 'lag'))]
print(head(DT, 8))
Date V1 lag1 lag2 lag3
1: 2011-01-01 8.05 NA NA NA
2: 2011-01-02 13.85 8.05 NA NA
3: 2011-01-03 3.11 13.85 8.05 NA
4: 2011-01-04 57.06 3.11 13.85 8.05
5: 2011-01-05 61.12 57.06 3.11 13.85
6: 2011-01-06 68.74 61.12 57.06 3.11
7: 2011-01-07 35.54 68.74 61.12 57.06
8: 2011-01-08 33.95 35.54 68.74 61.12
58. How to shuffle rows of a dataframe?
Difficulty Level: L2 Question : Shuffle rows of DT
. Input
DT <- data.table(Date = seq(as.Date('2011-01-01'),as.Date('2011-01-10'),by = 1),
V1 = runif(10, 1, 80),
V2 = runif(10, 1, 100),
V3 = runif(10,1, 50))
Show Solution# Input
DT <- data.table(Date = seq(as.Date('2011-01-01'), as.Date('2011-01-10'),by = 1),
V1 = runif(10, 1, 80),
V2 = runif(10, 1, 100),
V3 = runif(10,1, 50))
# Solution
set.seed(1000)
print(DT[sample(nrow(DT), replace = T),])
Date V1 V2 V3
1: 2011-01-04 35.76134 2.61564 17.215551
2: 2011-01-06 54.37578 56.66843 15.841530
3: 2011-01-03 59.85359 21.81431 32.227574
4: 2011-01-08 39.23481 74.79349 33.457133
5: 2011-01-03 59.85359 21.81431 32.227574
6: 2011-01-02 54.02029 34.80660 7.503086
7: 2011-01-06 54.37578 56.66843 15.841530
8: 2011-01-06 54.37578 56.66843 15.841530
9: 2011-01-06 54.37578 56.66843 15.841530
10: 2011-01-01 76.45873 27.03600 16.913088
59. How to select rows between two dates in a dataframe?
Difficulty Level: L1 Question : Select all rows between dates 2011-01-05
& 2011-01-08
in DT
. Input
set.seed(100)
DT <- data.table(Date = seq(as.Date('2011-01-01'),as.Date('2011-01-10'),by = 1),
V1 = runif(10, 1, 80),
V2 = runif(10, 1, 100),
V3 = runif(10,1, 50))
Desired Output
Date V1 V2 V3
1: 2011-01-06 28.39314 36.99609 29.814114
2: 2011-01-07 52.53458 33.46489 5.073762
Show Solution# Input
DT <- data.table(Date= seq(as.Date('2011-01-01'),as.Date('2011-01-10'),by = 1), V1 =runif(10, 1, 80), V2 = runif(10, 1, 100), V3 = runif(10,1, 50))
# Solution
output <- DT[(Date > '2011-01-05') & (Date < '2011-01-08'),]
print(output)
Date V1 V2 V3
1: 2011-01-06 28.39314 36.99609 29.814114
2: 2011-01-07 52.53458 33.46489 5.073762