# 101 R data.table Exercises

The data.table package in R is super fast when it comes to handling data. It has a syntax that reduces keystrokes while making R code easier to read. These set of exercises are designed to help you to oil your data brain through solving data manipulation exercises. Related post: 101 Python datatable Exercises (pydatatable)

101 R data.table Exercises. Photo by Ela Abbau.

## 1. How to install data.table and check the version?

# Install the data.table package
install.packages("data.table")

library(data.table)

# Check version
print(packageVersion("data.table"))


## 2. How to create a data.table from lists (vectors) ?

Difficulty Level: L1 Question : Create a pandas series from each of the items below: a list, numpy and a dictionary Input

list_1 <- c("a","b","c","d")
list_2 <- c("1","2","3","4")
list_3 <- c("aa","bb","cc","dd")

Show Solution
# Inputs
list_1 <- c("a","b","c","d")
list_2 <- c("1","2","3","4")
list_3 <- c("aa","bb","cc","dd")

# Solution
DT <- data.table(V1= list_1, V2 =list_2, V3 = list_3 )
print(DT)

   V1 V2 V3
1:  a  1 aa
2:  b  2 bb
3:  c  3 cc
4:  d  4 dd


## 3. How to import csv data files as data.table ?

Difficulty Level: L1 Question : Import BostonHousing dataset as a data.table. Show Solution

# Solution

      crim zn indus chas   nox    rm  age    dis rad tax ptratio      b lstat
1: 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
2: 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
3: 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
4: 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
5: 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
6: 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
medv
1: 24.0
2: 21.6
3: 34.7
4: 33.4
5: 36.2
6: 28.7


## 4. How to import only ‘n’ rows from a csv file to create a data.table?

Difficiulty Level: L1 Question : Import first 50 row of BostonHousing dataset as a data.table. Show Solution

# Solution
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv', nrows = 50)

      crim zn indus chas   nox    rm  age    dis rad tax ptratio      b lstat
1: 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
2: 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
3: 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
4: 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
5: 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
6: 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
medv
1: 24.0
2: 21.6
3: 34.7
4: 33.4
5: 36.2
6: 28.7


## 5. How to import only specified columns from a csv file?

Difficulty Level: L1 Question : Import only ‘crim’ and ‘medv’ columns of the BostonHousing dataset as a data.table. Show Solution

# Solution
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv', select = c('crim', 'medv'))

      crim medv
1: 0.00632 24.0
2: 0.02731 21.6
3: 0.02729 34.7
4: 0.03237 33.4
5: 0.06905 36.2
6: 0.02985 28.7


## 6. How to get the nrows, ncolumns, datatype, summary stats of each column in a data.table

Difficulty Level: L2 Question : Get the number of rows, columns, datatype and summary statistics of each column of the Cars93 dataset. Input

DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv')

Show Solution
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv')

#  number of rows and columns
print(NROW(DT))
print(NCOL(DT))

# datatypes
print(sapply(DT,class))

# summary statistics
summary(DT)

[1] 506
[1] 14
crim        zn     indus      chas       nox        rm       age       dis
"numeric" "numeric" "numeric" "integer" "numeric" "numeric" "numeric" "numeric"
rad       tax   ptratio         b     lstat      medv
"integer" "integer" "numeric" "numeric" "numeric" "numeric"

crim                zn             indus            chas
Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000
1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000
Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000
Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917
3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000
Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000
nox               rm             age              dis
Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130
1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100
Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207
Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795
3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188
Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127
Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32
1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38
Median : 5.000   Median :330.0   Median :19.05   Median :391.44
Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67
3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23
Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90
lstat            medv
Min.   : 1.73   Min.   : 5.00
1st Qu.: 6.95   1st Qu.:17.02
Median :11.36   Median :21.20
Mean   :12.65   Mean   :22.53
3rd Qu.:16.95   3rd Qu.:25.00
Max.   :37.97   Max.   :50.00


## 7. How to extract the row and column number of a particular cell with given criterion?

Difficulty Level: L2 Question : Which manufacturer, model and type has the highest Price? What is the row and column number of the cell with the highest Price value? Input

DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Show Solution
# Input

# Solution
# Get Manufacturer with highest price
print(DT[Price == max(Price, na.rm = TRUE)][,list(Manufacturer, Model, Type)])

# Get row number
DT[, .(rownum=.I, Price)][Price == max(Price, na.rm = TRUE), rownum]

    Manufacturer Model    Type
1: Mercedes-Benz  300E Midsize


59

## 8. How to rename a specific columns in a data.table?

Difficulty Level: Easy Question : Rename the column Model as New_Model in DT. Input

DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Show Solution
# Input

print(head(setnames(DT, old = "Model", new = "New_Model")[]))

   Manufacturer New_Model    Type Min.Price Price Max.Price MPG.city
1:        Acura   Integra   Small      12.9  15.9      18.8       25
2:         <NA>    Legend Midsize      29.2  33.9      38.7       18
3:         Audi        90 Compact      25.9  29.1      32.3       20
4:         Audi       100 Midsize        NA  37.7      44.6       19
5:          BMW      535i Midsize        NA  30.0        NA       22
6:        Buick   Century Midsize      14.2  15.7      17.3       22
MPG.highway            AirBags DriveTrain Cylinders EngineSize Horsepower
1:          31               None      Front         4        1.8        140
2:          25 Driver & Passenger      Front         6        3.2        200
3:          26        Driver only      Front         6        2.8        172
4:          26 Driver & Passenger       <NA>         6         NA        172
5:          30               <NA>       Rear         4        3.5        208
6:          31        Driver only       <NA>         4        2.2        110
RPM Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length
1: 6300         2890             Yes               13.2          5    177
2: 5500         2335             Yes               18.0          5    195
3: 5500         2280             Yes               16.9          5    180
4: 5500         2535            <NA>               21.1          6    193
5: 5700         2545             Yes               21.1          4    186
6: 5200         2565              No               16.4          6    189
Wheelbase Width Turn.circle Rear.seat.room Luggage.room Weight  Origin
1:       102    68          37           26.5           NA   2705 non-USA
2:       115    71          38           30.0           15   3560 non-USA
3:       102    67          37           28.0           14   3375 non-USA
4:       106    NA          37           31.0           17   3405 non-USA
5:       109    69          39           27.0           13   3640 non-USA
6:       105    69          41           28.0           16     NA     USA
Make
1: Acura Integra
2:  Acura Legend
3:       Audi 90
4:      Audi 100
5:      BMW 535i
6: Buick Century


## 9. How to check if a data.table has any missing values?

Difficulty Level: L1 Question : Check if DT has any missing values. Return TRUE is present else FALSE. Input

DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Show Solution
# Input

# Solution
anyNA(DT)


TRUE

## 10. How to check for missing values in a data.table?

Difficulty Level: L1 Question : Get the location of missing values in DT. Replace the value with TRUE is missing else FALSE. Input

DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')


Desired Output

     Manufacturer Model  Type Min.Price Price Max.Price MPG.city MPG.highway
[1,]        FALSE FALSE FALSE     FALSE FALSE     FALSE    FALSE       FALSE
[2,]         TRUE FALSE FALSE     FALSE FALSE     FALSE    FALSE       FALSE
[3,]        FALSE FALSE FALSE     FALSE FALSE     FALSE    FALSE       FALSE
[4,]        FALSE FALSE FALSE      TRUE FALSE     FALSE    FALSE       FALSE
[5,]        FALSE FALSE FALSE      TRUE FALSE      TRUE    FALSE       FALSE
[6,]        FALSE FALSE FALSE     FALSE FALSE     FALSE    FALSE       FALSE
(...truncated...)

Show Solution
# Input

# Solution

     Manufacturer Model  Type Min.Price Price Max.Price MPG.city MPG.highway
[1,]        FALSE FALSE FALSE     FALSE FALSE     FALSE    FALSE       FALSE
[2,]         TRUE FALSE FALSE     FALSE FALSE     FALSE    FALSE       FALSE
[3,]        FALSE FALSE FALSE     FALSE FALSE     FALSE    FALSE       FALSE
[4,]        FALSE FALSE FALSE      TRUE FALSE     FALSE    FALSE       FALSE
[5,]        FALSE FALSE FALSE      TRUE FALSE      TRUE    FALSE       FALSE
[6,]        FALSE FALSE FALSE     FALSE FALSE     FALSE    FALSE       FALSE
AirBags DriveTrain Cylinders EngineSize Horsepower   RPM Rev.per.mile
[1,]   FALSE      FALSE     FALSE      FALSE      FALSE FALSE        FALSE
[2,]   FALSE      FALSE     FALSE      FALSE      FALSE FALSE        FALSE
[3,]   FALSE      FALSE     FALSE      FALSE      FALSE FALSE        FALSE
[4,]   FALSE       TRUE     FALSE       TRUE      FALSE FALSE        FALSE
[5,]    TRUE      FALSE     FALSE      FALSE      FALSE FALSE        FALSE
[6,]   FALSE       TRUE     FALSE      FALSE      FALSE FALSE        FALSE
Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase Width
[1,]           FALSE              FALSE      FALSE  FALSE     FALSE FALSE
[2,]           FALSE              FALSE      FALSE  FALSE     FALSE FALSE
[3,]           FALSE              FALSE      FALSE  FALSE     FALSE FALSE
[4,]            TRUE              FALSE      FALSE  FALSE     FALSE  TRUE
[5,]           FALSE              FALSE      FALSE  FALSE     FALSE FALSE
[6,]           FALSE              FALSE      FALSE  FALSE     FALSE FALSE
Turn.circle Rear.seat.room Luggage.room Weight Origin  Make
[1,]       FALSE          FALSE         TRUE  FALSE  FALSE FALSE
[2,]       FALSE          FALSE        FALSE  FALSE  FALSE FALSE
[3,]       FALSE          FALSE        FALSE  FALSE  FALSE FALSE
[4,]       FALSE          FALSE        FALSE  FALSE  FALSE FALSE
[5,]       FALSE          FALSE        FALSE  FALSE  FALSE FALSE
[6,]       FALSE          FALSE        FALSE   TRUE  FALSE FALSE


## 11. How to get the row and column positions of missing values in a data.table?

Difficulty Level: L2 Question: Get the row and column positions of missing values in DT. Input

DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')


Desired Output

[1] "(1, 24)"  "(2, 1)"   "(4, 4)"   "(4, 10)"  "(4, 12)"  "(4, 16)"  "(4, 21)"
[8] "(5, 4)"   "(5, 6)"   "(5, 9)"   "(6, 10)"  "(6, 25)"  "(7, 6)"   "(7, 15)"
[15] "(7, 17)"  "(9, 21)"  "(10, 15)" "(11, 24)" "(12, 9)"  "(12, 10)" "(12, 13)"
[22] "(13, 11)" "(13, 24)" "(14, 11)" "(14, 17)" "(15, 7)"  "(15, 13)" "(15, 19)"
[29] "(15, 22)" "(16, 24)" "(17, 13)" "(17, 24)" "(19, 23)" "(19, 24)" "(19, 26)"
[36] "(20, 1)"  "(20, 22)" "(23, 17)" "(23, 23)" "(24, 15)" "(24, 24)" .. (truncated)

Show Solution
# Input

# Solution
output <- c()  # init output vector

for(i in 1:NROW(DT)){
for(j in 1:NCOL(DT)){
curr_value <- DT[i, j,with=F][[1]]
if(is.na(curr_value)){
position <- paste0('(', paste(i, j, sep=', '), ')')
output   <- c(output, position)
}
}
}


## 11. How to count the number of missing values in each column of data.table?

Difficulty Level: L2 Question : Count the number of missing values in each column of DT Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Show Solution
# Input

# Solution
for (i in colnames(DT)){
print(paste("Number of missing values in column", i , "is", sum(is.na(DT[,..i]))))
}

[1] "Number of missing values in column Manufacturer is 4"
[1] "Number of missing values in column Model is 1"
[1] "Number of missing values in column Type is 3"
[1] "Number of missing values in column Min.Price is 7"
[1] "Number of missing values in column Price is 2"
[1] "Number of missing values in column Max.Price is 5"
[1] "Number of missing values in column MPG.city is 9"
[1] "Number of missing values in column MPG.highway is 2"
[1] "Number of missing values in column AirBags is 6"
[1] "Number of missing values in column DriveTrain is 7"
[1] "Number of missing values in column Cylinders is 5"
[1] "Number of missing values in column EngineSize is 2"
[1] "Number of missing values in column Horsepower is 7"
[1] "Number of missing values in column RPM is 3"
[1] "Number of missing values in column Rev.per.mile is 6"
[1] "Number of missing values in column Man.trans.avail is 5"
[1] "Number of missing values in column Fuel.tank.capacity is 8"
[1] "Number of missing values in column Passengers is 2"
[1] "Number of missing values in column Length is 4"
[1] "Number of missing values in column Wheelbase is 1"
[1] "Number of missing values in column Width is 6"
[1] "Number of missing values in column Turn.circle is 5"
[1] "Number of missing values in column Rear.seat.room is 4"
[1] "Number of missing values in column Luggage.room is 19"
[1] "Number of missing values in column Weight is 7"
[1] "Number of missing values in column Origin is 5"
[1] "Number of missing values in column Make is 3"


## 12. How to replace missing values of data.table with 0?

Difficulty Level: L2 Question : Replace all missing values in DT with 0. Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Show Solution
# Input

# Solution
DT[is.na(DT)] <- 0

   Manufacturer   Model    Type Min.Price Price Max.Price MPG.city MPG.highway
1:        Acura Integra   Small      12.9  15.9      18.8       25          31
2:            0  Legend Midsize      29.2  33.9      38.7       18          25
3:         Audi      90 Compact      25.9  29.1      32.3       20          26
4:         Audi     100 Midsize       0.0  37.7      44.6       19          26
5:          BMW    535i Midsize       0.0  30.0       0.0       22          30
6:        Buick Century Midsize      14.2  15.7      17.3       22          31


## 13. How to replace missing values of numeric columns with the mean?

Difficulty Level: L2 Question : Replace missing values of numeric columns with their respective mean. Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Show Solution
# Input

# Solution
replace_with_mean <- function(x){ifelse(is.na(x),mean(x, na.rm = TRUE),x)}
output <- DT[,lapply(.SD, replace_with_mean)]


Manufacturer   Model    Type Min.Price Price Max.Price MPG.city MPG.highway
1:        Acura Integra   Small   12.9000  15.9  18.80000       25          31
2:         <NA>  Legend Midsize   29.2000  33.9  38.70000       18          25
3:         Audi      90 Compact   25.9000  29.1  32.30000       20          26
4:         Audi     100 Midsize   17.1186  37.7  44.60000       19          26
5:          BMW    535i Midsize   17.1186  30.0  21.45909       22          30
6:        Buick Century Midsize   14.2000  15.7  17.30000       22          31
AirBags DriveTrain Cylinders EngineSize Horsepower  RPM
1:               None      Front         4   1.800000        140 6300
2: Driver & Passenger      Front         6   3.200000        200 5500
3:        Driver only      Front         6   2.800000        172 5500
4: Driver & Passenger       <NA>         6   2.658242        172 5500
5:               <NA>       Rear         4   3.500000        208 5700
6:        Driver only       <NA>         4   2.200000        110 5200
Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase
1:         2890             Yes               13.2          5    177       102
2:         2335             Yes               18.0          5    195       115
3:         2280             Yes               16.9          5    180       102
4:         2535            <NA>               21.1          6    193       106
5:         2545             Yes               21.1          4    186       109
6:         2565              No               16.4          6    189       105
Width Turn.circle Rear.seat.room Luggage.room   Weight  Origin
1: 68.00000          37           26.5     13.98649 2705.000 non-USA
2: 71.00000          38           30.0     15.00000 3560.000 non-USA
3: 67.00000          37           28.0     14.00000 3375.000 non-USA
4: 69.44828          37           31.0     17.00000 3405.000 non-USA
5: 69.00000          39           27.0     13.00000 3640.000 non-USA
6: 69.00000          41           28.0     16.00000 3104.593     USA
Make
1: Acura Integra
2:  Acura Legend
3:       Audi 90
4:      Audi 100
5:      BMW 535i
6: Buick Century


## 14. How to select a specific column from a data.table as a list (vector)?

Difficulty Level: L1 Question : Get the column Model in DT as a list. Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')


Desired Solution

[1] "Integra"        "Legend"         "90"             "100"
[9] "Riviera"        "DeVille"        "Seville"        "Cavalier"
[13] "Corsica"        "Camaro"         "Lumina"         "Lumina_APV"
[17] "Astro"          "Caprice"        "Corvette"       "Concorde"
[21] "LeBaron"        "Imperial"       (...truncated...)

Show Solution
# Input

#Solution
DT <- DT[, Model]
print(DT)

 [1] "Integra"        "Legend"         "90"             "100"
[9] "Riviera"        "DeVille"        "Seville"        "Cavalier"
[13] "Corsica"        "Camaro"         "Lumina"         "Lumina_APV"
[17] "Astro"          "Caprice"        "Corvette"       "Concorde"
[25] "Spirit"         "Caravan"        "Dynasty"        "Stealth"
[29] "Summit"         "Vision"         "Festiva"        "Escort"
[33] "Tempo"          "Mustang"        "Probe"          "Aerostar"
[37] "Taurus"         "Crown_Victoria" "Metro"          "Storm"
[41] "Prelude"        "Civic"          "Accord"         "Excel"
[45] "Elantra"        "Scoupe"         "Sonata"         "Q45"
[49] "ES300"          "SC300"          "Continental"    "Town_Car"
[53] "323"            "Protege"        "626"            "MPV"
[57] "RX-7"           "190E"           "300E"           NA
[61] "Cougar"         "Mirage"         "Diamante"       "Sentra"
[65] "Altima"         "Quest"          "Maxima"         "Achieva"
[69] "Cutlass_Ciera"  "Silhouette"     "Eighty-Eight"   "Laser"
[73] "LeMans"         "Sunbird"        "Firebird"       "Grand_Prix"
[77] "Bonneville"     "900"            "SL"             "Justy"
[81] "Loyale"         "Legacy"         "Swift"          "Tercel"
[85] "Celica"         "Camry"          "Previa"         "Fox"
[93] "850"


## 15. How to select a specific column from a data.table as a data.table?

Difficulty Level: L2 Question : Get the column Model in DT as a data.table (rather than as a list). Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')


Desired Solution

     Model
1: Integra
2:  Legend
3:      90
4:     100
5:    535i
6: Century

Show Solution
# Input

#Solution
DT <- DT[, list(Model)]

     Model
1: Integra
2:  Legend
3:      90
4:     100
5:    535i
6: Century


## 16. How to sort a data.table based on a given column?

Difficulty Level: L1 Question: Reorder the rows as per Price column of DT. Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Show Solution
# Input

#Solution
DT <- DT[order(Price)]

   Manufacturer   Model  Type Min.Price Price Max.Price MPG.city MPG.highway
1:         Ford Festiva Small       6.9   7.4       7.9       31          33
2:      Hyundai   Excel Small       6.8   8.0       9.2       29          33
3:        Mazda     323 Small       7.4   8.3       9.1       29          37
4:          Geo   Metro Small       6.7   8.4      10.0       46          50
5:       Subaru   Justy Small       7.3   8.4       9.5       33          37
6:       Suzuki   Swift  <NA>       7.3   8.6        NA       39          43
AirBags DriveTrain Cylinders EngineSize Horsepower  RPM Rev.per.mile
1:    <NA>      Front         4        1.3         63 5000         3150
2:    None      Front         4        1.5         81 5500         2710
3:    None      Front         4        1.6         82 5000         2370
4:    None      Front         3        1.0         55 5700         3755
5:    None        4WD         3        1.2         73 5600         2875
6:    None      Front         3        1.3         70 6000         3360
Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase Width
1:             Yes               10.0          4    141        90    63
2:             Yes               11.9          5    168        94    63
3:             Yes               13.2          4    164        97    66
4:            <NA>               10.6          4    151        93    63
5:             Yes                9.2          4    146        90    60
6:             Yes               10.6          4    161        93    NA
Turn.circle Rear.seat.room Luggage.room Weight  Origin          Make
1:          33           26.0           12   1845     USA  Ford Festiva
2:          35           26.0           11   2345 non-USA Hyundai Excel
3:          34           27.0           16   2325 non-USA     Mazda 323
4:          34           27.5           10   1695 non-USA     Geo Metro
5:          32           23.5           10   2045 non-USA  Subaru Justy
6:          34           27.5           10   1965 non-USA  Suzuki Swift


## 17. How to sort a data.table based on two or more columns?

Difficulty Level: L1 Question : Reorder the rows as per ascending order of Price column and descending order of Max.Price column of DT. Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Show Solution
# Input

#Solution
DT <- DT[order(Price, -Max.Price)]

   Manufacturer   Model  Type Min.Price Price Max.Price MPG.city MPG.highway
1:         Ford Festiva Small       6.9   7.4       7.9       31          33
2:      Hyundai   Excel Small       6.8   8.0       9.2       29          33
3:        Mazda     323 Small       7.4   8.3       9.1       29          37
4:          Geo   Metro Small       6.7   8.4      10.0       46          50
5:       Subaru   Justy Small       7.3   8.4       9.5       33          37
6:       Suzuki   Swift  <NA>       7.3   8.6        NA       39          43
AirBags DriveTrain Cylinders EngineSize Horsepower  RPM Rev.per.mile
1:    <NA>      Front         4        1.3         63 5000         3150
2:    None      Front         4        1.5         81 5500         2710
3:    None      Front         4        1.6         82 5000         2370
4:    None      Front         3        1.0         55 5700         3755
5:    None        4WD         3        1.2         73 5600         2875
6:    None      Front         3        1.3         70 6000         3360
Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase Width
1:             Yes               10.0          4    141        90    63
2:             Yes               11.9          5    168        94    63
3:             Yes               13.2          4    164        97    66
4:            <NA>               10.6          4    151        93    63
5:             Yes                9.2          4    146        90    60
6:             Yes               10.6          4    161        93    NA
Turn.circle Rear.seat.room Luggage.room Weight  Origin          Make
1:          33           26.0           12   1845     USA  Ford Festiva
2:          35           26.0           11   2345 non-USA Hyundai Excel
3:          34           27.0           16   2325 non-USA     Mazda 323
4:          34           27.5           10   1695 non-USA     Geo Metro
5:          32           23.5           10   2045 non-USA  Subaru Justy
6:          34           27.5           10   1965 non-USA  Suzuki Swift


## 18. How to filter out rows from a data.table that satisfy multiple conditions?

Difficulty Level: L2 Question : Filter out the rows having Manufacturer as Ford and Price less than 30 Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')


Desired Output :

   Manufacturer    Model    Type Min.Price Price Max.Price MPG.city MPG.highway
1:         Ford  Festiva   Small       6.9   7.4       7.9       31          33
2:         Ford   Escort   Small       8.4  10.1      11.9       23          30
3:         Ford    Tempo Compact      10.4  11.3      12.2       22          27
4:         Ford  Mustang  Sporty      10.8  15.9      21.0       22          29
5:         Ford    Probe  Sporty      12.8  14.0      15.2       NA          30
6:         Ford Aerostar     Van      14.5  19.9      25.3       15          20
AirBags DriveTrain Cylinders EngineSize Horsepower  RPM Rev.per.mile
1:        <NA>      Front         4        1.3         63 5000         3150
2:        None      Front         4        1.8        127 6500         2410
3:        None      Front         4        2.3         96 4200         2805
4: Driver only       Rear         4        2.3        105 4600         2285
5: Driver only      Front         4        2.0        115 5500         2340
6: Driver only        4WD         6        3.0        145 4800         2080
(...truncated...)

Show Solution
#Input

#Solution
sol <- DT[Manufacturer == "Ford" & Price < 30]

   Manufacturer    Model    Type Min.Price Price Max.Price MPG.city MPG.highway
1:         Ford  Festiva   Small       6.9   7.4       7.9       31          33
2:         Ford   Escort   Small       8.4  10.1      11.9       23          30
3:         Ford    Tempo Compact      10.4  11.3      12.2       22          27
4:         Ford  Mustang  Sporty      10.8  15.9      21.0       22          29
5:         Ford    Probe  Sporty      12.8  14.0      15.2       NA          30
6:         Ford Aerostar     Van      14.5  19.9      25.3       15          20
AirBags DriveTrain Cylinders EngineSize Horsepower  RPM Rev.per.mile
1:        <NA>      Front         4        1.3         63 5000         3150
2:        None      Front         4        1.8        127 6500         2410
3:        None      Front         4        2.3         96 4200         2805
4: Driver only       Rear         4        2.3        105 4600         2285
5: Driver only      Front         4        2.0        115 5500         2340
6: Driver only        4WD         6        3.0        145 4800         2080
Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase Width
1:             Yes               10.0          4    141        90    63
2:            <NA>               13.2          5    171        98    67
3:             Yes               15.9          5    177       100    68
4:             Yes               15.4          4    180       101    68
5:             Yes               15.5          4    179       103    70
6:             Yes               21.0          7    176       119    72
Turn.circle Rear.seat.room Luggage.room Weight Origin          Make
1:          33           26.0           12   1845    USA  Ford Festiva
2:          36           28.0           12   2530    USA   Ford Escort
3:          NA           27.5           13     NA    USA          <NA>
4:          40           24.0           12     NA    USA  Ford Mustang
5:          38           23.0           18   2710    USA    Ford Probe
6:          45           30.0           NA   3735    USA Ford Aerostar


## 19. How to select multiple columns from data.table and rename them?

Difficulty Level: L2 Question : Get columns ‘Manufacturer’ & ‘Type’ from data.table and rename them to ‘MANUFACTURER’ & ‘TYPE’ Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')


Desired Output

   MANUFACTURER    TYPE
1:        Acura   Small
2:         <NA> Midsize
3:         Audi Compact
4:         Audi Midsize
5:          BMW Midsize
6:        Buick Midsize

Show Solution
#Input

#Solution
sol <- DT[, .(MANUFACTURER = Manufacturer, TYPE = Type)]

   MANUFACTURER    TYPE
1:        Acura   Small
2:         <NA> Midsize
3:         Audi Compact
4:         Audi Midsize
5:          BMW Midsize
6:        Buick Midsize


## 20. How to get mean of rows that satisfy a condition from a given column?

Difficulty Level: L2 Question : Get mean of all the vehicles having Price greater than 30. Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')


Desired Output :

38.6833

Show Solution
#Input

#Solution
output <- DT[Price > 30, mean(Price, na.rm = TRUE)]
output


38.6833333333333

## 21. How to remove only specified columns from a data.table?

Difficulty Level: L2 Question : Create new data.table by removing ‘Manufacturer’ & ‘Type’ columns. Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Show Solution
#Input

#Solution
output <- DT[, !c("Manufacturer", "Type")]

     Model Min.Price Price Max.Price MPG.city MPG.highway            AirBags
1: Integra      12.9  15.9      18.8       25          31               None
2:  Legend      29.2  33.9      38.7       18          25 Driver & Passenger
3:      90      25.9  29.1      32.3       20          26        Driver only
4:     100        NA  37.7      44.6       19          26 Driver & Passenger
5:    535i        NA  30.0        NA       22          30               <NA>
6: Century      14.2  15.7      17.3       22          31        Driver only
DriveTrain Cylinders EngineSize Horsepower  RPM Rev.per.mile Man.trans.avail
1:      Front         4        1.8        140 6300         2890             Yes
2:      Front         6        3.2        200 5500         2335             Yes
3:      Front         6        2.8        172 5500         2280             Yes
4:       <NA>         6         NA        172 5500         2535            <NA>
5:       Rear         4        3.5        208 5700         2545             Yes
6:       <NA>         4        2.2        110 5200         2565              No
Fuel.tank.capacity Passengers Length Wheelbase Width Turn.circle
1:               13.2          5    177       102    68          37
2:               18.0          5    195       115    71          38
3:               16.9          5    180       102    67          37
4:               21.1          6    193       106    NA          37
5:               21.1          4    186       109    69          39
6:               16.4          6    189       105    69          41
Rear.seat.room Luggage.room Weight  Origin          Make
1:           26.5           NA   2705 non-USA Acura Integra
2:           30.0           15   3560 non-USA  Acura Legend
3:           28.0           14   3375 non-USA       Audi 90
4:           31.0           17   3405 non-USA      Audi 100
5:           27.0           13   3640 non-USA      BMW 535i
6:           28.0           16     NA     USA Buick Century


## 22. How to calculate count of each unique value in a given column?

Difficulty Level: L2 Question : Calculate number of records for every Manufacturer in DT. Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')


Desired Output

   Manufacturer N
1:        Acura 1
2:         <NA> 4
3:         Audi 2
4:          BMW 1
5:        Buick 4

Show Solution
#Input

#Solution
output <- DT[, .(.N), by = .(Manufacturer)]

   Manufacturer N
1:        Acura 1
2:         <NA> 4
3:         Audi 2
4:          BMW 1
5:        Buick 4


## 23. How to find mean of a column grouped by multiple columns?

Difficulty Level: L2 Question : Calculate mean Price for every Manufacturer, Type combination in DT. Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')


Desired Output

   Manufacturer    Type       V1
1:        Acura   Small 15.90000
2:         <NA> Midsize 31.93333
3:         Audi Compact 29.10000
4:         Audi Midsize 37.70000
5:          BMW Midsize 30.00000
6:        Buick Midsize 21.00000

Show Solution
#Input

#Solution
output <- DT[, mean(Price, na.rm = TRUE), by = .(Manufacturer, Type)]

   Manufacturer    Type       V1
1:        Acura   Small 15.90000
2:         <NA> Midsize 31.93333
3:         Audi Compact 29.10000
4:         Audi Midsize 37.70000
5:          BMW Midsize 30.00000
6:        Buick Midsize 21.00000


## 24. How to find mean of column grouped by multiple columns and sort by grouped columns?

Difficulty Level: L2 Question : Calculate mean Price for every Manufacturer, Type combination and arrange it by the keys in DT. Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')


Desired Output

   Manufacturer    Type    V1
1:        Acura   Small 15.90
2:         Audi Compact 29.10
3:         Audi Midsize 37.70
4:          BMW Midsize 30.00
5:        Buick   Large 22.25
6:        Buick Midsize 21.00

Show Solution
#Input

#Solution 1
output <- DT[, mean(Price, na.rm = TRUE), keyby = .(Manufacturer, Type)]

#Solution 1
output <- DT[, mean(Price, na.rm = TRUE), by = .(Manufacturer, Type)][order(Manufacturer, Type)]

   Manufacturer    Type    V1
1:        Acura   Small 15.90
2:         Audi Compact 29.10
3:         Audi Midsize 37.70
4:          BMW Midsize 30.00
5:        Buick   Large 22.25
6:        Buick Midsize 21.00


## 25. How to select all numeric columns from data.table?

Difficulty Level: L2 Question : Select numeric columns from DT. Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')


Desired Output

   Min.Price Price Max.Price MPG.city MPG.highway EngineSize Horsepower  RPM
1:      12.9  15.9      18.8       25          31        1.8        140 6300
2:      29.2  33.9      38.7       18          25        3.2        200 5500
3:      25.9  29.1      32.3       20          26        2.8        172 5500
4:        NA  37.7      44.6       19          26         NA        172 5500
5:        NA  30.0        NA       22          30        3.5        208 5700
6:      14.2  15.7      17.3       22          31        2.2        110 5200
Rev.per.mile Fuel.tank.capacity Passengers Length Wheelbase Width
1:         2890               13.2          5    177       102    68
2:         2335               18.0          5    195       115    71
3:         2280               16.9          5    180       102    67
4:         2535               21.1          6    193       106    NA
5:         2545               21.1          4    186       109    69
6:         2565               16.4          6    189       105    69
(...truncated...)

Show Solution
# Input

# Solution 1
output <- DT[, .SD, .SDcols = which(sapply(DT, is.numeric))]

   Min.Price Price Max.Price MPG.city MPG.highway EngineSize Horsepower  RPM
1:      12.9  15.9      18.8       25          31        1.8        140 6300
2:      29.2  33.9      38.7       18          25        3.2        200 5500
3:      25.9  29.1      32.3       20          26        2.8        172 5500
4:        NA  37.7      44.6       19          26         NA        172 5500
5:        NA  30.0        NA       22          30        3.5        208 5700
6:      14.2  15.7      17.3       22          31        2.2        110 5200
Rev.per.mile Fuel.tank.capacity Passengers Length Wheelbase Width
1:         2890               13.2          5    177       102    68
2:         2335               18.0          5    195       115    71
3:         2280               16.9          5    180       102    67
4:         2535               21.1          6    193       106    NA
5:         2545               21.1          4    186       109    69
6:         2565               16.4          6    189       105    69
Turn.circle Rear.seat.room Luggage.room Weight
1:          37           26.5           NA   2705
2:          38           30.0           15   3560
3:          37           28.0           14   3375
4:          37           31.0           17   3405
5:          39           27.0           13   3640
6:          41           28.0           16     NA


## 26. How to calculate mean of all numeric columns for every Manufacturer?

Difficulty Level: L3 Question : Calculate mean of all numeric columns for every Manufacturer Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')


Desired Output

   Manufacturer Min.Price  Price Max.Price MPG.city MPG.highway EngineSize
1:        Acura    12.900 15.900     18.80     25.0       31.00      1.800
2:         <NA>    26.775 28.550     30.30     19.0       26.00      2.975
3:         Audi        NA 33.400     38.45     19.5       26.00         NA
4:          BMW        NA 30.000        NA     22.0       30.00      3.500
5:        Buick    20.750 21.625        NA     19.0       27.75      3.875
6:     Cadillac    35.250 37.400     39.50     16.0       25.00      4.750
(...truncated...)

Show Solution
#Input

#Solution
output <- DT[, .SD, .SDcols = which(sapply(DT, is.numeric))]
output <- cbind(output, DT[, list(Manufacturer)])
output <- output[, lapply(.SD, mean), by = Manufacturer]

   Manufacturer Min.Price  Price Max.Price MPG.city MPG.highway EngineSize
1:        Acura    12.900 15.900     18.80     25.0       31.00      1.800
2:         <NA>    26.775 28.550     30.30     19.0       26.00      2.975
3:         Audi        NA 33.400     38.45     19.5       26.00         NA
4:          BMW        NA 30.000        NA     22.0       30.00      3.500
5:        Buick    20.750 21.625        NA     19.0       27.75      3.875
6:     Cadillac    35.250 37.400     39.50     16.0       25.00      4.750
Horsepower  RPM Rev.per.mile Fuel.tank.capacity Passengers Length Wheelbase
1:      140.0 6300       2890.0               13.2       5.00 177.00    102.00
2:      186.5 5750           NA                 NA       5.00 193.25    109.75
3:      172.0 5500       2407.5               19.0       5.50 186.50    104.00
4:      208.0 5700       2545.0               21.1       4.00 186.00    109.00
5:      157.5 4700           NA                 NA       5.75 200.75    110.00
6:      247.5 5050           NA               19.0       5.50 205.00    112.50
Width Turn.circle Rear.seat.room Luggage.room  Weight
1: 68.00       37.00         26.500           NA 2705.00
2: 71.25          NA         29.000         13.5 3458.75
3:    NA       37.00         29.500         15.5 3390.00
4: 69.00       39.00         27.000         13.0 3640.00
5:    NA       42.25         28.875         17.0      NA
6: 73.50       43.50         33.000           NA 3777.50


## 27. How to get first 3 rows for every Manufacturer?

Difficulty Level: L3 Question : Get first 3 rows for every Manufacturer in DT. Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')


Desired Output

   Manufacturer   Model    Type Min.Price Price Max.Price MPG.city MPG.highway
1:        Acura Integra   Small      12.9  15.9      18.8       25          31
2:         Audi      90 Compact      25.9  29.1      32.3       20          26
3:         Audi     100 Midsize        NA  37.7      44.6       19          26
4:          BMW    535i Midsize        NA  30.0        NA       22          30
5:        Buick Century Midsize      14.2  15.7      17.3       22          31
6:        Buick LeSabre   Large      19.9  20.8        NA       19          28
(...truncated...)

Show Solution
#Input

#Solution
output <- DT[!is.na(Manufacturer), head(.SD, 3), by = Manufacturer]

   Manufacturer   Model    Type Min.Price Price Max.Price MPG.city MPG.highway
1:        Acura Integra   Small      12.9  15.9      18.8       25          31
2:         Audi      90 Compact      25.9  29.1      32.3       20          26
3:         Audi     100 Midsize        NA  37.7      44.6       19          26
4:          BMW    535i Midsize        NA  30.0        NA       22          30
5:        Buick Century Midsize      14.2  15.7      17.3       22          31
6:        Buick LeSabre   Large      19.9  20.8        NA       19          28
AirBags DriveTrain Cylinders EngineSize Horsepower  RPM
1:               None      Front         4        1.8        140 6300
2:        Driver only      Front         6        2.8        172 5500
3: Driver & Passenger       <NA>         6         NA        172 5500
4:               <NA>       Rear         4        3.5        208 5700
5:        Driver only       <NA>         4        2.2        110 5200
6:        Driver only      Front         6        3.8        170 4800
Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase
1:         2890             Yes               13.2          5    177       102
2:         2280             Yes               16.9          5    180       102
3:         2535            <NA>               21.1          6    193       106
4:         2545             Yes               21.1          4    186       109
5:         2565              No               16.4          6    189       105
6:           NA              No                 NA          6    200       111
Width Turn.circle Rear.seat.room Luggage.room Weight  Origin          Make
1:    68          37           26.5           NA   2705 non-USA Acura Integra
2:    67          37           28.0           14   3375 non-USA       Audi 90
3:    NA          37           31.0           17   3405 non-USA      Audi 100
4:    69          39           27.0           13   3640 non-USA      BMW 535i
5:    69          41           28.0           16     NA     USA Buick Century
6:    74          42           30.5           17   3470     USA Buick LeSabre


## 28. How to remove specific column from data.table?

Difficulty Level: L2 Question : Remove column Model from DT. Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Show Solution
#Input

#Solution
DT[, Model := NULL]

   Manufacturer    Type Min.Price Price Max.Price MPG.city MPG.highway
1:        Acura   Small      12.9  15.9      18.8       25          31
2:         <NA> Midsize      29.2  33.9      38.7       18          25
3:         Audi Compact      25.9  29.1      32.3       20          26
4:         Audi Midsize        NA  37.7      44.6       19          26
5:          BMW Midsize        NA  30.0        NA       22          30
6:        Buick Midsize      14.2  15.7      17.3       22          31
AirBags DriveTrain Cylinders EngineSize Horsepower  RPM
1:               None      Front         4        1.8        140 6300
2: Driver & Passenger      Front         6        3.2        200 5500
3:        Driver only      Front         6        2.8        172 5500
4: Driver & Passenger       <NA>         6         NA        172 5500
5:               <NA>       Rear         4        3.5        208 5700
6:        Driver only       <NA>         4        2.2        110 5200
Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase
1:         2890             Yes               13.2          5    177       102
2:         2335             Yes               18.0          5    195       115
3:         2280             Yes               16.9          5    180       102
4:         2535            <NA>               21.1          6    193       106
5:         2545             Yes               21.1          4    186       109
6:         2565              No               16.4          6    189       105
Width Turn.circle Rear.seat.room Luggage.room Weight  Origin          Make
1:    68          37           26.5           NA   2705 non-USA Acura Integra
2:    71          38           30.0           15   3560 non-USA  Acura Legend
3:    67          37           28.0           14   3375 non-USA       Audi 90
4:    NA          37           31.0           17   3405 non-USA      Audi 100
5:    69          39           27.0           13   3640 non-USA      BMW 535i
6:    69          41           28.0           16     NA     USA Buick Century


## 29. How to set key to data.table?

Difficulty Level: L1 Question : Set column AirBags as key to the data.table Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Show Solution
#Input

#Solution
setkey(DT,AirBags)


## 30. How to get all rows where AirBags are at 'Driver & Passenger' in DT?

Difficulty Level: L2 Question : Get all rows where AirBags are Driver & Passenger in DT. Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Show Solution
#Input

#Solution
setkey(DT,AirBags)

   Manufacturer    Model    Type Min.Price Price Max.Price MPG.city MPG.highway
1:         <NA>   Legend Midsize      29.2  33.9      38.7       18          25
2:         Audi      100 Midsize        NA  37.7      44.6       19          26
3:     Cadillac  Seville Midsize      37.5  40.1      42.7       16          25
4:    Chevrolet   Camaro  Sporty      13.4  15.1      16.8       19          28
5:         <NA> Concorde   Large      18.4  18.4      18.4       20          28
6:     Chrysler  LeBaron Compact      14.5  15.8      17.1       23          28
AirBags DriveTrain Cylinders EngineSize Horsepower  RPM
1: Driver & Passenger      Front         6        3.2        200 5500
2: Driver & Passenger       <NA>         6         NA        172 5500
3: Driver & Passenger      Front         8        4.6        295 6000
4: Driver & Passenger       Rear      <NA>        3.4        160 4600
5: Driver & Passenger      Front         6        3.3        153 5300
6: Driver & Passenger      Front         4        3.0        141 5000
Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase
1:         2335             Yes               18.0          5    195       115
2:         2535            <NA>               21.1          6    193       106
3:         1985              No               20.0          5    204       111
4:         1805             Yes                 NA          4    193       101
5:         1990              No               18.0          6    203       113
6:         2090              No               16.0          6    183       104
Width Turn.circle Rear.seat.room Luggage.room Weight  Origin
1:    71          38           30.0           15   3560 non-USA
2:    NA          37           31.0           17   3405 non-USA
3:    74          44           31.0           NA   3935     USA
4:    74          43           25.0           13   3240     USA
5:    74          NA           31.0           15   3515     USA
6:    68          41           30.5           14   3085     USA
Make
1:      Acura Legend
2:          Audi 100
4:  Chevrolet Camaro
5: Chrylser Concorde
6:  Chrysler LeBaron


## 31. How to get all rows where AirBags are either 'Driver & Passenger' or 'No data'?

Difficulty Level: L2 Question : Get all rows where AirBags are Driver & Passenger & No data. Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')


Desired Output



Show Solution
#Input
library(data.table)

#Solution
setkey(DT,AirBags)
print(DT[c("Driver & Passenger", "No data")])

     Manufacturer       Model    Type Min.Price Price Max.Price MPG.city
1:          <NA>      Legend Midsize      29.2  33.9      38.7       18
2:          Audi         100 Midsize        NA  37.7      44.6       19
3:      Cadillac     Seville Midsize      37.5  40.1      42.7       16
4:     Chevrolet      Camaro  Sporty      13.4  15.1      16.8       19
5:          <NA>    Concorde   Large      18.4  18.4      18.4       20
6:      Chrysler     LeBaron Compact      14.5  15.8      17.1       23
7:         Eagle      Vision   Large        NA  19.3      21.2       20
8:         Honda     Prelude  Sporty      17.0  19.8      22.7       24
9:         Honda      Accord Compact      13.8  17.5      21.2       24
10:          <NA>       SC300 Midsize      34.7  35.2      35.6       18
11:       Lincoln Continental Midsize      33.3  34.3      35.3       17
12:       Lincoln    Town_Car   Large      34.4  36.1      37.8       18
13: Mercedes-Benz        300E Midsize      43.8  61.9      80.0       19
14:       Pontiac    Firebird    <NA>      14.0  17.7      21.4       19
15:       Pontiac  Bonneville   Large      19.4  24.4      29.4       19
16:          <NA>         850 Midsize      24.8  26.7      28.5       20
17:          <NA>        <NA>    <NA>        NA    NA        NA       NA
MPG.highway            AirBags DriveTrain Cylinders EngineSize Horsepower
1:          25 Driver & Passenger      Front         6        3.2        200
2:          26 Driver & Passenger       <NA>         6         NA        172
3:          25 Driver & Passenger      Front         8        4.6        295
4:          28 Driver & Passenger       Rear      <NA>        3.4        160
5:          28 Driver & Passenger      Front         6        3.3        153
6:          28 Driver & Passenger      Front         4        3.0        141
7:          28 Driver & Passenger      Front         6        3.5         NA
8:          31 Driver & Passenger      Front         4        2.3        160
9:          31 Driver & Passenger      Front         4        2.2        140
10:          23 Driver & Passenger       Rear         6        3.0        225
11:          26 Driver & Passenger       <NA>         6        3.8        160
12:          26 Driver & Passenger       Rear         8        4.6        210
13:          25 Driver & Passenger       Rear         6        3.2        217
14:          28 Driver & Passenger       Rear         6         NA        160
15:          28 Driver & Passenger      Front         6        3.8        170
16:          28 Driver & Passenger      Front         5        2.4        168
17:          NA            No data       <NA>      <NA>         NA         NA
RPM Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length
1: 5500         2335             Yes               18.0          5    195
2: 5500         2535            <NA>               21.1          6    193
3: 6000         1985              No               20.0          5    204
4: 4600         1805             Yes                 NA          4    193
5: 5300         1990              No               18.0          6    203
6: 5000         2090              No               16.0          6    183
7: 5800         1980              No               18.0          6    202
8: 5800         2855             Yes                 NA          4    175
9: 5600           NA             Yes               17.0          4    185
10: 6000         2510             Yes                 NA          4    191
11: 4400         1835              No               18.4          6    205
12: 4600         1840              No               20.0         NA    219
13: 5500         2220              No               18.5          5     NA
14: 4600         1805             Yes               15.5          4    196
15: 4800         1565              No               18.0          6    177
16: 6200           NA             Yes               19.3          5    184
17:   NA           NA            <NA>                 NA         NA     NA
Wheelbase Width Turn.circle Rear.seat.room Luggage.room Weight  Origin
1:       115    71          38           30.0           15   3560 non-USA
2:       106    NA          37           31.0           17   3405 non-USA
3:       111    74          44           31.0           NA   3935     USA
4:       101    74          43           25.0           13   3240     USA
5:       113    74          NA           31.0           15   3515     USA
6:       104    68          41           30.5           14   3085     USA
7:       113    74          40           30.0           15   3490     USA
8:       100    70          39           23.5            8   2865 non-USA
9:       107    67          41           28.0           14   3040 non-USA
10:       106    71          39           25.0            9   3515 non-USA
11:       109    73          42           30.0           19   3695     USA
12:       117    77          45           31.5           22   4055     USA
13:       110    69          37             NA           15   3525 non-USA
14:       101    75          43           25.0           13   3240     USA
15:       111    74          43           30.5           18   3495     USA
16:       105    69          38           30.0           15   3245 non-USA
17:        NA    NA          NA             NA           NA     NA    <NA>
Make
1:        Acura Legend
2:            Audi 100
4:    Chevrolet Camaro
5:   Chrylser Concorde
6:    Chrysler LeBaron
7:        Eagle Vision
8:       Honda Prelude
9:        Honda Accord
10:         Lexus SC300
11: Lincoln Continental
12:    Lincoln Town_Car
13:  Mercedes-Benz 300E
14:    Pontiac Firebird
15:  Pontiac Bonneville
16:           Volvo 850
17:                <NA>


## 33. How to get last and 2nd last (penultimate) row of data.table?

Difficulty Level: L2 Question 1 : Get penultimate row of DT. Question 2 : Get last row of DT. Input:

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')


Desired Output:

   Manufacturer Model    Type Min.Price Price Max.Price MPG.city MPG.highway
1:         <NA>   850 Midsize      24.8  26.7      28.5       20          28
2:        Volvo   240 Compact      21.8  22.7      23.5       21          28
AirBags DriveTrain Cylinders EngineSize Horsepower  RPM
1: Driver & Passenger      Front         5        2.4        168 6200
2:        Driver only       Rear      <NA>        2.3        114 5400
Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase
1:           NA             Yes               19.3          5    184       105
2:         2215             Yes               15.8          5    190       104
Width Turn.circle Rear.seat.room Luggage.room Weight  Origin      Make
1:    69          38           30.0           15   3245 non-USA Volvo 850
2:    67          37           29.5           14   2985 non-USA Volvo 240

Show Solution
#Input

# Solution
output <- DT[c(.N, .N - 1)]
print(output)

   Manufacturer Model    Type Min.Price Price Max.Price MPG.city MPG.highway
1:         <NA>   850 Midsize      24.8  26.7      28.5       20          28
2:        Volvo   240 Compact      21.8  22.7      23.5       21          28
AirBags DriveTrain Cylinders EngineSize Horsepower  RPM
1: Driver & Passenger      Front         5        2.4        168 6200
2:        Driver only       Rear      <NA>        2.3        114 5400
Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase
1:           NA             Yes               19.3          5    184       105
2:         2215             Yes               15.8          5    190       104
Width Turn.circle Rear.seat.room Luggage.room Weight  Origin      Make
1:    69          38           30.0           15   3245 non-USA Volvo 850
2:    67          37           29.5           14   2985 non-USA Volvo 240


## 34. How to reorder columns of data.table?

Difficulty Level: L2 Question : Reorder columns of DT as V2 , V1, V3 Input

DT <- data.table(V1= c("a","b","c","d"), V2 =c("1","2","3","4"), V3 = c("aa","bb","cc","dd") )


Desired Output

   V2 V1 V3
1:  1  a aa
2:  2  b bb
3:  3  c cc
4:  4  d dd

Show Solution
# Input
DT <- data.table(V1= c("a","b","c","d"), V2 =c("1","2","3","4"), V3 = c("aa","bb","cc","dd") )

#Solution
output <- setcolorder(DT,c("V2","V1","V3"))
print(output)

   V2 V1 V3
1:  1  a aa
2:  2  b bb
3:  3  c cc
4:  4  d dd


## 35. How to select one column and compute standard deviation of another column and return a single value that gets recycled?

Difficulty Level: L2 Question : Select column Manufacturer and compute standard deviation of column Price and return a single value that gets recycled. Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')


Desired Output:

   Manufacturer Sd.Price
1:        Acura  9.72428
2:         <NA>  9.72428
3:         Audi  9.72428
4:         Audi  9.72428
5:          BMW  9.72428
6:        Buick  9.72428

Show Solution
# Input

#Solution
output <- DT[,.(Manufacturer, Sd.Price = sd(Price, na.rm = TRUE))]

   Manufacturer Sd.Price
1:        Acura  9.72428
2:         <NA>  9.72428
3:         Audi  9.72428
4:         Audi  9.72428
5:          BMW  9.72428
6:        Buick  9.72428


## 36. How to use row indexing, column indexing and group by expressions all together?

Difficulty Level: L2 Question : Calculate sum of Weight of cars under every Manufacturer but exclude all 'Midsize' Type cars. Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')


Desired Output

   Manufacturer    V1
1:        Acura  2705
2:         Audi  3375
3:        Buick  7575
5:    Chevrolet 23545
6:         <NA>  3515

Show Solution
# Input

#Solution
output = DT[Type !="Midsize", sum(Weight), by=Manufacturer]

   Manufacturer    V1
1:        Acura  2705
2:         Audi  3375
3:        Buick  7575
5:    Chevrolet 23545
6:         <NA>  3515


## 37. How to get the row number of the nth largest value in a column?

Difficulty Level: L2 Question : Find the row position of the 5th largest value of column Price in DT. Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Show Solution
# Input

#Solution
DT <- DT[order(Price)]
fifth_largest_price = DT[5, Price]
output = DT[Price == fifth_largest_price, which = TRUE]
output

1. 4
2. 5

## 38. How to find and cap outliers from a data.table column?

Difficulty Level: L2 Question: Replace all values of Length column in the lower 5%ile and greater than 95%ile with respective 5th and 95th %ile value. Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Show Solution
# Input

# Solution
DT[Length > quantile(Length, .95, na.rm=T), Length := round(quantile(DT$Length, .95, na.rm=T))] DT[Length < quantile(Length, .05, na.rm=T), Length := round(quantile(DT$Length, .05, na.rm=T))]


## 39. How to swap two column values based on a condition in another column of a data.table?

Difficulty Level: L2 Question : Swap column V1 & V3 for all the rows where V2 = 2. Input

DT <- data.table(V1= c("a","b","c","d"),
V2 =c("1","2","2","3"),
V3 = c("aa","bb","cc","dd") )


Desired Output:

   V1 V2 V3
1:  a  1 aa
2: bb  2  b
3: cc  2  c
4:  d  3 dd

Show Solution
# Input
DT <- data.table(V1= c("a","b","c","d"), V2 =c("1","2","2","3"), V3 = c("aa","bb","cc","dd") )

#Solution
DT[V2=="2", c("V3", "V1") := .(V1, V3)]
print(DT)

   V1 V2 V3
1:  a  1 aa
2: bb  2  b
3: cc  2  c
4:  d  3 dd


## 40. How to swap two rows of a dataframe?

Difficulty Level: L2 Questions : Swap rows 1 and 2 in DT. Input

DT <- data.table(V1= c("a","b","c","d"),
V2 =c("1","2","2","3"),
V3 = c("aa","bb","cc","dd") )


Desired Output

   V1 V2 V3
1:  b  2 bb
2:  a  1 aa
3:  c  2 cc
4:  d  3 dd

Show Solution
# Input
DT <- data.table(V1= c("a","b","c","d"), V2 =c("1","2","2","3"), V3 = c("aa","bb","cc","dd") )

# Solution
swap_rows <- function(dt, i1, i2){
i1_row <- DT[i1,]
i2_row <- DT[i2,]
DT[i1,] <- i2_row
DT[i2,] <- i1_row
return(DT)
}

output <- swap_rows(DT, 1,2)
print(output)

   V1 V2 V3
1:  b  2 bb
2:  a  1 aa
3:  c  2 cc
4:  d  3 dd


## 41. How to reverse the rows of a data.table?

Difficulty Level: L2 Question : Reverse all the rows of DT so that topmost row goes to bottom. Input

DT <- data.table(V1= c("a","b","c","d"),
V2 =c("1","2","2","3"),
V3 = c("aa","bb","cc","dd") )


Desired Output

     V1  V2  V3
[1,] "d" "3" "dd"
[2,] "c" "2" "cc"
[3,] "b" "2" "bb"
[4,] "a" "1" "aa"

Show Solution
# Input
DT <- data.table(V1= c("a","b","c","d"), V2 =c("1","2","2","3"), V3 = c("aa","bb","cc","dd") )

# Solution
output <- do.call(cbind, lapply(DT, rev))
print(output)

     V1  V2  V3
[1,] "d" "3" "dd"
[2,] "c" "2" "cc"
[3,] "b" "2" "bb"
[4,] "a" "1" "aa"


## 42. How to create one-hot encodings of a categorical variable (dummy variables)?

Difficulty Level: L2 Question : Get one-hot encodings for columns Gender & Degree in the data.table DT and append it as columns. Input

DT <- data.table(Gender = c("M","M","F","F"), E_Id =c(1,2,3,4), Degree = c("UG","PG","PhD","UG") )
Gender E_Id Degree
1:      M    1     UG
2:      M    2     PG
3:      F    3    PhD
4:      F    4     UG


Desired Output

   E_Id Gender_F Gender_M Degree_PG Degree_PhD Degree_UG
1:    1        0        1         0          0         1
2:    2        0        1         1          0         0
3:    3        1        0         0          1         0
4:    4        1        0         0          0         1

Show Solution
# Input
DT <- data.table(Gender = c("M","M","F","F"), E_Id =c(1,2,3,4), Degree = c("UG","PG","PhD","UG") )

# Solution
output <- dcast(melt(DT, id.vars='E_Id'), E_Id ~ variable + value, fun = length)
print(output)

   E_Id Gender_F Gender_M Degree_PG Degree_PhD Degree_UG
1:    1        0        1         0          0         1
2:    2        0        1         1          0         0
3:    3        1        0         0          1         0
4:    4        1        0         0          0         1


## 43. How to find median value in each numeric column in R data.table?

Difficulty Level: L2 Question : Find row-wise sum values of DT. Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')


Desired Output

   Min.Price Price Max.Price MPG.city MPG.highway EngineSize Horsepower  RPM
1:      14.6  17.7     19.15       21          28        2.3        140 5200
Rev.per.mile Fuel.tank.capacity Passengers Length Wheelbase Width
1:         2360               16.5          5    181       103    69
Turn.circle Rear.seat.room Luggage.room Weight
1:          39           27.5           14   3085

Show Solution
#Input

#Solution
output <- DT[, .SD, .SDcols = which(sapply(DT, is.numeric))]
output <- output[, lapply(.SD, median, na.rm = TRUE)]

   Min.Price Price Max.Price MPG.city MPG.highway EngineSize Horsepower  RPM
1:      14.6  17.7     19.15       21          28        2.3        140 5200
Rev.per.mile Fuel.tank.capacity Passengers Length Wheelbase Width
1:         2360               16.5          5    181       103    69
Turn.circle Rear.seat.room Luggage.room Weight
1:          39           27.5           14   3085


## 44. How to compute correlation score of each column against other columns?

Difficulty Level: L2 Question : Compute correlation score of each column against other columns. Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Show Solution
#Input

#Solution
output <- DT[, .SD, .SDcols = which(sapply(DT, is.numeric))]
print(cor(output, use = "na.or.complete"))


## 45. How to create a column containing the minimum by maximum of each row?

Difficulty Level: L2 Question : Compute the minimum-by-maximum for every row of df.

DT <- data.table(V1 = runif(10, 1, 100),
V2 = runif(10, 1, 80),
V3 = runif(10, 1, 100))

Show Solution
# Input
DT <- data.table(V1= runif(10, 1, 100), V2 =runif(10, 1, 80), V3 = runif(10, 1, 100))

#Solution
DT[,max_min:=do.call(pmin,.SD)/do.call(pmax,.SD)]

# Solution 2
DT <- data.table(V1= runif(10, 1, 100), V2 =runif(10, 1, 80), V3 = runif(10, 1, 100))
DT[, max_min2:=apply(.SD, 1, min)/apply(.SD, 1, max)]

         V1       V2       V3   max_min
1: 16.33154 66.08863 41.38638 0.2471157
2: 87.79467 38.57875 38.50150 0.4385404
3: 40.95019 19.47167 15.03562 0.3671685
4: 12.40790 17.68716 34.95050 0.3550136
5: 34.15262 24.52459 71.04999 0.3451738
6: 60.22314 60.22538 10.79905 0.1793106


## 46. How to scale all columns in a dataframe?

Difficulty Level: L2 Question : Normalize all columns in DT. Without using external packages. Input

DT <- data.table(V1= runif(10, 1, 100),
V2 =runif(10, 1, 80),
V3 = runif(10, 1, 100))

Show Solution
# Input
DT <- data.table(V1= runif(10, 1, 100), V2 =runif(10, 1, 80), V3 = runif(10, 1, 100))

# Solution
output <- DT[, lapply(.SD, scale)]
print(output)

         V1.V1      V2.V1       V3.V1
1:  0.9884659 -1.2595111  0.09593436
2:  0.1217149 -0.8867426 -0.31629923
3:  0.5872020  0.1443417  0.70839114
4: -0.5178531  0.5737876  1.12962574
5:  1.0221688 -0.9475154 -0.63975308
6: -1.5091412  1.2276058  1.11944272
7: -0.9203772  1.0074295 -0.33182208
8: -0.8005338  1.3745140 -1.41247525
9:  1.5281054 -0.9386551 -1.45771901
10: -0.4997516 -0.2952545  1.10467468


## 47. How to compute the correlation of each column with the succeeding column?

Difficulty Level: L2 Compute the correlation of each column of df with its succeeding column. Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')


Desired Output

                 Price
Min.Price 0.9705018
Max.Price
Price 0.9802519
MPG.city
Max.Price -0.5377239
MPG.highway
MPG.city   0.9479157
EngineSize
MPG.highway -0.6259651
Horsepower
EngineSize  0.7313269
RPM
Horsepower 0.04736874
Rev.per.mile
RPM    0.4623314
(...truncated...)

Show Solution
#Input

#Solution
output <- DT[, .SD, .SDcols = which(sapply(DT, is.numeric))]  # numeric columns

for (i in 1:(ncol(output) - 1)){
a <- i + 1
print(cor(output[,..i], output[,..a], use = "na.or.complete"))
}

              Price
Min.Price 0.9705018
Max.Price
Price 0.9802519
MPG.city
Max.Price -0.5377239
MPG.highway
MPG.city   0.9479157
EngineSize
MPG.highway -0.6259651
Horsepower
EngineSize  0.7313269
RPM
Horsepower 0.04736874
Rev.per.mile
RPM    0.4623314
Fuel.tank.capacity
Rev.per.mile         -0.6318497
Passengers
Fuel.tank.capacity  0.4965661
Length
Passengers 0.4737477
Wheelbase
Length 0.8275146
Width
Wheelbase 0.8127431
Turn.circle
Width   0.8170549
Rear.seat.room
Turn.circle      0.4738118
Luggage.room
Rear.seat.room     0.668847
Weight
Luggage.room 0.6425643


## 48. How to replace both the diagonals of data.table with 0?

Difficulty Level: L2 Replace both values in both diagonals of DT with 0. Input

DT <- data.table(V1= runif(4, 1, 100), V2 =runif(4, 1, 80), V3 = runif(4, 1, 100), V4 = runif(4,1, 50))

V1        V2        V3       V4
1: 97.451642 65.393451  6.496714 27.80014
2: 56.475427  5.234317 80.551154 10.86617
3: 65.965884 18.954913 70.968490 25.92509
4:  3.723218 26.590925 75.723006 42.22132


Desired output

         V1       V2        V3       V4
1:  0.00000 65.39345  6.496714  0.00000
2: 56.47543  0.00000  0.000000 10.86617
3: 65.96588  0.00000  0.000000 25.92509
4:  0.00000 26.59093 75.723006  0.00000

Show Solution
# Input
DT <- data.table(V1= runif(4, 1, 100), V2 =runif(4, 1, 80), V3 = runif(4, 1, 100), V4 = runif(4,1, 50))

# Solution
for (i in 1:nrow(DT)){
DT[i,i] <- 0
DT[nrow(DT) - i + 1, i] <- 0
}
print(DT)

         V1       V2       V3        V4
1:  0.00000 72.08959 12.63875  0.000000
2: 17.61880  0.00000  0.00000  1.696741
3: 82.20107  0.00000  0.00000 40.321027
4:  0.00000 40.47473 71.60260  0.000000


## 49. How to join two dataframes by 2 columns so they have only the common rows?

Difficulty Level: L2 Join dataframes df1 and df2 by ‘fruit-pazham’ and ‘weight-kilo’. Input

DT1 = data.table(fruit =  c('apple', 'banana', 'orange'),
weight =  c('high', 'medium', 'low'),
price=  c(1,2,3))

DT2 = data.table(pazham =  c('apple', 'orange', 'pine'),
kilo= c('high', 'low', 'high'),
price =  c(1,2,3))


Desired Output

    fruit weight price_left price_right
1:  apple   high          1           1
2: orange    low          3           2

Show Solution
# Input
DT1 = data.table(fruit =  c('apple', 'banana', 'orange'),
weight =  c('high', 'medium', 'low'),
price=  c(1,2,3))

DT2 = data.table(pazham =  c('apple', 'orange', 'pine'),
kilo= c('high', 'low', 'high'),
price =  c(1,2,3))

# Solution
output <- merge(DT1, DT2,
by.x = c('fruit', 'weight'),
by.y = c('pazham', 'kilo'),
all = FALSE,
suffixes = c("_left", "_right"))

print(output)

    fruit weight price_left price_right
1:  apple   high          1           1
2: orange    low          3           2


## 50. How to get the positions where values of two columns of data.table match?

Difficulty Level: L2 Question : Get the positions where values of two columns of DT match. Input

DT <- data.table(fruit1 =  c('apple', 'orange', 'banana', 'mango),
fruit2 =  c('apple', 'grapes', 'banana', 'jackfruit'))


Desired Output

1 3

Show Solution
# Input
DT <- data.table(fruit1 =  c('apple', 'orange', 'banana', 'mango'),
fruit2 =  c('apple', 'grapes', 'banana', 'jackfruit'))

# Solution
which(DT$fruit1 == DT$fruit2)

1. 1
2. 3

## 51. How to create lags and leads of a column in a data.table?

Difficulty Level: L2 Create two new columns in df, one of which is a lag 1 (shift column a down by 1 row) of column ‘a’ and the other is a lead 1 (shift column b up by 1 row). Input

DT <- data.table(V1 = runif(4, 1, 100),
V2 = runif(4, 1, 80),
V3 = runif(4, 1, 100),
V4 = runif(4,1, 50))


Desired Output

         V1       V2        V3       V4 lagged_column lead_column
1: 90.41034 79.62526 57.525005 16.88133            NA    13.19427
2: 13.19427 29.20243  3.962821 35.72731      90.41034    13.79110
3: 13.79110 15.68368 78.933011 18.43108      13.19427    86.76293
4: 86.76293 33.66214  6.706639 43.40251      13.79110          NA

Show Solution
# Input
DT <- data.table(V1= runif(4, 1, 100), V2 =runif(4, 1, 80), V3 = runif(4, 1, 100), V4 = runif(4,1, 50))

# Solution
DT[, lagged_column := shift(V1, type = 'lag')]
print(DT)

         V1        V2        V3       V4 lagged_column lead_column
1: 51.50006 74.050716 27.618570 16.68658            NA    82.29783
2: 82.29783 63.969537  9.054432  1.16761      51.50006    78.36425
3: 78.36425 63.937995 85.047398 21.39628      82.29783    63.64181
4: 63.64181  2.482478 10.060054 27.33597      78.36425          NA


## 52. How to get the frequency of unique values in the entire dataframe?

Difficulty Level: L2 Question : Get the frequency of unique values in the entire dataframe DT. Input

DT <- data.table(fruit =  c('apple', 'orange', 'banana', 'apple'))


Desired Output

    fruit number_of_distinct_orders
1:  apple                         2
2: orange                         1
3: banana                         1

Show Solution
# Input
DT <- data.table(fruit =  c('apple', 'orange', 'banana', 'apple'))

# Solution
output <- DT
output[,flag := 1]
print(output[, .(number_of_distinct_orders = length(flag)), by = fruit])

    fruit number_of_distinct_orders
1:  apple                         2
2: orange                         1
3: banana                         1


## 53. How to convert a table from long to wide format using R data.table?

Question : Create a pivot table out of DT keeping ID & Month as key. Difficulty Level: L2 Input

n <- 5
DT <- data.table(
ID       = sample(1:20, n, replace=TRUE),
Month    = sample(1:12, n, replace=TRUE),
Category = c("Drinks", "Food", "Drinks", "Food", "Food"),
Expenses = runif(n),
key      = c('ID', 'Month')
)

   ID Month Category  Expenses
1:  4     3   Drinks 0.9921504
2:  4     4     Food 0.8073523
3:  8    10     Food 0.5533336
4: 14     8     Food 0.2588098
5: 20     4   Drinks 0.2012480


Desired Output

   ID Month    Drinks      Food
1:  4     3 0.9921504 0.0000000
2:  4     4 0.0000000 0.8073523
3:  8    10 0.0000000 0.5533336
4: 14     8 0.0000000 0.2588098
5: 20     4 0.2012480 0.0000000

Show Solution
# Input
n <- 5
DT <- data.table(
ID=sample(1:20, n, replace=TRUE),
Month=sample(1:12, n, replace=TRUE),
Category= c("Drinks", "Food", "Drinks", "Food", "Food"),
Expenses=runif(n),
key=c('ID', 'Month')
)

# Solution
output <- dcast.data.table(DT, ID + Month ~ Category, fun=sum, value.var='Expenses')
print(output)

   ID Month    Drinks       Food
1:  7     6 0.0000000 0.75928350
2:  7    10 0.5103192 0.00000000
3:  9     3 0.1320670 0.00000000
4: 10     2 0.0000000 0.07450386
5: 18     7 0.0000000 0.23840221


## 54. How to create a bootstrap sample (sample same number of rows with replacement) from a data.table?

Difficulty Level: L2 Question : Create a bootstrap sample of length 5 from DT. Input

DT <- data.table(V1 = runif(20, 1, 100),
V2 = runif(20, 1, 80),
V3 = runif(20, 1, 100),
V4 = runif(20,1, 50))

Show Solution
# Input
DT <- data.table(V1 = runif(20, 1, 100),
V2 = runif(20, 1, 80),
V3 = runif(20, 1, 100),
V4 = runif(20,1, 50))

# Solution
n <- 5
row_number <- sample(1:20, n, replace=TRUE)
print(DT[row_number,])

          V1       V2       V3       V4
1:  3.213551 25.49768 30.01245  5.81102
2: 28.313228 63.25563 89.86874 30.28312
3: 37.163717 41.17688 98.72226 43.52485
4: 91.656742 60.68913  9.93234 21.66074
5: 51.661261 26.40149 28.44038 27.73880


## 55. How to bin a numeric column with user custom intervals to form a categorical (text) column?

Difficulty Level: L2 In DT, create a new column (cat_col, that has 'fail' if the average of the first two columns is less than 40. Else, it contains 'pass'. Input

set.seed(100)
DT <- data.table(V1 = runif(20, 1, 100),
V2 = runif(20, 1, 80))


Desired Ouput

           V1       V2 cat_col
1: 31.468845 43.32908    fail
2: 26.509578 57.15350    pass
3: 55.679921 43.52955    pass
4:  6.581932 60.16881    fail
5: 47.386379 34.18801    pass
6: 48.893303 14.54220    fail
(...truncated...)

Show Solution
# Input
set.seed(100)
DT <- data.table(V1= runif(20, 1, 100),
V2 =runif(20, 1, 80))

# Solution
DT[, cat_col := ifelse((V1 + V2)/2 < 40, 'fail' , 'pass')]
print(DT)

           V1       V2 cat_col
1: 31.468845 43.32908    fail
2: 26.509578 57.15350    pass
3: 55.679921 43.52955    pass
4:  6.581932 60.16881    fail
5: 47.386379 34.18801    pass
6: 48.893303 14.54220    fail
7: 81.427859 61.85383    pass
8: 37.661733 70.67433    pass
9: 55.109301 44.37864    pass
10: 17.855943 22.94018    fail
11: 62.874651 39.57617    pass
12: 88.334386 74.35190    pass
13: 28.755030 28.54667    fail
14: 40.450302 76.37846    pass
15: 76.492557 55.92666    pass
16: 67.233150 71.26683    pass
17: 21.256604 15.25217    fail
18: 36.394960 50.72188    pass
19: 36.588036 79.17557    pass
20: 69.338762 11.29282    pass


## 56. How to convert a data.table from wide to long format?

Difficulty Level: L2 Question : Convert DT from wide to long format by keeping ID & Month as key. Input

n <- 5
DT <- data.table(
ID       = sample(1:20, n, replace=TRUE),
Month    = sample(1:12, n, replace=TRUE),
Drinks   = sample(10:20,n, replace = TRUE),
Food     = sample(10:20,n, replace = TRUE),
Starters = sample(5:15,n, replace = TRUE)
)

ID Month Drinks Food Starters
1:  1    12     17   14       13
2: 14     2     12   12       11
3: 12     7     12   15       11
4: 19     1     16   15       15
5: 15    11     17   10       10


Desired Output

   ID Month variable value
1: 14     4   Drinks    12
2:  4    11   Drinks    18
3: 19     5   Drinks    17
4: 19     7   Drinks    15
5: 14     4   Drinks    11
6: 14     4     Food    14
(...truncated...)

Show Solution
# Input
n <- 5
DT <- data.table(
ID=sample(1:20, n, replace=TRUE),
Month=sample(1:12, n, replace=TRUE),
Drinks= sample(10:20,n, replace = TRUE),
Food = sample(10:20,n, replace = TRUE),
Starters = sample(5:15,n, replace = TRUE)
)

# Solution
output = melt(DT, id.vars = c("ID", "Month"),
measure.vars = c("Drinks", "Food", "Starters"))

   ID Month variable value
1: 14     4   Drinks    12
2:  4    11   Drinks    18
3: 19     5   Drinks    17
4: 19     7   Drinks    15
5: 14     4   Drinks    11
6: 14     4     Food    14


## 57. How to create a dataframe that contains the first 3 lags of a time series data?

Difficulty Level: L2 Question : Append a new column in DT with first three lag of column V1. Input

DT <- data.table(Date = seq(as.Date('2011-01-01'),as.Date('2011-01-10'),by = 1),
V1   = round(runif(10, 1, 80),2))


Desired Output

         Date    V1  lag1  lag2  lag3
1: 2011-01-01 48.92    NA    NA    NA
2: 2011-01-02 65.45 48.92    NA    NA
3: 2011-01-03 67.63 65.45 48.92    NA
4: 2011-01-04 63.26 67.63 65.45 48.92
5: 2011-01-05  2.50 63.26 67.63 65.45
6: 2011-01-06 56.15  2.50 63.26 67.63
7: 2011-01-07 65.39 56.15  2.50 63.26
8: 2011-01-08 45.90 65.39 56.15  2.50

Show Solution
# Input
DT <- data.table(Date = seq(as.Date('2011-01-01'),as.Date('2011-01-10'),by = 1),
V1   = round(runif(10, 1, 80),2))

# Solution
DT[, :=(lag1 = shift(V1, 1, type = 'lag'),
lag2 = shift(V1, 2, type = 'lag'),
lag3 = shift(V1, 3, type = 'lag'))]


         Date    V1  lag1  lag2  lag3
1: 2011-01-01  8.05    NA    NA    NA
2: 2011-01-02 13.85  8.05    NA    NA
3: 2011-01-03  3.11 13.85  8.05    NA
4: 2011-01-04 57.06  3.11 13.85  8.05
5: 2011-01-05 61.12 57.06  3.11 13.85
6: 2011-01-06 68.74 61.12 57.06  3.11
7: 2011-01-07 35.54 68.74 61.12 57.06
8: 2011-01-08 33.95 35.54 68.74 61.12


## 58. How to shuffle rows of a dataframe?

Difficulty Level: L2 Question : Shuffle rows of DT. Input

DT <- data.table(Date = seq(as.Date('2011-01-01'),as.Date('2011-01-10'),by = 1),
V1   = runif(10, 1, 80),
V2   = runif(10, 1, 100),
V3   = runif(10,1, 50))

Show Solution
# Input
DT <- data.table(Date = seq(as.Date('2011-01-01'), as.Date('2011-01-10'),by = 1),
V1   = runif(10, 1, 80),
V2   = runif(10, 1, 100),
V3   = runif(10,1, 50))

# Solution
set.seed(1000)
print(DT[sample(nrow(DT), replace = T),])

          Date       V1       V2        V3
1: 2011-01-04 35.76134  2.61564 17.215551
2: 2011-01-06 54.37578 56.66843 15.841530
3: 2011-01-03 59.85359 21.81431 32.227574
4: 2011-01-08 39.23481 74.79349 33.457133
5: 2011-01-03 59.85359 21.81431 32.227574
6: 2011-01-02 54.02029 34.80660  7.503086
7: 2011-01-06 54.37578 56.66843 15.841530
8: 2011-01-06 54.37578 56.66843 15.841530
9: 2011-01-06 54.37578 56.66843 15.841530
10: 2011-01-01 76.45873 27.03600 16.913088


## 59. How to select rows between two dates in a dataframe?

Difficulty Level: L1 Question : Select all rows between dates 2011-01-05 & 2011-01-08 in DT. Input

set.seed(100)
DT <- data.table(Date = seq(as.Date('2011-01-01'),as.Date('2011-01-10'),by = 1),
V1   = runif(10, 1, 80),
V2   = runif(10, 1, 100),
V3   = runif(10,1, 50))


Desired Output

         Date       V1       V2        V3
1: 2011-01-06 28.39314 36.99609 29.814114
2: 2011-01-07 52.53458 33.46489  5.073762

Show Solution
# Input
DT <- data.table(Date= seq(as.Date('2011-01-01'),as.Date('2011-01-10'),by = 1), V1 =runif(10, 1, 80), V2 = runif(10, 1, 100), V3 = runif(10,1, 50))

# Solution
output <- DT[(Date > '2011-01-05') & (Date < '2011-01-08'),]
print(output)

         Date       V1       V2        V3
1: 2011-01-06 28.39314 36.99609 29.814114
2: 2011-01-07 52.53458 33.46489  5.073762

To be continued . . .

Course Preview