101 R data.table Exercises

The data.table package in R is super fast when it comes to handling data. It has a syntax that reduces keystrokes while making R code easier to read. These set of exercises are designed to help you to oil your data brain through solving data manipulation exercises.

Related post: 101 Python datatable Exercises (pydatatable)

101 R data.table Exercises. Photo by Ela Abbau.

1. How to install data.table and check the version?

# Install the data.table package
install.packages("data.table")

# Load the library
library(data.table)

# Check version
print(packageVersion("data.table"))

2. How to create a data.table from lists (vectors) ?

Difficulty Level: L1

Question : Create a pandas series from each of the items below: a list, numpy and a dictionary

Input

list_1 <- c("a","b","c","d")
list_2 <- c("1","2","3","4")
list_3 <- c("aa","bb","cc","dd")
Show Solution
# Inputs
list_1 <- c("a","b","c","d")
list_2 <- c("1","2","3","4")
list_3 <- c("aa","bb","cc","dd")

# Solution
DT <- data.table(V1= list_1, V2 =list_2, V3 = list_3 )
print(DT)
   V1 V2 V3
1:  a  1 aa
2:  b  2 bb
3:  c  3 cc
4:  d  4 dd

3. How to import csv data files as data.table ?

Difficulty Level: L1

Question : Import BostonHousing dataset as a data.table.

Show Solution
# Solution
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv')
print(head(DT))
      crim zn indus chas   nox    rm  age    dis rad tax ptratio      b lstat
1: 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
2: 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
3: 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
4: 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
5: 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
6: 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
   medv
1: 24.0
2: 21.6
3: 34.7
4: 33.4
5: 36.2
6: 28.7

4. How to import only ‘n’ rows from a csv file to create a data.table?

Difficiulty Level: L1

Question : Import first 50 row of BostonHousing dataset as a data.table.

Show Solution
# Solution
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv', nrows = 50)
print(head(DT))
      crim zn indus chas   nox    rm  age    dis rad tax ptratio      b lstat
1: 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
2: 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
3: 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
4: 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
5: 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
6: 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
   medv
1: 24.0
2: 21.6
3: 34.7
4: 33.4
5: 36.2
6: 28.7

5. How to import only specified columns from a csv file?

Difficulty Level: L1

Question : Import only ‘crim’ and ‘medv’ columns of the BostonHousing dataset as a data.table.

Show Solution
# Solution
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv', select = c('crim', 'medv'))
print(head(DT))
      crim medv
1: 0.00632 24.0
2: 0.02731 21.6
3: 0.02729 34.7
4: 0.03237 33.4
5: 0.06905 36.2
6: 0.02985 28.7

6. How to get the nrows, ncolumns, datatype, summary stats of each column in a data.table

Difficulty Level: L2

Question : Get the number of rows, columns, datatype and summary statistics of each column of the Cars93 dataset.

Input

DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv')
Show Solution
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv')

#  number of rows and columns
print(NROW(DT))
print(NCOL(DT))

# datatypes
print(sapply(DT,class))

# summary statistics
summary(DT)
[1] 506
[1] 14
     crim        zn     indus      chas       nox        rm       age       dis 
"numeric" "numeric" "numeric" "integer" "numeric" "numeric" "numeric" "numeric" 
      rad       tax   ptratio         b     lstat      medv 
"integer" "integer" "numeric" "numeric" "numeric" "numeric" 



      crim                zn             indus            chas        
 Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
 1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
 Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
 Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
 3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
 Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
      nox               rm             age              dis        
 Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
 1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
 Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
 Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
 3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
 Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
      rad              tax           ptratio            b         
 Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
 1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
 Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
 Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
 3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
 Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
     lstat            medv      
 Min.   : 1.73   Min.   : 5.00  
 1st Qu.: 6.95   1st Qu.:17.02  
 Median :11.36   Median :21.20  
 Mean   :12.65   Mean   :22.53  
 3rd Qu.:16.95   3rd Qu.:25.00  
 Max.   :37.97   Max.   :50.00  

7. How to extract the row and column number of a particular cell with given criterion?

Difficulty Level: L2

Question : Which manufacturer, model and type has the highest Price? What is the row and column number of the cell with the highest Price value?

Input

DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Show Solution
# Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

# Solution
# Get Manufacturer with highest price
print(DT[Price == max(Price, na.rm = TRUE)][,list(Manufacturer, Model, Type)])

# Get row number
DT[, .(rownum=.I, Price)][Price == max(Price, na.rm = TRUE), rownum]
    Manufacturer Model    Type
1: Mercedes-Benz  300E Midsize

59

8. How to rename a specific columns in a data.table?

Difficulty Level: Easy

Question : Rename the column Model as New_Model in DT.

Input

DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Show Solution
# Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

print(head(setnames(DT, old = "Model", new = "New_Model")[]))
   Manufacturer New_Model    Type Min.Price Price Max.Price MPG.city
1:        Acura   Integra   Small      12.9  15.9      18.8       25
2:         <NA>    Legend Midsize      29.2  33.9      38.7       18
3:         Audi        90 Compact      25.9  29.1      32.3       20
4:         Audi       100 Midsize        NA  37.7      44.6       19
5:          BMW      535i Midsize        NA  30.0        NA       22
6:        Buick   Century Midsize      14.2  15.7      17.3       22
   MPG.highway            AirBags DriveTrain Cylinders EngineSize Horsepower
1:          31               None      Front         4        1.8        140
2:          25 Driver & Passenger      Front         6        3.2        200
3:          26        Driver only      Front         6        2.8        172
4:          26 Driver & Passenger       <NA>         6         NA        172
5:          30               <NA>       Rear         4        3.5        208
6:          31        Driver only       <NA>         4        2.2        110
    RPM Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length
1: 6300         2890             Yes               13.2          5    177
2: 5500         2335             Yes               18.0          5    195
3: 5500         2280             Yes               16.9          5    180
4: 5500         2535            <NA>               21.1          6    193
5: 5700         2545             Yes               21.1          4    186
6: 5200         2565              No               16.4          6    189
   Wheelbase Width Turn.circle Rear.seat.room Luggage.room Weight  Origin
1:       102    68          37           26.5           NA   2705 non-USA
2:       115    71          38           30.0           15   3560 non-USA
3:       102    67          37           28.0           14   3375 non-USA
4:       106    NA          37           31.0           17   3405 non-USA
5:       109    69          39           27.0           13   3640 non-USA
6:       105    69          41           28.0           16     NA     USA
            Make
1: Acura Integra
2:  Acura Legend
3:       Audi 90
4:      Audi 100
5:      BMW 535i
6: Buick Century

9. How to check if a data.table has any missing values?

Difficulty Level: L1

Question : Check if DT has any missing values. Return TRUE is present else FALSE.

Input

DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Show Solution
# Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

# Solution
anyNA(DT)

TRUE

10. How to check for missing values in a data.table?

Difficulty Level: L1

Question : Get the location of missing values in DT. Replace the value with TRUE is missing else FALSE.

Input

DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Desired Output

     Manufacturer Model  Type Min.Price Price Max.Price MPG.city MPG.highway
[1,]        FALSE FALSE FALSE     FALSE FALSE     FALSE    FALSE       FALSE
[2,]         TRUE FALSE FALSE     FALSE FALSE     FALSE    FALSE       FALSE
[3,]        FALSE FALSE FALSE     FALSE FALSE     FALSE    FALSE       FALSE
[4,]        FALSE FALSE FALSE      TRUE FALSE     FALSE    FALSE       FALSE
[5,]        FALSE FALSE FALSE      TRUE FALSE      TRUE    FALSE       FALSE
[6,]        FALSE FALSE FALSE     FALSE FALSE     FALSE    FALSE       FALSE
(...truncated...)
Show Solution
# Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

# Solution
print(head(is.na(DT)))
     Manufacturer Model  Type Min.Price Price Max.Price MPG.city MPG.highway
[1,]        FALSE FALSE FALSE     FALSE FALSE     FALSE    FALSE       FALSE
[2,]         TRUE FALSE FALSE     FALSE FALSE     FALSE    FALSE       FALSE
[3,]        FALSE FALSE FALSE     FALSE FALSE     FALSE    FALSE       FALSE
[4,]        FALSE FALSE FALSE      TRUE FALSE     FALSE    FALSE       FALSE
[5,]        FALSE FALSE FALSE      TRUE FALSE      TRUE    FALSE       FALSE
[6,]        FALSE FALSE FALSE     FALSE FALSE     FALSE    FALSE       FALSE
     AirBags DriveTrain Cylinders EngineSize Horsepower   RPM Rev.per.mile
[1,]   FALSE      FALSE     FALSE      FALSE      FALSE FALSE        FALSE
[2,]   FALSE      FALSE     FALSE      FALSE      FALSE FALSE        FALSE
[3,]   FALSE      FALSE     FALSE      FALSE      FALSE FALSE        FALSE
[4,]   FALSE       TRUE     FALSE       TRUE      FALSE FALSE        FALSE
[5,]    TRUE      FALSE     FALSE      FALSE      FALSE FALSE        FALSE
[6,]   FALSE       TRUE     FALSE      FALSE      FALSE FALSE        FALSE
     Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase Width
[1,]           FALSE              FALSE      FALSE  FALSE     FALSE FALSE
[2,]           FALSE              FALSE      FALSE  FALSE     FALSE FALSE
[3,]           FALSE              FALSE      FALSE  FALSE     FALSE FALSE
[4,]            TRUE              FALSE      FALSE  FALSE     FALSE  TRUE
[5,]           FALSE              FALSE      FALSE  FALSE     FALSE FALSE
[6,]           FALSE              FALSE      FALSE  FALSE     FALSE FALSE
     Turn.circle Rear.seat.room Luggage.room Weight Origin  Make
[1,]       FALSE          FALSE         TRUE  FALSE  FALSE FALSE
[2,]       FALSE          FALSE        FALSE  FALSE  FALSE FALSE
[3,]       FALSE          FALSE        FALSE  FALSE  FALSE FALSE
[4,]       FALSE          FALSE        FALSE  FALSE  FALSE FALSE
[5,]       FALSE          FALSE        FALSE  FALSE  FALSE FALSE
[6,]       FALSE          FALSE        FALSE   TRUE  FALSE FALSE

11. How to get the row and column positions of missing values in a data.table?

Difficulty Level: L2

Question: Get the row and column positions of missing values in DT.

Input

DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Desired Output

[1] "(1, 24)"  "(2, 1)"   "(4, 4)"   "(4, 10)"  "(4, 12)"  "(4, 16)"  "(4, 21)" 
[8] "(5, 4)"   "(5, 6)"   "(5, 9)"   "(6, 10)"  "(6, 25)"  "(7, 6)"   "(7, 15)" 
[15] "(7, 17)"  "(9, 21)"  "(10, 15)" "(11, 24)" "(12, 9)"  "(12, 10)" "(12, 13)"
[22] "(13, 11)" "(13, 24)" "(14, 11)" "(14, 17)" "(15, 7)"  "(15, 13)" "(15, 19)"
[29] "(15, 22)" "(16, 24)" "(17, 13)" "(17, 24)" "(19, 23)" "(19, 24)" "(19, 26)"
[36] "(20, 1)"  "(20, 22)" "(23, 17)" "(23, 23)" "(24, 15)" "(24, 24)" .. (truncated)
Show Solution
# Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

# Solution
output <- c()  # init output vector

for(i in 1:NROW(DT)){
  for(j in 1:NCOL(DT)){
    curr_value <- DT[i, j,with=F][[1]]
    if(is.na(curr_value)){
      position <- paste0('(', paste(i, j, sep=', '), ')')
      output   <- c(output, position)
    }
  }
}

11. How to count the number of missing values in each column of data.table?

Difficulty Level: L2

Question : Count the number of missing values in each column of DT

Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Show Solution
# Input
DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

# Solution
for (i in colnames(DT)){
    print(paste("Number of missing values in column", i , "is", sum(is.na(DT[,..i]))))
}
[1] "Number of missing values in column Manufacturer is 4"
[1] "Number of missing values in column Model is 1"
[1] "Number of missing values in column Type is 3"
[1] "Number of missing values in column Min.Price is 7"
[1] "Number of missing values in column Price is 2"
[1] "Number of missing values in column Max.Price is 5"
[1] "Number of missing values in column MPG.city is 9"
[1] "Number of missing values in column MPG.highway is 2"
[1] "Number of missing values in column AirBags is 6"
[1] "Number of missing values in column DriveTrain is 7"
[1] "Number of missing values in column Cylinders is 5"
[1] "Number of missing values in column EngineSize is 2"
[1] "Number of missing values in column Horsepower is 7"
[1] "Number of missing values in column RPM is 3"
[1] "Number of missing values in column Rev.per.mile is 6"
[1] "Number of missing values in column Man.trans.avail is 5"
[1] "Number of missing values in column Fuel.tank.capacity is 8"
[1] "Number of missing values in column Passengers is 2"
[1] "Number of missing values in column Length is 4"
[1] "Number of missing values in column Wheelbase is 1"
[1] "Number of missing values in column Width is 6"
[1] "Number of missing values in column Turn.circle is 5"
[1] "Number of missing values in column Rear.seat.room is 4"
[1] "Number of missing values in column Luggage.room is 19"
[1] "Number of missing values in column Weight is 7"
[1] "Number of missing values in column Origin is 5"
[1] "Number of missing values in column Make is 3"

12. How to replace missing values of data.table with 0?

Difficulty Level: L2

Question : Replace all missing values in DT with 0.

Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Show Solution
# Input
DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

# Solution
DT[is.na(DT)] <- 0
print(head(DT[, 1:8]))
   Manufacturer   Model    Type Min.Price Price Max.Price MPG.city MPG.highway
1:        Acura Integra   Small      12.9  15.9      18.8       25          31
2:            0  Legend Midsize      29.2  33.9      38.7       18          25
3:         Audi      90 Compact      25.9  29.1      32.3       20          26
4:         Audi     100 Midsize       0.0  37.7      44.6       19          26
5:          BMW    535i Midsize       0.0  30.0       0.0       22          30
6:        Buick Century Midsize      14.2  15.7      17.3       22          31

13. How to replace missing values of numeric columns with the mean?

Difficulty Level: L2

Question : Replace missing values of numeric columns with their respective mean.

Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Show Solution
# Input
DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

# Solution
replace_with_mean <- function(x){ifelse(is.na(x),mean(x, na.rm = TRUE),x)}
output <- DT[,lapply(.SD, replace_with_mean)]
print(head(output))

   Manufacturer   Model    Type Min.Price Price Max.Price MPG.city MPG.highway
1:        Acura Integra   Small   12.9000  15.9  18.80000       25          31
2:         <NA>  Legend Midsize   29.2000  33.9  38.70000       18          25
3:         Audi      90 Compact   25.9000  29.1  32.30000       20          26
4:         Audi     100 Midsize   17.1186  37.7  44.60000       19          26
5:          BMW    535i Midsize   17.1186  30.0  21.45909       22          30
6:        Buick Century Midsize   14.2000  15.7  17.30000       22          31
              AirBags DriveTrain Cylinders EngineSize Horsepower  RPM
1:               None      Front         4   1.800000        140 6300
2: Driver & Passenger      Front         6   3.200000        200 5500
3:        Driver only      Front         6   2.800000        172 5500
4: Driver & Passenger       <NA>         6   2.658242        172 5500
5:               <NA>       Rear         4   3.500000        208 5700
6:        Driver only       <NA>         4   2.200000        110 5200
   Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase
1:         2890             Yes               13.2          5    177       102
2:         2335             Yes               18.0          5    195       115
3:         2280             Yes               16.9          5    180       102
4:         2535            <NA>               21.1          6    193       106
5:         2545             Yes               21.1          4    186       109
6:         2565              No               16.4          6    189       105
      Width Turn.circle Rear.seat.room Luggage.room   Weight  Origin
1: 68.00000          37           26.5     13.98649 2705.000 non-USA
2: 71.00000          38           30.0     15.00000 3560.000 non-USA
3: 67.00000          37           28.0     14.00000 3375.000 non-USA
4: 69.44828          37           31.0     17.00000 3405.000 non-USA
5: 69.00000          39           27.0     13.00000 3640.000 non-USA
6: 69.00000          41           28.0     16.00000 3104.593     USA
            Make
1: Acura Integra
2:  Acura Legend
3:       Audi 90
4:      Audi 100
5:      BMW 535i
6: Buick Century

14. How to select a specific column from a data.table as a list (vector)?

Difficulty Level: L1

Question : Get the column Model in DT as a list.

Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Desired Solution

[1] "Integra"        "Legend"         "90"             "100"           
 [5] "535i"           "Century"        "LeSabre"        "Roadmaster"    
 [9] "Riviera"        "DeVille"        "Seville"        "Cavalier"      
[13] "Corsica"        "Camaro"         "Lumina"         "Lumina_APV"    
[17] "Astro"          "Caprice"        "Corvette"       "Concorde"      
[21] "LeBaron"        "Imperial"       (...truncated...) 
Show Solution
# Input
DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

#Solution
DT <- DT[, Model]
print(DT)
 [1] "Integra"        "Legend"         "90"             "100"           
 [5] "535i"           "Century"        "LeSabre"        "Roadmaster"    
 [9] "Riviera"        "DeVille"        "Seville"        "Cavalier"      
[13] "Corsica"        "Camaro"         "Lumina"         "Lumina_APV"    
[17] "Astro"          "Caprice"        "Corvette"       "Concorde"      
[21] "LeBaron"        "Imperial"       "Colt"           "Shadow"        
[25] "Spirit"         "Caravan"        "Dynasty"        "Stealth"       
[29] "Summit"         "Vision"         "Festiva"        "Escort"        
[33] "Tempo"          "Mustang"        "Probe"          "Aerostar"      
[37] "Taurus"         "Crown_Victoria" "Metro"          "Storm"         
[41] "Prelude"        "Civic"          "Accord"         "Excel"         
[45] "Elantra"        "Scoupe"         "Sonata"         "Q45"           
[49] "ES300"          "SC300"          "Continental"    "Town_Car"      
[53] "323"            "Protege"        "626"            "MPV"           
[57] "RX-7"           "190E"           "300E"           NA              
[61] "Cougar"         "Mirage"         "Diamante"       "Sentra"        
[65] "Altima"         "Quest"          "Maxima"         "Achieva"       
[69] "Cutlass_Ciera"  "Silhouette"     "Eighty-Eight"   "Laser"         
[73] "LeMans"         "Sunbird"        "Firebird"       "Grand_Prix"    
[77] "Bonneville"     "900"            "SL"             "Justy"         
[81] "Loyale"         "Legacy"         "Swift"          "Tercel"        
[85] "Celica"         "Camry"          "Previa"         "Fox"           
[89] "Eurovan"        "Passat"         "Corrado"        "240"           
[93] "850"           

15. How to select a specific column from a data.table as a data.table?

Difficulty Level: L2

Question : Get the column Model in DT as a data.table (rather than as a list).

Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Desired Solution

     Model
1: Integra
2:  Legend
3:      90
4:     100
5:    535i
6: Century
Show Solution
# Input
DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

#Solution
DT <- DT[, list(Model)]
print(head(DT))
     Model
1: Integra
2:  Legend
3:      90
4:     100
5:    535i
6: Century

16. How to sort a data.table based on a given column?

Difficulty Level: L1

Question: Reorder the rows as per Price column of DT.

Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Show Solution
# Input
DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

#Solution
DT <- DT[order(Price)]
print(head(DT))
   Manufacturer   Model  Type Min.Price Price Max.Price MPG.city MPG.highway
1:         Ford Festiva Small       6.9   7.4       7.9       31          33
2:      Hyundai   Excel Small       6.8   8.0       9.2       29          33
3:        Mazda     323 Small       7.4   8.3       9.1       29          37
4:          Geo   Metro Small       6.7   8.4      10.0       46          50
5:       Subaru   Justy Small       7.3   8.4       9.5       33          37
6:       Suzuki   Swift  <NA>       7.3   8.6        NA       39          43
   AirBags DriveTrain Cylinders EngineSize Horsepower  RPM Rev.per.mile
1:    <NA>      Front         4        1.3         63 5000         3150
2:    None      Front         4        1.5         81 5500         2710
3:    None      Front         4        1.6         82 5000         2370
4:    None      Front         3        1.0         55 5700         3755
5:    None        4WD         3        1.2         73 5600         2875
6:    None      Front         3        1.3         70 6000         3360
   Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase Width
1:             Yes               10.0          4    141        90    63
2:             Yes               11.9          5    168        94    63
3:             Yes               13.2          4    164        97    66
4:            <NA>               10.6          4    151        93    63
5:             Yes                9.2          4    146        90    60
6:             Yes               10.6          4    161        93    NA
   Turn.circle Rear.seat.room Luggage.room Weight  Origin          Make
1:          33           26.0           12   1845     USA  Ford Festiva
2:          35           26.0           11   2345 non-USA Hyundai Excel
3:          34           27.0           16   2325 non-USA     Mazda 323
4:          34           27.5           10   1695 non-USA     Geo Metro
5:          32           23.5           10   2045 non-USA  Subaru Justy
6:          34           27.5           10   1965 non-USA  Suzuki Swift

17. How to sort a data.table based on two or more columns?

Difficulty Level: L1

Question : Reorder the rows as per ascending order of Price column and descending order of Max.Price column of DT.

Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Show Solution
# Input
DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

#Solution
DT <- DT[order(Price, -Max.Price)]
print(head(DT))
   Manufacturer   Model  Type Min.Price Price Max.Price MPG.city MPG.highway
1:         Ford Festiva Small       6.9   7.4       7.9       31          33
2:      Hyundai   Excel Small       6.8   8.0       9.2       29          33
3:        Mazda     323 Small       7.4   8.3       9.1       29          37
4:          Geo   Metro Small       6.7   8.4      10.0       46          50
5:       Subaru   Justy Small       7.3   8.4       9.5       33          37
6:       Suzuki   Swift  <NA>       7.3   8.6        NA       39          43
   AirBags DriveTrain Cylinders EngineSize Horsepower  RPM Rev.per.mile
1:    <NA>      Front         4        1.3         63 5000         3150
2:    None      Front         4        1.5         81 5500         2710
3:    None      Front         4        1.6         82 5000         2370
4:    None      Front         3        1.0         55 5700         3755
5:    None        4WD         3        1.2         73 5600         2875
6:    None      Front         3        1.3         70 6000         3360
   Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase Width
1:             Yes               10.0          4    141        90    63
2:             Yes               11.9          5    168        94    63
3:             Yes               13.2          4    164        97    66
4:            <NA>               10.6          4    151        93    63
5:             Yes                9.2          4    146        90    60
6:             Yes               10.6          4    161        93    NA
   Turn.circle Rear.seat.room Luggage.room Weight  Origin          Make
1:          33           26.0           12   1845     USA  Ford Festiva
2:          35           26.0           11   2345 non-USA Hyundai Excel
3:          34           27.0           16   2325 non-USA     Mazda 323
4:          34           27.5           10   1695 non-USA     Geo Metro
5:          32           23.5           10   2045 non-USA  Subaru Justy
6:          34           27.5           10   1965 non-USA  Suzuki Swift

18. How to filter out rows from a data.table that satisfy multiple conditions?

Difficulty Level: L2

Question : Filter out the rows having Manufacturer as Ford and Price less than 30

Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Desired Output :

   Manufacturer    Model    Type Min.Price Price Max.Price MPG.city MPG.highway
1:         Ford  Festiva   Small       6.9   7.4       7.9       31          33
2:         Ford   Escort   Small       8.4  10.1      11.9       23          30
3:         Ford    Tempo Compact      10.4  11.3      12.2       22          27
4:         Ford  Mustang  Sporty      10.8  15.9      21.0       22          29
5:         Ford    Probe  Sporty      12.8  14.0      15.2       NA          30
6:         Ford Aerostar     Van      14.5  19.9      25.3       15          20
       AirBags DriveTrain Cylinders EngineSize Horsepower  RPM Rev.per.mile
1:        <NA>      Front         4        1.3         63 5000         3150
2:        None      Front         4        1.8        127 6500         2410
3:        None      Front         4        2.3         96 4200         2805
4: Driver only       Rear         4        2.3        105 4600         2285
5: Driver only      Front         4        2.0        115 5500         2340
6: Driver only        4WD         6        3.0        145 4800         2080
(...truncated...)
Show Solution
#Input
DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

#Solution
sol <- DT[Manufacturer == "Ford" & Price < 30]
print(head(sol))
   Manufacturer    Model    Type Min.Price Price Max.Price MPG.city MPG.highway
1:         Ford  Festiva   Small       6.9   7.4       7.9       31          33
2:         Ford   Escort   Small       8.4  10.1      11.9       23          30
3:         Ford    Tempo Compact      10.4  11.3      12.2       22          27
4:         Ford  Mustang  Sporty      10.8  15.9      21.0       22          29
5:         Ford    Probe  Sporty      12.8  14.0      15.2       NA          30
6:         Ford Aerostar     Van      14.5  19.9      25.3       15          20
       AirBags DriveTrain Cylinders EngineSize Horsepower  RPM Rev.per.mile
1:        <NA>      Front         4        1.3         63 5000         3150
2:        None      Front         4        1.8        127 6500         2410
3:        None      Front         4        2.3         96 4200         2805
4: Driver only       Rear         4        2.3        105 4600         2285
5: Driver only      Front         4        2.0        115 5500         2340
6: Driver only        4WD         6        3.0        145 4800         2080
   Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase Width
1:             Yes               10.0          4    141        90    63
2:            <NA>               13.2          5    171        98    67
3:             Yes               15.9          5    177       100    68
4:             Yes               15.4          4    180       101    68
5:             Yes               15.5          4    179       103    70
6:             Yes               21.0          7    176       119    72
   Turn.circle Rear.seat.room Luggage.room Weight Origin          Make
1:          33           26.0           12   1845    USA  Ford Festiva
2:          36           28.0           12   2530    USA   Ford Escort
3:          NA           27.5           13     NA    USA          <NA>
4:          40           24.0           12     NA    USA  Ford Mustang
5:          38           23.0           18   2710    USA    Ford Probe
6:          45           30.0           NA   3735    USA Ford Aerostar

19. How to select multiple columns from data.table and rename them?

Difficulty Level: L2

Question : Get columns ‘Manufacturer’ & ‘Type’ from data.table and rename them to ‘MANUFACTURER’ & ‘TYPE’

Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Desired Output

   MANUFACTURER    TYPE
1:        Acura   Small
2:         <NA> Midsize
3:         Audi Compact
4:         Audi Midsize
5:          BMW Midsize
6:        Buick Midsize
Show Solution
#Input
DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

#Solution
sol <- DT[, .(MANUFACTURER = Manufacturer, TYPE = Type)]
print(head(sol))
   MANUFACTURER    TYPE
1:        Acura   Small
2:         <NA> Midsize
3:         Audi Compact
4:         Audi Midsize
5:          BMW Midsize
6:        Buick Midsize

20. How to get mean of rows that satisfy a condition from a given column?

Difficulty Level: L2

Question : Get mean of all the vehicles having Price greater than 30.

Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Desired Output :

38.6833
Show Solution
#Input
DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

#Solution
output <- DT[Price > 30, mean(Price, na.rm = TRUE)]
output

38.6833333333333

21. How to remove only specified columns from a data.table?

Difficulty Level: L2

Question : Create new data.table by removing ‘Manufacturer’ & ‘Type’ columns.

Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Show Solution
#Input
DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

#Solution 
output <- DT[, !c("Manufacturer", "Type")]
print(head(output))
     Model Min.Price Price Max.Price MPG.city MPG.highway            AirBags
1: Integra      12.9  15.9      18.8       25          31               None
2:  Legend      29.2  33.9      38.7       18          25 Driver & Passenger
3:      90      25.9  29.1      32.3       20          26        Driver only
4:     100        NA  37.7      44.6       19          26 Driver & Passenger
5:    535i        NA  30.0        NA       22          30               <NA>
6: Century      14.2  15.7      17.3       22          31        Driver only
   DriveTrain Cylinders EngineSize Horsepower  RPM Rev.per.mile Man.trans.avail
1:      Front         4        1.8        140 6300         2890             Yes
2:      Front         6        3.2        200 5500         2335             Yes
3:      Front         6        2.8        172 5500         2280             Yes
4:       <NA>         6         NA        172 5500         2535            <NA>
5:       Rear         4        3.5        208 5700         2545             Yes
6:       <NA>         4        2.2        110 5200         2565              No
   Fuel.tank.capacity Passengers Length Wheelbase Width Turn.circle
1:               13.2          5    177       102    68          37
2:               18.0          5    195       115    71          38
3:               16.9          5    180       102    67          37
4:               21.1          6    193       106    NA          37
5:               21.1          4    186       109    69          39
6:               16.4          6    189       105    69          41
   Rear.seat.room Luggage.room Weight  Origin          Make
1:           26.5           NA   2705 non-USA Acura Integra
2:           30.0           15   3560 non-USA  Acura Legend
3:           28.0           14   3375 non-USA       Audi 90
4:           31.0           17   3405 non-USA      Audi 100
5:           27.0           13   3640 non-USA      BMW 535i
6:           28.0           16     NA     USA Buick Century

22. How to calculate count of each unique value in a given column?

Difficulty Level: L2

Question : Calculate number of records for every Manufacturer in DT.

Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Desired Output

   Manufacturer N
1:        Acura 1
2:         <NA> 4
3:         Audi 2
4:          BMW 1
5:        Buick 4
6:     Cadillac 2
Show Solution
#Input
DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

#Solution 
output <- DT[, .(.N), by = .(Manufacturer)]
print(head(output))
   Manufacturer N
1:        Acura 1
2:         <NA> 4
3:         Audi 2
4:          BMW 1
5:        Buick 4
6:     Cadillac 2

23. How to find mean of a column grouped by multiple columns?

Difficulty Level: L2

Question : Calculate mean Price for every Manufacturer, Type combination in DT.

Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Desired Output

   Manufacturer    Type       V1
1:        Acura   Small 15.90000
2:         <NA> Midsize 31.93333
3:         Audi Compact 29.10000
4:         Audi Midsize 37.70000
5:          BMW Midsize 30.00000
6:        Buick Midsize 21.00000
Show Solution
#Input
DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

#Solution 
output <- DT[, mean(Price, na.rm = TRUE), by = .(Manufacturer, Type)]
print(head(output))
   Manufacturer    Type       V1
1:        Acura   Small 15.90000
2:         <NA> Midsize 31.93333
3:         Audi Compact 29.10000
4:         Audi Midsize 37.70000
5:          BMW Midsize 30.00000
6:        Buick Midsize 21.00000

24. How to find mean of column grouped by multiple columns and sort by grouped columns?

Difficulty Level: L2

Question : Calculate mean Price for every Manufacturer, Type combination and arrange it by the keys in DT.

Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Desired Output

   Manufacturer    Type    V1
1:        Acura   Small 15.90
2:         Audi Compact 29.10
3:         Audi Midsize 37.70
4:          BMW Midsize 30.00
5:        Buick   Large 22.25
6:        Buick Midsize 21.00
Show Solution
#Input
DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

#Solution 1
output <- DT[, mean(Price, na.rm = TRUE), keyby = .(Manufacturer, Type)]

#Solution 1
output <- DT[, mean(Price, na.rm = TRUE), by = .(Manufacturer, Type)][order(Manufacturer, Type)]
print(head(output))
   Manufacturer    Type    V1
1:        Acura   Small 15.90
2:         Audi Compact 29.10
3:         Audi Midsize 37.70
4:          BMW Midsize 30.00
5:        Buick   Large 22.25
6:        Buick Midsize 21.00

25. How to select all numeric columns from data.table?

Difficulty Level: L2

Question : Select numeric columns from DT.

Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Desired Output

   Min.Price Price Max.Price MPG.city MPG.highway EngineSize Horsepower  RPM
1:      12.9  15.9      18.8       25          31        1.8        140 6300
2:      29.2  33.9      38.7       18          25        3.2        200 5500
3:      25.9  29.1      32.3       20          26        2.8        172 5500
4:        NA  37.7      44.6       19          26         NA        172 5500
5:        NA  30.0        NA       22          30        3.5        208 5700
6:      14.2  15.7      17.3       22          31        2.2        110 5200
   Rev.per.mile Fuel.tank.capacity Passengers Length Wheelbase Width
1:         2890               13.2          5    177       102    68
2:         2335               18.0          5    195       115    71
3:         2280               16.9          5    180       102    67
4:         2535               21.1          6    193       106    NA
5:         2545               21.1          4    186       109    69
6:         2565               16.4          6    189       105    69
(...truncated...)
Show Solution
# Input
DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

# Solution 1
output <- DT[, .SD, .SDcols = which(sapply(DT, is.numeric))]
print(head(output))
   Min.Price Price Max.Price MPG.city MPG.highway EngineSize Horsepower  RPM
1:      12.9  15.9      18.8       25          31        1.8        140 6300
2:      29.2  33.9      38.7       18          25        3.2        200 5500
3:      25.9  29.1      32.3       20          26        2.8        172 5500
4:        NA  37.7      44.6       19          26         NA        172 5500
5:        NA  30.0        NA       22          30        3.5        208 5700
6:      14.2  15.7      17.3       22          31        2.2        110 5200
   Rev.per.mile Fuel.tank.capacity Passengers Length Wheelbase Width
1:         2890               13.2          5    177       102    68
2:         2335               18.0          5    195       115    71
3:         2280               16.9          5    180       102    67
4:         2535               21.1          6    193       106    NA
5:         2545               21.1          4    186       109    69
6:         2565               16.4          6    189       105    69
   Turn.circle Rear.seat.room Luggage.room Weight
1:          37           26.5           NA   2705
2:          38           30.0           15   3560
3:          37           28.0           14   3375
4:          37           31.0           17   3405
5:          39           27.0           13   3640
6:          41           28.0           16     NA

26. How to calculate mean of all numeric columns for every Manufacturer?

Difficulty Level: L3

Question : Calculate mean of all numeric columns for every Manufacturer

Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Desired Output

   Manufacturer Min.Price  Price Max.Price MPG.city MPG.highway EngineSize
1:        Acura    12.900 15.900     18.80     25.0       31.00      1.800
2:         <NA>    26.775 28.550     30.30     19.0       26.00      2.975
3:         Audi        NA 33.400     38.45     19.5       26.00         NA
4:          BMW        NA 30.000        NA     22.0       30.00      3.500
5:        Buick    20.750 21.625        NA     19.0       27.75      3.875
6:     Cadillac    35.250 37.400     39.50     16.0       25.00      4.750
(...truncated...)
Show Solution
#Input
DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

#Solution
output <- DT[, .SD, .SDcols = which(sapply(DT, is.numeric))]
output <- cbind(output, DT[, list(Manufacturer)])
output <- output[, lapply(.SD, mean), by = Manufacturer]
print(head(output))
   Manufacturer Min.Price  Price Max.Price MPG.city MPG.highway EngineSize
1:        Acura    12.900 15.900     18.80     25.0       31.00      1.800
2:         <NA>    26.775 28.550     30.30     19.0       26.00      2.975
3:         Audi        NA 33.400     38.45     19.5       26.00         NA
4:          BMW        NA 30.000        NA     22.0       30.00      3.500
5:        Buick    20.750 21.625        NA     19.0       27.75      3.875
6:     Cadillac    35.250 37.400     39.50     16.0       25.00      4.750
   Horsepower  RPM Rev.per.mile Fuel.tank.capacity Passengers Length Wheelbase
1:      140.0 6300       2890.0               13.2       5.00 177.00    102.00
2:      186.5 5750           NA                 NA       5.00 193.25    109.75
3:      172.0 5500       2407.5               19.0       5.50 186.50    104.00
4:      208.0 5700       2545.0               21.1       4.00 186.00    109.00
5:      157.5 4700           NA                 NA       5.75 200.75    110.00
6:      247.5 5050           NA               19.0       5.50 205.00    112.50
   Width Turn.circle Rear.seat.room Luggage.room  Weight
1: 68.00       37.00         26.500           NA 2705.00
2: 71.25          NA         29.000         13.5 3458.75
3:    NA       37.00         29.500         15.5 3390.00
4: 69.00       39.00         27.000         13.0 3640.00
5:    NA       42.25         28.875         17.0      NA
6: 73.50       43.50         33.000           NA 3777.50

27. How to get first 3 rows for every Manufacturer?

Difficulty Level: L3

Question : Get first 3 rows for every Manufacturer in DT.

Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Desired Output

   Manufacturer   Model    Type Min.Price Price Max.Price MPG.city MPG.highway
1:        Acura Integra   Small      12.9  15.9      18.8       25          31
2:         Audi      90 Compact      25.9  29.1      32.3       20          26
3:         Audi     100 Midsize        NA  37.7      44.6       19          26
4:          BMW    535i Midsize        NA  30.0        NA       22          30
5:        Buick Century Midsize      14.2  15.7      17.3       22          31
6:        Buick LeSabre   Large      19.9  20.8        NA       19          28
(...truncated...)
Show Solution
#Input
DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

#Solution 
output <- DT[!is.na(Manufacturer), head(.SD, 3), by = Manufacturer]
print(head(output))
   Manufacturer   Model    Type Min.Price Price Max.Price MPG.city MPG.highway
1:        Acura Integra   Small      12.9  15.9      18.8       25          31
2:         Audi      90 Compact      25.9  29.1      32.3       20          26
3:         Audi     100 Midsize        NA  37.7      44.6       19          26
4:          BMW    535i Midsize        NA  30.0        NA       22          30
5:        Buick Century Midsize      14.2  15.7      17.3       22          31
6:        Buick LeSabre   Large      19.9  20.8        NA       19          28
              AirBags DriveTrain Cylinders EngineSize Horsepower  RPM
1:               None      Front         4        1.8        140 6300
2:        Driver only      Front         6        2.8        172 5500
3: Driver & Passenger       <NA>         6         NA        172 5500
4:               <NA>       Rear         4        3.5        208 5700
5:        Driver only       <NA>         4        2.2        110 5200
6:        Driver only      Front         6        3.8        170 4800
   Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase
1:         2890             Yes               13.2          5    177       102
2:         2280             Yes               16.9          5    180       102
3:         2535            <NA>               21.1          6    193       106
4:         2545             Yes               21.1          4    186       109
5:         2565              No               16.4          6    189       105
6:           NA              No                 NA          6    200       111
   Width Turn.circle Rear.seat.room Luggage.room Weight  Origin          Make
1:    68          37           26.5           NA   2705 non-USA Acura Integra
2:    67          37           28.0           14   3375 non-USA       Audi 90
3:    NA          37           31.0           17   3405 non-USA      Audi 100
4:    69          39           27.0           13   3640 non-USA      BMW 535i
5:    69          41           28.0           16     NA     USA Buick Century
6:    74          42           30.5           17   3470     USA Buick LeSabre

28. How to remove specific column from data.table?

Difficulty Level: L2

Question : Remove column Model from DT.

Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Show Solution
#Input
DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

#Solution 
DT[, Model := NULL]
print(head(DT))
   Manufacturer    Type Min.Price Price Max.Price MPG.city MPG.highway
1:        Acura   Small      12.9  15.9      18.8       25          31
2:         <NA> Midsize      29.2  33.9      38.7       18          25
3:         Audi Compact      25.9  29.1      32.3       20          26
4:         Audi Midsize        NA  37.7      44.6       19          26
5:          BMW Midsize        NA  30.0        NA       22          30
6:        Buick Midsize      14.2  15.7      17.3       22          31
              AirBags DriveTrain Cylinders EngineSize Horsepower  RPM
1:               None      Front         4        1.8        140 6300
2: Driver & Passenger      Front         6        3.2        200 5500
3:        Driver only      Front         6        2.8        172 5500
4: Driver & Passenger       <NA>         6         NA        172 5500
5:               <NA>       Rear         4        3.5        208 5700
6:        Driver only       <NA>         4        2.2        110 5200
   Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase
1:         2890             Yes               13.2          5    177       102
2:         2335             Yes               18.0          5    195       115
3:         2280             Yes               16.9          5    180       102
4:         2535            <NA>               21.1          6    193       106
5:         2545             Yes               21.1          4    186       109
6:         2565              No               16.4          6    189       105
   Width Turn.circle Rear.seat.room Luggage.room Weight  Origin          Make
1:    68          37           26.5           NA   2705 non-USA Acura Integra
2:    71          38           30.0           15   3560 non-USA  Acura Legend
3:    67          37           28.0           14   3375 non-USA       Audi 90
4:    NA          37           31.0           17   3405 non-USA      Audi 100
5:    69          39           27.0           13   3640 non-USA      BMW 535i
6:    69          41           28.0           16     NA     USA Buick Century

29. How to set key to data.table?

Difficulty Level: L1

Question : Set column AirBags as key to the data.table

Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Show Solution
#Input
DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

#Solution 
setkey(DT,AirBags)

30. How to get all rows where AirBags are at 'Driver & Passenger' in DT?

Difficulty Level: L2

Question : Get all rows where AirBags are Driver & Passenger in DT.

Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Show Solution
#Input
DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

#Solution 
setkey(DT,AirBags)
print(head(DT["Driver & Passenger"]))
   Manufacturer    Model    Type Min.Price Price Max.Price MPG.city MPG.highway
1:         <NA>   Legend Midsize      29.2  33.9      38.7       18          25
2:         Audi      100 Midsize        NA  37.7      44.6       19          26
3:     Cadillac  Seville Midsize      37.5  40.1      42.7       16          25
4:    Chevrolet   Camaro  Sporty      13.4  15.1      16.8       19          28
5:         <NA> Concorde   Large      18.4  18.4      18.4       20          28
6:     Chrysler  LeBaron Compact      14.5  15.8      17.1       23          28
              AirBags DriveTrain Cylinders EngineSize Horsepower  RPM
1: Driver & Passenger      Front         6        3.2        200 5500
2: Driver & Passenger       <NA>         6         NA        172 5500
3: Driver & Passenger      Front         8        4.6        295 6000
4: Driver & Passenger       Rear      <NA>        3.4        160 4600
5: Driver & Passenger      Front         6        3.3        153 5300
6: Driver & Passenger      Front         4        3.0        141 5000
   Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase
1:         2335             Yes               18.0          5    195       115
2:         2535            <NA>               21.1          6    193       106
3:         1985              No               20.0          5    204       111
4:         1805             Yes                 NA          4    193       101
5:         1990              No               18.0          6    203       113
6:         2090              No               16.0          6    183       104
   Width Turn.circle Rear.seat.room Luggage.room Weight  Origin
1:    71          38           30.0           15   3560 non-USA
2:    NA          37           31.0           17   3405 non-USA
3:    74          44           31.0           NA   3935     USA
4:    74          43           25.0           13   3240     USA
5:    74          NA           31.0           15   3515     USA
6:    68          41           30.5           14   3085     USA
                Make
1:      Acura Legend
2:          Audi 100
3:  Cadillac Seville
4:  Chevrolet Camaro
5: Chrylser Concorde
6:  Chrysler LeBaron

31. How to get all rows where AirBags are either 'Driver & Passenger' or 'No data'?

Difficulty Level: L2

Question : Get all rows where AirBags are Driver & Passenger & No data.

Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Desired Output


Show Solution
#Input
library(data.table)
DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

#Solution 
setkey(DT,AirBags)
print(DT[c("Driver & Passenger", "No data")])
     Manufacturer       Model    Type Min.Price Price Max.Price MPG.city
 1:          <NA>      Legend Midsize      29.2  33.9      38.7       18
 2:          Audi         100 Midsize        NA  37.7      44.6       19
 3:      Cadillac     Seville Midsize      37.5  40.1      42.7       16
 4:     Chevrolet      Camaro  Sporty      13.4  15.1      16.8       19
 5:          <NA>    Concorde   Large      18.4  18.4      18.4       20
 6:      Chrysler     LeBaron Compact      14.5  15.8      17.1       23
 7:         Eagle      Vision   Large        NA  19.3      21.2       20
 8:         Honda     Prelude  Sporty      17.0  19.8      22.7       24
 9:         Honda      Accord Compact      13.8  17.5      21.2       24
10:          <NA>       SC300 Midsize      34.7  35.2      35.6       18
11:       Lincoln Continental Midsize      33.3  34.3      35.3       17
12:       Lincoln    Town_Car   Large      34.4  36.1      37.8       18
13: Mercedes-Benz        300E Midsize      43.8  61.9      80.0       19
14:       Pontiac    Firebird    <NA>      14.0  17.7      21.4       19
15:       Pontiac  Bonneville   Large      19.4  24.4      29.4       19
16:          <NA>         850 Midsize      24.8  26.7      28.5       20
17:          <NA>        <NA>    <NA>        NA    NA        NA       NA
    MPG.highway            AirBags DriveTrain Cylinders EngineSize Horsepower
 1:          25 Driver & Passenger      Front         6        3.2        200
 2:          26 Driver & Passenger       <NA>         6         NA        172
 3:          25 Driver & Passenger      Front         8        4.6        295
 4:          28 Driver & Passenger       Rear      <NA>        3.4        160
 5:          28 Driver & Passenger      Front         6        3.3        153
 6:          28 Driver & Passenger      Front         4        3.0        141
 7:          28 Driver & Passenger      Front         6        3.5         NA
 8:          31 Driver & Passenger      Front         4        2.3        160
 9:          31 Driver & Passenger      Front         4        2.2        140
10:          23 Driver & Passenger       Rear         6        3.0        225
11:          26 Driver & Passenger       <NA>         6        3.8        160
12:          26 Driver & Passenger       Rear         8        4.6        210
13:          25 Driver & Passenger       Rear         6        3.2        217
14:          28 Driver & Passenger       Rear         6         NA        160
15:          28 Driver & Passenger      Front         6        3.8        170
16:          28 Driver & Passenger      Front         5        2.4        168
17:          NA            No data       <NA>      <NA>         NA         NA
     RPM Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length
 1: 5500         2335             Yes               18.0          5    195
 2: 5500         2535            <NA>               21.1          6    193
 3: 6000         1985              No               20.0          5    204
 4: 4600         1805             Yes                 NA          4    193
 5: 5300         1990              No               18.0          6    203
 6: 5000         2090              No               16.0          6    183
 7: 5800         1980              No               18.0          6    202
 8: 5800         2855             Yes                 NA          4    175
 9: 5600           NA             Yes               17.0          4    185
10: 6000         2510             Yes                 NA          4    191
11: 4400         1835              No               18.4          6    205
12: 4600         1840              No               20.0         NA    219
13: 5500         2220              No               18.5          5     NA
14: 4600         1805             Yes               15.5          4    196
15: 4800         1565              No               18.0          6    177
16: 6200           NA             Yes               19.3          5    184
17:   NA           NA            <NA>                 NA         NA     NA
    Wheelbase Width Turn.circle Rear.seat.room Luggage.room Weight  Origin
 1:       115    71          38           30.0           15   3560 non-USA
 2:       106    NA          37           31.0           17   3405 non-USA
 3:       111    74          44           31.0           NA   3935     USA
 4:       101    74          43           25.0           13   3240     USA
 5:       113    74          NA           31.0           15   3515     USA
 6:       104    68          41           30.5           14   3085     USA
 7:       113    74          40           30.0           15   3490     USA
 8:       100    70          39           23.5            8   2865 non-USA
 9:       107    67          41           28.0           14   3040 non-USA
10:       106    71          39           25.0            9   3515 non-USA
11:       109    73          42           30.0           19   3695     USA
12:       117    77          45           31.5           22   4055     USA
13:       110    69          37             NA           15   3525 non-USA
14:       101    75          43           25.0           13   3240     USA
15:       111    74          43           30.5           18   3495     USA
16:       105    69          38           30.0           15   3245 non-USA
17:        NA    NA          NA             NA           NA     NA    <NA>
                   Make
 1:        Acura Legend
 2:            Audi 100
 3:    Cadillac Seville
 4:    Chevrolet Camaro
 5:   Chrylser Concorde
 6:    Chrysler LeBaron
 7:        Eagle Vision
 8:       Honda Prelude
 9:        Honda Accord
10:         Lexus SC300
11: Lincoln Continental
12:    Lincoln Town_Car
13:  Mercedes-Benz 300E
14:    Pontiac Firebird
15:  Pontiac Bonneville
16:           Volvo 850
17:                <NA>

33. How to get last and 2nd last (penultimate) row of data.table?

Difficulty Level: L2

Question 1 : Get penultimate row of DT.

Question 2 : Get last row of DT.

Input:

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Desired Output:

   Manufacturer Model    Type Min.Price Price Max.Price MPG.city MPG.highway
1:         <NA>   850 Midsize      24.8  26.7      28.5       20          28
2:        Volvo   240 Compact      21.8  22.7      23.5       21          28
              AirBags DriveTrain Cylinders EngineSize Horsepower  RPM
1: Driver & Passenger      Front         5        2.4        168 6200
2:        Driver only       Rear      <NA>        2.3        114 5400
   Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase
1:           NA             Yes               19.3          5    184       105
2:         2215             Yes               15.8          5    190       104
   Width Turn.circle Rear.seat.room Luggage.room Weight  Origin      Make
1:    69          38           30.0           15   3245 non-USA Volvo 850
2:    67          37           29.5           14   2985 non-USA Volvo 240
Show Solution
#Input
DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

# Solution
output <- DT[c(.N, .N - 1)]
print(output)
   Manufacturer Model    Type Min.Price Price Max.Price MPG.city MPG.highway
1:         <NA>   850 Midsize      24.8  26.7      28.5       20          28
2:        Volvo   240 Compact      21.8  22.7      23.5       21          28
              AirBags DriveTrain Cylinders EngineSize Horsepower  RPM
1: Driver & Passenger      Front         5        2.4        168 6200
2:        Driver only       Rear      <NA>        2.3        114 5400
   Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase
1:           NA             Yes               19.3          5    184       105
2:         2215             Yes               15.8          5    190       104
   Width Turn.circle Rear.seat.room Luggage.room Weight  Origin      Make
1:    69          38           30.0           15   3245 non-USA Volvo 850
2:    67          37           29.5           14   2985 non-USA Volvo 240

34. How to reorder columns of data.table?

Difficulty Level: L2

Question : Reorder columns of DT as V2 , V1, V3

Input

DT <- data.table(V1= c("a","b","c","d"), V2 =c("1","2","3","4"), V3 = c("aa","bb","cc","dd") )

Desired Output

   V2 V1 V3
1:  1  a aa
2:  2  b bb
3:  3  c cc
4:  4  d dd
Show Solution
# Input
DT <- data.table(V1= c("a","b","c","d"), V2 =c("1","2","3","4"), V3 = c("aa","bb","cc","dd") )

#Solution
output <- setcolorder(DT,c("V2","V1","V3"))
print(output)
   V2 V1 V3
1:  1  a aa
2:  2  b bb
3:  3  c cc
4:  4  d dd

35. How to select one column and compute standard deviation of another column and return a single value that gets recycled?

Difficulty Level: L2

Question : Select column Manufacturer and compute standard deviation of column Price and return a single value that gets recycled.

Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Desired Output:

   Manufacturer Sd.Price
1:        Acura  9.72428
2:         <NA>  9.72428
3:         Audi  9.72428
4:         Audi  9.72428
5:          BMW  9.72428
6:        Buick  9.72428
Show Solution
# Input
DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

#Solution
output <- DT[,.(Manufacturer, Sd.Price = sd(Price, na.rm = TRUE))]
print(head(output))
   Manufacturer Sd.Price
1:        Acura  9.72428
2:         <NA>  9.72428
3:         Audi  9.72428
4:         Audi  9.72428
5:          BMW  9.72428
6:        Buick  9.72428

36. How to use row indexing, column indexing and group by expressions all together?

Difficulty Level: L2

Question : Calculate sum of Weight of cars under every Manufacturer but exclude all 'Midsize' Type cars.

Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Desired Output

   Manufacturer    V1
1:        Acura  2705
2:         Audi  3375
3:        Buick  7575
4:     Cadillac  3620
5:    Chevrolet 23545
6:         <NA>  3515
Show Solution
# Input
DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

#Solution
output = DT[Type !="Midsize", sum(Weight), by=Manufacturer]
print(head(output))
   Manufacturer    V1
1:        Acura  2705
2:         Audi  3375
3:        Buick  7575
4:     Cadillac  3620
5:    Chevrolet 23545
6:         <NA>  3515

37. How to get the row number of the nth largest value in a column?

Difficulty Level: L2

Question : Find the row position of the 5th largest value of column Price in DT.

Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Show Solution
# Input
DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

#Solution
DT <- DT[order(Price)]
fifth_largest_price = DT[5, Price]
output = DT[Price == fifth_largest_price, which = TRUE]
output
  1. 4
  2. 5

38. How to find and cap outliers from a data.table column?

Difficulty Level: L2

Question: Replace all values of Length column in the lower 5%ile and greater than 95%ile with respective 5th and 95th %ile value.

Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Show Solution
# Input
DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

# Solution
DT[Length > quantile(Length, .95, na.rm=T), Length := round(quantile(DT$Length, .95, na.rm=T))]
DT[Length < quantile(Length, .05, na.rm=T), Length := round(quantile(DT$Length, .05, na.rm=T))]

39. How to swap two column values based on a condition in another column of a data.table?

Difficulty Level: L2

Question : Swap column V1 & V3 for all the rows where V2 = 2.

Input

DT <- data.table(V1= c("a","b","c","d"), 
                 V2 =c("1","2","2","3"), 
                 V3 = c("aa","bb","cc","dd") )

Desired Output:

   V1 V2 V3
1:  a  1 aa
2: bb  2  b
3: cc  2  c
4:  d  3 dd
Show Solution
# Input
DT <- data.table(V1= c("a","b","c","d"), V2 =c("1","2","2","3"), V3 = c("aa","bb","cc","dd") )

#Solution
DT[V2=="2", c("V3", "V1") := .(V1, V3)]
print(DT)
   V1 V2 V3
1:  a  1 aa
2: bb  2  b
3: cc  2  c
4:  d  3 dd

40. How to swap two rows of a dataframe?

Difficulty Level: L2

Questions : Swap rows 1 and 2 in DT.

Input

DT <- data.table(V1= c("a","b","c","d"), 
                 V2 =c("1","2","2","3"), 
                 V3 = c("aa","bb","cc","dd") )

Desired Output

   V1 V2 V3
1:  b  2 bb
2:  a  1 aa
3:  c  2 cc
4:  d  3 dd
Show Solution
# Input
DT <- data.table(V1= c("a","b","c","d"), V2 =c("1","2","2","3"), V3 = c("aa","bb","cc","dd") )

# Solution
swap_rows <- function(dt, i1, i2){
    i1_row <- DT[i1,]
    i2_row <- DT[i2,]
    DT[i1,] <- i2_row
    DT[i2,] <- i1_row
    return(DT)
}

output <- swap_rows(DT, 1,2)
print(output)
   V1 V2 V3
1:  b  2 bb
2:  a  1 aa
3:  c  2 cc
4:  d  3 dd

41. How to reverse the rows of a data.table?

Difficulty Level: L2

Question : Reverse all the rows of DT so that topmost row goes to bottom.

Input

DT <- data.table(V1= c("a","b","c","d"), 
                 V2 =c("1","2","2","3"), 
                 V3 = c("aa","bb","cc","dd") )

Desired Output

     V1  V2  V3  
[1,] "d" "3" "dd"
[2,] "c" "2" "cc"
[3,] "b" "2" "bb"
[4,] "a" "1" "aa"
Show Solution
# Input
DT <- data.table(V1= c("a","b","c","d"), V2 =c("1","2","2","3"), V3 = c("aa","bb","cc","dd") )

# Solution
output <- do.call(cbind, lapply(DT, rev))
print(output)
     V1  V2  V3  
[1,] "d" "3" "dd"
[2,] "c" "2" "cc"
[3,] "b" "2" "bb"
[4,] "a" "1" "aa"

42. How to create one-hot encodings of a categorical variable (dummy variables)?

Difficulty Level: L2

Question : Get one-hot encodings for columns Gender & Degree in the data.table DT and append it as columns.

Input

DT <- data.table(Gender = c("M","M","F","F"), E_Id =c(1,2,3,4), Degree = c("UG","PG","PhD","UG") )
   Gender E_Id Degree
1:      M    1     UG
2:      M    2     PG
3:      F    3    PhD
4:      F    4     UG

Desired Output

   E_Id Gender_F Gender_M Degree_PG Degree_PhD Degree_UG
1:    1        0        1         0          0         1
2:    2        0        1         1          0         0
3:    3        1        0         0          1         0
4:    4        1        0         0          0         1
Show Solution
# Input
DT <- data.table(Gender = c("M","M","F","F"), E_Id =c(1,2,3,4), Degree = c("UG","PG","PhD","UG") )

# Solution
output <- dcast(melt(DT, id.vars='E_Id'), E_Id ~ variable + value, fun = length)
print(output)
   E_Id Gender_F Gender_M Degree_PG Degree_PhD Degree_UG
1:    1        0        1         0          0         1
2:    2        0        1         1          0         0
3:    3        1        0         0          1         0
4:    4        1        0         0          0         1


43. How to find median value in each numeric column in R data.table?

Difficulty Level: L2

Question : Find row-wise sum values of DT.

Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Desired Output

   Min.Price Price Max.Price MPG.city MPG.highway EngineSize Horsepower  RPM
1:      14.6  17.7     19.15       21          28        2.3        140 5200
   Rev.per.mile Fuel.tank.capacity Passengers Length Wheelbase Width
1:         2360               16.5          5    181       103    69
   Turn.circle Rear.seat.room Luggage.room Weight
1:          39           27.5           14   3085
Show Solution
#Input
DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

#Solution
output <- DT[, .SD, .SDcols = which(sapply(DT, is.numeric))]
output <- output[, lapply(.SD, median, na.rm = TRUE)]
print(head(output))
   Min.Price Price Max.Price MPG.city MPG.highway EngineSize Horsepower  RPM
1:      14.6  17.7     19.15       21          28        2.3        140 5200
   Rev.per.mile Fuel.tank.capacity Passengers Length Wheelbase Width
1:         2360               16.5          5    181       103    69
   Turn.circle Rear.seat.room Luggage.room Weight
1:          39           27.5           14   3085

44. How to compute correlation score of each column against other columns?

Difficulty Level: L2

Question : Compute correlation score of each column against other columns.

Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Show Solution
#Input
DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

#Solution
output <- DT[, .SD, .SDcols = which(sapply(DT, is.numeric))]
print(cor(output, use = "na.or.complete"))

45. How to create a column containing the minimum by maximum of each row?

Difficulty Level: L2

Question : Compute the minimum-by-maximum for every row of df.

DT <- data.table(V1 = runif(10, 1, 100), 
                 V2 = runif(10, 1, 80), 
                 V3 = runif(10, 1, 100))
Show Solution
# Input
DT <- data.table(V1= runif(10, 1, 100), V2 =runif(10, 1, 80), V3 = runif(10, 1, 100))

#Solution
DT[,max_min:=do.call(pmin,.SD)/do.call(pmax,.SD)]
print(head(DT))

# Solution 2
DT <- data.table(V1= runif(10, 1, 100), V2 =runif(10, 1, 80), V3 = runif(10, 1, 100))
DT[, max_min2:=apply(.SD, 1, min)/apply(.SD, 1, max)]
print(head(DT))
         V1       V2       V3   max_min
1: 16.33154 66.08863 41.38638 0.2471157
2: 87.79467 38.57875 38.50150 0.4385404
3: 40.95019 19.47167 15.03562 0.3671685
4: 12.40790 17.68716 34.95050 0.3550136
5: 34.15262 24.52459 71.04999 0.3451738
6: 60.22314 60.22538 10.79905 0.1793106

46. How to scale all columns in a dataframe?

Difficulty Level: L2

Question : Normalize all columns in DT.

Without using external packages.

Input

DT <- data.table(V1= runif(10, 1, 100), 
                 V2 =runif(10, 1, 80), 
                 V3 = runif(10, 1, 100))
Show Solution
# Input
DT <- data.table(V1= runif(10, 1, 100), V2 =runif(10, 1, 80), V3 = runif(10, 1, 100))

# Solution
output <- DT[, lapply(.SD, scale)]
print(output)
         V1.V1      V2.V1       V3.V1
 1:  0.9884659 -1.2595111  0.09593436
 2:  0.1217149 -0.8867426 -0.31629923
 3:  0.5872020  0.1443417  0.70839114
 4: -0.5178531  0.5737876  1.12962574
 5:  1.0221688 -0.9475154 -0.63975308
 6: -1.5091412  1.2276058  1.11944272
 7: -0.9203772  1.0074295 -0.33182208
 8: -0.8005338  1.3745140 -1.41247525
 9:  1.5281054 -0.9386551 -1.45771901
10: -0.4997516 -0.2952545  1.10467468

47. How to compute the correlation of each column with the succeeding column?

Difficulty Level: L2

Compute the correlation of each column of df with its succeeding column.

Input

DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Desired Output

                 Price
   Min.Price 0.9705018
             Max.Price 
       Price 0.9802519
              MPG.city
  Max.Price -0.5377239
           MPG.highway
  MPG.city   0.9479157
            EngineSize
MPG.highway -0.6259651
            Horsepower 
 EngineSize  0.7313269
                   RPM
 Horsepower 0.04736874
          Rev.per.mile
      RPM    0.4623314
(...truncated...)
Show Solution
#Input
DT <-  fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

#Solution
output <- DT[, .SD, .SDcols = which(sapply(DT, is.numeric))]  # numeric columns

for (i in 1:(ncol(output) - 1)){
    a <- i + 1
    print(cor(output[,..i], output[,..a], use = "na.or.complete"))
}
              Price
Min.Price 0.9705018
      Max.Price
Price 0.9802519
            MPG.city
Max.Price -0.5377239
         MPG.highway
MPG.city   0.9479157
            EngineSize
MPG.highway -0.6259651
           Horsepower
EngineSize  0.7313269
                  RPM
Horsepower 0.04736874
    Rev.per.mile
RPM    0.4623314
             Fuel.tank.capacity
Rev.per.mile         -0.6318497
                   Passengers
Fuel.tank.capacity  0.4965661
              Length
Passengers 0.4737477
       Wheelbase
Length 0.8275146
              Width
Wheelbase 0.8127431
      Turn.circle
Width   0.8170549
            Rear.seat.room
Turn.circle      0.4738118
               Luggage.room
Rear.seat.room     0.668847
                Weight
Luggage.room 0.6425643

48. How to replace both the diagonals of data.table with 0?

Difficulty Level: L2

Replace both values in both diagonals of DT with 0.

Input

DT <- data.table(V1= runif(4, 1, 100), V2 =runif(4, 1, 80), V3 = runif(4, 1, 100), V4 = runif(4,1, 50))

          V1        V2        V3       V4
1: 97.451642 65.393451  6.496714 27.80014
2: 56.475427  5.234317 80.551154 10.86617
3: 65.965884 18.954913 70.968490 25.92509
4:  3.723218 26.590925 75.723006 42.22132

Desired output

         V1       V2        V3       V4
1:  0.00000 65.39345  6.496714  0.00000
2: 56.47543  0.00000  0.000000 10.86617
3: 65.96588  0.00000  0.000000 25.92509
4:  0.00000 26.59093 75.723006  0.00000
Show Solution
# Input
DT <- data.table(V1= runif(4, 1, 100), V2 =runif(4, 1, 80), V3 = runif(4, 1, 100), V4 = runif(4,1, 50))

# Solution
for (i in 1:nrow(DT)){
    DT[i,i] <- 0
    DT[nrow(DT) - i + 1, i] <- 0
}
print(DT)
         V1       V2       V3        V4
1:  0.00000 72.08959 12.63875  0.000000
2: 17.61880  0.00000  0.00000  1.696741
3: 82.20107  0.00000  0.00000 40.321027
4:  0.00000 40.47473 71.60260  0.000000

49. How to join two dataframes by 2 columns so they have only the common rows?

Difficulty Level: L2

Join dataframes df1 and df2 by ‘fruit-pazham’ and ‘weight-kilo’.

Input

DT1 = data.table(fruit =  c('apple', 'banana', 'orange'),
                    weight =  c('high', 'medium', 'low'),
                    price=  c(1,2,3))

DT2 = data.table(pazham =  c('apple', 'orange', 'pine'),
                    kilo= c('high', 'low', 'high'),
                    price =  c(1,2,3))

Desired Output

    fruit weight price_left price_right
1:  apple   high          1           1
2: orange    low          3           2
Show Solution
# Input
DT1 = data.table(fruit =  c('apple', 'banana', 'orange'),
                    weight =  c('high', 'medium', 'low'),
                    price=  c(1,2,3))

DT2 = data.table(pazham =  c('apple', 'orange', 'pine'),
                    kilo= c('high', 'low', 'high'),
                    price =  c(1,2,3))

# Solution
output <- merge(DT1, DT2, 
                  by.x = c('fruit', 'weight'), 
                  by.y = c('pazham', 'kilo'), 
                  all = FALSE, 
                  suffixes = c("_left", "_right"))

print(output)
    fruit weight price_left price_right
1:  apple   high          1           1
2: orange    low          3           2


50. How to get the positions where values of two columns of data.table match?

Difficulty Level: L2

Question : Get the positions where values of two columns of DT match.

Input

DT <- data.table(fruit1 =  c('apple', 'orange', 'banana', 'mango),
                 fruit2 =  c('apple', 'grapes', 'banana', 'jackfruit'))

Desired Output

1 3
Show Solution
# Input
DT <- data.table(fruit1 =  c('apple', 'orange', 'banana', 'mango'),
                    fruit2 =  c('apple', 'grapes', 'banana', 'jackfruit'))

# Solution
which(DT$fruit1 == DT$fruit2)
  1. 1
  2. 3


51. How to create lags and leads of a column in a data.table?

Difficulty Level: L2

Create two new columns in df, one of which is a lag 1 (shift column a down by 1 row) of column ‘a’ and the other is a lead 1 (shift column b up by 1 row).

Input

DT <- data.table(V1 = runif(4, 1, 100), 
                 V2 = runif(4, 1, 80), 
                 V3 = runif(4, 1, 100), 
                 V4 = runif(4,1, 50))

Desired Output

         V1       V2        V3       V4 lagged_column lead_column
1: 90.41034 79.62526 57.525005 16.88133            NA    13.19427
2: 13.19427 29.20243  3.962821 35.72731      90.41034    13.79110
3: 13.79110 15.68368 78.933011 18.43108      13.19427    86.76293
4: 86.76293 33.66214  6.706639 43.40251      13.79110          NA
Show Solution
# Input
DT <- data.table(V1= runif(4, 1, 100), V2 =runif(4, 1, 80), V3 = runif(4, 1, 100), V4 = runif(4,1, 50))

# Solution
DT[, lagged_column := shift(V1, type = 'lag')]
DT[, lead_column := shift(V1, type = 'lead')]
print(DT)
         V1        V2        V3       V4 lagged_column lead_column
1: 51.50006 74.050716 27.618570 16.68658            NA    82.29783
2: 82.29783 63.969537  9.054432  1.16761      51.50006    78.36425
3: 78.36425 63.937995 85.047398 21.39628      82.29783    63.64181
4: 63.64181  2.482478 10.060054 27.33597      78.36425          NA

52. How to get the frequency of unique values in the entire dataframe?

Difficulty Level: L2

Question : Get the frequency of unique values in the entire dataframe DT.

Input

DT <- data.table(fruit =  c('apple', 'orange', 'banana', 'apple'))

Desired Output

    fruit number_of_distinct_orders
1:  apple                         2
2: orange                         1
3: banana                         1
Show Solution
# Input
DT <- data.table(fruit =  c('apple', 'orange', 'banana', 'apple'))

# Solution
output <- DT
output[,flag := 1]
print(output[, .(number_of_distinct_orders = length(flag)), by = fruit])
    fruit number_of_distinct_orders
1:  apple                         2
2: orange                         1
3: banana                         1

53. How to convert a table from long to wide format using R data.table?

Question : Create a pivot table out of DT keeping ID & Month as key.

Difficulty Level: L2

Input

n <- 5
DT <- data.table(  
    ID       = sample(1:20, n, replace=TRUE), 
    Month    = sample(1:12, n, replace=TRUE),
    Category = c("Drinks", "Food", "Drinks", "Food", "Food"),
    Expenses = runif(n),
    key      = c('ID', 'Month')
)
   ID Month Category  Expenses
1:  4     3   Drinks 0.9921504
2:  4     4     Food 0.8073523
3:  8    10     Food 0.5533336
4: 14     8     Food 0.2588098
5: 20     4   Drinks 0.2012480

Desired Output

   ID Month    Drinks      Food
1:  4     3 0.9921504 0.0000000
2:  4     4 0.0000000 0.8073523
3:  8    10 0.0000000 0.5533336
4: 14     8 0.0000000 0.2588098
5: 20     4 0.2012480 0.0000000
Show Solution
# Input
n <- 5
DT <- data.table(
  ID=sample(1:20, n, replace=TRUE), 
  Month=sample(1:12, n, replace=TRUE),
  Category= c("Drinks", "Food", "Drinks", "Food", "Food"),
  Expenses=runif(n),
  key=c('ID', 'Month')
)

# Solution
output <- dcast.data.table(DT, ID + Month ~ Category, fun=sum, value.var='Expenses')
print(output)
   ID Month    Drinks       Food
1:  7     6 0.0000000 0.75928350
2:  7    10 0.5103192 0.00000000
3:  9     3 0.1320670 0.00000000
4: 10     2 0.0000000 0.07450386
5: 18     7 0.0000000 0.23840221

54. How to create a bootstrap sample (sample same number of rows with replacement) from a data.table?

Difficulty Level: L2

Question : Create a bootstrap sample of length 5 from DT.

Input

DT <- data.table(V1 = runif(20, 1, 100), 
                 V2 = runif(20, 1, 80), 
                 V3 = runif(20, 1, 100), 
                 V4 = runif(20,1, 50))
Show Solution
# Input
DT <- data.table(V1 = runif(20, 1, 100), 
                 V2 = runif(20, 1, 80), 
                 V3 = runif(20, 1, 100), 
                 V4 = runif(20,1, 50))

# Solution
n <- 5
row_number <- sample(1:20, n, replace=TRUE)
print(DT[row_number,])
          V1       V2       V3       V4
1:  3.213551 25.49768 30.01245  5.81102
2: 28.313228 63.25563 89.86874 30.28312
3: 37.163717 41.17688 98.72226 43.52485
4: 91.656742 60.68913  9.93234 21.66074
5: 51.661261 26.40149 28.44038 27.73880

55. How to bin a numeric column with user custom intervals to form a categorical (text) column?

Difficulty Level: L2

In DT, create a new column (cat_col, that has 'fail' if the average of the first two columns is less than 40. Else, it contains 'pass'.

Input

set.seed(100)
DT <- data.table(V1 = runif(20, 1, 100), 
                 V2 = runif(20, 1, 80))

Desired Ouput

           V1       V2 cat_col
 1: 31.468845 43.32908    fail
 2: 26.509578 57.15350    pass
 3: 55.679921 43.52955    pass
 4:  6.581932 60.16881    fail
 5: 47.386379 34.18801    pass
 6: 48.893303 14.54220    fail
(...truncated...)
Show Solution
# Input
set.seed(100)
DT <- data.table(V1= runif(20, 1, 100), 
                 V2 =runif(20, 1, 80))

# Solution
DT[, cat_col := ifelse((V1 + V2)/2 < 40, 'fail' , 'pass')]
print(DT)
           V1       V2 cat_col
 1: 31.468845 43.32908    fail
 2: 26.509578 57.15350    pass
 3: 55.679921 43.52955    pass
 4:  6.581932 60.16881    fail
 5: 47.386379 34.18801    pass
 6: 48.893303 14.54220    fail
 7: 81.427859 61.85383    pass
 8: 37.661733 70.67433    pass
 9: 55.109301 44.37864    pass
10: 17.855943 22.94018    fail
11: 62.874651 39.57617    pass
12: 88.334386 74.35190    pass
13: 28.755030 28.54667    fail
14: 40.450302 76.37846    pass
15: 76.492557 55.92666    pass
16: 67.233150 71.26683    pass
17: 21.256604 15.25217    fail
18: 36.394960 50.72188    pass
19: 36.588036 79.17557    pass
20: 69.338762 11.29282    pass

56. How to convert a data.table from wide to long format?

Difficulty Level: L2

Question : Convert DT from wide to long format by keeping ID & Month as key.

Input

n <- 5
DT <- data.table(
  ID       = sample(1:20, n, replace=TRUE), 
  Month    = sample(1:12, n, replace=TRUE),
  Drinks   = sample(10:20,n, replace = TRUE),
  Food     = sample(10:20,n, replace = TRUE),
  Starters = sample(5:15,n, replace = TRUE)   
)

   ID Month Drinks Food Starters
1:  1    12     17   14       13
2: 14     2     12   12       11
3: 12     7     12   15       11
4: 19     1     16   15       15
5: 15    11     17   10       10

Desired Output

   ID Month variable value
1: 14     4   Drinks    12
2:  4    11   Drinks    18
3: 19     5   Drinks    17
4: 19     7   Drinks    15
5: 14     4   Drinks    11
6: 14     4     Food    14
(...truncated...)
Show Solution
# Input
n <- 5
DT <- data.table(
  ID=sample(1:20, n, replace=TRUE), 
  Month=sample(1:12, n, replace=TRUE),
  Drinks= sample(10:20,n, replace = TRUE),
  Food = sample(10:20,n, replace = TRUE),
  Starters = sample(5:15,n, replace = TRUE)   
)

# Solution
output = melt(DT, id.vars = c("ID", "Month"),
                measure.vars = c("Drinks", "Food", "Starters"))
print(head(output))
   ID Month variable value
1: 14     4   Drinks    12
2:  4    11   Drinks    18
3: 19     5   Drinks    17
4: 19     7   Drinks    15
5: 14     4   Drinks    11
6: 14     4     Food    14

57. How to create a dataframe that contains the first 3 lags of a time series data?

Difficulty Level: L2

Question : Append a new column in DT with first three lag of column V1.

Input

DT <- data.table(Date = seq(as.Date('2011-01-01'),as.Date('2011-01-10'),by = 1), 
                 V1   = round(runif(10, 1, 80),2))

Desired Output

         Date    V1  lag1  lag2  lag3
1: 2011-01-01 48.92    NA    NA    NA
2: 2011-01-02 65.45 48.92    NA    NA
3: 2011-01-03 67.63 65.45 48.92    NA
4: 2011-01-04 63.26 67.63 65.45 48.92
5: 2011-01-05  2.50 63.26 67.63 65.45
6: 2011-01-06 56.15  2.50 63.26 67.63
7: 2011-01-07 65.39 56.15  2.50 63.26
8: 2011-01-08 45.90 65.39 56.15  2.50
Show Solution
# Input
DT <- data.table(Date = seq(as.Date('2011-01-01'),as.Date('2011-01-10'),by = 1), 
                 V1   = round(runif(10, 1, 80),2))

# Solution
DT[, `:=`(lag1 = shift(V1, 1, type = 'lag'),
          lag2 = shift(V1, 2, type = 'lag'),
          lag3 = shift(V1, 3, type = 'lag'))]

print(head(DT, 8))
         Date    V1  lag1  lag2  lag3
1: 2011-01-01  8.05    NA    NA    NA
2: 2011-01-02 13.85  8.05    NA    NA
3: 2011-01-03  3.11 13.85  8.05    NA
4: 2011-01-04 57.06  3.11 13.85  8.05
5: 2011-01-05 61.12 57.06  3.11 13.85
6: 2011-01-06 68.74 61.12 57.06  3.11
7: 2011-01-07 35.54 68.74 61.12 57.06
8: 2011-01-08 33.95 35.54 68.74 61.12

58. How to shuffle rows of a dataframe?

Difficulty Level: L2

Question : Shuffle rows of DT.

Input

DT <- data.table(Date = seq(as.Date('2011-01-01'),as.Date('2011-01-10'),by = 1), 
                 V1   = runif(10, 1, 80), 
                 V2   = runif(10, 1, 100), 
                 V3   = runif(10,1, 50))
Show Solution
# Input
DT <- data.table(Date = seq(as.Date('2011-01-01'), as.Date('2011-01-10'),by = 1), 
                 V1   = runif(10, 1, 80), 
                 V2   = runif(10, 1, 100), 
                 V3   = runif(10,1, 50))

# Solution
set.seed(1000)
print(DT[sample(nrow(DT), replace = T),])
          Date       V1       V2        V3
 1: 2011-01-04 35.76134  2.61564 17.215551
 2: 2011-01-06 54.37578 56.66843 15.841530
 3: 2011-01-03 59.85359 21.81431 32.227574
 4: 2011-01-08 39.23481 74.79349 33.457133
 5: 2011-01-03 59.85359 21.81431 32.227574
 6: 2011-01-02 54.02029 34.80660  7.503086
 7: 2011-01-06 54.37578 56.66843 15.841530
 8: 2011-01-06 54.37578 56.66843 15.841530
 9: 2011-01-06 54.37578 56.66843 15.841530
10: 2011-01-01 76.45873 27.03600 16.913088

59. How to select rows between two dates in a dataframe?

Difficulty Level: L1

Question : Select all rows between dates 2011-01-05 & 2011-01-08 in DT.

Input

set.seed(100)
DT <- data.table(Date = seq(as.Date('2011-01-01'),as.Date('2011-01-10'),by = 1), 
                 V1   = runif(10, 1, 80), 
                 V2   = runif(10, 1, 100), 
                 V3   = runif(10,1, 50))

Desired Output

         Date       V1       V2        V3
1: 2011-01-06 28.39314 36.99609 29.814114
2: 2011-01-07 52.53458 33.46489  5.073762
Show Solution
# Input
DT <- data.table(Date= seq(as.Date('2011-01-01'),as.Date('2011-01-10'),by = 1), V1 =runif(10, 1, 80), V2 = runif(10, 1, 100), V3 = runif(10,1, 50))

# Solution
output <- DT[(Date > '2011-01-05') & (Date < '2011-01-08'),]
print(output)
         Date       V1       V2        V3
1: 2011-01-06 28.39314 36.99609 29.814114
2: 2011-01-07 52.53458 33.46489  5.073762

To be continued . . .