Casting tables with dcast and rowid

4 minute read

In many cases we need to reshape the tables we are working with, that is, change their format from wide to long (melt) and vice versa (cast). It’s usually a trivial operation but sometimes it requires a little more thinking. Let’s see a simplified example of a real problem.

Data generation

A totally random table with 50 rows is generated with the following columns:

  • id: Random numbers between 1 and 30 (with replacement).
  • phone: Random numbers between 600000000 and 699999999 (without replacement).
  • scoring: Random decimal numbers between 1 and 10 (without replacement).
  • pwd: Random strings with length 8.

The idea behind this data generation is to get an unknown number of rows with the same id with an unknown number of repetitions.

# Load packages
library(data.table)
library(knitr)

# Fixing the seed for reproducible results
set.seed(13)

# Random table generation
myDT <- data.table(
  id = sample(1:30, 50, replace = T),
  phone = sample(600000000:699999999, 50),
  scoring = runif(50, 1, 10),
  pwd = sapply(
    1:50,
    FUN = function(x) paste0(sample(letters, 8, replace = T), collapse = '')
    )
  )

# Order by id
setorder(myDT, id)

# Show data
kable(myDT, caption = 'Random Table', format = 'html')
Random Table
id phone scoring pwd
1 645403893 1.020973 klcapxnp
1 619413048 2.683202 iohatbab
2 609713628 7.300181 yudqfjmg
2 641256896 6.117944 tuimsqil
3 699784180 6.601338 daymvjqa
3 640456925 4.971228 jcvgzbuj
3 699936658 2.814284 bkkutjau
4 659143790 9.478907 xvdzpylw
5 627234317 1.931959 fmpcamnx
5 660916649 1.253950 dctpbtam
7 693736060 2.075693 hnpuolaj
8 639782493 6.433026 rwmwqvdu
10 619850570 8.600371 jzdkusva
10 680929713 6.954972 zeqxntgc
11 645335202 2.682323 lojmtrvs
11 629926642 9.379491 cxqadrdj
12 607223895 7.975297 avxympqg
13 653996293 2.607843 nnuqxigs
13 645038697 8.760344 mcwduuiw
14 646393146 1.607134 lkowclpa
14 694678826 8.862866 bxhelfjs
16 616062260 6.175765 ncukcyzz
17 662680472 5.231123 mkdaflur
17 629480466 2.209104 zdnpwuia
17 671447129 8.875030 oyftrcbp
18 614526543 3.632207 uijzspbs
18 635067768 8.161684 owrevrzb
18 608546595 5.084333 afrcytbl
19 693548878 1.149800 zoqoetuz
19 623388830 2.247059 kthwcoef
19 692795692 7.412554 juaelcsd
20 649672445 4.446254 liutuwqf
20 664961449 9.614372 rnrndysi
20 602632286 3.724153 rqidgdol
21 642990005 9.319300 ytcojema
21 636961722 6.555269 tnsletpt
21 684581198 7.147596 cqnoitbj
22 655936305 6.468607 diypikrs
22 692048681 5.884008 towijsgw
23 634100247 2.145143 ovkziixa
23 631965985 4.648430 owyiszmh
25 642384037 2.570212 yocaenac
26 639233975 7.821415 jabxdxab
27 623326281 6.333832 rcmnymfb
27 657307628 1.835592 ljxframj
27 680172458 2.802887 mxcokckt
28 694408637 3.547176 wsxxhocd
28 632219643 8.953089 xhnvzwae
29 602132896 4.271840 hycucpcz
30 615307379 5.279515 dvhbrnim

Change to wide format:

The goal is to have just one row per id without losing information.

The difficulty here is that the number of columns to add is not constant, varying according to the number of occurrences of each identifier. Of course, it is a problem that can be solved in many ways, especially with such a small table like this. The idea is to show how thanks to the power of R it can be solved in a general way in just one command call:

# Generate wide format table grouping by unique id
df <- dcast(
  myDT,
  formula = id ~ rowid(id),
  value.var = c('phone', 'scoring', 'pwd')
  )

# Show new table
kable(df, caption = 'Wide Format Table', format = 'html')
Wide Format Table
id phone_1 phone_2 phone_3 scoring_1 scoring_2 scoring_3 pwd_1 pwd_2 pwd_3
1 645403893 619413048 NA 1.020973 2.683202 NA klcapxnp iohatbab NA
2 609713628 641256896 NA 7.300181 6.117944 NA yudqfjmg tuimsqil NA
3 699784180 640456925 699936658 6.601338 4.971228 2.814284 daymvjqa jcvgzbuj bkkutjau
4 659143790 NA NA 9.478907 NA NA xvdzpylw NA NA
5 627234317 660916649 NA 1.931959 1.253950 NA fmpcamnx dctpbtam NA
7 693736060 NA NA 2.075693 NA NA hnpuolaj NA NA
8 639782493 NA NA 6.433026 NA NA rwmwqvdu NA NA
10 619850570 680929713 NA 8.600371 6.954972 NA jzdkusva zeqxntgc NA
11 645335202 629926642 NA 2.682323 9.379491 NA lojmtrvs cxqadrdj NA
12 607223895 NA NA 7.975297 NA NA avxympqg NA NA
13 653996293 645038697 NA 2.607843 8.760344 NA nnuqxigs mcwduuiw NA
14 646393146 694678826 NA 1.607134 8.862866 NA lkowclpa bxhelfjs NA
16 616062260 NA NA 6.175765 NA NA ncukcyzz NA NA
17 662680472 629480466 671447129 5.231123 2.209104 8.875030 mkdaflur zdnpwuia oyftrcbp
18 614526543 635067768 608546595 3.632207 8.161684 5.084333 uijzspbs owrevrzb afrcytbl
19 693548878 623388830 692795692 1.149800 2.247059 7.412554 zoqoetuz kthwcoef juaelcsd
20 649672445 664961449 602632286 4.446254 9.614372 3.724153 liutuwqf rnrndysi rqidgdol
21 642990005 636961722 684581198 9.319300 6.555269 7.147596 ytcojema tnsletpt cqnoitbj
22 655936305 692048681 NA 6.468607 5.884008 NA diypikrs towijsgw NA
23 634100247 631965985 NA 2.145143 4.648430 NA ovkziixa owyiszmh NA
25 642384037 NA NA 2.570212 NA NA yocaenac NA NA
26 639233975 NA NA 7.821415 NA NA jabxdxab NA NA
27 623326281 657307628 680172458 6.333832 1.835592 2.802887 rcmnymfb ljxframj mxcokckt
28 694408637 632219643 NA 3.547176 8.953089 NA wsxxhocd xhnvzwae NA
29 602132896 NA NA 4.271840 NA NA hycucpcz NA NA
30 615307379 NA NA 5.279515 NA NA dvhbrnim NA NA

Using dcast with the collaboration of rowid casts the table grouping by id and forces uniqueness without the data being aggregated.

I don’t know if there is an easy way to achieve this with traditional tools like SQL or a spreadsheet, but I’m afraid that a little more work would be involved.

Leave a comment