Casting tables with dcast
and rowid
In many cases we need to reshape the tables we are working with, that is, change their format from wide to long (melt) and vice versa (cast). It’s usually a trivial operation but sometimes it requires a little more thinking. Let’s see a simplified example of a real problem.
Data generation
A totally random table with 50 rows is generated with the following columns:
- id: Random numbers between 1 and 30 (with replacement).
- phone: Random numbers between 600000000 and 699999999 (without replacement).
- scoring: Random decimal numbers between 1 and 10 (without replacement).
- pwd: Random strings with length 8.
The idea behind this data generation is to get an unknown number of rows with the same id with an unknown number of repetitions.
# Load packages
library(data.table)
library(knitr)
# Fixing the seed for reproducible results
set.seed(13)
# Random table generation
myDT <- data.table(
id = sample(1:30, 50, replace = T),
phone = sample(600000000:699999999, 50),
scoring = runif(50, 1, 10),
pwd = sapply(
1:50,
FUN = function(x) paste0(sample(letters, 8, replace = T), collapse = '')
)
)
# Order by id
setorder(myDT, id)
# Show data
kable(myDT, caption = 'Random Table', format = 'html')
id | phone | scoring | pwd |
---|---|---|---|
1 | 645403893 | 1.020973 | klcapxnp |
1 | 619413048 | 2.683202 | iohatbab |
2 | 609713628 | 7.300181 | yudqfjmg |
2 | 641256896 | 6.117944 | tuimsqil |
3 | 699784180 | 6.601338 | daymvjqa |
3 | 640456925 | 4.971228 | jcvgzbuj |
3 | 699936658 | 2.814284 | bkkutjau |
4 | 659143790 | 9.478907 | xvdzpylw |
5 | 627234317 | 1.931959 | fmpcamnx |
5 | 660916649 | 1.253950 | dctpbtam |
7 | 693736060 | 2.075693 | hnpuolaj |
8 | 639782493 | 6.433026 | rwmwqvdu |
10 | 619850570 | 8.600371 | jzdkusva |
10 | 680929713 | 6.954972 | zeqxntgc |
11 | 645335202 | 2.682323 | lojmtrvs |
11 | 629926642 | 9.379491 | cxqadrdj |
12 | 607223895 | 7.975297 | avxympqg |
13 | 653996293 | 2.607843 | nnuqxigs |
13 | 645038697 | 8.760344 | mcwduuiw |
14 | 646393146 | 1.607134 | lkowclpa |
14 | 694678826 | 8.862866 | bxhelfjs |
16 | 616062260 | 6.175765 | ncukcyzz |
17 | 662680472 | 5.231123 | mkdaflur |
17 | 629480466 | 2.209104 | zdnpwuia |
17 | 671447129 | 8.875030 | oyftrcbp |
18 | 614526543 | 3.632207 | uijzspbs |
18 | 635067768 | 8.161684 | owrevrzb |
18 | 608546595 | 5.084333 | afrcytbl |
19 | 693548878 | 1.149800 | zoqoetuz |
19 | 623388830 | 2.247059 | kthwcoef |
19 | 692795692 | 7.412554 | juaelcsd |
20 | 649672445 | 4.446254 | liutuwqf |
20 | 664961449 | 9.614372 | rnrndysi |
20 | 602632286 | 3.724153 | rqidgdol |
21 | 642990005 | 9.319300 | ytcojema |
21 | 636961722 | 6.555269 | tnsletpt |
21 | 684581198 | 7.147596 | cqnoitbj |
22 | 655936305 | 6.468607 | diypikrs |
22 | 692048681 | 5.884008 | towijsgw |
23 | 634100247 | 2.145143 | ovkziixa |
23 | 631965985 | 4.648430 | owyiszmh |
25 | 642384037 | 2.570212 | yocaenac |
26 | 639233975 | 7.821415 | jabxdxab |
27 | 623326281 | 6.333832 | rcmnymfb |
27 | 657307628 | 1.835592 | ljxframj |
27 | 680172458 | 2.802887 | mxcokckt |
28 | 694408637 | 3.547176 | wsxxhocd |
28 | 632219643 | 8.953089 | xhnvzwae |
29 | 602132896 | 4.271840 | hycucpcz |
30 | 615307379 | 5.279515 | dvhbrnim |
Change to wide format:
The goal is to have just one row per id without losing information.
The difficulty here is that the number of columns to add is not constant, varying according to the number of occurrences of each identifier. Of course, it is a problem that can be solved in many ways, especially with such a small table like this. The idea is to show how thanks to the power of R it can be solved in a general way in just one command call:
# Generate wide format table grouping by unique id
df <- dcast(
myDT,
formula = id ~ rowid(id),
value.var = c('phone', 'scoring', 'pwd')
)
# Show new table
kable(df, caption = 'Wide Format Table', format = 'html')
id | phone_1 | phone_2 | phone_3 | scoring_1 | scoring_2 | scoring_3 | pwd_1 | pwd_2 | pwd_3 |
---|---|---|---|---|---|---|---|---|---|
1 | 645403893 | 619413048 | NA | 1.020973 | 2.683202 | NA | klcapxnp | iohatbab | NA |
2 | 609713628 | 641256896 | NA | 7.300181 | 6.117944 | NA | yudqfjmg | tuimsqil | NA |
3 | 699784180 | 640456925 | 699936658 | 6.601338 | 4.971228 | 2.814284 | daymvjqa | jcvgzbuj | bkkutjau |
4 | 659143790 | NA | NA | 9.478907 | NA | NA | xvdzpylw | NA | NA |
5 | 627234317 | 660916649 | NA | 1.931959 | 1.253950 | NA | fmpcamnx | dctpbtam | NA |
7 | 693736060 | NA | NA | 2.075693 | NA | NA | hnpuolaj | NA | NA |
8 | 639782493 | NA | NA | 6.433026 | NA | NA | rwmwqvdu | NA | NA |
10 | 619850570 | 680929713 | NA | 8.600371 | 6.954972 | NA | jzdkusva | zeqxntgc | NA |
11 | 645335202 | 629926642 | NA | 2.682323 | 9.379491 | NA | lojmtrvs | cxqadrdj | NA |
12 | 607223895 | NA | NA | 7.975297 | NA | NA | avxympqg | NA | NA |
13 | 653996293 | 645038697 | NA | 2.607843 | 8.760344 | NA | nnuqxigs | mcwduuiw | NA |
14 | 646393146 | 694678826 | NA | 1.607134 | 8.862866 | NA | lkowclpa | bxhelfjs | NA |
16 | 616062260 | NA | NA | 6.175765 | NA | NA | ncukcyzz | NA | NA |
17 | 662680472 | 629480466 | 671447129 | 5.231123 | 2.209104 | 8.875030 | mkdaflur | zdnpwuia | oyftrcbp |
18 | 614526543 | 635067768 | 608546595 | 3.632207 | 8.161684 | 5.084333 | uijzspbs | owrevrzb | afrcytbl |
19 | 693548878 | 623388830 | 692795692 | 1.149800 | 2.247059 | 7.412554 | zoqoetuz | kthwcoef | juaelcsd |
20 | 649672445 | 664961449 | 602632286 | 4.446254 | 9.614372 | 3.724153 | liutuwqf | rnrndysi | rqidgdol |
21 | 642990005 | 636961722 | 684581198 | 9.319300 | 6.555269 | 7.147596 | ytcojema | tnsletpt | cqnoitbj |
22 | 655936305 | 692048681 | NA | 6.468607 | 5.884008 | NA | diypikrs | towijsgw | NA |
23 | 634100247 | 631965985 | NA | 2.145143 | 4.648430 | NA | ovkziixa | owyiszmh | NA |
25 | 642384037 | NA | NA | 2.570212 | NA | NA | yocaenac | NA | NA |
26 | 639233975 | NA | NA | 7.821415 | NA | NA | jabxdxab | NA | NA |
27 | 623326281 | 657307628 | 680172458 | 6.333832 | 1.835592 | 2.802887 | rcmnymfb | ljxframj | mxcokckt |
28 | 694408637 | 632219643 | NA | 3.547176 | 8.953089 | NA | wsxxhocd | xhnvzwae | NA |
29 | 602132896 | NA | NA | 4.271840 | NA | NA | hycucpcz | NA | NA |
30 | 615307379 | NA | NA | 5.279515 | NA | NA | dvhbrnim | NA | NA |
Using dcast
with the collaboration of rowid
casts the table grouping by id and forces uniqueness without the data being aggregated.
I don’t know if there is an easy way to achieve this with traditional tools like SQL or a spreadsheet, but I’m afraid that a little more work would be involved.
Leave a comment