Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange Behavior When Doing a Grouped Filter And Removing Missing Values #481

Closed
jrisi256 opened this issue Dec 6, 2024 · 2 comments
Closed
Labels
bug an unexpected problem or unintended behavior

Comments

@jrisi256
Copy link

jrisi256 commented Dec 6, 2024

Hi! I am not sure if the behavior I encountered was expected or not, or if it is documented anywhere. But I was certainly not expecting it! And it seems a little counter-intuitive to me if this is supposed to be the correct behavior.

Basically when conducting a grouped-by filter and working with missing values, dtplyr will return a row of all missing values for each missing value.

library(dtplyr)
library(dplyr)
 
example_df <- tibble(id = c(1, 1, 1, 1), value = c(NA, NA, 0, 1))
 
dplyr_fltr <- example_df %>% group_by(id) %>% filter(value == min(value, na.rm = T))
dplyr_fltr
# A tibble: 1 × 2
# Groups:   id [1]
     id value
  <dbl> <dbl>
1     1     0
 
dtplyr_fltr <- example_df %>% lazy_dt() %>% group_by(id) %>% filter(value == min(value, na.rm = T)) %>% as_tibble()
dtplyr_fltr
# A tibble: 3 × 2
     id value
  <dbl> <dbl>
1    NA    NA
2    NA    NA
3     1     0

I have noticed the issue arises when using group_by() specifically. In other words, if I remove the group_by(), the code returns a table which matches the output from dplyr.

dtplyr_fltr_working <- example_df %>% lazy_dt() %>% filter(value == min(value, na.rm = T)) %>% as_tibble()
dtplyr_fltr_working
# A tibble: 1 × 2
     id value
  <dbl> <dbl>
1     1     0

I wound up including extra clauses in my filter statement to get the correct behavior:

dtplyr_groupby_fltr_working <-
    example_df %>%
    lazy_dt() %>%
    group_by(id) %>%
    filter(all(is.na(value)) | (value == min(value, na.rm = T) & !is.na(value))) %>%
    as_tibble()
dtplyr_groupby_fltr_working
# A tibble: 1 × 2
     id value
  <dbl> <dbl>
1     1     0

However, it seems like it would be better to fix dtplyr to match the output from dplyr? Or maybe highlight in the documentation somewhere why dtplyr and dplyr differ here. My apologies, again, if this has been discussed somewhere already. If you could link to that discussion, that'd be very useful!

@markfairbanks markfairbanks added the bug an unexpected problem or unintended behavior label Dec 6, 2024
@markfairbanks
Copy link
Collaborator

Thanks for reporting, we'll take a look.

Looks like this is happening because we use the .I trick when filtering by group. .I returns a vector with NAs, and slicing using NAs causes you to make empty rows in data.table.

library(dtplyr)
library(dplyr)

example_df <- data.table(id = c(1, 1, 1, 1), value = c(NA, NA, 0, 1))

example_df %>%
  lazy_dt() %>%
  filter(value == 0, .by = id)
#> Source: local data table [3 x 2]
#> Call:   `_DT1`[`_DT1`[, .I[value == 0], by = .(id)]$V1]
#> 
#>      id value
#>   <dbl> <dbl>
#> 1    NA    NA
#> 2    NA    NA
#> 3     1     0
#> 
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results

example_df[c(NA, NA, 3)]
#>       id value
#>    <num> <num>
#> 1:    NA    NA
#> 2:    NA    NA
#> 3:     1     0

@markfairbanks
Copy link
Collaborator

Just noticed this is a duplicate of #474. I'm going to close this, but feel free to track there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

2 participants