Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect records returned with subnational2 filters? #11

Open
steffilazerte opened this issue Oct 28, 2020 · 19 comments
Open

Incorrect records returned with subnational2 filters? #11

steffilazerte opened this issue Oct 28, 2020 · 19 comments
Assignees
Labels

Comments

@steffilazerte
Copy link
Member

@pmorrill perhaps something for you?

When I send a query to list_collections with the filter specifying the Winnipeg subnational2 code:

filter={"subnational2":["CA.MB.11"]}

I get 42,222,856 records, which is way more than the 1,533,348 records I get when I use the filter specifying all of Manitoba:

filter={"statProv":["MB"]}

only

Am I specifying the subnational2 code incorrectly? If I am, would it be better to return an error than all the records?

@pmorrill
Copy link

I am not sure how this works, in the db procedures. @denislepage or @cjardine-bsc may have insight? I can take a look at my own code tomorrow, to verify it is doing what it can.

@cjardine-bsc
Copy link

yup, something wonky going on there for sure. There should only be at max 481k records in CA.MB.11

I'm realizing that I don't actually know what proc your API is using. I'll check the documentation.

@cjardine-bsc
Copy link

Hmm, that level of detail on the DB side isn't covered

https://github.com/BirdStudiesCanada/NatureCountsAPI/blob/master/NatureCounts_API_v1.00.md

Paul, can you let me know what proc you're using? Merci!

@pmorrill
Copy link

I'll track this in the morning, and send any helpful details that I can find!

@denislepage
Copy link
Member

It's not a proc in this case, but a query built from DataRequest in Java.

The subnational2_code field in the bmde_data is set based on the coordinates when the data is added, so it's a pretty simple rule. As CJ points out, there are 481k records in Winnipeg for all projects and levels. There's likely either a boolean condition that is wrong in DataRequest (but there hasn't been any changes on that for ages), or the parameter is incorrectly set from the API.

What other filters were you using?

@steffilazerte
Copy link
Member Author

No other filters

@pmorrill
Copy link

Something is wrong, yes. If I request CA.MB.11 filter on MONARCHWATCH, the record count returns 24, via this query:

select collection, count(*) as nrecords from bscakn.dbo.bmde_data WITH(NOLOCK) WHERE collection in ('MONARCHWATCH') AND bmde_status_level >= 2 AND subnational2_code in ('CA.MB.11') group by collection

But if I request CA.MB filter on MONMARCHWATCH, the record count returns 145665, via this query (note there is no subnational2_code in the query at all:

select collection, count(*) as nrecords from bscakn.dbo.bmde_data WITH(NOLOCK) WHERE collection in ('MONARCHWATCH') AND bmde_status_level >= 2 group by collection

In both cases, the code itself is added to the DataRequest instance via a call to DataRequest#addFilterShapes(String[] s).

@denislepage why is it not recognizing the CA.MB filter? Is it not legit?

@steffilazerte
Copy link
Member Author

I think that's actually the opposite of my problem, I was finding that the subnational2 queries were returning way more records than the statprov = "MB" query...

@pmorrill
Copy link

The answer seems to be in this regex, found on 863 of DataRequest:

else if (filterShapes[i].matches("^[A-Z][A-Z]\\.[A-Z][A-Z]\\.[A-Z0-9][A-Z0-9][0-9]{0,1}$")) subnat2.add(filterShapes[i]);

This regex only recognizes the form CA.MB.11 and not CA.MB.

For the moment, I will switch to setting this parameter using DataRequest#setFilterSubnat2(..) directly. But you might want to investigate, Denis.

@pmorrill
Copy link

Yes - your problem is exhibiting opposite result, but we need to fix this and then re-run your queries. The regex in question might be the cause, anyway. Can you run your tests just using MONARCHWATCH? I cannot sum data the way you do, so give me some specific collection results so it's easier to debug.

@steffilazerte
Copy link
Member Author

Sure, I get 24 records with statprov = "MB" and 145665 records with subnational2 = "CA.MB.11" (I also get 145665 with no region filter, just the collections filter).

@pmorrill
Copy link

I experimented more with this and there is a problem. The filterSubnat2 variable in DataRequest appears to be treated as a numeric value. That does not work with our handling of the subnat2 parameter (which is a string). Therefore, I cannot try to set the filterSubnat2 variable directly, and am reverting to using the original DataRequest#addFilterShapes call.

The end result: absolutely nothing is solved. Time for @denislepage to take a look

@pmorrill
Copy link

pmorrill commented Oct 29, 2020

One other note: Steffi's results may be as expected. I am not sure that the 'subnat2' parameter matches anything other than a literal string. So if a collection record is attached to CA.MB.11 it will not also be attached to CA.MB automatically. This would lead to reduction in record count as seen by steffi.

Again: need input from Denis.

@denislepage
Copy link
Member

denislepage commented Oct 29, 2020 via email

@steffilazerte
Copy link
Member Author

Wow! Sounds like fun! In the R package, we use subnational2 but since users don't see what I send to the API, that part can be changed as needed.

The R package never sends more than one region filter, which can be: country OR statprov OR subnational2 OR iba OR bcr OR utm_squares OR bbox.

@pmorrill
Copy link

pmorrill commented Nov 2, 2020

This has nothing to do with naming of the variables that you pass, Steffi. I interpret your variable (subNat2) with no problem. Don't change that.

@denislepage : if I pass the subnat2 code into DataRequest#setFilterSubnat2(), I get the following query back from DataRequest#getSQLToCountResults():

select collection, count(*) as nrecords from bscakn.dbo.bmde_data WITH(NOLOCK) WHERE collection in ('MONARCHWATCH') AND subnational2_code in (CA.MB.11) AND bmde_status_level >= 2 group by collection

You can see that this will not work (the varchar is not being handled as a String). Look at line 830 in DataRequest to see why this happens: the third parameter is set to false, so the filter is treated as numeric. I mentioned this above.

For this reason, I am using addFilterShapes(), and it does work fine for proper subnat2 strings such as 'CA.MB.11'. It generates the following sql (which is perfectly acceptable):

select collection, count(*) as nrecords from bscakn.dbo.bmde_data WITH(NOLOCK) WHERE collection in ('MONARCHWATCH') AND bmde_status_level >= 2 AND subnational2_code in ('CA.MB.11') group by collection

@pmorrill
Copy link

pmorrill commented Nov 2, 2020

Ach! Sorry - I take it back (must be Monday)! The problem IS the variable name. @steffilazerte , you need to use subNat2 as the variable name for it to be recognized in the API. When I do that I get the following record counts:

subNat2 = "CA.MB.11' => 481304
statProv = "MB" => 3777382

This seems ok.

However: @denislepage I still stand by the 2nd part of the previous statement: setFilterSubnat2() is not working as it should.

@steffilazerte
Copy link
Member Author

Good to know!

@denislepage
Copy link
Member

denislepage commented Nov 2, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants