-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect records returned with subnational2 filters? #11
Comments
I am not sure how this works, in the db procedures. @denislepage or @cjardine-bsc may have insight? I can take a look at my own code tomorrow, to verify it is doing what it can. |
yup, something wonky going on there for sure. There should only be at max 481k records in CA.MB.11 I'm realizing that I don't actually know what proc your API is using. I'll check the documentation. |
Hmm, that level of detail on the DB side isn't covered https://github.com/BirdStudiesCanada/NatureCountsAPI/blob/master/NatureCounts_API_v1.00.md Paul, can you let me know what proc you're using? Merci! |
I'll track this in the morning, and send any helpful details that I can find! |
It's not a proc in this case, but a query built from DataRequest in Java. The subnational2_code field in the bmde_data is set based on the coordinates when the data is added, so it's a pretty simple rule. As CJ points out, there are 481k records in Winnipeg for all projects and levels. There's likely either a boolean condition that is wrong in DataRequest (but there hasn't been any changes on that for ages), or the parameter is incorrectly set from the API. What other filters were you using? |
No other filters |
Something is wrong, yes. If I request CA.MB.11 filter on MONARCHWATCH, the record count returns 24, via this query:
But if I request CA.MB filter on MONMARCHWATCH, the record count returns 145665, via this query (note there is no subnational2_code in the query at all:
In both cases, the code itself is added to the DataRequest instance via a call to DataRequest#addFilterShapes(String[] s). @denislepage why is it not recognizing the CA.MB filter? Is it not legit? |
I think that's actually the opposite of my problem, I was finding that the subnational2 queries were returning way more records than the statprov = "MB" query... |
The answer seems to be in this regex, found on 863 of DataRequest:
This regex only recognizes the form CA.MB.11 and not CA.MB. For the moment, I will switch to setting this parameter using DataRequest#setFilterSubnat2(..) directly. But you might want to investigate, Denis. |
Yes - your problem is exhibiting opposite result, but we need to fix this and then re-run your queries. The regex in question might be the cause, anyway. Can you run your tests just using MONARCHWATCH? I cannot sum data the way you do, so give me some specific collection results so it's easier to debug. |
Sure, I get 24 records with statprov = "MB" and 145665 records with subnational2 = "CA.MB.11" (I also get 145665 with no region filter, just the collections filter). |
I experimented more with this and there is a problem. The filterSubnat2 variable in DataRequest appears to be treated as a numeric value. That does not work with our handling of the subnat2 parameter (which is a string). Therefore, I cannot try to set the filterSubnat2 variable directly, and am reverting to using the original DataRequest#addFilterShapes call. The end result: absolutely nothing is solved. Time for @denislepage to take a look |
One other note: Steffi's results may be as expected. I am not sure that the 'subnat2' parameter matches anything other than a literal string. So if a collection record is attached to CA.MB.11 it will not also be attached to CA.MB automatically. This would lead to reduction in record count as seen by steffi. Again: need input from Denis. |
I found 2 issues:
1) Steffi is passing subnational2, but in APIJsonParams.buildAknDataRequest, you are looking for subNat2. And, to complicate things further, the reference entry point for subnat2 (metadata/subnat2) returns a field called subnat2_code. The field name in the bmde_data table itself is subnational2_code, and that may be the field name that people see when they download data(?). I don’t think I can come up with yet another way to refer to this concept that isn’t a fruit or an animal name. Seems like the API expects subNat2, for now. If users are not really seeing these other variants, it may not matter as much that they are not consistent. The bmde_field would be the most problematic to change, though we can rename it in the API outputs if necessary.
2) The other issue is that filterShapes in DataRequests is not used for this. You should instead use setFilterSubnat2 and pass a string array. I don’t think a regex is used in this case to validate the entry, but invalid entries will return no data.
Filter shapes is a different thing that is mostly useful for data summaries that isn’t relevant for the API(at least yet). DataRequests object can also be used to build summary queries (e.g. weekly or annual bar charts).
If you want to see an actual application of filter shapes, you can look at this data summary page that generates barcharts, for instance, but that is not immediately relevant (queries using advanced options would rely on shapes and send a slower summary query to bmde_data instead of a cached data summary): https://www.birdscanada.org/birdmon/default/datasummaries.jsp
As Paul says, if you mix filter fields of different levels (subnat2, province and country), it will use the finest of the 3 levels and ignore the rest to make the query more efficient (CA.MB.11 would not make sense with anything other than MB as province, or CA as the country, and not including the field leads to faster results). But, because we are currently using the wrong field, the query doesn’t include the subnat2 field, but also ignores the province.
I didn’t quite get the references to subnat2 being numeric.
|
Wow! Sounds like fun! In the R package, we use The R package never sends more than one region filter, which can be: country OR statprov OR subnational2 OR iba OR bcr OR utm_squares OR bbox. |
This has nothing to do with naming of the variables that you pass, Steffi. I interpret your variable (subNat2) with no problem. Don't change that. @denislepage : if I pass the subnat2 code into DataRequest#setFilterSubnat2(), I get the following query back from DataRequest#getSQLToCountResults():
You can see that this will not work (the varchar is not being handled as a String). Look at line 830 in DataRequest to see why this happens: the third parameter is set to false, so the filter is treated as numeric. I mentioned this above. For this reason, I am using addFilterShapes(), and it does work fine for proper subnat2 strings such as 'CA.MB.11'. It generates the following sql (which is perfectly acceptable):
|
Ach! Sorry - I take it back (must be Monday)! The problem IS the variable name. @steffilazerte , you need to use subNat2 as the variable name for it to be recognized in the API. When I do that I get the following record counts: subNat2 = "CA.MB.11' => 481304 This seems ok. However: @denislepage I still stand by the 2nd part of the previous statement: setFilterSubnat2() is not working as it should. |
Good to know! |
Ah. Now I get the reference to numeric. I will fix that, but sounds like the shape filter has the same effect, so if that works, you can leave as is.
Is there still an issue left?
|
@pmorrill perhaps something for you?
When I send a query to list_collections with the filter specifying the Winnipeg subnational2 code:
filter={"subnational2":["CA.MB.11"]}
I get 42,222,856 records, which is way more than the 1,533,348 records I get when I use the filter specifying all of Manitoba:
filter={"statProv":["MB"]}
only
Am I specifying the subnational2 code incorrectly? If I am, would it be better to return an error than all the records?
The text was updated successfully, but these errors were encountered: