-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Script the installation (and updates) of NCBI taxonomy data #14
Comments
From @jparham email on Dec 9, 2014, at 12:57 AM:
The process of importing newer NCBI data is not automatic, but it’s pretty straightforward for someone with moderate “sysadmin” skills. The NCBI “dump" files are large, so the process takes a little time to complete. It would also be prudent to backup the FCDB data beforehand as a precaution, at least until this has become routine. NCBI updates could be done as part of annual maintenance, or we can try to script it to be entirely hands-off. (There's already a placeholder for this in the FCDB’s admin dashboard.) I thought that this data was slow to change, but @hlapp explained that NCBI updates are actually quite frequent (sometimes daily). To me, this means the frequency of updates to FCDB is more of an editorial decision.
This is of course the other interesting question: Does an NCBI update create more work, or screw up the data in FCDB? This should not be an issue. By design, our calibrated nodes “float” along with changes to NCBI, so updating the NCBI taxonomy then running these tasks in the admin dashboard should bring everything nicely up to date:
This assumes that NCBI identifiers are never discarded, which I understand to be the case. After the updates above, the system will show subtle differences:
|
Interesting questions, and important ones too. Most NCBI taxonomy updates will be irrelevant to FCD because most reflect relatively low level relationships, groups without good fossil records, minor changes that affect only one or two taxa, or simple changes in rank. It is only major changes that will really need to be updated in FCD, ones where relationships of groups with a good fossil record are radically reorganized because of improvements in phylogenetic understanding (whales moving into Artiodactyla, for example). It is probably also these big changes that would most likely break FCD scripts. Would it be worth experimenting with robustness of the code? We could manually alter an NCBI data file and import to test version of FCD. |
Yes, the old dev site at http://fossils.ibang.com/ has older calibrations and test data, but we can use it to test the process and resulting changes. |
I am all for a test run with the old dev site. Shall we proceed? It would be good to see how everything reacts before we come to the point of needing to do it for real. |
Of course, this won't necessarily help us understand how future changes to the underlying hierarchy may affect node pinning. I.e., we could get a false "ok" signal. |
could get a false ok, of course, but it's about the only way to test it that i know of.
|
I think we should do the test. But before we do the test, we should see if any changed parts of the NCBI for the comparison/test involved calibrations that we have in the database. I guess one other option, which I mentioned before is freezing it. But a third option would be to just optimistically go ahead and let NCBI hierarchy change and then if there is a problem roll it back. I kind of favor this option, if rolling it back would not be too difficult. |
Sorry, I didn't mean to gloss over this. It's certainly an option, and guarantees a minimum of surprises. Naturally, some searches may suffer if someone is expecting the site to have the latest NCBI taxonomy.
This is fairly easy to do, provided
In this case, I'd recommend you keep a sysadmin in the loop, possibly treating this as a planned, annual maintenance operation as suggested in #55. In principle, it could be scripted from the admin dashboard, but in practice I wouldn't be comfortable with an easy, "full auto" version of this. |
What I was thinking for the test is to purposefully change parts of the taxonomy that are relevant to calibrations that have been entered. That said, I don't mind Jim's option of simply freezing it. It will still meet most FCD purposes even if the agreement isn't perfect.
|
David, my apologies, I didn't realize that you meant to purposefully change parts of the taxonomy- that makes sense. Please see what JimA says above, about backing it up. Is this something that is reasonable moving forward? If so then we can accept the updates and just revert if there is an issue. Would be better than a freeze. But if it is not easy to revert then a freeze would be best. |
Yes, based on the SQL script I used to set up this table, it looks like we can simply modify the |
FYI, I'm working on a test today. This will include "before" and "after" scans of all calibrated nodes, tracing the lineage of each pinned node. This should generate a watch list of calibrations that need review in the new taxonomy. |
UPDATE: I've been testing an NCBI update on the old dev site (fossils.ibang.com) with mixed results. As described above, I can generate a report of all calibrations that need review, but there are others that need immediate repairs before the Browse feature can be made to work with the new taxonomy.
I was mistaken. While these identifiers are not re-used, it's actually very common for their nodes to be removed from the database. From the original NCBI Taxonomy database paper:
And sure enough, in my test update we have a few calibrations that were pinned to taxids that have since been deleted. I'm working now on a report that will flag these calibrations are they're discovered. The fix is actually straightforward -- just edit each offending calibration and refresh its entries in section 4. Locate this calibration within the NCBI tree. Re-entering the same taxon names (or sometimes picking a new substitute) sets things right. More notes to come as I learn more, but suffice to say that updating the NCBI taxonomy will not be a fully automatic process, and will almost always require some curation time. |
I've updated the three calibrations (215, 218, 220) on fossils.ibang.com that needed revisions to the calibrated node location, and now it seems all's well. This required a bit of jumping around (rebuilding the different tables in the Admin Dashboard, sometimes more than once, then re-testing in the MySQL interactive console). The final report flags almost all the calibrations in the system as needing review. In each case, the "pinned" nodes that tie each node to the NCBI taxonomy have seen changes in their NCBI lineage, which means they could in principle show up in a new place in the Browse view, or in different clades in Search. This was an unusually long lag between NCBI taxonomy versions, but it suggests once again that there's a lot to review when we update the taxonomy. |
Thanks, Jim (A). This may mean that we want to follow Jim P's suggestion to simply freeze the taxonomy until there are major upgrades to the system.
|
Agreed, I see no other way. |
By "upgrades to the system", do you mean changes in the NCBI taxonomy, or new features in FCDB? Because there's not much more we can do technically to overcome the need for review. OK, there's one possible improvement: NCBI provides information about merged nodes as well as deleted nodes; in the case of a merge, we could probably trace these and re-pin to the new (merged) node. But that was not an issue in this NCBI update. Also, keep in mind that most (or even all) of the calibrations needing review are probably just fine in the new taxonomy. It's more of a sanity check. You might even decide to go ahead with NCBI updates, re-pin the nodes whose NCBI targets were deleted, and postpone further work pending user complaints. I suppose the bottom line is, Does all this work result in a noticeably better site? The quickest way to judge this is to compare the Browse results on fossils.ibang.com (latest NCBI) versus those on fossilcalibrations.org (early-2013 NCBI). |
Addresses #14, with a manual solution for now.
While it's all fresh in my mind, I've gone ahead and documented the tools and methods used in this test. This process requires a moderately skilled sysadmin and at least one subject-matter expert, and should be accompanied by a window of planned downtime ⌚ and a fresh cup of coffee. ☕ |
This is currently only documented in the script
db/database-migration-002.sql
(link).Ultimately, it should be available from the site's Admin Dashboard page as an easily repeated task. For now, we should offer a script to download and install the latest NCBI taxonomy, then update the db's internal timestamp for this admin task.
The text was updated successfully, but these errors were encountered: