-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatic threshold for peaks-over-threshold #268
Conversation
Sync `develop` & `master`
This reverts commit 4477580.
@ssolson I cleaned this up quite a bit and made a Matlab minimal example form the code I got from Seongho. He unfortunately left which makes figuring some of this out more difficult. The problem is that the solution I get with my Python implementation is slightly different from what I get with the original Matlab code. I think the issue might be that I am not accounting for all the metrics they used for "independent storms". I would appreciate some help if anyone else can take a look at this. The failing errors don't seem to be related to this PR. |
Carlos thank you. The errors are unrelated to this PR and are fixed in #241 which should be reviewed and approved soon. I took a bit of a look at his today. Running the tests I see that I get the same answer as you have specified but I was wondering how big is the difference from the Matlab version? I started looking into the generalized pareto distribution in scipy vs matlab via the docs but I was ultimately hindered by not being able to run the Matlab version without getting Matlab installed on my machine. It seems like the idea would be to step-by-step compare the python vs Matlab output to see where the difference arises. @akeeste are you setup to run Matlab and do you have time to take a look into this further? |
@ssolson I'm set-up for this. I will review this and debug next week. |
@cmichelenstrofer I made some progress today and noting where I'm at for reference. I noticed a couple things, but I'm still trying to figure out how much they affect the end result, given the input Hs used in your working example. I'll refer to Seongho's original code as the 'matlab' version and your refactored code as the 'python version'
I'll continue debugging next week |
I'm still digging into the settings of various functions, but I can confirm that the functions for percentile, Generalized Pareto distribution fitting, and probability plotting functions all give slightly different results between their respective MATLAB and SciPy versions. I'll see if there are optional settings we can use to converge the results Updates: PercentileMATLAB's The Hazen method requires NumPy>1.22.0. For older versions (1.22.0 > Numpy > 1.9), the 'higher' linear method is very close, but not identical to MATLAB's percentile. Generalized Pareto distribution fittingMATLAB and SciPy are giving different distribution parameters when the excess is fit to a Generalized Pareto Distribution. Both MATLAB's GeneralizedParetoDistribution (description) and SciPy's stats.genpareto have an identical definition. MATLAB's fit and SciPy's fit both estimate the distribution parameters using Maximum Likelihood Estimation and assume that the location/threshold is zero. A zero threshold is correct given that both versions of the code already use the excess (peaks-threshold) to estimate the tail distribution. MATLAB fitting does not take any other options. SciPy takes initial guesses for shape, scale, etc but that outcome seems unpredictable. I'm not sure what else can be done to make the distribution fitting converge. There are not many points in the tail. In this example, even the 90th percentile only has 9 points. This will only decrease as the percentile is iteratively increased. Example with excess defined using a 90th percentile threshold:
Probability plottingMy stats background begins to fail here. I'm not confident in how the correlations are each defined, but when given the same distribution parameters the results are nearly identical.
Example, given the resultant MATLAB distribution above:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cmichelenstrofer Suggestions to improve a couple minor issues with the storm peak calculation. Otherwise, the storm peak calculation seems to pull all of the same independent storm peaks that the original matlab code does.
The comparison is closer. I think the remaining discrepancy is from the distribution fitting described above. Comparison after these changes:
Matlab: 99.53%, 1.041 m
Python: 99.13%, 1.032 m
Co-authored-by: Adam Keester <[email protected]>
Co-authored-by: Adam Keester <[email protected]>
Co-authored-by: Adam Keester <[email protected]>
Co-authored-by: Adam Keester <[email protected]>
@akeeste this is great! Thanks for looking into this in so much detail! I think the results are basically identical. The very tiny difference is from the techniques to fit the distribution. Do you think it is ready merge? |
@cmichelenstrofer Sounds good. Can you first remove the MATLAB working example and updated the test result to 99.13% and 1.032m. After that we can merge |
Seems like I could commit the suggestion directly to the branch. I mistyped last time, the percentile still returns as a fraction. I updated the test to compare to 0.9913. The test should pass now |
Update threshold test value one more time
Overall, I'm happy with this PR as is. I have a few suggestions, neither required but I think would round out the PR nicely.
Lastly, I noticed the use of a nested function. Historically, we've kept these functions separate. While nesting does highlight that the inner function is only needed by the parent function, it can also make the code appear more cluttered. I'm not opposed to the current structure, but I wanted to initiate a brief discussion on this topic. What are your thoughts? |
To wrap up this PR I
I leave the submodule-level docstring and any other optional developments for future PRs. All tests have passed. Merging this PR |
# MHKiT v0.8.0 We're excited to announce the release of MHKiT v0.8.0, which brings a host of new features, enhancements, and bug fixes across various modules, ensuring compatibility with Python 3.10 and 3.11, and introducing full xarray support for more flexible data handling. Significant updates in the Wave and DOLfYN modules improve functionality and extend capabilities. ## Python 3.10 & 3.11 Support MHKiT now supports python 3.10 and 3.11. Support for 3.12 will follow in the next minor update. - #240 ## Wave Module ### Enhancements: **Automatic Threshold Calculation for Peaks-Over-Threshold**: We've introduced a new feature that automatically calculates the "best" threshold for identifying significant wave events. This method, originally developed by Neary, V. S., et al. in their 2020 study, has now been translated from Matlab to Python, enhancing our existing peaks-over-threshold functionality. **Wave Heights Analysis**: A new function, `wave_heights`, has been added to extract the heights of individual waves from a time series. This function uses zero up-crossing analysis to accurately measure wave heights, improving upon our previous methods which only provided the maximum value between up-crossings. **Enhanced Zero Crossing Analysis**: Building on the above, the zero crossing code previously embedded in `global_peaks` has been isolated into a helper function. This modular approach not only refines the codebase but also supports new functionalities such as calculating wave heights, zero crossing periods, and identifying crests. ### Bug Fixes: **Contour Sampling Error in Wave Contours**: A bug identified in `mhkit.wave.contours.samples_contour` has been resolved. The issue occurred when period samples defined using the maximum period resulted in values outside the interpolation range of the contour data. This was corrected by ensuring that all sampling points are within the interpolation range and adjusting the contour data selection process accordingly. - #268 - #252 - #278 ## Xarray Support MHKiT functions now fully support the use of xarray for passing and returning data. - #279 - #282 - #285 - #302 - #310 ## DOLfYN Thanks to the many user contributions and users of MHKiT the DOLFYN module include a significant number of enhancements and bug fixes. ### Enhancements: **Altimeter Support**: Enhanced the Nortek Signature Reader to add capability for reading ADCP dual profile configurations. **Data Handling Improvements**: Introduced logic to skip messy header data that can accumulate during measurements collected via Nortek software on PCs and Macs. **Instrument Noise Subtraction**: Added a function to subtract instrument noise from turbulence intensity estimation using RMS calculations, providing results that differ by approximately 1% from the existing standard deviation-based "I" property. **Improved File Handling**: Updates for RDI files to handle changing "number of cells" and variable "cell sizes," which are now bin-averaged into the largest cell size. ### Bug Fixes: **Power Spectra Calculation**: Fixed a bug where a given noise value was not being subtracted from the power spectra, and noise was inadvertently added as an input to dissipation rate calculations. **Improved Header Handling**: Allowed RDI reader to skip junk headers effectively. **Nortek Reader C Types Update**: Adjusted C types in the Nortek reader to handle below-zero water temperatures and to allow pitch and roll values to go negative. - #280 - #289 - #290 - #292 - #293 - #294 - #299 ## River & Tidal: D3D Added limits to `variable_interpolation` and added 3 array input capability to `create_points` - #271 ## Developer Experience ### Black formatting Black formatting is now enforced on all MHKiT files. This ensures consistent formatting across the MHKiT package. - #281 ### Linting & Type Hints MHKiT is in the process of enforcing pylint and adding type hints to all functions. Currently this has been achieved and is enforced in the Loads and Power modules. - #288 - #296 ### CI/CD This release introduces significant reduction in testing time for development. This is achieved by reducing the number of tests for pulls against the develop branch and only running hindcast test when changes are made to it. A bug in the hindcast CI was fixed which only ran on changes to the hindcast tests instead of the hindcast module. Additionally the wave and wind hindcast needed to be separated in 2 jobs due to the excessive time taken to run a wind cache. This created a number of follow on PRs around solidifying the logic of these job. A special case for Python 3.8, pip, and Mac OS was added to use homebrew to install NetCDF and HDF5 to get tests to pass. - #241 - #270 - #306 - #311 - #317 - #318 - #319 - #320 - #324 ### Clean Up MHKiT fixed an implementation error where functions used assert instead of built in errors for type and value checking. Additionally these PRs removed unused files, fixed typos, and created an argument which allows users to run CDIP API calls silently. - #276 - #272 - #273
replaces #214.
MHKiT already has a peaks-over-threshold method, which takes in threshold as one of the inputs. This method here is to automatically calculate the "best" threshold. It is from a paper by Vince and Seongho:
Neary, V. S., S. Ahn, B. E. Seng, M. N. Allahdadi, T. Wang, Z. Yang and R. He (2020). "Characterization of Extreme Wave Conditions for Wave Energy Converter Design and Project Risk Assessment.” J. Mar. Sci. Eng. 2020, 8(4), 289; https://doi.org/10.3390/jmse8040289.
They had it in Matlab and I converted it to Python here.