-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up wave.resource module #352
Conversation
This reverts commit a2d5f61.
…aframes and 2+ var datasets
@ssolson This PR is ready for review. Tests should pass now With some modifications to the type handling functions, and an appropriate |
|
Thanks @ssolson. I'll merge that in here and fix a couple minor items with some examples |
@akeeste TODO:
|
@ssolson this PR is now ready for review and all tests are passing. I tightened up the timing on the environmental contours, 3 extreme response, and PacWave examples. A straight forward test case on the difference in computational expense is using a wave resource function (e.g. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@akeeste overall this addresses the issue. Thanks for putting this together. I have just a couple questions and a few minor clean up items.
@@ -442,11 +440,9 @@ def test_mler_export_time_series(self): | |||
mler["WaveSpectrum"] = self.mler["Norm_Spec"].values | |||
mler["Phase"] = self.mler["phase"].values | |||
k = resource.wave_number(wave_freq, 70) | |||
k = k.fillna(0) | |||
np.nan_to_num(k, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This returns k so it would need to be:
k=nan_to_num(k,0)
However k has no nans so I don't think this is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The zero frequency in wave_freq
results in a nan value. The call to np.nan_to_num()
updates the input k
in my testing, which is useful if k is not a numpy array as the input data is both updated and retains its type. If it's more clear for this particular test, I can update to redefine k
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a limiting factor to getting the PR through but are you saying np.nan_to_num
modifies k inplace when you run it?
The docs say the function returns the modified array and that was my experience when I paused the code here:
https://numpy.org/doc/2.0/reference/generated/numpy.nan_to_num.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ssolson you are correct. I didn't look at the docstring close enough when converting pandas fillna
to numpy nan_to_num
. np.nan_to_num(k, 0)
is setting copy
to False and changing k in place, instead of specifying the fill value which already defaults to 0. I'll update this line so that we're using the function correctly
@@ -95,7 +95,8 @@ def test_kfromw(self): | |||
|
|||
expected = self.valdata1[i]["k"] | |||
k = wave.resource.wave_number(f, h, rho) | |||
calculated = k.loc[:, "k"].values | |||
# calculated = k.loc[:, "k"].values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete
mhkit/utils/type_handling.py
Outdated
# Rename to "variable" to match how multiple Dataset variables get converted into a DataArray dimension | ||
data = xr.DataArray(data) | ||
if data.dims[1] == "dim_1": | ||
# Slight chance their is already a name for the columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
their => there
mhkit/wave/performance.py
Outdated
LM: pandas DataFrame, xarray DatArray, or xarray Dataset | ||
Capture length | ||
JM: pandas DataFrame or xarray Dataset | ||
JM: pandas DataFrame, xarray DatArray, or xarray Dataset | ||
Wave energy flux | ||
frequency: pandas DataFrame or xarray Dataset | ||
frequency: pandas DataFrame, xarray DatArray, or xarray Dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xararray "DataArray" not "DatArray"
@@ -87,26 +86,24 @@ def elevation_spectrum( | |||
+ "temporal spacing for eta." | |||
) | |||
|
|||
S = xr.Dataset() | |||
for var in eta.data_vars: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This for loop allowed users to process multiple wave heights in a DataFrame ect. Removing it means the user can only process one eta at a time.
Does this update require use to remove this functionality? Would it make sense to be able to parse each variable in a DataSet into a DataArray? Or to have the user create any needed loop outside of the funtion for simplicity?
E.g. the following test will fail. Can we make this work?
def test_elevation_spectrum_multiple_variables(self):
time = np.linspace(0, 100, 1000)
eta1 = np.sin(2 * np.pi * 0.1 * time)
eta2 = np.sin(2 * np.pi * 0.2 * time)
eta3 = np.sin(2 * np.pi * 0.3 * time)
eta_dataset = xr.Dataset({
'eta1': (['time'], eta1),
'eta2': (['time'], eta2),
'eta3': (['time'], eta3)
}, coords={'time': time})
sample_rate = 10
nnft = 256
spectra = wave.resource.elevation_spectrum(
eta_dataset, sample_rate, nnft
)
If so lets finish out this test and add this to the test suite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users can still input Datasets, but right now all variables must have the same dimensions, it will be converted to a DataArray up front. The function then returns a DataArray.
I'll look at reinstating a dataset/dataframe loop so that those types are returned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this loop again. Our previous slowdown of large pandas DataFrames --> xarray datasets could occur in these two functions now, but I don't think that is a typical use case. For example in the case described in #331, it's unlikely a user would have thousands of different wave elevation time series and convert them all to wave spectra (likewise for thousands of distinct spectra being converted to elevation time series). If that case does come up, the slow down should not be due to our implementation but the large quantity of data involved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adam the last thing I was checking this morning was adding the above test to the testing suite.
Currently though when I pass a DataSet I am getting a DataFrame back.
I think the idea was for function return the same type as the user passes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now if a user inputs a multivariate type (Dataset, DataFrame), they will get a multivariate type back, but whether its a Dataset or DataFrame is still controlled by the to_pandas
flag, which is consistent with other functions containing this flag.
It might be worth making a more complex default for to_pandas
where a use could specify True/False, and if they don't specify the function defaults to return the type (xarray/pandas) that they input. However this should be a separate PR that address all instances of this flag for consistency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies. That was an oversite on my part on the to_pandas
parameter.
I still want to add the above test to the testing suite akeeste#6
|
||
omega = xr.DataArray( | ||
data=2 * np.pi * f, dims=frequency_dimension, coords={frequency_dimension: f} | ||
) | ||
|
||
eta = xr.Dataset() | ||
for var in S.data_vars: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above removed the ability to iterate of multiple columns, but we still accept DataSets and multi column pandas
mhkit/wave/resource.py
Outdated
@@ -1153,7 +1164,7 @@ def wave_number(f, h, rho=1025, g=9.80665, to_pandas=True): | |||
""" | |||
if isinstance(f, (int, float)): | |||
f = np.asarray([f]) | |||
f = convert_to_dataarray(f) | |||
# f = convert_to_dataarray(f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete
…levation_spectrum
@ssolson I addressed all your comments and again allowed datasets into |
* matplotlib >=3.8 * remove debug
All tests passed. Merging! |
@ssolson This is a follow-up to my other wave PRs and resolves #331. Handling the various edge cases robustly in pure numpy is difficult, so I want to first resolve #331 by using DataArrays throughout the wave resource functions instead of Datasets.
Similar to Ryan's testing mentioned in #331, I found that using DataArrays/Pandas has a 1000x speed up vs Datasets for very large input data. This should restore MHKiT's speed to it's previous state. Using a pure numpy base would have an additional 5-10x speed up from DataArrays, but I think the current work with DataArrays will: