Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up wave.resource module #352

Merged
merged 40 commits into from
Oct 22, 2024
Merged

Conversation

akeeste
Copy link
Contributor

@akeeste akeeste commented Sep 17, 2024

@ssolson This is a follow-up to my other wave PRs and resolves #331. Handling the various edge cases robustly in pure numpy is difficult, so I want to first resolve #331 by using DataArrays throughout the wave resource functions instead of Datasets.

Similar to Ryan's testing mentioned in #331, I found that using DataArrays/Pandas has a 1000x speed up vs Datasets for very large input data. This should restore MHKiT's speed to it's previous state. Using a pure numpy base would have an additional 5-10x speed up from DataArrays, but I think the current work with DataArrays will:

  • be sufficient for our users
  • be easier to develop with
  • be easier to handle edge cases

@akeeste akeeste requested a review from ssolson September 17, 2024 19:07
@akeeste akeeste marked this pull request as ready for review October 2, 2024 18:34
@akeeste akeeste marked this pull request as draft October 2, 2024 18:44
@akeeste akeeste marked this pull request as ready for review October 10, 2024 17:24
@akeeste
Copy link
Contributor Author

akeeste commented Oct 10, 2024

@ssolson This PR is ready for review. Tests should pass now

With some modifications to the type handling functions, and an appropriate frequency_dimension input if required, the wave.resource functions should handle Pandas Series, Pandas DataFrames, and xarray DataArrays regardless of input shape, dimensions names, dimension order, etc. I largely moved away from converting data to xarraay Datasets because they are slow and more difficult to work with.

@akeeste
Copy link
Contributor Author

akeeste commented Oct 10, 2024

mhkit.loads.graphics is unchanged, so I'm not sure why the pylint loads test is now failing on the number of positional arguments there. This branch is up to date with develop

@akeeste
Copy link
Contributor Author

akeeste commented Oct 14, 2024

Thanks @ssolson. I'll merge that in here and fix a couple minor items with some examples

@akeeste
Copy link
Contributor Author

akeeste commented Oct 14, 2024

@akeeste TODO:

  • fix a couple last issues with example notebooks
  • reduce time required for example notebooks to enforce speed back to previous benchmark

@akeeste
Copy link
Contributor Author

akeeste commented Oct 14, 2024

@ssolson this PR is now ready for review and all tests are passing. I tightened up the timing on the environmental contours, 3 extreme response, and PacWave examples.

A straight forward test case on the difference in computational expense is using a wave resource function (e.g. energy_period) with a year of NDBC spectral data, or repeating Ryan's script in #331

@ssolson ssolson added wave module Clean Up Improve code consistency and readability labels Oct 16, 2024
Copy link
Contributor

@ssolson ssolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akeeste overall this addresses the issue. Thanks for putting this together. I have just a couple questions and a few minor clean up items.

@@ -442,11 +440,9 @@ def test_mler_export_time_series(self):
mler["WaveSpectrum"] = self.mler["Norm_Spec"].values
mler["Phase"] = self.mler["phase"].values
k = resource.wave_number(wave_freq, 70)
k = k.fillna(0)
np.nan_to_num(k, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This returns k so it would need to be:
k=nan_to_num(k,0)

However k has no nans so I don't think this is needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The zero frequency in wave_freq results in a nan value. The call to np.nan_to_num() updates the input k in my testing, which is useful if k is not a numpy array as the input data is both updated and retains its type. If it's more clear for this particular test, I can update to redefine k

Copy link
Contributor

@ssolson ssolson Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a limiting factor to getting the PR through but are you saying np.nan_to_num modifies k inplace when you run it?

The docs say the function returns the modified array and that was my experience when I paused the code here:
https://numpy.org/doc/2.0/reference/generated/numpy.nan_to_num.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ssolson you are correct. I didn't look at the docstring close enough when converting pandas fillna to numpy nan_to_num. np.nan_to_num(k, 0) is setting copy to False and changing k in place, instead of specifying the fill value which already defaults to 0. I'll update this line so that we're using the function correctly

@@ -95,7 +95,8 @@ def test_kfromw(self):

expected = self.valdata1[i]["k"]
k = wave.resource.wave_number(f, h, rho)
calculated = k.loc[:, "k"].values
# calculated = k.loc[:, "k"].values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete

# Rename to "variable" to match how multiple Dataset variables get converted into a DataArray dimension
data = xr.DataArray(data)
if data.dims[1] == "dim_1":
# Slight chance their is already a name for the columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

their => there

Comment on lines 309 to 313
LM: pandas DataFrame, xarray DatArray, or xarray Dataset
Capture length
JM: pandas DataFrame or xarray Dataset
JM: pandas DataFrame, xarray DatArray, or xarray Dataset
Wave energy flux
frequency: pandas DataFrame or xarray Dataset
frequency: pandas DataFrame, xarray DatArray, or xarray Dataset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xararray "DataArray" not "DatArray"

@@ -87,26 +86,24 @@ def elevation_spectrum(
+ "temporal spacing for eta."
)

S = xr.Dataset()
for var in eta.data_vars:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This for loop allowed users to process multiple wave heights in a DataFrame ect. Removing it means the user can only process one eta at a time.

Does this update require use to remove this functionality? Would it make sense to be able to parse each variable in a DataSet into a DataArray? Or to have the user create any needed loop outside of the funtion for simplicity?

E.g. the following test will fail. Can we make this work?

    def test_elevation_spectrum_multiple_variables(self):
        time = np.linspace(0, 100, 1000)
        eta1 = np.sin(2 * np.pi * 0.1 * time)  
        eta2 = np.sin(2 * np.pi * 0.2 * time)  
        eta3 = np.sin(2 * np.pi * 0.3 * time)  
        
        eta_dataset = xr.Dataset({
            'eta1': (['time'], eta1),
            'eta2': (['time'], eta2),
            'eta3': (['time'], eta3)
        }, coords={'time': time})
        
        sample_rate = 10  
        nnft = 256 
        
        spectra = wave.resource.elevation_spectrum(
            eta_dataset, sample_rate, nnft
        )

If so lets finish out this test and add this to the test suite.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users can still input Datasets, but right now all variables must have the same dimensions, it will be converted to a DataArray up front. The function then returns a DataArray.

I'll look at reinstating a dataset/dataframe loop so that those types are returned

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this loop again. Our previous slowdown of large pandas DataFrames --> xarray datasets could occur in these two functions now, but I don't think that is a typical use case. For example in the case described in #331, it's unlikely a user would have thousands of different wave elevation time series and convert them all to wave spectra (likewise for thousands of distinct spectra being converted to elevation time series). If that case does come up, the slow down should not be due to our implementation but the large quantity of data involved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adam the last thing I was checking this morning was adding the above test to the testing suite.

Currently though when I pass a DataSet I am getting a DataFrame back.

I think the idea was for function return the same type as the user passes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now if a user inputs a multivariate type (Dataset, DataFrame), they will get a multivariate type back, but whether its a Dataset or DataFrame is still controlled by the to_pandas flag, which is consistent with other functions containing this flag.

It might be worth making a more complex default for to_pandas where a use could specify True/False, and if they don't specify the function defaults to return the type (xarray/pandas) that they input. However this should be a separate PR that address all instances of this flag for consistency

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies. That was an oversite on my part on the to_pandas parameter.

I still want to add the above test to the testing suite akeeste#6


omega = xr.DataArray(
data=2 * np.pi * f, dims=frequency_dimension, coords={frequency_dimension: f}
)

eta = xr.Dataset()
for var in S.data_vars:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above removed the ability to iterate of multiple columns, but we still accept DataSets and multi column pandas

@@ -1153,7 +1164,7 @@ def wave_number(f, h, rho=1025, g=9.80665, to_pandas=True):
"""
if isinstance(f, (int, float)):
f = np.asarray([f])
f = convert_to_dataarray(f)
# f = convert_to_dataarray(f)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete

@akeeste
Copy link
Contributor Author

akeeste commented Oct 17, 2024

@ssolson I addressed all your comments and again allowed datasets into wave.resource.surface_elevation and wave.resource.elevation_spectrum

akeeste and others added 2 commits October 21, 2024 09:38
@akeeste
Copy link
Contributor Author

akeeste commented Oct 22, 2024

@ssolson akeeste#6 is merged. Anything else you want to address in this PR?

@ssolson
Copy link
Contributor

ssolson commented Oct 22, 2024

@ssolson akeeste#6 is merged. Anything else you want to address in this PR?

Thanks Adam. All good to merge assuming the tests pass.

@akeeste
Copy link
Contributor Author

akeeste commented Oct 22, 2024

All tests passed. Merging!

@akeeste akeeste merged commit 6c2d013 into MHKiT-Software:develop Oct 22, 2024
41 checks passed
@akeeste akeeste deleted the speed_up_wave branch October 22, 2024 15:15
@ssolson ssolson mentioned this pull request Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Clean Up Improve code consistency and readability wave module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants