-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Xarray Strategy causing significant MHKiT computation times (MHKiT slow) #331
Comments
Thanks Ryan. I found the same and we have discussed in #327. The issue is around our approach to xarray and not testing the time of functions. I am addressing this through the testing of notebooks and we will be timing them going forward in #330 . wrt to xarray we are adjusting our strategy moving from a simplistic everything in xarray to a tailored approach due to the slowness this introduced. @akeeste This issue can serve as the foundation for our xarray strategy. |
I think that the best concept is to write your functions to accept # %%
import xarray as xr
from mhkit.wave.resource import significant_wave_height
import numpy as np
import timeit
import matplotlib.pyplot as plt
# %% [markdown]
# # Load data
# %%
ds = xr.open_dataset('cape_hatteras_download_spectra_2000.nc',)
ds = ds.rename({'time_index': 'time'})
ds['frequency'].attrs['units'] = 'Hz'
ds['frequency'].attrs['long_name'] = 'Frequency'
ds['direction'] = ds['direction']
ds['direction'].attrs['units'] = 'rad'
ds['direction'].attrs['long_name'] = 'Direction'
ds.attrs['gid'] = ds.gid.item()
ds = ds.drop_vars('gid').squeeze()
ds = ds.to_array()
ds = ds.drop_vars('variable').squeeze()
ds = ds / (1025*9.81)
dso = ds.integrate('direction')
dso.attrs['units'] = 'm$^2$/Hz'
dso.attrs['long_name'] = 'Spectral density'
dso.name = 'S'
dso
# %% [markdown]
# # Timing
# %%
time = {}
n = 20
# %% [markdown]
# ## Using MHKiT
# %%
time['mhkit'] = timeit.timeit(
lambda: significant_wave_height(dso.to_pandas().transpose()), number=n)/n
# %% [markdown]
# ## Using numpy
# %%
def moment(S,f, order=0):
df = np.insert(np.diff(f), 0, f[0])
m = np.sum(df*S*f**order, axis=1)
return m
def sig_wave_height(S,f):
return 4*np.sqrt(moment(S,f, 0))
# %%
time['numpy'] = timeit.timeit(
lambda: sig_wave_height(dso, dso['frequency']), number=n)/n
# %%
time
# %%
time['mhkit']/time['numpy']
# %% [markdown]
# # Check that they agree
# %%
significant_wave_height(dso.to_pandas().transpose()).to_numpy().squeeze() - sig_wave_height(dso, dso['frequency']) |
Agreed, the complexity of xarray was a lot slower than anticipated. I generally like this approach of simplifying internal functionality while maintaining a flexible IO for the user's chosen data types. |
I've been working through this issue more today, using the significant wave height function as an example. In this case, the core problem is that a large pandas DataFrames (with a frequency dimension and variables across time) is being converted to a Dataset of one dimension and 8000+ variables, instead of Dataset with 1 variable (or DataArray) with 2 dimensions (frequency x time). This is the default behavior when using native xarray/pandas functions to convert DataFrames to Datasets. The conversion back and forth with thousands of xr.Dataset variables is slow and applying mathematical functions to xr.Datasets with thousands of variables is slow. Likely similar to looping through all 8000+ time stamps and converting them or applying some mathematical function to each one instead of treating them as a tensor. There's a lot of historical data that ends up like this in MHKiT (time x frequency, etc). Often it should not actually contain multiple variables but has multiple dimensions. In my testing I saw the following increases in computational expense versus the pure numpy implementation above:
My proposed solutions, pending more rigorous testing and assessment of other functions:
|
@ryancoe an immediate fix in your case-- MHKiT now allows you to input xarray to any MHKiT function instead of converting to pandas first. This should increase your speed across MHKiT immediately. Just call out frequency_dimension or time_dimension as required:
Edit--I forgot I changed a couple lines locally to
and
|
This issue should be resolved robustly and throughout the entire wave module with the addition of #352. Let us know if this comes up again. |
Not a bug, but I was trying to analyze 30 years of hourly data with
xarray
,dask
, andmhkit
and having a lot of trouble getting things to run in a timely manner. I found that by simply re-writing some of the functions frommhkit
in purenumpy
, I got speed improvements on the order of 3000x. Looking at themhkit
code, I think this can only be due to converting between types?mhkit.wave.resource.significant_wave_height
numpy
cape_hatteras_download_spectra_2000.nc.zip
The text was updated successfully, but these errors were encountered: