Usage | Release | Development |
---|---|---|
The QuadratiK package is implemented in both R and Python, providing a comprehensive set of goodness-of-fit tests and a clustering technique using kernel-based quadratic distances, and algorithms for generating random samples from a PKBD distribution. It includes:
- Goodness-of-Fit Tests : The software implements one, two, and k-sample tests for goodness of fit, offering an efficient and mathematically sound way to assess the fit of probability distributions. Expanded capabilities include supporting tests for uniformity on the d-dimensional Sphere based on Poisson kernel densities. Our tests are particularly useful for large, high-dimensional datasets where the assessment of fit of probability models is of interest. Specifically, we offer tests for normality, as well as two- and k-sample tests, where testing equality of two or more distributions is of interest, i.e. H_0: F_1 = F_2 and H_0: F_1 = \ldots = F_k respectively. The proposed tests perform well in terms of level and power for contiguous alternatives, heavy tailed distributions and in higher dimensions.
- Poisson Kernel-based Distribution (PKBD) : The package also includes functionality for generating random samples from PKBD and computing the density value. A short guide on PKBD is included in User Guide. For more details please see Golzy and Markatou (2020) and Sablica et al. (2023).
- Clustering Algorithm for Spherical Data: The package incorporates a unique clustering algorithm specifically tailored for spherical data. This algorithm leverages a mixture of Poisson-kernel-based densities on the sphere, enabling effective clustering of spherical data or data that has been spherically transformed. This facilitates the uncovering of underlying patterns and relationships in the data. The clustering algorithm is especially useful in the presence of noise in the data and the presence of non-negligible overlap between clusters.
- Additional Features: Alongside these functionalities, the software includes additional graphical functions, aiding users in validating cluster results as well as visualizing and representing clustering results. This enhances the interpretability and usability of the analysis.
- User Interface: We also provide a dashboard application built using
streamlit
allowing users to access the methods implemented in the package without the need for programming.
The R implementation can be found on CRAN and the corresponding GitHub repository is available here.
Giovanni Saraceno <[email protected]>, Marianthi Markatou <[email protected]>, Raktim Mukhopadhyay <[email protected]>, Mojgan Golzy <[email protected]>
Mantainer: Raktim Mukhopadhyay <[email protected]>
The documentation is hosted on Read the Docs at - https://quadratik.readthedocs.io/en/latest/
The package can be installed from PyPI using pip install QuadratiK
- QuadratiK Examples: A collection of basic examples that demonstrate how to use the core functionalities of the QuadratiK package. Ideal for new users to get started quickly.
- An Introduction to Poisson Kernel-Based distributions: A short introduction to the Poisson Kernel-Based distributions.
- Random sampling from the Poisson kernel-based density: Learn how to generate random samples from the Poisson kernel-based density and visualize the results.
- Usage Instructions for Dashboard Application: Step-by-step instructions on how to set up and use the QuadratiK dashboard application. This guide helps you interactively explore and analyze data using the dashboard's features.
You can also execute the examples on Binder .
To install the development version of QuadratiK
, you will need to download the code files from the master branch of the GitHub repository. Keep in mind that the development version may contain bugs or unstable features. For the latest stable release, we recommend installing via pip or downloading a release from GitHub.
To clone the master branch from GitHub, use the following command:
git clone https://github.com/rmj3197/QuadratiK.git
QuadratiK
uses the poetry
package manager for dependency management and installation. If you don't have Poetry
installed, you can install it by following the instructions in the Poetry Documentation.
We strongly recommend creating a new virtual environment to isolate the QuadratiK
installation and its dependencies from your system-wide Python environment. You can create a virtual environment using venv
, virtualenv
, or any other virtual environment manager of your choice. For example, using venv
:
python3 -m venv quadratik-env
source quadratik-env/bin/activate # On Windows: quadratik-env\Scripts\activate
After installation, you can activate the Poetry-managed virtual environment by running:
poetry shell
This ensures that any commands you run are executed within the isolated environment.
Please note that if managing your own virtual environment externally, you do not need to use poetry shell since you will already have activated that virtual environment and made available the correct python instance.
After setting up your virtual environment and cloning the repository, navigate to the QuadratiK directory:
cd QuadratiK
You can install the project dependencies and set up the development environment by running:
poetry install
This command will install the dependencies specified in pyproject.toml and the package, and set up the project for development.
To verify that everything is set up correctly, you can run the project's test suite. This will help ensure that the development environment is correctly configured:
poetry run pytest
This command uses Poetry to run pytest within the virtual environment, executing all the tests defined in the project.
- If you encounter any issues during installation or while using the development version, please report them on the GitHub Issues page.
- To keep your development environment up-to-date, you can periodically pull the latest changes from the master branch and run poetry update to update dependencies.
For contributing to QuadratiK
, please follow the contribution guidelines provided in the repository.
The code of conduct can be found at Code of Conduct.
This project uses the GPL-3.0 license, with a full version of the license included in the repository.
Below is a list of packages in R and Python that provide functionalities related to Goodness-of-Fit testing. Please note that this list is not exhaustive. We also would like to point out that while these packages deal with goodness-of-fit in general, none encodes the methodology and algorithms that are present in our software. Furthermore, our software incorporates a clustering algorithm for data that reside on the d-dimensional sphere that is especially useful in the presence of noise in the data and the presence of non-negligible overlap between clusters. Functions that can be used to generate data from PKBDs are also provided.
-
stats
: Contains the Kolmogorov-Smirnov test, performed using the ks.test function. -
goftest
: Includes the Cramér-von Mises test. -
goft
: Provides the Anderson-Darling test. -
vsgoftest
: Performs GoF tests for various distributions (uniform, normal, lognormal, exponential, gamma, Weibull, Pareto, Fisher, Laplace, and Beta) based on Shannon entropy and the Kullback-Leibler divergence. -
GoFKernel
: Contains an implementation of Fan's test. -
GSAR
: Implements graph-based ranking strategies for univariate and high-dimensional multivariate two-sample GoF tests. Includes the univariate run-based test, two-sample Kolmogorov-Smirnov test, and a modified Kolmogorov-Smirnov test for scale alternatives. -
crossmatch
: Provides a two-sample test based on interpoint distances. -
energy
: Offers a collection of test statistics for multivariate inference based on energy statistics. -
kernlab
: Includes an implementation of the Maximum Mean Discrepancy (MMD) test statistic using kernel mean embedding properties. -
kSamples
: Contains several nonparametric Rank Score$k$ -sample tests, including the Kruskal-Wallis test, van der Waerden scores, normal scores, and the Anderson-Darling test. -
coin
: Provides permutation tests tailored against location and scale alternatives, and for survival distributions. -
circular
: Offers tests for data represented as points on the surface of a unit hypersphere, including Rayleigh's test, Rao’s Spacing test, Kuiper's test, and Watson's test of uniformity. -
CircNNTSR
: Provides a test for uniformity based on nonnegative trigonometric sums. -
sphunif
: Contains a collection of Sobolev tests and other nonparametric tests for uniformity on the sphere.
-
scipy
: Includes a number of goodness-of-fit (GoF) tests, such as the Kolmogorov-Smirnov test, Cramér-von Mises test, and Anderson-Darling test. For more details, please see the Scipy Statistical Functions documentation. -
hyppo
: This package offers implementations of various Goodness-of-Fit (GoF) testing methods, such as the Maximum Mean Discrepancy (MMD) and Energy statistics for$k$ -sample testing. For more information, visit: Hyppo Documentation.
If you use this package, please consider citing it using the following entry:
@misc{saraceno2024goodnessoffitclusteringsphericaldata,
title={Goodness-of-Fit and Clustering of Spherical Data: the QuadratiK package in R and Python},
author={Giovanni Saraceno and Marianthi Markatou and Raktim Mukhopadhyay and Mojgan Golzy},
year={2024},
eprint={2402.02290},
archivePrefix={arXiv},
primaryClass={stat.CO},
url={https://arxiv.org/abs/2402.02290},
}
The work has been supported by Kaleida Health Foundation and National Science Foundation.
Saraceno G., Markatou M., Mukhopadhyay R., Golzy M. (2024). Goodness-of-Fit and Clustering of Spherical Data: the QuadratiK package in R and Python. arXiv preprint arXiv:2402.02290.
Ding Y., Markatou M., Saraceno G. (2023). “Poisson Kernel-Based Tests for Uniformity on the d-Dimensional Sphere.” Statistica Sinica. DOI: 10.5705/ss.202022.0347.
Golzy M. & Markatou M. (2020) Poisson Kernel-Based Clustering on the Sphere: Convergence Properties, Identifiability, and a Method of Sampling, Journal of Computational and Graphical Statistics, 29:4, 758-770, DOI: 10.1080/10618600.2020.1740713.
Sablica, L., Hornik, K., & Leydold, J. (2023). Efficient sampling from the PKBD distribution. Electronic Journal of Statistics, 17(2), 2180-2209.
Markatou, M., & Saraceno, G. (2024). A unified framework for multivariate two-sample and k-sample kernel-based quadratic distance goodness-of-fit tests. DOI: 10.48550/arXiv.2407.16374v1