pg_index_stats

Lightweight extension for PostgreSQL that generates extended statistics based on index definitions. It introduces dependency of the statistics on the corresponding index.

According to the postgres docs it is impractical to compute multivariate statistics automatically. Our conjecture here is that index structure reflects that a specific set of columns and extensions is most frequently used for extracting data, and it is critical to build optimal query plans when combinations of these columns are involved.

Interface

Integer GUC pg_index_stats.columns_limit - number of first columns of an index which will be involved in extended statistics creation (default 5). Set its value to 0 if you want to pause generation of new statistics.
Function pg_index_stats_build(idxname, mode DEFAULT 'mcv, distinct') - manually create extended statistics on an expression defined by formula of the index idxname.
Function pg_index_stats_remove() - remove all previously automatically generated statistics.
Function pg_index_stats_rebuild() - remove old and create new extended statistics over non-system indexes existed in the database.

Installation

Download or git clone source code
setup environment variables to reference desired installation. Something like following:

export USE_PGXS=1
export PATH=/usr/pgsql-16/bin:/usr/pgsql-16/share:$PATH

In the extension folder execute:

make
make install

Usage

After installation, load the extension statically (see shared_preload_libraries), dynamically (by the LOAD command), or execute the CREATE EXTENSION pg_index_stats statement. When loaded, the extension will create extended statistics on any non-system index definition. Use the \dX command to observe extended statistics in the database. If you want to build extended statistics over the database defined before the extension loading, use the pg_index_stats_rebuild routine. For example:

CREATE TABLE test(x integer, y integer);
CREATE INDEX ON test (x,y);
CREATE EXTENSION pg_index_stats;
SELECT pg_index_stats_rebuild();

The function pg_index_stats_rebuild will pass over all the database non-system tables and build extended statistics according to the definition of each index containing more than one column.

Notes

Each created statistics depends on the index and the pg_index_stats extension. Hence, dropping an index you remove corresponding auto-generated extended statistics. Dropping pg_index_stats extension you will remove all auto-generated statistics in the database.
Although multivariate case is trivial (it will be used by the core natively after an ANALYZE finished), univariate one (histogram and MCV on the ROW()) isn't used by the core and we should invent something - can we implement some code under the get_relation_stats_hook and/or get_index_stats_hook ?
For clarity, the extension adds to auto-generated statistics comments likewise 'pg_index_stats - multivariate statistics'. To explore all generated statistics an user can execute simple query like:

SELECT s.oid,s.stxname,d.description
FROM  pg_statistic_ext s JOIN pg_description d ON (s.oid = d.objoid)
WHERE description LIKE 'pg_index_stats%';

Research

The main goal of this work is to understand the real benefits of extended statistics. Here we have two directions:

Is it possible to make it more compact? Adding new statistics we can have different relationships: duplicated, overlapping, intersecting, including. So, we should try to look into all these cases and find the best solution to optimize storage and ANALYZE time.
How to use extended statistics. Do we need additional core tweaks to make it more effective?
What about histograms on the set of columns? Partially, answer can be found there: https://www.postgresql.org/message-id/flat/CAN_hQmsYDhPMK4bSJHPw3RGnpKqPbx2Uk_7HRV%2BDz9F5H7dj1A%40mail.gmail.com and https://www.postgresql.org/message-id/flat/5460244C.8080109%40fuzzy.cz#4fcda52e1dc956e0631056ef2ab12949

TODO

In the case of indexes intersection, combine their multivariate statistics into some meta statistics.
? Restrict the shared library activity by databases where the extension was created.
? Extend modes: maybe user wants only ndistincts or relatively lightweight column dependencies?

Second Thoughts

Could we introduce some automatization here? For example, generate functional dependencies only for cases when real dependency factor on columns more than a predefined value?
As I can see, univariate statistics on a ROW(Index Tuple Descriptor) look cheaper and contain a whole set of columns covered by histogram and MCV. So, when the user creates an index because he knows he would use queries with clauses utilizing the index, it would be more profitable to use such statistics. Unfortunately, core PostgreSQL doesn't allow estimations on a group of columns; it is possible only for extended statistics. So, univariate statistics could be utilized only in PostgreSQL forks for now.

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
expected		expected
sql		sql
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pg_index_stats--0.1.sql		pg_index_stats--0.1.sql
pg_index_stats.c		pg_index_stats.c
pg_index_stats.control		pg_index_stats.control
pg_index_stats.h		pg_index_stats.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pg_index_stats

Interface

Installation

Usage

Notes

Research

TODO

Second Thoughts

About

Releases

Packages

Languages

License

danolivo/pg_index_stats

Folders and files

Latest commit

History

Repository files navigation

pg_index_stats

Interface

Installation

Usage

Notes

Research

TODO

Second Thoughts

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages