Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add postgres dev #71

Draft
wants to merge 18 commits into
base: develop
Choose a base branch
from
Draft

Add postgres dev #71

wants to merge 18 commits into from

Conversation

gclaramunt
Copy link
Contributor

No description provided.

@gclaramunt gclaramunt marked this pull request as draft December 17, 2024 23:25
sql/snapshot.sql Outdated
subheight int8 NOT NULL,
last_snapshot_hash text NULL,
created_at timestamp NULL,
updated_at timestamp NULL,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need updated_at field? afaik all data here are immutable and once stored, never updated 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not necessarily true. OpenSearch indexes are edited sometimes in unusual cases, for example. The only overhead for updated_at is the trigger when it's actually updated so it doesn't really impact performance significantly. It makes the database a lot more human readable in case of any issues though.

I prefer to have them on all tables but it's up for debate if there's a good reason to remove those columns.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenSearch indexes are edited sometimes in unusual cases, for example.

@AlexBrandes could you elaborate on that? what is an example case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You would know better than I would but haven't you guys had to edit or reindex in certain cases? There have been the sporadic issues with balances (I know some of those were issues between 2 indexes). There was also an older case I remember that had a broken chain that had to be fixed.

I can think of other cases when we might update a row instead of deleting it and re-inserting also. For example, we might decide that we really need all historical state proofs from all snapshots to be returned through the API. Updating the existing rows would be a better strategy than deleting/re-inserting everything.

I guess the important questions are:

  • Will this field impact performance enough that we should avoid it? imo no
  • Will this field be useful enough for working with the DB that it's worth including? imo yes

Copy link
Member

@marcinwadon marcinwadon Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK all cases were that the data were not indexed. Since we ingest blockchain data that are immutable, we should never update data indexed in the database. (For example, an exchange confirmed a transaction and then you update its balance or parent hash, or whatever). AFAIK block explorer, since it's centralized, works as a source of truth for the whole network. Cluster nodes fetch data from block explorer when performing a rollback. Imo those data should never be modified.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it would be a rare occurrence that a record would be updated - in general I agree.

What's the argument against including the column though? Performance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both arguments make sense. Although adding the updated_at column later is not free, if we're careful, we can minimize the impact, so I don't think we need to commit right now :)
(since updates are rare, maybe we can consider a separate table also)

sql/snapshot.sql Outdated
amount int8 NOT NULL,
fee int8 NOT NULL,
block_hash text NOT NULL,
transaction_original_id int4 NULL,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this field? 🤔

Comment on lines 92 to 102
object Configuration {
import pureconfig.ConfigSource
import pureconfig.generic.auto._

case class DbConfig(host: String, port: Int, user: String, password: Option[String], database: String, maxSessions: Int)

lazy val dbConfig: DbConfig = ConfigSource
.default
.at("snapshotStreaming.db")
.loadOrThrow[DbConfig]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use the pattern that we already have configured in the project? it would be great not to make two different ways of reading config, so either use the previous one or refactor the previous one to use pureconfig instead (but as a separate PR) 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a WIP and the main goal is to validate the approach and library. Once it reaches a mature state, I'll unify it with the rest of the project. I favor PureConfig because is less boilerplate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Me too, but EOD I'd opt for consistency so either this or that

build.sbt Show resolved Hide resolved
Copy link
Member

@AlexBrandes AlexBrandes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall Notes:

  • Table names should be plural (global_snapshots, dag_balance_changes, etc.)
  • Think about what indexes will be needed based on the existing endpoint definitions.
  • Tables that still need to be added:
    • metagraph balances
    • blocks
    • DAG rewards + metagraph rewards
    • FeeTransactions
    • Metagraphs


-- DROP TABLE global_snapshot;

CREATE TABLE global_snapshot (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notes:

  • hash can be VARCHAR
  • subheight can be int4
  • last_snapshot_hash can be VARCHAR
  • We need to include fields from the current snapshot schema here as nullable fields:
    • epochProgress
    • metagraphSnapshotCount
  • We should have a unique index/constraint on hash
  • last_snapshot_hash should FK to hash on this table
  • created_at / updated_at - need DEFAULT NOW() to set initial value and then I think updated_at needs a trigger to update whenever it’s changed. We should consider timestampz instead of timestamp here but maybe it doesn’t matter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need to add epochProgress - also add version and proofs as a separate relation.

sql/snapshot.sql Show resolved Hide resolved

-- DROP TABLE dag_transaction;

CREATE TABLE dag_transaction (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notes:

  • hash can be VARCHAR
  • global_ordinal - nice, let’s use this format throughout. Need an index on this column to fetch transactions per snapshot.
  • block - is this supposed to be block_hash? Can be VARCHAR.
  • source - this is a wallet address. Should be VARCHAR.
  • destination - also a wallet address
  • block_hash - what’s the difference between this and block?
  • transaction_original_id - does this FK to a separate table? I’m not really sure why we return this from the API but we’ll need to continue returning it for backwards compatibility. We could store it as JSON here as it’s never queried separately from this record and it’s a nested record so it would mean multiple tables to store it separately.
  • created_at / updated_at - same as above

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few other missing columns here if you check against the existing BE API output: salt, parent, etc.

https://be-mainnet.constellationnetwork.io/transactions?limit=10

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still missing the columns in the above comment.

sql/snapshot.sql Outdated

-- DROP TABLE metagraph_snapshot;

CREATE TABLE metagraph_snapshot (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notes:

  • metagraph_id should be VARCHAR
  • ordinal - should we make this metagraph_ordinal ? Or just ordinal for metagraph snapshots and global_ordinal for global snapshots? We should be consistent on all tables.
  • global_ordinal - I think this should be NOT NULL. Is there a case this could be NULL?
  • hash - VARCHAR
  • subheight - probably int4 since this doesn’t get big
  • last_snapshot_hash VARCHAR
  • created_at / updated_at - same as above
  • global_snapshot_fk constraint references the metagraph ordinal but we need 2 FKs here anyways: global snapshot and metagraph snapshot.
  • Need epochProgress here also and probably other CurrencySnapshot fields (fee definitely). Take a look at the output in an actual snapshot.

sql/snapshot.sql Show resolved Hide resolved
@gclaramunt
Copy link
Contributor Author

regarding varchar vs text, I don't have a preference, the underlying datatype is the same. Since size is fixed, should we consider char(SIZE) ?

@AlexBrandes
Copy link
Member

regarding varchar vs text, I don't have a preference, the underlying datatype is the same. Since size is fixed, should we consider char(SIZE) ?

Ah, I didn't know this. That's not the case in MySQL, for example.

I think TEXT is fine then. I feel like we will run into edge cases we can't support if we force an exact length with char(SIZE).

Comment on lines 89 to 108
val SplitSnapshot(snapshot, blocks,
transactions,
balances,
currSnapshot,
currIncrementalSnapshots,
currBlocks,
currTransactions,
currFeeTransactions,
currBalances) = splitSnapshot
val parallelRequests = updateParallelRequests(
blocks,
transactions,
balances,
currSnapshot,
currIncrementalSnapshots,
currBlocks,
currTransactions,
currFeeTransactions,
currBalances
).grouped(config.bulkSize).toList
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you can put whole split snapshot into the updateParallelRequests function?

Comment on lines 53 to 56
(currSnapshot, currIncrementalSnapshots, currBlocks, currTransactions, currFeeTransactions, currBalances) =
mappedCurrencyData
} yield {
val (snapshot, blocks, transactions, balances) = mappedGlobalData
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why currency data are in the for comp and global data are in yield? imo only the SplitSnapshot should be in the yield. or even, create a fn in the SplitSnapshot companion object that will receive mapped global data and mapped currency data, and produce the split snapshot. then, that whole for comprehension may look like:

(globalMapper.mapGlobalSnapshot(...), currencyMapper.mapCurrencySnapshots(...))
  .mapN(SplitSnapshot(_, _))


case class UpdateRequests(
sequentialRequests: Seq[Seq[UpdateRequest]],
parallelRequests : List[List[UpdateRequest]]
)

case class SplitSnapshot(snapshot: Snapshot, blocks: Seq[Block], txs: List[Transaction], balances: Seq[AddressBalance], cdSnapshots: Seq[CurrencyData[Snapshot]], cdCcySnapshots: Seq[CurrencyData[CurrencySnapshot]], cdBlocks: Seq[CurrencyData[Block]], cdTxs: Seq[CurrencyData[Transaction]], cdFeeTxs: Seq[CurrencyData[FeeTransaction]], cdBalances: Seq[CurrencyData[AddressBalance]])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the use case for the SplitSnapshot? 🤔 I see you create the class and then desctructure it immediately after 🤔

Copy link
Member

@AlexBrandes AlexBrandes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I think we need an addresses table that all the different transaction tables FK to (source/destination).


-- DROP TABLE global_snapshot;

CREATE TABLE global_snapshot (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need to add epochProgress - also add version and proofs as a separate relation.

snapshot_hash varchar NOT NULL,
snapshot_ordinal int8 NOT NULL,
created_at timestamp DEFAULT now() NOT NULL,
updated_at timestamp DEFAULT now() NOT NULL,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll just comment here but it applies to all the tables. You'll need an auto-update trigger to support updated_at.

CONSTRAINT global_snapshot_pkey PRIMARY KEY (ordinal)
created_at timestamp DEFAULT now() NOT NULL,
updated_at timestamp DEFAULT now() NOT NULL,
CONSTRAINT global_snapshot_pk PRIMARY KEY (hash)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ordinal needs to be a unique constraint

staking_address varchar NULL,
created_at timestamp DEFAULT now() NOT NULL,
updated_at timestamp DEFAULT now() NOT NULL,
CONSTRAINT metagraph_snapshot_pk PRIMARY KEY (metagraph_id, hash)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(metagraph_id, ordinal) should be a unique constraint here also.

@@ -42,13 +84,13 @@ CREATE INDEX global_snapshot_hash_idx ON public.global_snapshot USING btree (has
-- DROP TABLE dag_balance_change;

CREATE TABLE dag_balance_change (
hash varchar NOT NULL,
ordinal int8 NOT NULL,
snapshot_hash varchar NOT NULL,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if snapshot_hash is the FK to snapshots here, we need global_ordinal because we need to be able to query this by address and ordinal like this

SELECT balance 
FROM dag_balance_changes 
WHERE global_ordinal < :snapshot_ordinal and address = :address 
LIMIT 1

CONSTRAINT dag_balance_change_global_snapshot_fk FOREIGN KEY (ordinal) REFERENCES global_snapshot(ordinal)
created_at timestamp DEFAULT now() NOT NULL,
updated_at timestamp DEFAULT now() NOT NULL,
CONSTRAINT dag_balance_change_pk PRIMARY KEY (snapshot_hash, address),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
CONSTRAINT dag_balance_change_pk PRIMARY KEY (snapshot_hash, address),
CONSTRAINT dag_balance_change_pk PRIMARY KEY (address, global_ordinal),

snapshot_hash varchar NOT NULL,
created_at timestamp DEFAULT now() NOT NULL,
updated_at timestamp DEFAULT now() NOT NULL,
CONSTRAINT dag_block_pk PRIMARY KEY (hash),
CONSTRAINT dag_block_block_fk FOREIGN KEY (hash) REFERENCES block(hash),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an FK to the parent abstract_table? I think we can probably remove.

hash varchar NOT NULL,
block varchar NULL,
source_addr varchar NOT NULL,
CREATE TABLE dag_reward_transaction (
destination_addr varchar NOT NULL,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would call this one destination to match the transaction schema

global_snapshot_hash varchar NOT NULL,
created_at timestamp DEFAULT now() NOT NULL,
updated_at timestamp DEFAULT now() NOT NULL,
CONSTRAINT reward_transaction_global_snapshot_fk FOREIGN KEY (global_snapshot_hash) REFERENCES global_snapshot(hash)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This table has no PK.


-- DROP TABLE dag_transaction;

CREATE TABLE dag_transaction (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still missing the columns in the above comment.

Copy link
Member

@AlexBrandes AlexBrandes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry - hit submit on the last review too quick. Here are a few more comments.

created_at timestamp DEFAULT now() NOT NULL,
updated_at timestamp DEFAULT now() NOT NULL,
CONSTRAINT dag_transaction_pk PRIMARY KEY (hash),
CONSTRAINT dag_transaction_abstract_transaction_fk FOREIGN KEY (hash) REFERENCES abstract_transaction(hash),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it need an FK to the abstract table? I'm not familiar with how that works.

metagraph_id varchar NULL,
created_at timestamp DEFAULT now() NOT NULL,
updated_at timestamp DEFAULT now() NOT NULL,
CONSTRAINT fee_transaction_pkey PRIMARY KEY (hash),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be at least (metagraph_id, hash). We should double check that the hash is unique to the metagraph also. It's generated from the data but I don't remember how it's made unique.

Comment on lines +159 to +160
metagraph_snapshot_hash varchar NULL,
metagraph_id varchar NULL,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
metagraph_snapshot_hash varchar NULL,
metagraph_id varchar NULL,
metagraph_snapshot_hash varchar NOT NULL,
metagraph_id varchar NOT NULL,


CREATE TABLE metagraph_balance_change (
metagraph_id varchar NOT NULL,
metagraph_hash varchar NOT NULL,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as the dag_balance_change table. We need ordinal here to be queried, unless the intent is to join to the metagraph_snapshots table. That seems like it might be too inefficient though to search against.

balance int8 NULL,
created_at timestamp DEFAULT now() NOT NULL,
updated_at timestamp DEFAULT now() NOT NULL,
CONSTRAINT metagraph_balance_change_metagraph_snapshot_fk FOREIGN KEY (metagraph_id,metagraph_hash) REFERENCES metagraph_snapshot(metagraph_id,hash)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This table has no PK


-- DROP TABLE metagraph_block;

CREATE TABLE metagraph_block (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This table has no PK.


-- DROP TABLE metagraph_reward_transaction;

CREATE TABLE metagraph_reward_transaction (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This table has no PK.


-- DROP TABLE metagraph_snapshots_to_global_snapshot;

CREATE TABLE metagraph_snapshots_to_global_snapshot (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this works - metagraph_id and metagraph_snapshot_hash have to be NOT NULL and the FK to metagraph_snapshot would need to be removed because these rows would be inserted before the snapshot records exist. This table does feel a little bit messy though if it doesn't have FK constraints.

It might be easier to just to the metagraph_snapshot_count or a JSONB field with an array of metagraph snapshot records on the global snapshot. They would only be used while upload of the metagraph snapshots is pending.

created_at timestamp DEFAULT now() NOT NULL,
updated_at timestamp DEFAULT now() NOT NULL,
CONSTRAINT metagraph_snapshots_to_global_snapshot_global_snapshot_fk FOREIGN KEY (global_snapshot_hash) REFERENCES global_snapshot(hash),
CONSTRAINT metagraph_snapshots_to_global_snapshot_metagraph_snapshot_fk FOREIGN KEY (metagraph_id,metagraph_snapshot_hash) REFERENCES metagraph_snapshot(metagraph_id,hash)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This table has no PK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants