Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(mater): add unixfs wrapping support #690

Open
wants to merge 5 commits into
base: develop
Choose a base branch
from

Conversation

pete-eiger
Copy link
Contributor

PR: Add UnixFS Wrapping Support

Description

This PR adds the ability for users to decide whether to wrap file contents in raw blocks or UnixFS DAG nodes when converting to a CARv2. Specifically:

  1. WrapMode enum (Raw or UnixFS) is introduced.
  2. CLI updated: A --wrap option allows the user to choose how data is packaged in the CAR.
  3. Balanced‐tree builder: A new stream_balanced_tree_unixfs function creates proper UnixFS leaves and parent nodes, ensuring the DAG is navigable by retrieval clients.
  4. Raw flow preserved: If --wrap=raw is chosen, it uses stream_balanced_tree with raw leaves.

All relevant tests now pass, including a new check verifying that the UnixFS mode indeed produces the expected number of leaf blocks (matching the total chunk count) and parent nodes.

Important points for reviewers

  • UnixFS DAG creation: Handled by stream_balanced_tree_unixfs, which encodes each chunk as a DAG‐PB File node.
  • Raw mode: Leaves remain raw blocks, handled by the original stream_balanced_tree.
  • CLI: A new ConvertConfig uses WrapMode to drive the logic in create_filestore.
  • Tests: The test test_filestore_unixfs_dag_structure confirms a correct UnixFS DAG with equal numbers of chunks and leaf blocks.

Checklist

  • Important points for reviewers?
    The code merges neatly with existing balanced‐tree logic. Tests confirm that both raw and UnixFS modes work properly.
  • Description of the change?
    Yes, see above.
  • Follow‐ups?
    None; this PR completes the UnixFS wrapping feature.
  • Has it been tested?
    Yes, new tests pass locally (test_filestore_unixfs_dag_structure).
  • Any alternative approaches considered?
    A manual parent node builder was considered, but leveraging the existing balanced‐tree approach is cleaner.
  • Documented new APIs?
    The WrapMode enum is documented and the CLI’s --wrap flag is explained in the help text.

@pete-eiger pete-eiger requested a review from a team January 21, 2025 17:36
@pete-eiger pete-eiger self-assigned this Jan 21, 2025
@pete-eiger pete-eiger added the ready for review Review is needed label Jan 21, 2025
@pete-eiger pete-eiger linked an issue Jan 21, 2025 that may be closed by this pull request
Comment on lines +52 to +67
/// Determines how content is encoded in the CAR file.
#[derive(clap::ValueEnum, Clone, Debug)]
pub enum WrapMode {
/// Store content directly in the CAR file without additional metadata.
/// This is the most space-efficient option but provides minimal metadata.
Raw,
/// Wrap content in UnixFS format which includes file metadata and DAG structure.
/// This is compatible with IPFS and provides richer metadata about the content.
UnixFS,
}

impl Default for WrapMode {
fn default() -> Self {
Self::Raw
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defaults for unit-type enums can be derived too:

Suggested change
/// Determines how content is encoded in the CAR file.
#[derive(clap::ValueEnum, Clone, Debug)]
pub enum WrapMode {
/// Store content directly in the CAR file without additional metadata.
/// This is the most space-efficient option but provides minimal metadata.
Raw,
/// Wrap content in UnixFS format which includes file metadata and DAG structure.
/// This is compatible with IPFS and provides richer metadata about the content.
UnixFS,
}
impl Default for WrapMode {
fn default() -> Self {
Self::Raw
}
}
/// Determines how content is encoded in the CAR file.
#[derive(clap::ValueEnum, Clone, Debug, Default)]
pub enum WrapMode {
/// Store content directly in the CAR file without additional metadata.
/// This is the most space-efficient option but provides minimal metadata.
#[default]
Raw,
/// Wrap content in UnixFS format which includes file metadata and DAG structure.
/// This is compatible with IPFS and provides richer metadata about the content.
UnixFS,
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no other wrapping. Having an enum here is overengineering

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. Also lets have a wrapped mode by default. If the user wants the raw they should provide a flag.

written += output_file.write(&contents).await?;

if let Ok((cid, contents)) = self.read_block().await {
if contents.len() == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember when implementing this that this break condition cause the output file to miss some data. Did you check that the output file contains all the data?

@jmg-duarte jmg-duarte changed the title Feat/675/mater convert wrap contents in unixfs dag feat(mater): add unixfs wrapping support Jan 22, 2025
Copy link
Member

@cernicc cernicc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job on figuring this out 💪 I've added some comments and questions. The tests are currently not compiling.

Heads up. If you want to rerun the ci you have to retag the pr with the ready for review. The ci doesn't run automatically on each commit.

@@ -1,4 +1,4 @@
use std::collections::BTreeSet;
use std::{collections::BTreeSet, time::Duration};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Konrad is currently working on fixing the maat. Do you think we should remove the changes so that he won't have conflicts when merging? The thing is that there are more changes needed to fix the tests. There are not only compile time problems.

Comment on lines +52 to +67
/// Determines how content is encoded in the CAR file.
#[derive(clap::ValueEnum, Clone, Debug)]
pub enum WrapMode {
/// Store content directly in the CAR file without additional metadata.
/// This is the most space-efficient option but provides minimal metadata.
Raw,
/// Wrap content in UnixFS format which includes file metadata and DAG structure.
/// This is compatible with IPFS and provides richer metadata about the content.
UnixFS,
}

impl Default for WrapMode {
fn default() -> Self {
Self::Raw
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. Also lets have a wrapped mode by default. If the user wants the raw they should provide a flag.

}
}
}

impl Default for Config {
fn default() -> Self {
Self::Balanced {
chunk_size: DEFAULT_BLOCK_SIZE,
tree_width: DEFAULT_TREE_WIDTH,
chunk_size: 256 * 1024, // 256KiB
Copy link
Member

@cernicc cernicc Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the reason of removing the DEFAULT_BLOCK_SIZE and DEFAULT_TREE_WIDTH? I just want to see if there is something to learn here :D

tree_width: DEFAULT_TREE_WIDTH,
chunk_size: 256 * 1024, // 256KiB
tree_width: 174, // Default from ipfs
wrap_mode: WrapMode::Raw // Default to raw like go-car
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

go-car creates a wrapped car by default


let mater_config = match config.wrap_mode {
WrapMode::Raw => Config::default(),
WrapMode::UnixFS => Config::Balanced {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How were the chunk_size and tree_width here determined?

} // otherwise, the buffer is not full, so we don't do a thing
}
// If we reach EOF but still have a partial chunk, yield it
if read_bytes == 0 && buf.len() > 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this a bug?

entries.push(entry);
position += writer.write_block(&node_cid, &node_bytes).await?;

if let Some(existing_position) = written_blocks.get(&node_cid) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change was needed because of deduplication? Was that a bug?

@@ -29,6 +29,7 @@ thiserror.workspace = true
tokio = { workspace = true, features = ["fs", "macros", "rt-multi-thread"] }
tokio-stream.workspace = true
tokio-util = { workspace = true, features = ["io"] }
clap = { workspace = true, features = ["derive"] }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should hide the clap behind a feature flag. This was one of the reasons why we split the cli and lib.

@@ -21,7 +26,7 @@ impl<R> Reader<R> {

impl<R> Reader<R>
where
R: AsyncRead + Unpin,
R: AsyncRead + Unpin + AsyncSeek,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is Seek needed?

.route("/upload/:cid", put(upload))
.route(
"/upload/:cid",
put(upload as fn(State<Arc<StorageServerState>>, Path<String>, Request<Body>) -> _)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change needed?

Copy link
Contributor

@jmg-duarte jmg-duarte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work. The PR description would be better if it explained how the new code achieves the functionality since there's a lack of comments on the code for that specific purpose

@@ -29,6 +29,7 @@ thiserror.workspace = true
tokio = { workspace = true, features = ["fs", "macros", "rt-multi-thread"] }
tokio-stream.workspace = true
tokio-util = { workspace = true, features = ["io"] }
clap = { workspace = true, features = ["derive"] }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This must be removed. The mater crate has nothing to do with the CLI

Comment on lines +37 to +38
#[arg(long, value_enum, default_value = "raw")]
wrap: WrapMode,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's either wrapped or not. This is a bool

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this file related to mater?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this file related to mater?

Comment on lines +44 to +45
chunk_size: 256 * 1024, // 256KiB
tree_width: 174, // Default from ipfs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 - const
2 - why are you changing the existing value?

Comment on lines +372 to +374
let total_file_size: u64 = children.iter().map(|c| c.1.raw_data_length).sum();
let total_encoded: u64 = children.iter().map(|c| c.1.encoded_data_length).sum();
let blocksizes: Vec<_> = children.iter().map(|c| c.1.raw_data_length).collect();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're performing 3 loops when you could do just 1

};
let encoded = DagPbCodec::encode_to_vec(&pb_node)?;
let mh = generate_multihash::<Sha256,_>(&encoded);
let cid = Cid::new_v1(DAG_PB_CODE, mh);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why v1?

Comment on lines +390 to +396
let pb_links = children.into_iter().map(|(child_cid, link_info)| {
PbLink {
cid: child_cid,
name: Some("".to_string()),
size: Some(link_info.encoded_data_length),
}
}).collect::<Vec<_>>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also add this to the previous loop issue

@@ -1,10 +1,9 @@
mod blockstore;
mod file;
mod filestore;
pub mod filestore;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you exposing the whole module?

Comment on lines +70 to +77
pub struct ConvertConfig {
/// How the content should be wrapped in the CAR file.
/// See [`WrapMode`] for details.
pub wrap_mode: WrapMode,
/// Whether to overwrite existing files at the output path.
/// If false, will error if the output file already exists.
pub overwrite: bool,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This structure adds more mental overhead than it solves issues. They're two parameters, it's ok to use them explicitly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready for review Review is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: Mater convert, wrap contents in unixfs dag
4 participants