Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Abstraction for Adapting Various Model Objects to Tool Collection Inputs #19359

Open
jmchilton opened this issue Dec 20, 2024 · 1 comment

Comments

@jmchilton
Copy link
Member

jmchilton commented Dec 20, 2024

Motivation

I wrote this issue up as a very esoteric diary entry to myself and maybe Marius and Wolfgang 😆. But there was no motivation to provide context so I'm going to put a little motivation here at the top. The drive is easier to see in context - tool authors should have it a lot easier.

Mappers in the IUC currently have a wild number of conditional options to deal with different ways the data may be structured - but that is silly - it isn't a tool concern at all. If the mapper doesn't have a "pooled" mode there should just be one input not a complex conditional:

<param type="data_collection" type="paired_or_unpaired" label="Reads" />

If there is a pooled option, the conditional should be wildly simpler for the end user and for the tool author than the current conditionals in the wild:

<conditional name="reads">
    <param type="boolean" label="Pooled Analysis?" name="pooled" /> 
    <when value="true">
       <param type="data_collection" type="list,list:paired,list:paired_or_unpaired" label="Reads" />
    </when>
    <when value="false">
       <param type="data_collection" type="paired_or_unpaired" label="Reads" />
    </when>
</conditional>

From there any sort of issue about how to prepare or assemble the data should be the job of the tool form. If people don't have collections they should be able to just use datasets and the tool form should adapt them just at runtime for the tool code and not have complex model objects sitting in the history.

Full Issue

Required for #6063.

I was originally convinced collections would require all sorts of tool form trickery to promote single datasets to lists or group things based on filters or whatever. But we've gotten by with relatively few tricks thanks mostly to collection operations and being quite explicit about data manipulations. And within the realm of magic we do... I think it is essentially all reduce collections on to less structured things (doing a collection reduction over a multi-data input for instance or mapping a collection of a single data input). There is nowhere we have needed to promote or adapt other things to collections. But the CWL branch makes heavy use of such magic and has been using a concept I called EphemeralCollections. The approach was fairly naive and unstructured and we've grown a lot in terms of schema and type checking since that I started working with that idea.

I've been implementing #6063 for the sample sheet work (my current implementation for these mixed paired/unpaired collections is at jmchilton@89b74fe). For this work I needed to be able to promote single datasets in a list into singleton collections of type paired_or_unpaired with a single input. I think the expected (and useful) semantics of mapping a plain list of datasets over a collection input with collection_type paired_or_unpaired is pretty intuitive and natural. But it does sort of require some adapting of datasets (or in this case single DCEs) into collection-looking things. These "collections" make sense to the tool code and are a much better approach than torturing basic.py with even more wild overloading of existing object types but they don't belong in the database as model objects - they can exist just in the realm of tool evaluation and users never want to see these byproducts in the UI presumably. This is exactly what I tried to do with EphemeralCollections the better part of a decade ago. This work gave me a chance to redo EphemeralCollections with more structure. I called the idea CollectionAdapters in that commit.

So far I've just adapted lists for mapping over paired_or_unpaired collection inputs, so the the API didn't need a new syntax. But I think there are a few other cases we want to adapt things to collection inputs where we would need an API syntax I think.

  • One should be able to pass a single dataset to a paired_or_unpaired input. Reviewing the tools in the IUC - all the language assumes everything is mapping over a list of pairs and paired inputs are not used outside of that context. I think this also is reflected in Anton's dislike of the paired collection creator. So I wasn't going to block the paired_or_unpaired work on this feature... but I would like to know we have the CollectionAdapter idea right outside the one single adapter I've implemented and this would be the next natural place to use it. My idea for the API syntax would be something like {src: "CollectionAdapter", type: "PromoteDatasetToCollectionAdapter", collection_type: "paired_or_unpaired", adapting: {src: hda, id: "<encoded_id>"}}
  • This would immediately give rise to another use case - one of the advantages of a multi-data input versus a collection list input is that you can send single datasets to the multi-data input. One can imagine using the same adapter to allow a dataset to be passed to the list input: {src: "CollectionAdapter", type: "PromoteDatasetToCollectionAdapter", collection_type: "list", adapting: {src: hda, id: "<encoded_id>"}}

I think I should stop at that if I implement this idea further because each trick is potentially complexity we have to add to the workflow editor and workflow extraction code, but some other obvious adapters are floating out there that could really simplify tool authoring.

  • {src: "CollectionAdapter", type: "PromoteDatasetsToPairedAdapter", adapting: {forward: {src: hda, id: "<encoded_id>"}, reverse: {src: hda, id: "<encoded_id>"}} would allow tool authors to just specify a single paired_or_unpaired input and let the tool form render options for individual pairs, two dataset combos, or single datasets. That would so simplify dozens of tools and help us control language and UI around collections so we can make it more clear what is happening.
  • We cannot send a set of individual datasets to collection list input currently but the semantics would be clear and the operation reasonable - one could imagine doing that with {src: "CollectionAdapter", type: "PromoteDatasetsToListAdapter", adapting: [{src: hda, id: "<encoded_id>"}*]}
  • We cannot map a list over a list input or a multi-data input - the semantics are clear but the tool is always going to consume the collection in its entirety and not allow mapping over that input. Again the semantics of how that would work are clear and the operation is reasonable so it would make good sense to implement I think. The syntax might be something like: {src: "CollectionAdapter", type: "WrapEachElementInAListAdapter", adapting: {src: hdca, id: "<encoded_id>"}}. Promoting each list element to its own list list and making the list a list:list effectively for the tool code would cause Galaxy's natural map/reduce semantics to implement the desired operation.

We've gotten by 10 years with a fairly "dumb" tool form but reviewing tools that work with collections in tools-iuc - we've pushed a lot of this complexity onto tool authors and tools. Tool authors have less versatility in how to adapt to and document these complexities. This is what people like @jxtx and @bgruening said would happen and if I recall correctly they both wanted a bit more magic so tools could be simpler but I didn't have the time or creativity to make progress back then. Maybe with collection adapters and tool form enhancements we can start reclaiming some of that complexity back into Galaxy and develop better visual languages for conveying these operations to users and let tool authors just worry about tool intricacies and not collection semantics so much.

@mvdbeek
Copy link
Member

mvdbeek commented Dec 20, 2024

That looks awesome 🔥, the branch makes so much more sense than the ephemeral collections, I'm looking forward to merging the SE/PE versions of workflows and tool inputs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants