Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standard file symbol syntax #238

Open
varungandhi-src opened this issue Apr 23, 2024 · 4 comments
Open

Standard file symbol syntax #238

varungandhi-src opened this issue Apr 23, 2024 · 4 comments

Comments

@varungandhi-src
Copy link
Contributor

varungandhi-src commented Apr 23, 2024

(Converted from Slack thread)

I was thinking maybe we should just go ahead and add a standard "file symbol" syntax to SCIP so that you can emit occurrences which are references to a file rather than some fake occurrence within the file.

For some languages, there are no package declarations at the top of a file, and we want to be able to emit references to files for import statements. In the past, we've relied on hacks, which don't work when the client tries to do something principled. https://github.com/sourcegraph/sourcegraph/issues/59733#issuecomment-1945520000

@varungandhi-src
Copy link
Contributor Author

Questions from @kritzcreek

I think being able to reference files makes sense. There are a lot of languages that use files as meaningful units (most common would be implicit module).
How'd you represent these symbols? Relative paths to the index root? Would the indexers need to emit definition occurrences for all files. just the ones referenced, or would the file symbols have special resolution logic?
When introducing the features I'd maybe add a guide line along the lines of:
"Where possible we should prefer linking to syntax (think module Data.List where in Haskell, rather than targeting the file)"

@varungandhi-src
Copy link
Contributor Author

Relative paths to the index root?

Yes, relative, syntactically normalized (no . or ..). This is the current syntax for global symbols

<scheme> ' ' <package> ' ' (<descriptor>)+

We should reuse the prefix bit, but the last bit needs to not match any descriptor.

Identifiers cannot be empty, which means that we can maybe we can use some syntax like:

<scheme> ' ' <package> ' ' file://<relative_path>

Based on the grammar, I think this syntax would be illegal, because the sequence :// is not allowed. It is also easy to understand.

We retain the <package> part because you should be able to refer to files in other repos.

Would the indexers need to emit definition occurrences for all files. just the ones referenced

I think having files be referencable in SCIP is a useful concept, and in terms of extra storage, it wouldn't be that much, it's one string per document. So I propose the following:

Add two fields to Document: file_symbol_info: SymbolInformation and file_occurrence.

  • If one wants to add "file-level docs" for the file, they get attached to file_symbol_info.
  • If one wants to specify a real source range for the file's occurrence to land (e.g. <Name> in module <Name> in Haskell), they can attach it to the file_occurrence.

As part of Document normalization:

  • If doc.file_symbol_info?.symbol? is unset, we implicitly construct one with the right name
  • If doc.file_occurrence?.symbol? is unset, we implicitly set it with the right name.
  • If doc.file_occurrence?.range? is unset, we implicitly set it to [0, 0, 0] so as to not break other code relying on ranges set for occurrences (not 100% sure about this, feels a bit hacky... - need to be careful that we don't panic or silently drop these elsewhere during Document normalization or insertion)

would the file symbols have special resolution logic?

Hmm, resolution-wise, I don't think we should need anything special at all (maybe I'm missing something -- do you think this would be necessary?). The extra sauce will be needed in client-side code nav.

  • On 'Go to definition', it needs to not highlight any source range if the file_occurrence.range is zero-length.
  • It needs to make sure that it provides some ways of doing 'Find references' for files which do not a proper source range for the file_occurrence.

When introducing the features I'd maybe add a guide line along the lines of:

Yeah, definitely, agreed. We've added examples in the past in scip.proto to guide indexer authors.

@varungandhi-src
Copy link
Contributor Author

Follow-up from @kritzcreek

syntactically normalized

Queue the support requests for symlinked setups, or directory mapping like tsconfig paths etc...
I agree syntactically normalized is the best we can do, but I'm still a bit worried people might interpret "files are supported" to mean "file systems are supported"

and in terms of extra storage, it wouldn't be that much

We should definitely keep an eye on this, paths can be relatively long and strings are expensive. I'd hope gzip would compress the repetition in path prefixes relatively well, but we'll have to check.

Hmm, resolution-wise, I don't think we should need anything special at all (maybe I'm missing something -- do you think this would be necessary?)

I was thinking about this in terms of the codeintel backend. If the frontend doesn't "decode" the symbol to request a file (and maybe it can't anyway because of doc.file_occurence), we'd need to add these new file symbols to the codeintel_scip_symbols table, right?

@varungandhi-src
Copy link
Contributor Author

Queue the support requests for symlinked setups, or directory mapping like tsconfig paths etc...
I agree syntactically normalized is the best we can do, but I'm still a bit worried people might interpret "files are supported" to mean "file systems are supported"

We already have the rules written here. https://github.com/sourcegraph/scip/blob/main/scip.proto#L79-L88

We haven't received any complaints about it so far.

and in terms of extra storage, it wouldn't be that much
We should definitely keep an eye on this, paths can be relatively long and strings are expensive

Import statements are generally much fewer than other occurrences in most code (PureScript doesn't count as mainstream 😛). I think we really should not worry about this.

I was thinking about this in terms of the codeintel backend. If the frontend doesn't "decode" the symbol to request a file (and maybe it can't anyway because of doc.file_occurence), we'd need to add these new file symbols to the codeintel_scip_symbols table, right?

My point was, the "special-ness" of these symbols would be around normalization (which happens during ingestion) mainly, and special affordances needed in the UI. Storage (at rest) and retrieval/lookup (in the backend) should be unaffected because they're "just" symbols.

In my mind, "resolution" ~ lookup. Since the storage would be like other symbols, resolution/lookup would also be similar.

@linear linear bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant