Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Winch: implement fpu to int conversions for aarch64 #9889

Merged
merged 8 commits into from
Jan 7, 2025

Conversation

MarinPostma
Copy link
Contributor

This PR implements FPU to int conversions for winch, with the unsigned_truncate and signed_truncate methods.

@MarinPostma MarinPostma requested review from a team as code owners December 21, 2024 16:35
@MarinPostma MarinPostma requested review from cfallin and alexcrichton and removed request for a team December 21, 2024 16:35
@MarinPostma
Copy link
Contributor Author

one more in the #8321 series

@github-actions github-actions bot added the winch Winch issues or pull requests label Dec 21, 2024
Copy link

Subscribe to Label Action

cc @saulecabrera

This issue or pull request has been labeled: "winch"

Thus the following users have been cc'd because of the following labels:

  • saulecabrera: winch

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

@MarinPostma MarinPostma changed the title Winch: implement fpu to int conversions Winch: implement fpu to int conversions for aarch64 Dec 21, 2024
@saulecabrera
Copy link
Member

I can help reviewing this one.

@saulecabrera saulecabrera requested review from saulecabrera and removed request for cfallin and alexcrichton January 1, 2025 17:47
Copy link
Member

@saulecabrera saulecabrera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. Left some comments inline.

As a general comment, the emit_* naming convention is not standard across the assembler implementations (as noted and fixed in #9918), so I´d suggest dropping the emit_ prefix in the assembler methods introduced in this PR.

}

/// Emit the `fcmp` instruction with `rn` and `rm` as source registers.
fn emit_fpu_cmp(&mut self, rn: Reg, rm: Reg, size: OperandSize) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we don´t need to duplicate this method, fcmp already exists.

Comment on lines 1140 to 1141
/// Load the max value for an integer of size out_size, as a floating-point
/// of size `in-size`, into register `rd`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Load the max value for an integer of size out_size, as a floating-point
/// of size `in-size`, into register `rd`.
/// Load the max value for an integer of size out_size, as a floating-point
/// of size `in_size`, into register `rd`.

use OperandSize::*;

match in_size {
OperandSize::S32 => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I believe that here and in the next match arm below you can omit the OperandSize:: prefix given that you're already importing it above (e.g., use OperandSize::*)

use OperandSize::*;

match in_size {
OperandSize::S32 => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment here, I think it should be possible to omit the OperandSize prefix in both match arms.

// - check bounds
self.emit_check_nan(src, src_size);

self.emit_min_fp_value(signed, src_size, dst_size, Writable::from_reg(tmp_reg));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're writing to tmp_reg, I think this should be defined in the signature, similar to dst: tmp_reg: Writable<Reg>

src_size,
dst_size,
kind,
// TODO: strangely, this register is provided in unsigned_truncate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I´d suggest avoiding passing a placeholder register here. The temp register that is passed to unsigned_truncate is an allocatable register, therefore there's very little chance of it getting unintentionally clobbered. Instead having a single implementation for both cases, I´d suggest breaking down the implementation in the assembler into two methods, one for each case (signed/unsigned), then in each implementation you can decide to use the scratch register, immediately before emission, without the risk of unintentionally clobbering it. This is similar to how the x64 assembler implementation works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make sure I understand, similarly to how x86 implementation, you would like to have two methods, fpu_to_sint and fpu_to_uint in asm, and also, those methods should be passed a temporary register in masm, rather than using the provided one in the case of unsigned_truncate?

I have a few questions/remarks:

  • what is the purpose of the temp reg in the unsigned variant? Why must this registry be allocated rather than used as a temp register? Is the caller expecting to find something in there? It is not explicitly passed as a writable register, but the x64 implementation implicitly converts it to one.
  • While splitting the convert_ methods for x64 makes sense to me since they correspond to a single instruction, we must emit a bunch of checks that are aware of the sign in the case of aarch64. Do you suggest duplicating all the checks in signed and unsigned variants? Or do we have two methods in ASM that call the current implementation with the signed argument set accordingly?
  • I'm sorry if that sounds very ignorant, but why do we have to worry about clobbering a tmp register anyway? Can't it, by definition, be used for any purpose, and one should not have too strong an assumption about what it may contain, especially when handing it over to a callee?
  • I'm surely missing something, but it seems to me that splitting the convert method into a signed and unsigned variant, and the use of tmp regs are orthogonal things, so I want to make sure I understand what you want correctly :)

Copy link
Member

@saulecabrera saulecabrera Jan 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of the allocatable temporary register is to provide an extra
register to aid in the emission of the instruction sequence needed for the
unsigned case; in x64, 4 registers are required for the instruction sequence:
1 for the destination and 3 temporary registers. Since in the case of x64 the
instruction sequence is completely handled by Cranelift, there's no risk of
unintentionally clobbering any of the registers passed to it, the unintentional
clobbering risk is particularly important in the case of scratch registers in Winch: even
though they can be used for any purpose, one important detail about them is that
they are not tracked by Winch's regalloc therefore they must be used sparingly
and with extreme caution: a misuse could result in unintentional clobbering and
therefore subtle bugs. Ideally, the live range of the scratch registers should be
as short as possible to avoid potential bugs.

In this particular case the reason why I am suggesting improving the usage of
the scratch register is to minimize the risk of unintentional clobbering of such
registers: once a scratch register is passed as a parameter, its live range is
extended and the risk of misuse is greater. A concrete example in this case of
a potential risk of unintentional clobbering is passing the scratch register to
fpu_to_int and then using the float scratch register in any of the methods
needed for the instruction sequence needed to fulfill the truncate instructions
(this is not the case today, but by passing the float scratch register around as
if it were any other register, it becomes harder to reason and audit in the
future).

There are some alternatives that I've thought about to improve the
handling/usage of the scratch registers to address some of the challenges
exposed in the previous paragraph. We could discuss those in a separate
issue/PR (e.g., relying on the type system to identify/audit scratch register usage and/or
providing exclusive access to the scratch registers, similar to what we have for
the allocatable registers). Care needs to be taken here to ensure that none of
the solutions explored affect compilation performance.

Taking a deeper look at your changes for the unsigned case, only 3 registers are
needed here compared to the x64 implementation which demands 4. This means that
you can effectively ignore the temporary register defined in the Masm
unsigned_truncate signature and only rely on the float scratch register.

Taking this into account, my concrete suggestion to move forward here is to:

  • Redefine the unsigned_truncate signature to be:
fn unsigned_truncate(
    &mut self,
    context: &mut CodeGenContext<Emission>,
    src_size: OperandSize,
    dst_size: OperandSize,
    kind: TruncKind
) -> Result<()>;

This allows maximum flexibility to encode the ISA differences and more
importantly reduces register pressure in cases where only 3 registers are
needed, reducing unnecessary register pressure where possible is important to
minimize memory traffic.

This is similar to how we handle the div instruction, given that the
requirements vary between ISAs.

  • Redefine fpu_to_int to be:
fn fpu_to_int(
    &mut self,
    dst: Writable<Reg>,
    src: Reg,
    src_size: OperandSize,
    dst_size: OperandSize,
    kind: TruncKind,
    signed: bool
) { ... };

Avoiding passing the scratch registers as params and shortening their live
range as much as possible by using them explicitly in each of the helper
functions that you have to fulfill the instruction sequence.


I believe that with this approach, my concerns will be addressed and there won't
be a need to split the signed/unsigned implementations (initially I was confused
with the TODO comment in the code and had assumed that the implementation was
making use of a scratch register that wasn't actually needed, although that's
not the case).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the detailed writeup! This makes things a lot clearer. I'll make the changes 👍

@MarinPostma
Copy link
Contributor Author

@saulecabrera I addressed most of the review comments but left a few questions

@MarinPostma
Copy link
Contributor Author

@saulecabrera I have made the requested changes

Copy link
Member

@saulecabrera saulecabrera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks for all your patience iterating on this. Left two minor comments and then we should be able to land this one.

let dst_ty = match dst_size {
OperandSize::S32 => WasmValType::I32,
OperandSize::S64 => WasmValType::I64,
_ => bail!("invalid conversion size"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you can use CodeGenError::unexpected_operand_size() instead. https://github.com/bytecodealliance/wasmtime/blob/main/winch/codegen/src/codegen/error.rs#L139

let dst_ty = match dst_size {
OperandSize::S32 => WasmValType::I32,
OperandSize::S64 => WasmValType::I64,
_ => bail!("invalid conversion size"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar here, you can use the appropriate error constructor.

@MarinPostma
Copy link
Contributor Author

It should be good now @saulecabrera

Copy link
Member

@saulecabrera saulecabrera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@saulecabrera saulecabrera added this pull request to the merge queue Jan 7, 2025
Merged via the queue into bytecodealliance:main with commit f309bfd Jan 7, 2025
39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
winch Winch issues or pull requests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants