Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to plurimath gem #149

Open
4 tasks
skalee opened this issue Mar 15, 2021 · 17 comments
Open
4 tasks

Switch to plurimath gem #149

skalee opened this issue Mar 15, 2021 · 17 comments

Comments

@skalee
Copy link
Contributor

skalee commented Mar 15, 2021

In order to reduce code duplication in projects, extract logic to another gem. It looks like the most up-to-date version is here: https://github.com/metanorma/stepmod-utils/blob/728bd50bf609afd6c7ef0a6848f45a8419a57819/lib/stepmod/utils/html_to_asciimath.rb.

  • Extract HTML math conversion to a separate gem.
  • Use that gem in glossarist/iev-data.
  • Transfer relevant open issues from this project to plurimath/html2math.
  • Consider making some performance improvements in plurimath/html2math.

Extracted from #144:

@skalee we have copied of the 'fake math conversion' code to here:
https://github.com/metanorma/stepmod-utils/blob/728bd50bf609afd6c7ef0a6848f45a8419a57819/lib/stepmod/utils/html_to_asciimath.rb

And this is probably time to extract out this 'fake math conversion' functionality to a separate gem under the Plurimath umbrella. Can you help with that? Thanks.

@skalee
Copy link
Contributor Author

skalee commented Mar 19, 2021

@ronaldtse If you got some test suite by chance, or some technical description of the input format, that would be very helpful.

@skalee
Copy link
Contributor Author

skalee commented Mar 19, 2021

I'll do my best, but it won't be very reliable. For example https://www.electropedia.org/iev/iev.nsf/display?openform&ievref=102-02-13 — I can probably detect and convert j<i>b</i>, but I won't detect lone j which also happens in the definition.

@ronaldtse
Copy link
Member

@skalee unfortunately I don't have a set of compiled examples. There are definitely enough examples from the IEV, and I know that @w00lf has encountered some formulas that required work on top of the original code, perhaps he has some specs/examples to provide.

@skalee
Copy link
Contributor Author

skalee commented Mar 20, 2021

Thanks! I've extracted some from IEV. If @w00lf has a set of troublesome examples, it would be great to check them too.

@skalee
Copy link
Contributor Author

skalee commented Mar 27, 2021

@ronaldtse Short follow-up:

I'm doing fine with converting HTML math expressions to AsciiMath. It's certainly doable and I've already developed a tool which supports many features they use in IEV.

The difficult part is telling HTML math from rich text apart. It's easy for a human but not necessarily for a computer. Detecting numbers isn't reliable, they may be used in different contexts. Detecting operators isn't reliable, because minus can be confused with dash. Detecting <i> isn't reliable, because this tag isn't used exclusively for math (e.g. in https://www.electropedia.org/iev/iev.nsf/display?openform&ievref=845-32-051). And so on. I believe that I can invent some heuristics and I'm working on that now, but this may be unable to detect some simplest formulas.

But perhaps it isn't needed at all? Perhaps we can keep HTML math as rich text, and IEC will gradually convert them to formulas during their ongoing work on these concepts? I know that it will take years. The question is if they really need anything more than that. And we need rich text conversion from HTML to AsciiDoc anyway.

@ronaldtse
Copy link
Member

But perhaps it isn't needed at all? Perhaps we can keep HTML math as rich text, and IEC will gradually convert them to formulas during their ongoing work on these concepts? I know that it will take years. The question is if they really need anything more than that. And we need rich text conversion from HTML to AsciiDoc anyway.

We have the following agreement with the IEV team on semantic enrichment:

  • We will try to convert the math as much as possible, but we cannot guarantee that all math is converted
  • The IEV team will convert the formulas from the current format when we provide them with the converted formulas

Given that it is very difficult to bring semantic enrichment to 100%, I think best effort is acceptable.

We have to further consider that any "units" used in the IEV should also be converted into semantic units, i.e. UnitsML.

For now let's delegate the decision on what "good enough" in math means here to you, since you are knee deep in this 😉

@skalee
Copy link
Contributor Author

skalee commented Mar 27, 2021

Then I guess heuristics will do.

@skalee
Copy link
Contributor Author

skalee commented Mar 27, 2021

I'm pretty sure that some concepts need to be fixed, otherwise we'll end up with nasty false positives. One example is https://www.electropedia.org/iev/iev.nsf/display?openform&ievref=102-03-30, this fragment precisely: forefinger(<b><i>V</i></b>). Because of there is no space before ( it looks like a function call. I hope that I'll be able to provide a list of required fixes someday in future.

@ronaldtse
Copy link
Member

In this case the heuristic could know that “forefinger” is too long for a math symbol, but it’s no way a great rule. Let us also report this to IEC.

@skalee
Copy link
Contributor Author

skalee commented Mar 27, 2021

Length checks will not work. There are formulas which would be broken this way, for example this one in in https://www.electropedia.org/iev/iev.nsf/display?openform&ievref=112-01-13:

dim(refractive index <i>n</i> = <i>c</i><sub>0</sub>/<i>c</i>) = (LT<sup>–1</sup>)<sup>0</sup>"

@skalee
Copy link
Contributor Author

skalee commented Mar 27, 2021

@ronaldtse What to do with lone Greek letters which aren't part of longer mathematical formulas like in following example (https://www.electropedia.org/iev/iev.nsf/display?openform&ievref=103-07-03):

the angular frequency is <i>&omega;</i>.
  1. Should they be converted to stem:[omega]?
the angular frequency is stem:[omega].
  1. Should they be left in HTML entity syntax, which is recognized in AsciiDoc (see https://docs.asciidoctor.org/asciidoc/latest/subs/replacements/)?
the angular frequency is _&omega;_.
  1. Or should they be converted to a regular Greek letter?
the angular frequency is _ω_.
  1. Or maybe there is some other idea how to handle them?

@ronaldtse
Copy link
Member

dim(refractive index n = c0/c) = (LT–1)0"

True, this probably will require manual conversion.

What to do with lone Greek letters which aren't part of longer mathematical formulas

They should be converted to normal stem:[xxx]. In the future we may further enrich them.

@skalee
Copy link
Contributor Author

skalee commented Mar 29, 2021

@ronaldtse What to do if given formula cannot be represented in AsciiMath, typically due to unsupported symbols? For example in 103-03-01:

Note 3 to entry: Notation H(<i>x</i>) is also used. Notation &thetasym;(<i>t</i>) is used for the unit step function of time. Notation &upsih;(<i>x</i>) has also been used.
HTML LaTeX AsciiMath
thetasym vartheta vartheta
upsih varUpsilon ???

Fallback to MathML, perhaps? Using LaTeX or Unicode? Or do you have some better idea? Opening a feature request in AsciiMath may work too in a long run.

@skalee
Copy link
Contributor Author

skalee commented Mar 29, 2021

Perhaps a better question would be: Given that AsciiMath is generally preferred but unsuitable for more complicated formulas, which syntax should be supplemental: LaTeX or MathML?

@skalee
Copy link
Contributor Author

skalee commented Apr 3, 2021

Short follow-up. The current plan is:

  1. Convert HTML math to AsciiMath with our converter.
  2. Convert AsciiMath to MathML with AsciiMath gem.
  3. Convert it back to AsciiMath.

While it sounds odd, there is a rationale for that.

ad 1. HTML math is sequential in its nature, AsciiMath is sequential too, MathML is more structural. It's far easier to convert HTML math to AsciiMath than to MathML. My almost-complete-converter to AsciiMath is simpler and smaller than my work-in-progress-converter to MathML.

ad 2. However, AsciiMath does not support some of the features used in MathML, especially special characters which need to be written in Unicode rather than using their English names composed of ASCII characters. That's why some HTML math formulas cannot be represented as AsciiMath in the easy-to-edit form. Or maybe I'm wrong and stem:[ϒ] is okay — but this is not English Y nor Greek upsilon, this is "ϒ Greek upsilon with hook symbol".

ad 3. However, AsciiMath is easier for users, and we want to have AsciiMath when possible. That's why we'll try to convert it back to AsciiMath and use some other notation when it's impossible.

@ronaldtse
Copy link
Member

@skalee full agree with the statements. Steps 2-3 will normalize the asciimath so it’s good.

@skalee skalee pinned this issue Apr 13, 2021
@ronaldtse
Copy link
Member

This task will depend on the plurimath gem: plurimath/plurimath#2

@ronaldtse ronaldtse changed the title Switch to plurimath/html2math Switch to plurimath gem Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants