Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support workflow to facilitate translation of Metanorma documents #349

Open
ronaldtse opened this issue Jan 12, 2022 · 17 comments
Open

Support workflow to facilitate translation of Metanorma documents #349

ronaldtse opened this issue Jan 12, 2022 · 17 comments
Labels
enhancement New feature or request on-hold

Comments

@ronaldtse
Copy link
Contributor

OGC wishes to produce a Japanese translation of the CityGML 2.0 document encoded in Metanorma. (metanorma/ogc-citygml2#1).

I thought about it and the following workflow makes most sense. The challenge is to only translate "content", not syntax.

  1. Metanorma parses the English into a document tree
  2. Use the Google Translate API (Ruby client) to translate while respecting structure.
  • i.e. individually translate sections / sentences along structural boundaries (without breaking links, etc).
  1. Create the Metanorma source files for the translated document, perhaps in a interleaved or identical structure manner to produce a side by side document.

This is a preliminary workflow that nonetheless require some thinking to realize.

@ronaldtse ronaldtse added the enhancement New feature or request label Jan 12, 2022
@ronaldtse ronaldtse moved this to Triage in Metanorma Jan 12, 2022
@opoudjis opoudjis moved this from Triage to In Progress in Metanorma Jan 13, 2022
@opoudjis
Copy link
Contributor

Professional translators do start with automated tools, it is true; but (a) they don't finish with them, and (b) they want a workflow where they can see both. They're going to want to use their workflow tools, you can't just dump gibberish Japanese in an XML document and have them sort it out later in ASCII.

We can, per what you suggest, insert Japanese for cleanup as duplications of context clauses. But this is a very big ask, and you should be talking to the professional translator who's actually going to do this, to work out reasonable tool support.

@opoudjis
Copy link
Contributor

And someone else is going to have to work on this.

The text interleaving would be duplicating tags, using the @lang attribute (and for things like titles, which should only appear once), the lang:[] macro. (https://www.metanorma.org/author/topics/languages/)

@ronaldtse
Copy link
Contributor Author

They're going to want to use their workflow tools

"Professional" or not, this is something that is needed by someone who is translating the language. The label "professional" is a distraction.

In this case here, we are talking about standards authoring. The workflow tool for authoring standards documents by "professional standards authors" is Metanorma.

The Japanese author should be able to use Metanorma to:

  1. Start with machine-translation without losing existing structure
  2. Refine the translation
  3. Publish the Japanese translation

@opoudjis
Copy link
Contributor

opoudjis commented Jan 17, 2022

Ronald, you are not understanding what I am saying.

Professional translations is emphatically NOT a distraction, if the workflow is of a translator using machine translation as a starting point to do bulk translation. Such translators will use a translation workbench tool, such as (to pick the first instance I've googled) https://www.memsource.com/translation-software/ . Such a tool will include memorised custom equivalents that the translator has keyed in, templates, technical dictionaries, and whatever else the translator has put in place to make their life easier.

A professional translator's environment is going to be that workbench. That is the environment they are going to work in. Metanorma is NOT a translation environment, it is an authoring tool, and their translation environment is going to have to integrate with Metanorma, in some way you will need to work out.

What you are proposing is to do machine translation drop in into Metanorma XML outside of the translator's workbench tool, and make them do all their refinements manually. I am telling you, professional translators will not find that adequate: you will be taking them away from their shortcuts and their technical dictionaries, which are normally integrated into their editor.

So you will need to investigate further, how translators go about translating marked up documents preserving markup in their tools. I think it is quite likely that this is a solved problem for their workbench tools; and if it is a solved problem, that is all the more reason for us to use the existing tools' way of solving the problem, rather than imposing our own solution on them. I think us doing our own solution is going to duplicate existing effort, and do a bad job of it, that such translators cannot use.

And that is why I make a point of saying "professional" translators, translators that routinely use translation workbench tools. A non-professional translator, a subject matter expert for example, will quite happily follow the workflow you propose, of refining a machine translation manually, since they don't have existing workbench tools; they'll be quite happy, for that matter, to eyeball original and machine translated target in two separate windows of an editor, rather than a more integrated environment, where they could do things like mouseover words to get dictionary lookup. And for all I know, OGC may be translating their documents in such an ad hoc way.

But if an SDO employs a professional translator, using translation tools, to do translations, then Metanorma will need to integrate with their workflow. And:

  • "Start with machine-translation without losing existing structure" --- their tools likely already can do that.
  • "Refine the translation" --- they will want to keep doing that within their existing environment.

@opoudjis
Copy link
Contributor

OK, given that the workflow envisioned is not one of a professional translator using a workbench:

  • Automated translation of XML source may preserve inline tags and attributes in formatting; that's not guaranteed, and if it does not, we may need to postprocess the XML. It would be much simpler to do the translation of asciidoctor source.
  • Automated translation of asciidoctor source will still potentially distort inline markup.
  • We should do automated translations one block at a time. (Delimited in Asciidoctor by blank lines.) We cannot translate things in inline markup separate from their context (by default): words marked up in boldface for example are still part of sentences.
  • We should output the original in comments next to the output translated block, so that the translator can see what the original is in place, and fix things. (Both markup, and text.)
  • Sourcecode ([source) should not be translated. This should include inline text in monospace.
  • Anchors and cross-references should not be translated; we hope they will be untranslatable, but we can't assume that. So any altered <<x, and [[x]] needs to be restored in automatic translation. (In <<x,y>>, the y should be translated: it is rendered text. In <<x,clause=n,y>>, the clause=n should not be translated, it is formatting.)
  • Table cells need to be processed one cell at a time
  • Bibliographies should not be translated
  • Document headers should not be translated. (The title should, but it's not worth trying to parse the document header.)

@opoudjis opoudjis moved this from In Progress to On hold in Metanorma Jan 19, 2022
@ronaldtse
Copy link
Contributor Author

Automated translation of XML source may preserve inline tags and attributes in formatting; that's not guaranteed, and if it does not, we may need to postprocess the XML. It would be much simpler to do the translation of asciidoctor source.

I respectfully disagree:

  1. We do not have a proper AsciiDoc parser that provides a parse tree suitable for translation. Any regex hack would just make the flow more fragile that it needs to be.
  2. The XML source is the only source of truth for the Metanorma document. Remember that the model-based standards code only unrolls content in the XML source, not the AsciiDoc source.

i.e. We should use the Metanorma Semantic XML for translation purposes.

@ronaldtse
Copy link
Contributor Author

The talk about "professional translators" is irrelevant to our task at hand right now.

Here are the facts:

  1. The Japanese translation will be published in Metanorma.
  2. The English source document is available in Metanorma AsciiDoc and XML.
  3. The Japanese translator wishes for some machine translation assistance to start with.

We just have to do whatever possible with these.

@opoudjis opoudjis assigned opoudjis and unassigned opoudjis Jan 19, 2022
@opoudjis
Copy link
Contributor

Google will skip HTML but not non-HTML XML markup (behaviour varies between languages).

Serialising the Asciidoctor parse tree into pseudo-HTML is itself a major venture, requiring a new parser, and the Asciidoctor parse tree cannot be relied on as stable.

The alternative is likely going to be quite lossy: source Asciidoctor > source Metanorma XML > source Metanorma Pseudo-HTML (substituting arbitrary HTML tags for Metanorma tags) > translated Metanorma Pseudo-HTML > translated Metanorma XML > translated Asciidoctor

Indeed, it'll be lossy enough that any translator is going to need to have two text windows side by side, source Asciidoctor, and output Asciidoctor --- and they're going to have to do a lot of repair of the latter copying from the former. If the document is clean (not much markup), this might be good enough. It's not a given that it will.

In XML, the provisos above become:

  • Content of <sourcecode>
  • Content of <code>
  • In the case of <xref>, content can be translated: the anchors and anchor cross-references will be segregated as markup.
  • <td> and <th> content needs to be translated in isolation
  • <references> need not to be translated
  • <bibdata> needs not to be translated (with the possible exception of <title> and <abstract>)

@opoudjis opoudjis removed their assignment Jul 5, 2022
@opoudjis
Copy link
Contributor

opoudjis commented Jul 5, 2022

Unassigning myself, I won't have time to do this, and I've outlined what needs doing

@ronaldtse
Copy link
Contributor Author

I found that LibreTranslate is a pretty good model that can be run locally.

@opoudjis
Copy link
Contributor

opoudjis commented Nov 6, 2023

DeepL also

@ghobona
Copy link

ghobona commented Nov 6, 2023

Discussed with OGC Staff on 2023-11-06.

More research needed before identifying a path forward.

@opoudjis
Copy link
Contributor

opoudjis commented Feb 15, 2024

Processing the input text in Asciidoctor format using coradoc is a more effective way forward.

@ghobona
Copy link

ghobona commented Mar 19, 2024

@opoudjis to look into providing an example from some prior work.

@opoudjis
Copy link
Contributor

The work is the samples of metanorma-jis that we have done, just to show that we support i18n for Japanese. The documents are jis-z-5999 and jis-z-8301-2019. Gobe would like to show these to his Japanese colleagues as proof of concept, but only if they are public documents. @ronaldtse please clarify status of documents.

@opoudjis
Copy link
Contributor

Compiled an OGC standard using the JIS flavour of Metanorma with Japanese language for metalanguage, and sent to @ghobona as proof of concept.

@opoudjis
Copy link
Contributor

Deprioritise, will depend on new infrastructure emerging, including translation tools.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request on-hold
Projects
Status: On hold
Development

No branches or pull requests

3 participants