Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output NISO STS XML format from an ISO OBP HTML #7

Open
2 tasks
ronaldtse opened this issue Apr 8, 2024 · 4 comments · May be fixed by #6
Open
2 tasks

Output NISO STS XML format from an ISO OBP HTML #7

ronaldtse opened this issue Apr 8, 2024 · 4 comments · May be fixed by #6
Labels
enhancement New feature or request

Comments

@ronaldtse
Copy link
Contributor

ronaldtse commented Apr 8, 2024

From:

The ISO OBP HTML is actually rendered from data of an XML format called "NISO STS" (the ISO flavor of it).

Instead of just the HTML output, we also want to output the NISO XML format.

Use case

Some ISO authors have to start documents from the ISO website as they are unable to obtain the STS files.

On the ISO OBP, informative content, such vocabulary, is freely available. The best way is to give them an automated way to extract this data.

Mechanism

The steps shall be as follows:

  1. Download the index.html for a particular URN
  2. From index.html, convert it into an STS XML document (using the code in the PR Add obp2sts command and related code to convert OBP HTML to STS XML #6), and then using the new sts gem to write it as STS XML.

This is a Ruby script that somewhat parses index.html, it's not yet complete. It is provided in:

CLI:

$ bundle exec exe/obp-access -o output iso:std:iso:5598:ed-3:v1:en
$ bundle exec exe/obp2sts output/index.html

=> writes out:

  • output/index.html.xml: STS XML file generated by obp2sts
  • output/index.html.sts.xml: STS XML file generated by the sts gem given output/index.html.xml as input

Library:

stshtml = Obp::StsHtml.new('output/index.html')
stsxmltext = stshtml.clean.to_xml
sts = Sts::NisoSts::Standard.from_xml(stsxmltext)
puts sts.to_xml(pretty: true)

Work to be done

  • obp2sts: Ensure that the StsHtml class completely converts all content from HTML to STS
  • obp2sts: Ensure resulting STS file from StsHtml#to_xml is properly parseable by the sts gem (main branch)
@ronaldtse ronaldtse added the enhancement New feature or request label Apr 8, 2024
@roberthopman
Copy link

roberthopman commented May 22, 2024

ruby 3.3.0, branch rt-obp2sts, running bundle exec exe/obp-access -o output iso:std:iso:5598:ed-3:v1:en and bundle exec exe/obp2sts output/index.html, it returns in index.html.sts.xlm one line with <standard/>. At this moment, the expected output? @ronaldtse

@ronaldtse
Copy link
Contributor Author

ronaldtse commented Jun 22, 2024

@roberthopman sorry for the delay in replying!

one line with <standard/>

No, it is supposed to provide content. Right now, the output is incorrect. The task is to fix the output.

So there are 3 steps:

  1. The input is output/index.html, which is the raw HTML fetched using the obp-access command. It is correct.
  2. The intermediary file is output/index.html.xml, which is "supposed" to be parseable by the sts gem. It does contain content, but is apparently incorrect and hence cannot be parsed by the sts gem.
  3. The final file is output/index.html.sts.xml, which is generated from output/index.html.xml. It is empty because it cannot read the intermediary file.

@ronaldtse
Copy link
Contributor Author

This is now updated in #6, with now a document structure created using the sts gem.

It already does a reasonable transform of the HTML file into STS by declarative building.

There are a number of TODOs in the code:

  1. The mixed content elements, such as <p> and <sec>, do not fully contain proper content. e.g. if you had <std-id> or <i> inside the content, they will be lost. This is a general issue about Sts::Mapper because I don't know how to actually use it properly. (ping @HassanAkbar )
  2. The "Terms and definitions" section need additional treatment, see the sample document.
  3. The "Normative references" section need treatment like the bibliography, which currently works to some extent (it does build a proper <ref-list> from ISO 5598, but I haven't tested against other documents.
  4. Annexes are not handled right now.
  5. (Important for @HassanAkbar ) I can't get the sts.to_xml method to generate XML content except for <standard .../>. Please help.

We're getting there.

@HassanAkbar
Copy link
Member

@ronaldtse I have a few questions related to this

  1. There is no mapping for the <a> tag in the sts-ruby gem. Should we map them to ext-link or somewhere else? There are some internal references as well in a tag, where should we map those?
  2. I was unable to find the mapping for entailedTerm-num in sts-ruby gem. Where should we map those?

Is there any guidelines or mapping available for all the HTML classes to sts-ruby classes? Or is there some example documents that I can use as reference related to how the expected output should be for the output/index.html.xml and output/index.html.sts.xml files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants