Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create an analyzer that checks for simple, ignorable non-text changes #175

Open
Mr0grog opened this issue Mar 19, 2018 · 4 comments
Open

Comments

@Mr0grog
Copy link
Member

Mr0grog commented Mar 19, 2018

As a first test of all the things needed to automatically rate a change’s significance, priority, let’s start with something simple that looks for changes that we can pretty confidently say aren’t meaningful:

  • No changes to the page’s text (except whitespace changes and punctuation, like ')
  • Attribute changes that are not for title, alt, href, or src (any others?) are not important

Example: https://monitoring.envirodatagov.org/page/b2b0b8cb-5e9b-4178-91c0-b8cb4466d2bd/b76dd1ab-a7aa-41d6-89f3-c45117a80dc5..2b55beed-db97-4249-b30a-600f61d94eb5

This is an easy analysis to do (and covers a lot of the kinds of changes I think we see), so it’s a good way to make sure we’ve built out:

  • a working pipeline for proactively analyzing new versions added to the DB
  • a style and format for organizing analysis code
@Mr0grog
Copy link
Member Author

Mr0grog commented Aug 1, 2018

At this weeks analyst meeting, CAPTHAs came up as another constantly changing thing that is hopefully easy to identify.

Also:

  • ASP.net postback/session data
  • Invisible form fields (would cover the above ASP.net stuff)

More far out:

  • Simple heuristics for identifying “related links” sections?
  • Allowing selectors for sections of the page to ignore as an argument?
    • To be usable, we need to add the ability to store a list of ignorable selectors in DB, but that’s separate work

We should probably turn this issue into an umbrella/epic issue for all these different ideas and pieces of work.

@Mr0grog
Copy link
Member Author

Mr0grog commented Aug 30, 2018

From some BLM examples @jschell42 sent me:

  • Cache-breaking hashes/unique values in subresource URLs (e.g. for CSS, JS)

  • Changes to id, class, name attributes (and moving those attributes).

  • Changes to title attributes probably should be accounted for somehow, but a) are hard to see and b) probably aren’t a big deal (so they should only matter a tiny bit, if they matter at all).

  • Addition/removal of empty title or maybe any attribute? (Might need a special list of attributes that have meaning just by their presence, like checked.)

  • Maybe just anything that’s non-text/image?

  • Amount of textual change?

    • % of total words?
    • Simhash?
    • Zhang-shasha?
    • ?
  • <meta> modified date? e.g:

    <meta name="dcterms.modified" content="2018-06-11T11:58:04-04:00" />

There’s definitely an interesting thing here I wasn’t thinking about before… we could make a big split in prioritization based simply on textual (+ images and such) content changes. I can see some super-useful annotation data we could display for analysts (especially in their sheets) like:

  • Did text change? y/n
  • % text change

Some diffs for examples:

@stale
Copy link

stale bot commented Mar 25, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.

@stale stale bot added the stale label Mar 25, 2019
@Mr0grog Mr0grog added the pinned label Mar 25, 2019
@stale stale bot removed the stale label Mar 25, 2019
@Mr0grog
Copy link
Member Author

Mr0grog commented Mar 27, 2019

Another example of something that should really be totally ignored: https://monitoring.envirodatagov.org/page/c4328d30-cada-452f-8642-4bff721f5fc2/9a448c37-9285-4107-9ffd-ea72214561a4..a8fab661-07bb-4409-92f7-f73deadf4e29 (change to class attribute)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants