Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slugify and truncate the default collection name #106

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions tests/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
import contextlib
import os
from typing import Any, Dict, Optional, Type

import pytest
Expand All @@ -20,3 +22,15 @@ def get_crawler(
runner = CrawlerRunner(settings)
crawler = runner.create_crawler(spider_cls)
return crawler


# https://stackoverflow.com/a/34333710
@contextlib.contextmanager
def set_env(**environ):
old_environ = dict(os.environ)
os.environ.update(environ)
try:
yield
finally:
os.environ.clear()
os.environ.update(old_environ)
Empty file added tests/incremental/__init__.py
Empty file.
46 changes: 44 additions & 2 deletions tests/incremental/test_collection_fp_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,20 @@
from unittest.mock import MagicMock, patch

import pytest
from scrapy import Spider
from scrapy.statscollectors import StatsCollector
from scrapy.utils.request import RequestFingerprinter
from scrapy.utils.test import get_crawler as _get_crawler
from twisted.internet.defer import Deferred, inlineCallbacks

from tests import get_crawler
from zyte_spider_templates._incremental.manager import CollectionsFingerprintsManager
from zyte_spider_templates._incremental.manager import (
CollectionsFingerprintsManager,
_get_collection_name,
)
from zyte_spider_templates.spiders.article import ArticleSpider

from .. import get_crawler, set_env


@pytest.fixture
def mock_crawler():
Expand Down Expand Up @@ -207,3 +213,39 @@ def test_spider_closed(mock_scrapinghub_client):
fp_manager.save_batch = MagicMock(side_effect=fp_manager.save_batch) # type: ignore
fp_manager.spider_closed()
fp_manager.save_batch.assert_called_once()


@pytest.mark.parametrize(
("env_vars", "settings", "spider_name", "collection_name"),
(
# INCREMENTAL_CRAWL_COLLECTION_NAME > SHUB_VIRTUAL_SPIDER > Spider.name
# INCREMENTAL_CRAWL_COLLECTION_NAME is used as is, others are
# slugified, length-limited and they and get an “_incremental” suffix.
(
{},
{},
"a A-1.α" + "a" * 2048,
"a_A_1__" + "a" * (2048 - len("a_A_1_a_incremental")) + "_incremental",
),
(
{"SHUB_VIRTUAL_SPIDER": "a A-1.α" + "a" * 2048},
{},
"foo",
"a_A_1__" + "a" * (2048 - len("a_A_1_a_incremental")) + "_incremental",
),
(
{"SHUB_VIRTUAL_SPIDER": "bar"},
{"INCREMENTAL_CRAWL_COLLECTION_NAME": "a A-1.α" + "a" * 2048},
"foo",
"a A-1.α" + "a" * 2048,
),
),
)
def test_collection_name(env_vars, settings, spider_name, collection_name):
class TestSpider(Spider):
name = spider_name

crawler = _get_crawler(settings_dict=settings, spidercls=TestSpider)
crawler.spider = TestSpider()
with set_env(**env_vars):
assert _get_collection_name(crawler) == collection_name
17 changes: 10 additions & 7 deletions zyte_spider_templates/_incremental/manager.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import asyncio
import logging
import re
from collections import defaultdict
from concurrent.futures import ThreadPoolExecutor
from typing import Dict, List, Optional, Set, Tuple, Union
Expand All @@ -22,11 +23,19 @@
logger = logging.getLogger(__name__)

INCREMENTAL_SUFFIX = "_incremental"
_MAX_LENGTH = 2048 - len(INCREMENTAL_SUFFIX)
COLLECTION_API_URL = "https://storage.scrapinghub.com/collections"

THREAD_POOL_EXECUTOR = ThreadPoolExecutor(max_workers=10)


def _get_collection_name(crawler: Crawler) -> str:
if name := crawler.settings.get("INCREMENTAL_CRAWL_COLLECTION_NAME"):
return name
name = get_spider_name(crawler).rstrip("_")[:_MAX_LENGTH] + INCREMENTAL_SUFFIX
return re.sub(r"[^a-zA-Z0-9_]", "_", name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one may be problematic for another reason - all spiders with unicode names of the same length would silently get the same collection name, and reuse the fingerprint DB unintentionally.

Copy link
Contributor Author

@Gallaecio Gallaecio Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do not care about human readability of the collection names, we could replace characters with their UTF-8 hexadecimal (e.g. å → C3A5) or some other Unicode character ID.

Otherwise, I think we are back to slugify or at least Unidecode, with the caveat that the results may change over time since we have no control over those libraries. We could pin them in zyte-spider-templates-project, but eventually some user may find this problematic.

We could also use the spider numeric ID, since collections are project-specific. We can extract it easily from the middle of the job ID. To keep backward-compatibility with 0.11, we could do this only on spiders with spider names that are invalid collection names.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also wonder what are the restrictions on the spider names. The solution could also be to bring the formats closer together (e.g. add more validation for virtual spider names).

Otherwise, I think we are back to slugify or at least Unidecode

Hm, using text-unidecode could be pretty stable. Unlike unidecode or python-slugify, it didn't have a release in many-many years :)

Copy link
Contributor Author

@Gallaecio Gallaecio Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, nice! And its maintainer feels familiar :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about just disabling the default name creation and require an explicit collection name?

Copy link
Contributor Author

@Gallaecio Gallaecio Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I think it is worth considering. While it is a slightly worse initial user experience, it is the most reliable approach, no need for us to worry about any long-term issues with the default name generation.

We could make it a backward-compatible change by making it required only in the UI with some JSON schema customization, and logging a deprecation error (because warnings are not visible in the UI easily) to encourage users to set it on spiders created with 0.11.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still unsure about requiring a collection name - it's reliable, but it's an additional step. I wonder how often would user face issues in practice with the current approach (unidecode), and if it worths an additional step.

What do you think about logging a warning if the collection name is not set explicitly, and the resulting name is mangled? With a suggestion to set it explicitly?

Copy link
Contributor Author

@Gallaecio Gallaecio Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about logging a warning if the collection name is not set explicitly, and the resulting name is mangled? With a suggestion to set it explicitly?

That should help, yes. Though with the current UI, I wonder how many people will notice the warning. On the other hand, I expect much fewer people to actually run into an issue.



class CollectionsFingerprintsManager:
def __init__(self, crawler: Crawler) -> None:
self.writer = None
Expand All @@ -37,7 +46,7 @@ def __init__(self, crawler: Crawler) -> None:
self.batch_size = crawler.settings.getint("INCREMENTAL_CRAWL_BATCH_SIZE", 50)

project_id = get_project_id(crawler)
collection_name = self.get_collection_name(crawler)
collection_name = _get_collection_name(crawler)

self.init_collection(project_id, collection_name)
self.api_url = f"{COLLECTION_API_URL}/{project_id}/s/{collection_name}"
Expand All @@ -51,12 +60,6 @@ def __init__(self, crawler: Crawler) -> None:

crawler.signals.connect(self.spider_closed, signal=signals.spider_closed)

def get_collection_name(self, crawler):
return (
crawler.settings.get("INCREMENTAL_CRAWL_COLLECTION_NAME")
or f"{get_spider_name(crawler)}{INCREMENTAL_SUFFIX}"
)

def init_collection(self, project_id, collection_name) -> None:
client = get_client()
collection = client.get_project(project_id).collections.get_store(
Expand Down
Loading