Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(wren-ai-service): Embed the SQL pairs in MDL #1082

Merged
merged 19 commits into from
Jan 10, 2025
Merged

Conversation

paopa
Copy link
Member

@paopa paopa commented Jan 2, 2025

Changes

  1. Pipeline Simplification

    • Removed SQL intention generation using LLM
    • Simplified document storage to use direct question-SQL pairs
    • Updated pipeline configuration to remove unnecessary LLM dependency
  2. Configuration Changes

    • Added SQL_PAIRS_PATH environment variable
    • Added pairs.json configuration file support
    • Updated deployment configurations to include new file mounts
  3. Sample Management

    • Added structured JSON format for storing SQL query examples
    • Updated sample format to use question-SQL pairs
    • Improved sample integration in SQL generation prompts
  4. Testing

    • Updated existing tests to reflect new pipeline structure
    • Added test data for pairs.json
    • Temporarily skipped outdated tests pending updates

Related Changes

  • Updated docker compose configurations
  • Modified kustomization files for deployment
  • Updated configuration examples
  • Added new utility tool for MDL string conversion

Summary by CodeRabbit

Release Notes

  • New Features

    • Added a new SQL pairs indexing pipeline for processing and storing SQL query examples
    • Introduced a tool to convert Model Definition Language (MDL) to string format
    • Enhanced SQL generation with support for sample query examples
  • Changes

    • Renamed sql_pairs_preparation pipeline to sql_pairs_indexing
    • Updated configuration files to reflect pipeline changes
    • Removed LLM-specific configuration from SQL pairs pipeline
  • Improvements

    • Streamlined SQL pairs handling across multiple service components
    • Added more flexible configuration for SQL pair processing
  • Testing

    • Updated test cases to align with new SQL pairs indexing pipeline
    • Added sample SQL pairs test data

Copy link
Contributor

coderabbitai bot commented Jan 2, 2025

Walkthrough

This pull request introduces significant changes to the Wren AI service, focusing on SQL pairs indexing and processing. The modifications span multiple files, introducing a new SqlPairs pipeline to replace the previous SqlPairsPreparation approach. Key changes include renaming pipeline stages, updating configuration files, adding new utility functions, and restructuring how SQL pairs are handled throughout the service. The changes aim to improve the flexibility and clarity of SQL pair management, with a more streamlined approach to indexing and processing.

Changes

File Path Change Summary
deployment/kustomizations/base/cm.yaml Renamed pipeline step from sql_pairs_preparation to sql_pairs_indexing
docker/config.example.yaml Updated pipeline configuration, removed llm from sql_pairs_indexing
wren-ai-service/src/config.py Added new configuration field sql_pairs_path
wren-ai-service/src/globals.py Updated service container with new SQL pairs handling
wren-ai-service/src/pipelines/indexing/ Introduced new SqlPairs pipeline, removed SqlPairsPreparation
wren-ai-service/tests/ Updated test files to match new SQL pairs indexing approach
wren-ai-service/tools/config/ Updated configuration files with new pipeline naming

Sequence Diagram

sequenceDiagram
    participant User
    participant SqlPairs
    participant DocumentConverter
    participant Embedder
    participant DocumentStore
    participant Cleaner

    User->>SqlPairs: Run with MDL string
    SqlPairs->>DocumentConverter: Convert SQL pairs to documents
    DocumentConverter-->>SqlPairs: Return documents
    SqlPairs->>Embedder: Embed documents
    Embedder-->>SqlPairs: Return embedded documents
    SqlPairs->>Cleaner: Clean SQL pairs
    Cleaner-->>SqlPairs: Return cleaned pairs
    SqlPairs->>DocumentStore: Write documents
Loading

Poem

🐰 SQL Pairs Dancing Free

In pipelines of code, a rabbit's glee
From preparation to indexing we leap
Transforming queries with magical sweep
No LLM needed, just pure SQL spree!

🔍✨

Finishing Touches

  • 📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@paopa paopa added module/ai-service ai-service related ci/ai-service ai-service related labels Jan 2, 2025
@paopa paopa force-pushed the feat/embed-sql-pairs branch 5 times, most recently from 7bd856c to 75fdd82 Compare January 9, 2025 04:01
@paopa paopa marked this pull request as ready for review January 9, 2025 04:09
@paopa paopa force-pushed the feat/embed-sql-pairs branch from 75fdd82 to 8d9a9d4 Compare January 9, 2025 04:10
@paopa paopa requested a review from cyyeh January 9, 2025 04:22
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🔭 Outside diff range comments (3)
wren-ai-service/src/pipelines/indexing/sql_pairs_deletion.py (2)

Line range hint 16-23: Maintain consistent parameter naming across the codebase.

The parameter id should be renamed to project_id for consistency with the changes in other files and the call to sql_pairs_cleaner.run.

 async def delete_sql_pairs(
     sql_pairs_cleaner: SqlPairsCleaner,
     sql_pair_ids: List[str],
-    id: Optional[str] = None,
+    project_id: Optional[str] = None,
 ) -> None:
-    return await sql_pairs_cleaner.run(sql_pair_ids=sql_pair_ids, project_id=id)
+    return await sql_pairs_cleaner.run(sql_pair_ids=sql_pair_ids, project_id=project_id)

Line range hint 44-54: Maintain consistent parameter naming in the pipeline class.

The run method still uses id parameter while the rest of the codebase has moved to project_id.

     async def run(
-        self, sql_pair_ids: List[str], id: Optional[str] = None
+        self, sql_pair_ids: List[str], project_id: Optional[str] = None
     ) -> Dict[str, Any]:
         logger.info("SQL Pairs Deletion pipeline is running...")
         return await self._pipe.execute(
             ["delete_sql_pairs"],
             inputs={
                 "sql_pair_ids": sql_pair_ids,
-                "id": id or "",
+                "project_id": project_id or "",
                 **self._components,
             },
         )
wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs.py (1)

Line range hint 37-37: Update test function name for consistency.

The test function name still contains "preparation" while others have been updated to "indexing".

-async def test_sql_pairs_preparation_saving_to_document_store_with_multiple_project_ids():
+async def test_sql_pairs_indexing_saving_to_document_store_with_multiple_project_ids():
🧹 Nitpick comments (12)
wren-ai-service/src/pipelines/indexing/sql_pairs.py (1)

184-188: Avoid hard-coded test data in the main block

In the __main__ block used for a dry run, the mdl_str is hard-coded with sample data. While this is acceptable for testing, consider accepting mdl_str as a command-line argument or reading from a file to make the script more flexible.

wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs_deletion.py (1)

20-25: Use consistent project IDs in tests

In the test, you use project_id="fake-id" when running sql_pairs.run, but later use id="fake-id-2" when running sql_pairs_deletion.run. This might lead to confusion as the IDs do not match, and could affect the accuracy of the test.

Consider using the same project_id throughout the test to ensure consistency:

-    await sql_pairs_deletion.run(id="fake-id-2", sql_pair_ids=["1", "2"])
+    await sql_pairs_deletion.run(id="fake-id", sql_pair_ids=["1", "2"])
wren-ai-service/tools/mdl_to_str.py (1)

24-29: Simplify string escaping using 'json.dumps'

The manual replacement of backslashes and double quotes can be error-prone. Consider using the built-in json.dumps function with the ensure_ascii=False and indent=None parameters for proper string serialization.

Replace the to_str function implementation with:

import json

def to_str(mdl: dict) -> str:
    return json.dumps(mdl, ensure_ascii=False)
wren-ai-service/src/pipelines/indexing/sql_pairs_deletion.py (1)

29-29: Track technical debt with an issue.

The TODO comment suggests removing this pipeline. This should be tracked in the issue tracker for proper follow-up.

Would you like me to create a GitHub issue to track this technical debt?

wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs.py (2)

Line range hint 10-34: Reduce test code duplication.

Both test functions share similar setup and assertion logic. Consider extracting common setup into fixtures and shared assertion helpers.

@pytest.fixture
async def sql_pairs_store():
    pipe_components = generate_components(settings.components)
    document_store_provider = pipe_components["sql_pairs_indexing"]["document_store_provider"]
    store = document_store_provider.get_store(
        dataset_name="sql_pairs",
        recreate_index=True,
    )
    return store, pipe_components

async def assert_documents(store, expected_count, project_id=None):
    assert await store.count_documents() == expected_count
    filters = None
    if project_id:
        filters = {
            "operator": "AND",
            "conditions": [
                {"field": "project_id", "operator": "==", "value": project_id},
            ],
        }
    documents = store.filter_documents(filters=filters)
    for document in documents:
        assert document.content, "content should not be empty"
        assert document.meta, "meta should not be empty"
        assert document.meta.get("sql_pair_id"), "sql_pair_id should be in meta"
        assert document.meta.get("sql"), "sql should be in meta"

Also applies to: 37-71


24-25: Use test constants for better maintainability.

Hard-coded MDL strings and test data paths should be moved to constants or test fixtures.

TEST_DATA = {
    'MDL_STR': '{"models": [{"properties": {"boilerplate": "test"}}]}',
    'SQL_PAIRS_PATH': "tests/data/pairs.json"
}

Also applies to: 52-53, 57-58

wren-ai-service/Justfile (2)

73-74: Add documentation for the new target.

The mdl-to-str target lacks documentation about its purpose and usage. Consider adding a comment explaining what it does and how to use it.

+# Convert MDL JSON file to a properly escaped string format
+# Usage: just mdl-to-str path/to/mdl.json
 mdl-to-str mdl_path="":
     poetry run python tools/mdl_to_str.py -p {{mdl_path}}

73-74: Add parameter validation.

The target should validate that mdl_path is provided and exists before running the Python script.

 mdl-to-str mdl_path="":
+    #!/usr/bin/env bash
+    if [ -z "{{mdl_path}}" ]; then
+        echo "Error: mdl_path is required"
+        exit 1
+    fi
+    if [ ! -f "{{mdl_path}}" ]; then
+        echo "Error: File {{mdl_path}} does not exist"
+        exit 1
+    fi
     poetry run python tools/mdl_to_str.py -p {{mdl_path}}
wren-ai-service/src/config.py (1)

57-57: Add documentation for the new configuration field.

The sql_pairs_path field should be documented to explain its purpose, expected format, and usage.

+    # Path to the JSON file containing SQL pairs for embedding in MDL
+    # The file should contain an array of SQL pair objects
     sql_pairs_path: str = Field(default="pairs.json")
wren-ai-service/src/pipelines/generation/sql_generation.py (1)

35-43: Consider adding sample validation.

While the SQL samples integration looks good, consider adding validation for the sample format to ensure each sample contains both question and sql fields.

@observe(capture_input=False)
def prompt(
    query: str,
    documents: List[str],
    exclude: List[Dict],
    text_to_sql_rules: str,
    prompt_builder: PromptBuilder,
    configuration: Configuration | None = None,
    sql_samples: List[Dict] | None = None,
) -> dict:
+    if sql_samples:
+        for sample in sql_samples:
+            if not all(key in sample for key in ['question', 'sql']):
+                raise ValueError("Each SQL sample must contain 'question' and 'sql' fields")
    return prompt_builder.run(
        query=query,
        documents=documents,
        exclude=exclude,
        text_to_sql_rules=text_to_sql_rules,
        instructions=construct_instructions(configuration),
        sql_samples=sql_samples,
        current_time=configuration.show_current_time(),
    )
wren-ai-service/src/globals.py (1)

227-228: Verify pipeline component naming consistency.

The pipeline key is "sql_pairs_preparation" but uses components from "sql_pairs_indexing". Consider aligning these names for better maintainability.

-                "sql_pairs_preparation": indexing.SqlPairs(
-                    **pipe_components["sql_pairs_indexing"],
+                "sql_pairs_indexing": indexing.SqlPairs(
+                    **pipe_components["sql_pairs_indexing"],
deployment/kustomizations/base/cm.yaml (1)

185-188: Consider documenting the pairs.json structure.

The pairs.json is initialized with an empty sample array. Consider adding:

  1. Documentation about the expected structure of the array
  2. Example entries to illustrate the format
   pairs.json: |
     {
+      # Array of SQL pairs for indexing
+      # Example format:
+      # "sample": [
+      #   {
+      #     "question": "What is the total sales?",
+      #     "sql": "SELECT SUM(sales) FROM transactions"
+      #   }
+      # ]
       "sample": []
     }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5fa29ed and 8d9a9d4.

📒 Files selected for processing (24)
  • deployment/kustomizations/base/cm.yaml (2 hunks)
  • deployment/kustomizations/base/deploy-wren-ai-service.yaml (2 hunks)
  • docker/config.example.yaml (1 hunks)
  • docker/docker-compose-dev.yaml (1 hunks)
  • docker/docker-compose.yaml (1 hunks)
  • wren-ai-service/Justfile (1 hunks)
  • wren-ai-service/src/config.py (1 hunks)
  • wren-ai-service/src/globals.py (2 hunks)
  • wren-ai-service/src/pipelines/generation/sql_generation.py (1 hunks)
  • wren-ai-service/src/pipelines/generation/utils/sql.py (1 hunks)
  • wren-ai-service/src/pipelines/indexing/__init__.py (2 hunks)
  • wren-ai-service/src/pipelines/indexing/sql_pairs.py (1 hunks)
  • wren-ai-service/src/pipelines/indexing/sql_pairs_deletion.py (1 hunks)
  • wren-ai-service/src/pipelines/indexing/sql_pairs_preparation.py (0 hunks)
  • wren-ai-service/src/pipelines/retrieval/sql_pairs_retrieval.py (1 hunks)
  • wren-ai-service/src/web/v1/services/semantics_preparation.py (1 hunks)
  • wren-ai-service/src/web/v1/services/sql_pairs_preparation.py (2 hunks)
  • wren-ai-service/tests/data/pairs.json (1 hunks)
  • wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs.py (2 hunks)
  • wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs_deletion.py (1 hunks)
  • wren-ai-service/tests/pytest/services/test_sql_pairs_preparation.py (2 hunks)
  • wren-ai-service/tools/config/config.example.yaml (1 hunks)
  • wren-ai-service/tools/config/config.full.yaml (1 hunks)
  • wren-ai-service/tools/mdl_to_str.py (1 hunks)
💤 Files with no reviewable changes (1)
  • wren-ai-service/src/pipelines/indexing/sql_pairs_preparation.py
✅ Files skipped from review due to trivial changes (3)
  • wren-ai-service/tests/data/pairs.json
  • wren-ai-service/tests/pytest/services/test_sql_pairs_preparation.py
  • wren-ai-service/src/pipelines/generation/utils/sql.py
🔇 Additional comments (18)
wren-ai-service/src/pipelines/indexing/sql_pairs.py (1)

73-76: Ensure all SQL pairs have valid 'id' fields

When creating SqlPair instances, pair.get("id") may return None if the 'id' key is missing in the pair. Since id is a required field of type str in the SqlPair model, this might raise validation errors.

Please ensure that all entries in external_pairs have an 'id' field. If 'id' might be missing, consider providing a default value or handling missing IDs appropriately.

wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs_deletion.py (2)

5-5: Updated imports reflect changes in the pipeline structure

The import statement correctly replaces SqlPairsPreparation with SqlPairs, aligning with the updated pipeline naming conventions.


29-32: Verify deletion operation in test

After running sql_pairs_deletion.run, the test asserts that the document count in the store has decreased. Ensure that the test correctly verifies the deletion operation by checking the actual contents of the store.

As a follow-up, you can enhance the test by retrieving the remaining documents and asserting that the deleted documents are no longer present.

wren-ai-service/src/pipelines/indexing/__init__.py (3)

96-98: LGTM! Parameter rename improves clarity.

The parameter rename from id to project_id enhances code readability and maintains consistency with the project's naming conventions.


106-108: LGTM! Filter condition is correctly updated.

The filter condition correctly uses the renamed project_id parameter, maintaining the logical integrity of the document filtering.


117-117: LGTM! Module exports are properly updated.

The import and __all__ list changes correctly reflect the refactoring from SqlPairsPreparation to SqlPairs.

Also applies to: 126-126

wren-ai-service/src/pipelines/retrieval/sql_pairs_retrieval.py (1)

28-28: LGTM! Key rename improves semantic clarity.

The change from "summary" to "question" better represents the actual content being formatted and aligns with the SQL pairs domain terminology.

wren-ai-service/src/web/v1/services/semantics_preparation.py (1)

82-87: LGTM! SQL pairs pipeline task properly integrated.

The addition of "sql_pairs" to the pipeline tasks ensures proper integration with the semantics preparation process while maintaining concurrent execution through asyncio.gather().

wren-ai-service/src/pipelines/generation/sql_generation.py (1)

35-43: LGTM! SQL samples section enhances LLM guidance.

The addition of the SQL samples section in the prompt template provides valuable examples to guide the LLM in generating more accurate SQL queries.

Consider the impact on token limits when many samples are provided. Run this script to analyze the template size:

✅ Verification successful

Token limits are properly handled by the system architecture

The system is well-designed with built-in safeguards:

  • Uses GPT-4's standard 4096 token limit
  • Retrieves relevant SQL samples through a dedicated pipeline
  • Conditionally includes samples in the template
  • Has default sample limits in place
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Calculate approximate token count for the template
echo 'Analyzing template size...'
rg -A 10 'sql_generation_user_prompt_template = """' | wc -w

Length of output: 128


Script:

#!/bin/bash
# Get the complete template content
echo "=== Complete Template ==="
rg -U "sql_generation_user_prompt_template = \"\"\".*?\"\"\"" -A 50

echo -e "\n=== Token Limit Configurations ==="
rg -i "token.*limit|max.*token"

echo -e "\n=== Sample Handling Logic ==="
rg -l "sql_samples" | xargs ast-grep --pattern 'def $FUNC($$$) { $$$ sql_samples $$$ }'

Length of output: 1685


Script:

#!/bin/bash
# Look for sample selection/limiting logic
echo "=== Sample Selection Logic ==="
rg -A 10 "def.*sql_samples"

echo -e "\n=== Sample List Creation ==="
rg -B 5 -A 5 "sql_samples.*=.*"

Length of output: 9163


Script:

#!/bin/bash
# Look for sample limit configurations
echo "=== Sample Limit Configurations ==="
rg -i "sample.*limit|limit.*sample|top.*sample|max.*sample"

echo -e "\n=== SQL Pairs Retrieval Logic ==="
rg -A 10 "sql_pairs_retrieval.*run"

Length of output: 2337

wren-ai-service/src/web/v1/services/sql_pairs_preparation.py (2)

94-94: Address TODO comment before merging.

The comment indicates that the SQL pairs preparation functionality needs proper implementation. This should be addressed before merging to ensure complete functionality.

Would you like me to help implement the proper SQL pairs preparation functionality or create a GitHub issue to track this task?


96-97: Verify parameter renaming impact.

The parameter rename from id to project_id improves naming consistency, but we should verify all callers have been updated.

✅ Verification successful

Parameter rename verification complete - all callers updated

The parameter rename from 'id' to 'project_id' has been properly implemented across the codebase. All relevant callers have been updated to use the new parameter name. The remaining uses of 'id' in the codebase are for different purposes (like sql_pair.id or sql_pairs_preparation_id) and not related to this parameter rename.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for any remaining uses of the old parameter name
rg "id=" -A 5 -B 5

Length of output: 117072


Script:

#!/bin/bash
# Search for function calls with id parameter
rg "id=[\'\"]?\w+[\'\"]?" --type py -g "!*test*" -A 3 -B 3

# Search for function calls with project_id parameter
rg "project_id=[\'\"]?\w+[\'\"]?" --type py -g "!*test*" -A 3 -B 3

# Use ast-grep to find function calls with id parameter
ast-grep --pattern 'id=$_' -l py

# Use ast-grep to find function calls with project_id parameter 
ast-grep --pattern 'project_id=$_' -l py

Length of output: 1416


Script:

#!/bin/bash
# Search for function calls with id parameter in sql_pairs related files
rg ".*\(.*id=.*\)" --type py -g "*sql_pairs*.py" -A 3 -B 3

# Search for function calls with project_id parameter in sql_pairs related files
rg ".*\(.*project_id=.*\)" --type py -g "*sql_pairs*.py" -A 3 -B 3

# Search for any remaining uses of id parameter in the codebase
rg "\bid=(?![\w-])" --type py -g "!*test*" -A 3 -B 3

# Search for any remaining uses of project_id parameter in the codebase
rg "\bproject_id=(?![\w-])" --type py -g "!*test*" -A 3 -B 3

Length of output: 10090

wren-ai-service/src/globals.py (1)

79-82: LGTM! Well-structured pipeline integration.

The SQL pairs pipeline component is properly integrated with the settings and follows the established pattern for pipeline configuration.

docker/docker-compose-dev.yaml (1)

48-48: LGTM! Proper configuration for SQL pairs.

The environment variable and volume mapping for SQL pairs are correctly configured and follow the established pattern.

Also applies to: 53-53

deployment/kustomizations/base/deploy-wren-ai-service.yaml (1)

66-67: LGTM! Consistent deployment configuration.

The environment variable and config-volume configuration for SQL pairs are properly set up and maintain consistency with the docker-compose configuration.

Also applies to: 77-78

docker/docker-compose.yaml (1)

59-59: LGTM! Environment variable and volume mapping for SQL pairs configuration.

The configuration follows Docker best practices by properly mapping the configuration file and setting up the corresponding environment variable.

Also applies to: 64-64

docker/config.example.yaml (1)

Line range hint 107-115: Verify the pipeline dependencies and configuration.

The SQL pairs pipeline components are well-structured with appropriate dependencies. However, let's verify the pipeline configuration usage across the codebase.

✅ Verification successful

Old references to sql_pairs_preparation exist but don't impact functionality

The pipeline components are correctly configured and used throughout the codebase. The references to sql_pairs_preparation only exist in test files and service names, not in the actual pipeline configuration or usage. No changes are required.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the pipeline configuration usage

# Check for pipeline component references
rg -A 3 "sql_pairs_(indexing|deletion|retrieval)"

# Check for any potential old references to sql_pairs_preparation
rg "sql_pairs_preparation"

Length of output: 20474

🧰 Tools
🪛 yamllint (1.35.1)

[error] 112-112: trailing spaces

(trailing-spaces)

wren-ai-service/tools/config/config.example.yaml (1)

Line range hint 125-133: Configuration is consistent with docker/config.example.yaml

🧰 Tools
🪛 yamllint (1.35.1)

[error] 130-130: trailing spaces

(trailing-spaces)

wren-ai-service/tools/config/config.full.yaml (1)

Line range hint 144-152: Verify the removal of LLM dependency from sql_pairs_indexing.

The LLM dependency has been removed from the sql_pairs_indexing pipeline while being retained in sql_pairs_retrieval. Please confirm if this architectural change is intentional and doesn't impact the indexing functionality.

🧰 Tools
🪛 yamllint (1.35.1)

[error] 149-149: trailing spaces

(trailing-spaces)

wren-ai-service/src/pipelines/indexing/sql_pairs.py Outdated Show resolved Hide resolved
wren-ai-service/src/pipelines/indexing/sql_pairs.py Outdated Show resolved Hide resolved
wren-ai-service/tools/mdl_to_str.py Outdated Show resolved Hide resolved
@paopa paopa force-pushed the feat/embed-sql-pairs branch from 8d9a9d4 to b165de2 Compare January 9, 2025 05:46
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs_deletion.py (1)

20-25: Consider using a test fixture for the pairs.json path.

Instead of hardcoding the test data path, consider:

  1. Moving it to a test fixture or constant
  2. Using a temporary file for isolation
    This would make the tests more maintainable and prevent potential issues with file locations in different environments.
+# At the top of the file
+TEST_SQL_PAIRS_PATH = "tests/data/pairs.json"

-        **pipe_components["sql_pairs_indexing"], sql_pairs_path="tests/data/pairs.json"
+        **pipe_components["sql_pairs_indexing"], sql_pairs_path=TEST_SQL_PAIRS_PATH
wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs.py (2)

Line range hint 10-34: Enhance test coverage for SQL pairs indexing.

While the test verifies basic document storage, consider adding assertions for:

  1. MDL string integration
  2. Project ID in metadata
  3. Specific content from pairs.json
     assert document.meta.get("sql_pair_id"), "sql_pair_id should be in meta"
     assert document.meta.get("sql"), "sql should be in meta"
+    assert document.meta.get("project_id") == "fake-id", "project_id should match"
+    # Verify MDL integration
+    assert document.meta.get("mdl") is not None, "mdl should be in meta"

Line range hint 51-71: Enhance multi-project ID test coverage.

The test could be more thorough in verifying the indexed documents:

  1. Add assertions for both project IDs
  2. Verify that documents are correctly associated with their respective projects
  3. Check that MDL strings are properly stored
     assert len(documents) == 2
+    # Verify documents for second project
+    documents_2 = store.filter_documents(
+        filters={
+            "operator": "AND",
+            "conditions": [
+                {"field": "project_id", "operator": "==", "value": "fake-id-2"},
+            ],
+        }
+    )
+    assert len(documents_2) == 2
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8d9a9d4 and b165de2.

📒 Files selected for processing (24)
  • deployment/kustomizations/base/cm.yaml (2 hunks)
  • deployment/kustomizations/base/deploy-wren-ai-service.yaml (2 hunks)
  • docker/config.example.yaml (1 hunks)
  • docker/docker-compose-dev.yaml (1 hunks)
  • docker/docker-compose.yaml (1 hunks)
  • wren-ai-service/Justfile (1 hunks)
  • wren-ai-service/src/config.py (1 hunks)
  • wren-ai-service/src/globals.py (2 hunks)
  • wren-ai-service/src/pipelines/generation/sql_generation.py (1 hunks)
  • wren-ai-service/src/pipelines/generation/utils/sql.py (1 hunks)
  • wren-ai-service/src/pipelines/indexing/__init__.py (2 hunks)
  • wren-ai-service/src/pipelines/indexing/sql_pairs.py (1 hunks)
  • wren-ai-service/src/pipelines/indexing/sql_pairs_deletion.py (1 hunks)
  • wren-ai-service/src/pipelines/indexing/sql_pairs_preparation.py (0 hunks)
  • wren-ai-service/src/pipelines/retrieval/sql_pairs_retrieval.py (1 hunks)
  • wren-ai-service/src/web/v1/services/semantics_preparation.py (1 hunks)
  • wren-ai-service/src/web/v1/services/sql_pairs_preparation.py (2 hunks)
  • wren-ai-service/tests/data/pairs.json (1 hunks)
  • wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs.py (2 hunks)
  • wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs_deletion.py (1 hunks)
  • wren-ai-service/tests/pytest/services/test_sql_pairs_preparation.py (2 hunks)
  • wren-ai-service/tools/config/config.example.yaml (1 hunks)
  • wren-ai-service/tools/config/config.full.yaml (1 hunks)
  • wren-ai-service/tools/mdl_to_str.py (1 hunks)
💤 Files with no reviewable changes (1)
  • wren-ai-service/src/pipelines/indexing/sql_pairs_preparation.py
🚧 Files skipped from review as they are similar to previous changes (20)
  • wren-ai-service/src/config.py
  • deployment/kustomizations/base/deploy-wren-ai-service.yaml
  • wren-ai-service/src/pipelines/generation/utils/sql.py
  • wren-ai-service/Justfile
  • wren-ai-service/tests/data/pairs.json
  • wren-ai-service/tools/config/config.example.yaml
  • wren-ai-service/src/pipelines/retrieval/sql_pairs_retrieval.py
  • docker/config.example.yaml
  • deployment/kustomizations/base/cm.yaml
  • wren-ai-service/src/web/v1/services/semantics_preparation.py
  • docker/docker-compose.yaml
  • wren-ai-service/tools/config/config.full.yaml
  • docker/docker-compose-dev.yaml
  • wren-ai-service/tools/mdl_to_str.py
  • wren-ai-service/src/pipelines/indexing/sql_pairs_deletion.py
  • wren-ai-service/src/web/v1/services/sql_pairs_preparation.py
  • wren-ai-service/src/pipelines/generation/sql_generation.py
  • wren-ai-service/tests/pytest/services/test_sql_pairs_preparation.py
  • wren-ai-service/src/pipelines/indexing/init.py
  • wren-ai-service/src/pipelines/indexing/sql_pairs.py
🔇 Additional comments (4)
wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs_deletion.py (2)

5-5: LGTM! Component renaming reflects the functionality better.

The change from "preparation" to "indexing" better represents the actual functionality of storing and managing SQL pairs.

Also applies to: 13-13


29-32: 🛠️ Refactor suggestion

Enhance test robustness for SQL pair deletion.

The test uses hardcoded SQL pair IDs without verifying their existence in pairs.json. Consider:

  1. Reading actual IDs from the test data file
  2. Adding assertions to verify the specific documents were deleted

Let's verify the content of pairs.json:

wren-ai-service/src/globals.py (2)

79-82: LGTM! Clean integration of SQL pairs indexing.

The addition of the SQL pairs pipeline with configuration from settings is well-structured and consistent with the overall changes.


227-228: Consider renaming the preparation service for consistency.

While the component has been updated to use indexing.SqlPairs, the service class name SqlPairsPreparationService still contains "preparation". Consider renaming it to maintain consistency with the new indexing terminology.

-        sql_pairs_preparation_service=SqlPairsPreparationService(
+        sql_pairs_indexing_service=SqlPairsIndexingService(

Note: This change would require updating the service class name and its usages across the codebase.

Let's check for other occurrences:

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs.py (2)

Line range hint 38-71: Consider adding negative test cases.

While the test covers multiple project IDs well, consider adding test cases for:

  • Invalid MDL string format
  • Non-existent project IDs
  • Empty or malformed SQL pairs file

Line range hint 1-71: Consider separating test data setup.

To improve test maintainability, consider:

  1. Moving the test data setup to a fixture
  2. Creating a dedicated test data factory for generating MDL strings
  3. Using parametrized tests for different SQL pair scenarios
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b165de2 and 13f4ff0.

📒 Files selected for processing (2)
  • wren-ai-service/src/pipelines/indexing/sql_pairs.py (1 hunks)
  • wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • wren-ai-service/src/pipelines/indexing/sql_pairs.py
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: pytest
  • GitHub Check: Analyze (go)
🔇 Additional comments (3)
wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs.py (3)

5-5: LGTM! Import and component changes align with the pipeline refactoring.

The changes from SqlPairsPreparation to SqlPairs and the component renaming to "sql_pairs_indexing" maintain consistency with the broader architectural changes.

Also applies to: 13-13


10-10: Function name updated for consistency.

The function name has been updated from "preparation" to "indexing" as previously suggested.


20-25: Verify the test data file existence and format.

The test relies on tests/data/pairs.json. Let's verify its existence and format.

✅ Verification successful

Test data file exists and is properly formatted

The file exists at "wren-ai-service/tests/data/pairs.json" with valid JSON structure containing appropriate test data for SQL pairs testing.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if the test data file exists and validate its JSON structure
if [ -f "tests/data/pairs.json" ]; then
    echo "File exists. Validating JSON structure..."
    jq '.' tests/data/pairs.json
else
    echo "Error: tests/data/pairs.json not found"
    exit 1
fi

Length of output: 130


Script:

#!/bin/bash
# Search for any pairs.json files in the repository
echo "Searching for pairs.json files:"
fd pairs.json

echo -e "\nSearching for any JSON files in tests directory:"
fd -e json . tests/

echo -e "\nSearching for other test files that might use pairs.json:"
rg -l "pairs.json" tests/

Length of output: 559


Script:

#!/bin/bash
# Validate the JSON structure of the found file
echo "Validating JSON structure of the found file:"
jq '.' wren-ai-service/tests/data/pairs.json

Length of output: 372

Copy link
Member

@cyyeh cyyeh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

docker/docker-compose.yaml Outdated Show resolved Hide resolved
@paopa paopa force-pushed the feat/embed-sql-pairs branch from 13f4ff0 to 81f13bf Compare January 10, 2025 03:41
Copy link
Member

@cyyeh cyyeh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs.py (1)

38-58: Update documentation strings in test functions

The docstrings or comments within the test functions still refer to the old sql_pairs_preparation naming. Consider updating any internal documentation to match the new naming convention for clarity.

Apply this diff to update the comments:

# Existing comments or docstrings, if any

Note: Since the provided code does not include comments, ensure that any existing comments are updated accordingly.

deployment/kustomizations/base/cm.yaml (1)

184-184: Add newline at end of file.

Add a newline character at the end of the file to comply with YAML best practices.

🧰 Tools
🪛 yamllint (1.35.1)

[error] 184-184: no new line character at the end of file

(new-line-at-end-of-file)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 13f4ff0 and 81f13bf.

📒 Files selected for processing (21)
  • deployment/kustomizations/base/cm.yaml (2 hunks)
  • docker/config.example.yaml (1 hunks)
  • wren-ai-service/Justfile (1 hunks)
  • wren-ai-service/src/config.py (1 hunks)
  • wren-ai-service/src/globals.py (2 hunks)
  • wren-ai-service/src/pipelines/generation/sql_generation.py (1 hunks)
  • wren-ai-service/src/pipelines/generation/utils/sql.py (1 hunks)
  • wren-ai-service/src/pipelines/indexing/__init__.py (2 hunks)
  • wren-ai-service/src/pipelines/indexing/sql_pairs.py (1 hunks)
  • wren-ai-service/src/pipelines/indexing/sql_pairs_deletion.py (1 hunks)
  • wren-ai-service/src/pipelines/indexing/sql_pairs_preparation.py (0 hunks)
  • wren-ai-service/src/pipelines/retrieval/sql_pairs_retrieval.py (1 hunks)
  • wren-ai-service/src/web/v1/services/semantics_preparation.py (1 hunks)
  • wren-ai-service/src/web/v1/services/sql_pairs_preparation.py (2 hunks)
  • wren-ai-service/tests/data/pairs.json (1 hunks)
  • wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs.py (2 hunks)
  • wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs_deletion.py (1 hunks)
  • wren-ai-service/tests/pytest/services/test_sql_pairs_preparation.py (2 hunks)
  • wren-ai-service/tools/config/config.example.yaml (1 hunks)
  • wren-ai-service/tools/config/config.full.yaml (1 hunks)
  • wren-ai-service/tools/mdl_to_str.py (1 hunks)
💤 Files with no reviewable changes (1)
  • wren-ai-service/src/pipelines/indexing/sql_pairs_preparation.py
🚧 Files skipped from review as they are similar to previous changes (15)
  • wren-ai-service/tests/pytest/services/test_sql_pairs_preparation.py
  • docker/config.example.yaml
  • wren-ai-service/tools/config/config.example.yaml
  • wren-ai-service/src/pipelines/retrieval/sql_pairs_retrieval.py
  • wren-ai-service/src/web/v1/services/semantics_preparation.py
  • wren-ai-service/Justfile
  • wren-ai-service/tests/data/pairs.json
  • wren-ai-service/tools/mdl_to_str.py
  • wren-ai-service/src/web/v1/services/sql_pairs_preparation.py
  • wren-ai-service/src/pipelines/generation/utils/sql.py
  • wren-ai-service/tools/config/config.full.yaml
  • wren-ai-service/src/pipelines/indexing/sql_pairs_deletion.py
  • wren-ai-service/src/pipelines/generation/sql_generation.py
  • wren-ai-service/src/config.py
  • wren-ai-service/src/pipelines/indexing/init.py
🧰 Additional context used
🪛 yamllint (1.35.1)
deployment/kustomizations/base/cm.yaml

[error] 184-184: no new line character at the end of file

(new-line-at-end-of-file)

⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: pytest
  • GitHub Check: Analyze (go)
🔇 Additional comments (7)
wren-ai-service/src/pipelines/indexing/sql_pairs.py (1)

54-57: ⚠️ Potential issue

Add exception handling for JSON parsing errors in boilerplates function

The function boilerplates parses mdl_str without handling potential JSON decoding exceptions. If mdl_str contains invalid JSON, orjson.loads will raise an exception, which could cause the pipeline to fail unexpectedly.

Apply this diff to handle JSON parsing errors gracefully:

def boilerplates(
    mdl_str: str,
) -> Set[str]:
+   try:
        mdl = orjson.loads(mdl_str)
+   except orjson.JSONDecodeError as e:
+       logger.error(f"Error parsing MDL JSON string: {e}")
+       return set()

Likely invalid or redundant comment.

wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs_deletion.py (2)

20-25: Update test to match new SqlPairs initialization parameters

The instantiation of SqlPairs now includes sql_pairs_path, but the project_id parameter is passed during the run method call. Ensure that the test accurately reflects the new initialization and method signatures.


29-32: Verify SQL pair IDs correspond to the correct project_id

In the deletion test, SQL pair IDs ["1", "2"] are used with id="fake-id-2", but the documents were added with project_id="fake-id". Ensure that the deletion call uses the correct project_id to accurately test the deletion functionality.

Consider updating the deletion call to match the project_id used during insertion:

await sql_pairs_deletion.run(
-   id="fake-id-2",
+   id="fake-id",
    sql_pair_ids=["1", "2"]
)
wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs.py (1)

10-25: Ensure consistent naming of test functions

The test function name has been updated to test_sql_pairs_indexing_saving_to_document_store, which improves consistency with the new SqlPairs class. Good job on updating the naming to reflect the changes.

deployment/kustomizations/base/cm.yaml (1)

155-157: LGTM! SQL pairs pipeline configuration is well-structured.

The new pipeline components are properly configured with appropriate dependencies, and the removal of the LLM dependency from sql_pairs_indexing aligns with the PR objective of simplifying the pipeline.

wren-ai-service/src/globals.py (2)

79-82: LGTM! SQL pairs integration is properly configured.

The new SQL pairs pipeline is well-integrated into the semantics preparation service with proper dependency injection and configuration.


227-228: Verify sql_pairs_path configuration.

Ensure that the sql_pairs_path setting is properly configured in the environment and the path exists in the deployment.

Comment on lines +71 to +77
return [
SqlPair(
id=pair.get("id"),
question=pair.get("question"),
sql=pair.get("sql"),
)
for boilerplate in boilerplates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Ensure id field is not missing in SqlPair instances

In the sql_pairs function, the id field is retrieved using pair.get("id"), which may return None if the key is missing. Since SqlPair requires id to be a string, a missing id could lead to validation errors or unintended behavior.

Apply this diff to enforce the presence of the id field:

def sql_pairs(
    boilerplates: Set[str],
    external_pairs: Dict[str, Any],
) -> List[SqlPair]:
    return [
        SqlPair(
-           id=pair.get("id"),
+           id=pair["id"],
            question=pair.get("question"),
            sql=pair.get("sql"),
        )
        for boilerplate in boilerplates
        if boilerplate in external_pairs
        for pair in external_pairs[boilerplate]
    ]

Alternatively, provide a default id or handle cases where id might be missing to prevent potential errors.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
return [
SqlPair(
id=pair.get("id"),
question=pair.get("question"),
sql=pair.get("sql"),
)
for boilerplate in boilerplates
return [
SqlPair(
id=pair["id"],
question=pair.get("question"),
sql=pair.get("sql"),
)
for boilerplate in boilerplates

@cyyeh cyyeh merged commit 186faa7 into main Jan 10, 2025
10 checks passed
@cyyeh cyyeh deleted the feat/embed-sql-pairs branch January 10, 2025 07:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/ai-service ai-service related module/ai-service ai-service related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants