feat(wren-ai-service): Embed the SQL pairs in MDL #1082

paopa · 2025-01-02T09:56:56Z

Changes

Pipeline Simplification
- Removed SQL intention generation using LLM
- Simplified document storage to use direct question-SQL pairs
- Updated pipeline configuration to remove unnecessary LLM dependency
Configuration Changes
- Added SQL_PAIRS_PATH environment variable
- Added pairs.json configuration file support
- Updated deployment configurations to include new file mounts
Sample Management
- Added structured JSON format for storing SQL query examples
- Updated sample format to use question-SQL pairs
- Improved sample integration in SQL generation prompts
Testing
- Updated existing tests to reflect new pipeline structure
- Added test data for pairs.json
- Temporarily skipped outdated tests pending updates

Related Changes

Updated docker compose configurations
Modified kustomization files for deployment
Updated configuration examples
Added new utility tool for MDL string conversion

Summary by CodeRabbit

Release Notes

New Features
- Added a new SQL pairs indexing pipeline for processing and storing SQL query examples
- Introduced a tool to convert Model Definition Language (MDL) to string format
- Enhanced SQL generation with support for sample query examples
Changes
- Renamed sql_pairs_preparation pipeline to sql_pairs_indexing
- Updated configuration files to reflect pipeline changes
- Removed LLM-specific configuration from SQL pairs pipeline
Improvements
- Streamlined SQL pairs handling across multiple service components
- Added more flexible configuration for SQL pair processing
Testing
- Updated test cases to align with new SQL pairs indexing pipeline
- Added sample SQL pairs test data

coderabbitai · 2025-01-02T09:57:03Z

Walkthrough

This pull request introduces significant changes to the Wren AI service, focusing on SQL pairs indexing and processing. The modifications span multiple files, introducing a new SqlPairs pipeline to replace the previous SqlPairsPreparation approach. Key changes include renaming pipeline stages, updating configuration files, adding new utility functions, and restructuring how SQL pairs are handled throughout the service. The changes aim to improve the flexibility and clarity of SQL pair management, with a more streamlined approach to indexing and processing.

Changes

File Path	Change Summary
`deployment/kustomizations/base/cm.yaml`	Renamed pipeline step from `sql_pairs_preparation` to `sql_pairs_indexing`
`docker/config.example.yaml`	Updated pipeline configuration, removed `llm` from `sql_pairs_indexing`
`wren-ai-service/src/config.py`	Added new configuration field `sql_pairs_path`
`wren-ai-service/src/globals.py`	Updated service container with new SQL pairs handling
`wren-ai-service/src/pipelines/indexing/`	Introduced new `SqlPairs` pipeline, removed `SqlPairsPreparation`
`wren-ai-service/tests/`	Updated test files to match new SQL pairs indexing approach
`wren-ai-service/tools/config/`	Updated configuration files with new pipeline naming

Sequence Diagram

sequenceDiagram
    participant User
    participant SqlPairs
    participant DocumentConverter
    participant Embedder
    participant DocumentStore
    participant Cleaner

    User->>SqlPairs: Run with MDL string
    SqlPairs->>DocumentConverter: Convert SQL pairs to documents
    DocumentConverter-->>SqlPairs: Return documents
    SqlPairs->>Embedder: Embed documents
    Embedder-->>SqlPairs: Return embedded documents
    SqlPairs->>Cleaner: Clean SQL pairs
    Cleaner-->>SqlPairs: Return cleaned pairs
    SqlPairs->>DocumentStore: Write documents

Poem

🐰 SQL Pairs Dancing Free

In pipelines of code, a rabbit's glee
From preparation to indexing we leap
Transforming queries with magical sweep
No LLM needed, just pure SQL spree!

🔍✨

Finishing Touches

📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 3

🔭 Outside diff range comments (3)

wren-ai-service/src/pipelines/indexing/sql_pairs_deletion.py (2)
Line range hint 16-23: Maintain consistent parameter naming across the codebase.

The parameter id should be renamed to project_id for consistency with the changes in other files and the call to sql_pairs_cleaner.run.
 async def delete_sql_pairs(
     sql_pairs_cleaner: SqlPairsCleaner,
     sql_pair_ids: List[str],
-    id: Optional[str] = None,
+    project_id: Optional[str] = None,
 ) -> None:
-    return await sql_pairs_cleaner.run(sql_pair_ids=sql_pair_ids, project_id=id)
+    return await sql_pairs_cleaner.run(sql_pair_ids=sql_pair_ids, project_id=project_id)
Line range hint 44-54: Maintain consistent parameter naming in the pipeline class.

The run method still uses id parameter while the rest of the codebase has moved to project_id.
     async def run(
-        self, sql_pair_ids: List[str], id: Optional[str] = None
+        self, sql_pair_ids: List[str], project_id: Optional[str] = None
     ) -> Dict[str, Any]:
         logger.info("SQL Pairs Deletion pipeline is running...")
         return await self._pipe.execute(
             ["delete_sql_pairs"],
             inputs={
                 "sql_pair_ids": sql_pair_ids,
-                "id": id or "",
+                "project_id": project_id or "",
                 **self._components,
             },
         )
wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs.py (1)
Line range hint 37-37: Update test function name for consistency.

The test function name still contains "preparation" while others have been updated to "indexing".
-async def test_sql_pairs_preparation_saving_to_document_store_with_multiple_project_ids():
+async def test_sql_pairs_indexing_saving_to_document_store_with_multiple_project_ids():

🧹 Nitpick comments (12)

wren-ai-service/src/pipelines/indexing/sql_pairs.py (1)

184-188: Avoid hard-coded test data in the main block

In the __main__ block used for a dry run, the mdl_str is hard-coded with sample data. While this is acceptable for testing, consider accepting mdl_str as a command-line argument or reading from a file to make the script more flexible.
wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs_deletion.py (1)
20-25: Use consistent project IDs in tests

In the test, you use project_id="fake-id" when running sql_pairs.run, but later use id="fake-id-2" when running sql_pairs_deletion.run. This might lead to confusion as the IDs do not match, and could affect the accuracy of the test.

Consider using the same project_id throughout the test to ensure consistency:
-    await sql_pairs_deletion.run(id="fake-id-2", sql_pair_ids=["1", "2"])
+    await sql_pairs_deletion.run(id="fake-id", sql_pair_ids=["1", "2"])
wren-ai-service/tools/mdl_to_str.py (1)
24-29: Simplify string escaping using 'json.dumps'

The manual replacement of backslashes and double quotes can be error-prone. Consider using the built-in json.dumps function with the ensure_ascii=False and indent=None parameters for proper string serialization.

Replace the to_str function implementation with:
import json

def to_str(mdl: dict) -> str:
    return json.dumps(mdl, ensure_ascii=False)
wren-ai-service/src/pipelines/indexing/sql_pairs_deletion.py (1)

29-29: Track technical debt with an issue.

The TODO comment suggests removing this pipeline. This should be tracked in the issue tracker for proper follow-up.

Would you like me to create a GitHub issue to track this technical debt?
wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs.py (2)
Line range hint 10-34: Reduce test code duplication.

Both test functions share similar setup and assertion logic. Consider extracting common setup into fixtures and shared assertion helpers.
@pytest.fixture
async def sql_pairs_store():
    pipe_components = generate_components(settings.components)
    document_store_provider = pipe_components["sql_pairs_indexing"]["document_store_provider"]
    store = document_store_provider.get_store(
        dataset_name="sql_pairs",
        recreate_index=True,
    )
    return store, pipe_components

async def assert_documents(store, expected_count, project_id=None):
    assert await store.count_documents() == expected_count
    filters = None
    if project_id:
        filters = {
            "operator": "AND",
            "conditions": [
                {"field": "project_id", "operator": "==", "value": project_id},
            ],
        }
    documents = store.filter_documents(filters=filters)
    for document in documents:
        assert document.content, "content should not be empty"
        assert document.meta, "meta should not be empty"
        assert document.meta.get("sql_pair_id"), "sql_pair_id should be in meta"
        assert document.meta.get("sql"), "sql should be in meta"
Also applies to: 37-71

24-25: Use test constants for better maintainability.

Hard-coded MDL strings and test data paths should be moved to constants or test fixtures.
TEST_DATA = {
    'MDL_STR': '{"models": [{"properties": {"boilerplate": "test"}}]}',
    'SQL_PAIRS_PATH': "tests/data/pairs.json"
}
Also applies to: 52-53, 57-58
wren-ai-service/Justfile (2)
73-74: Add documentation for the new target.

The mdl-to-str target lacks documentation about its purpose and usage. Consider adding a comment explaining what it does and how to use it.
+# Convert MDL JSON file to a properly escaped string format
+# Usage: just mdl-to-str path/to/mdl.json
 mdl-to-str mdl_path="":
     poetry run python tools/mdl_to_str.py -p {{mdl_path}}
73-74: Add parameter validation.

The target should validate that mdl_path is provided and exists before running the Python script.
 mdl-to-str mdl_path="":
+    #!/usr/bin/env bash
+    if [ -z "{{mdl_path}}" ]; then
+        echo "Error: mdl_path is required"
+        exit 1
+    fi
+    if [ ! -f "{{mdl_path}}" ]; then
+        echo "Error: File {{mdl_path}} does not exist"
+        exit 1
+    fi
     poetry run python tools/mdl_to_str.py -p {{mdl_path}}
wren-ai-service/src/config.py (1)
57-57: Add documentation for the new configuration field.

The sql_pairs_path field should be documented to explain its purpose, expected format, and usage.
+    # Path to the JSON file containing SQL pairs for embedding in MDL
+    # The file should contain an array of SQL pair objects
     sql_pairs_path: str = Field(default="pairs.json")
wren-ai-service/src/pipelines/generation/sql_generation.py (1)
35-43: Consider adding sample validation.

While the SQL samples integration looks good, consider adding validation for the sample format to ensure each sample contains both question and sql fields.
@observe(capture_input=False)
def prompt(
    query: str,
    documents: List[str],
    exclude: List[Dict],
    text_to_sql_rules: str,
    prompt_builder: PromptBuilder,
    configuration: Configuration | None = None,
    sql_samples: List[Dict] | None = None,
) -> dict:
+    if sql_samples:
+        for sample in sql_samples:
+            if not all(key in sample for key in ['question', 'sql']):
+                raise ValueError("Each SQL sample must contain 'question' and 'sql' fields")
    return prompt_builder.run(
        query=query,
        documents=documents,
        exclude=exclude,
        text_to_sql_rules=text_to_sql_rules,
        instructions=construct_instructions(configuration),
        sql_samples=sql_samples,
        current_time=configuration.show_current_time(),
    )
wren-ai-service/src/globals.py (1)
227-228: Verify pipeline component naming consistency.

The pipeline key is "sql_pairs_preparation" but uses components from "sql_pairs_indexing". Consider aligning these names for better maintainability.
-                "sql_pairs_preparation": indexing.SqlPairs(
-                    **pipe_components["sql_pairs_indexing"],
+                "sql_pairs_indexing": indexing.SqlPairs(
+                    **pipe_components["sql_pairs_indexing"],
deployment/kustomizations/base/cm.yaml (1)
185-188: Consider documenting the pairs.json structure.

The pairs.json is initialized with an empty sample array. Consider adding:

Documentation about the expected structure of the array

Example entries to illustrate the format
   pairs.json: |
     {
+      # Array of SQL pairs for indexing
+      # Example format:
+      # "sample": [
+      #   {
+      #     "question": "What is the total sales?",
+      #     "sql": "SELECT SUM(sales) FROM transactions"
+      #   }
+      # ]
       "sample": []
     }

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5fa29ed and 8d9a9d4.

📒 Files selected for processing (24)

deployment/kustomizations/base/cm.yaml (2 hunks)
deployment/kustomizations/base/deploy-wren-ai-service.yaml (2 hunks)
docker/config.example.yaml (1 hunks)
docker/docker-compose-dev.yaml (1 hunks)
docker/docker-compose.yaml (1 hunks)
wren-ai-service/Justfile (1 hunks)
wren-ai-service/src/config.py (1 hunks)
wren-ai-service/src/globals.py (2 hunks)
wren-ai-service/src/pipelines/generation/sql_generation.py (1 hunks)
wren-ai-service/src/pipelines/generation/utils/sql.py (1 hunks)
wren-ai-service/src/pipelines/indexing/__init__.py (2 hunks)
wren-ai-service/src/pipelines/indexing/sql_pairs.py (1 hunks)
wren-ai-service/src/pipelines/indexing/sql_pairs_deletion.py (1 hunks)
wren-ai-service/src/pipelines/indexing/sql_pairs_preparation.py (0 hunks)
wren-ai-service/src/pipelines/retrieval/sql_pairs_retrieval.py (1 hunks)
wren-ai-service/src/web/v1/services/semantics_preparation.py (1 hunks)
wren-ai-service/src/web/v1/services/sql_pairs_preparation.py (2 hunks)
wren-ai-service/tests/data/pairs.json (1 hunks)
wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs.py (2 hunks)
wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs_deletion.py (1 hunks)
wren-ai-service/tests/pytest/services/test_sql_pairs_preparation.py (2 hunks)
wren-ai-service/tools/config/config.example.yaml (1 hunks)
wren-ai-service/tools/config/config.full.yaml (1 hunks)
wren-ai-service/tools/mdl_to_str.py (1 hunks)

💤 Files with no reviewable changes (1)

wren-ai-service/src/pipelines/indexing/sql_pairs_preparation.py

✅ Files skipped from review due to trivial changes (3)

wren-ai-service/tests/data/pairs.json
wren-ai-service/tests/pytest/services/test_sql_pairs_preparation.py
wren-ai-service/src/pipelines/generation/utils/sql.py

🔇 Additional comments (18)

wren-ai-service/src/pipelines/indexing/sql_pairs.py (1)

73-76: Ensure all SQL pairs have valid 'id' fields

When creating SqlPair instances, pair.get("id") may return None if the 'id' key is missing in the pair. Since id is a required field of type str in the SqlPair model, this might raise validation errors.

Please ensure that all entries in external_pairs have an 'id' field. If 'id' might be missing, consider providing a default value or handling missing IDs appropriately.

wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs_deletion.py (2)

5-5: Updated imports reflect changes in the pipeline structure

The import statement correctly replaces SqlPairsPreparation with SqlPairs, aligning with the updated pipeline naming conventions.

29-32: Verify deletion operation in test

After running sql_pairs_deletion.run, the test asserts that the document count in the store has decreased. Ensure that the test correctly verifies the deletion operation by checking the actual contents of the store.

As a follow-up, you can enhance the test by retrieving the remaining documents and asserting that the deleted documents are no longer present.

wren-ai-service/src/pipelines/indexing/__init__.py (3)

96-98: LGTM! Parameter rename improves clarity.

The parameter rename from id to project_id enhances code readability and maintains consistency with the project's naming conventions.

106-108: LGTM! Filter condition is correctly updated.

The filter condition correctly uses the renamed project_id parameter, maintaining the logical integrity of the document filtering.

117-117: LGTM! Module exports are properly updated.

The import and __all__ list changes correctly reflect the refactoring from SqlPairsPreparation to SqlPairs.

Also applies to: 126-126

wren-ai-service/src/pipelines/retrieval/sql_pairs_retrieval.py (1)

28-28: LGTM! Key rename improves semantic clarity.

The change from "summary" to "question" better represents the actual content being formatted and aligns with the SQL pairs domain terminology.

wren-ai-service/src/web/v1/services/semantics_preparation.py (1)

82-87: LGTM! SQL pairs pipeline task properly integrated.

The addition of "sql_pairs" to the pipeline tasks ensures proper integration with the semantics preparation process while maintaining concurrent execution through asyncio.gather().
wren-ai-service/src/pipelines/generation/sql_generation.py (1)

35-43: LGTM! SQL samples section enhances LLM guidance.

The addition of the SQL samples section in the prompt template provides valuable examples to guide the LLM in generating more accurate SQL queries.

Consider the impact on token limits when many samples are provided. Run this script to analyze the template size:

✅ Verification successful

Token limits are properly handled by the system architecture

The system is well-designed with built-in safeguards:

Uses GPT-4's standard 4096 token limit

Retrieves relevant SQL samples through a dedicated pipeline

Conditionally includes samples in the template

Has default sample limits in place
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Calculate approximate token count for the template
echo 'Analyzing template size...'
rg -A 10 'sql_generation_user_prompt_template = """' | wc -w
Length of output: 128

Script:
#!/bin/bash
# Get the complete template content
echo "=== Complete Template ==="
rg -U "sql_generation_user_prompt_template = \"\"\".*?\"\"\"" -A 50

echo -e "\n=== Token Limit Configurations ==="
rg -i "token.*limit|max.*token"

echo -e "\n=== Sample Handling Logic ==="
rg -l "sql_samples" | xargs ast-grep --pattern 'def $FUNC($$$) { $$$ sql_samples $$$ }'
Length of output: 1685

Script:
#!/bin/bash
# Look for sample selection/limiting logic
echo "=== Sample Selection Logic ==="
rg -A 10 "def.*sql_samples"

echo -e "\n=== Sample List Creation ==="
rg -B 5 -A 5 "sql_samples.*=.*"
Length of output: 9163

Script:
#!/bin/bash
# Look for sample limit configurations
echo "=== Sample Limit Configurations ==="
rg -i "sample.*limit|limit.*sample|top.*sample|max.*sample"

echo -e "\n=== SQL Pairs Retrieval Logic ==="
rg -A 10 "sql_pairs_retrieval.*run"
Length of output: 2337
wren-ai-service/src/web/v1/services/sql_pairs_preparation.py (2)

94-94: Address TODO comment before merging.

The comment indicates that the SQL pairs preparation functionality needs proper implementation. This should be addressed before merging to ensure complete functionality.

Would you like me to help implement the proper SQL pairs preparation functionality or create a GitHub issue to track this task?

96-97: Verify parameter renaming impact.

The parameter rename from id to project_id improves naming consistency, but we should verify all callers have been updated.

✅ Verification successful

Parameter rename verification complete - all callers updated

The parameter rename from 'id' to 'project_id' has been properly implemented across the codebase. All relevant callers have been updated to use the new parameter name. The remaining uses of 'id' in the codebase are for different purposes (like sql_pair.id or sql_pairs_preparation_id) and not related to this parameter rename.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Search for any remaining uses of the old parameter name
rg "id=" -A 5 -B 5
Length of output: 117072

Script:
#!/bin/bash
# Search for function calls with id parameter
rg "id=[\'\"]?\w+[\'\"]?" --type py -g "!*test*" -A 3 -B 3

# Search for function calls with project_id parameter
rg "project_id=[\'\"]?\w+[\'\"]?" --type py -g "!*test*" -A 3 -B 3

# Use ast-grep to find function calls with id parameter
ast-grep --pattern 'id=$_' -l py

# Use ast-grep to find function calls with project_id parameter 
ast-grep --pattern 'project_id=$_' -l py
Length of output: 1416

Script:
#!/bin/bash
# Search for function calls with id parameter in sql_pairs related files
rg ".*$.*id=.*$" --type py -g "*sql_pairs*.py" -A 3 -B 3

# Search for function calls with project_id parameter in sql_pairs related files
rg ".*$.*project_id=.*$" --type py -g "*sql_pairs*.py" -A 3 -B 3

# Search for any remaining uses of id parameter in the codebase
rg "\bid=(?![\w-])" --type py -g "!*test*" -A 3 -B 3

# Search for any remaining uses of project_id parameter in the codebase
rg "\bproject_id=(?![\w-])" --type py -g "!*test*" -A 3 -B 3
Length of output: 10090
wren-ai-service/src/globals.py (1)

79-82: LGTM! Well-structured pipeline integration.

The SQL pairs pipeline component is properly integrated with the settings and follows the established pattern for pipeline configuration.

docker/docker-compose-dev.yaml (1)

48-48: LGTM! Proper configuration for SQL pairs.

The environment variable and volume mapping for SQL pairs are correctly configured and follow the established pattern.

Also applies to: 53-53

deployment/kustomizations/base/deploy-wren-ai-service.yaml (1)

66-67: LGTM! Consistent deployment configuration.

The environment variable and config-volume configuration for SQL pairs are properly set up and maintain consistency with the docker-compose configuration.

Also applies to: 77-78

docker/docker-compose.yaml (1)

59-59: LGTM! Environment variable and volume mapping for SQL pairs configuration.

The configuration follows Docker best practices by properly mapping the configuration file and setting up the corresponding environment variable.

Also applies to: 64-64
docker/config.example.yaml (1)

Line range hint 107-115: Verify the pipeline dependencies and configuration.

The SQL pairs pipeline components are well-structured with appropriate dependencies. However, let's verify the pipeline configuration usage across the codebase.

✅ Verification successful

Old references to sql_pairs_preparation exist but don't impact functionality

The pipeline components are correctly configured and used throughout the codebase. The references to sql_pairs_preparation only exist in test files and service names, not in the actual pipeline configuration or usage. No changes are required.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the pipeline configuration usage

# Check for pipeline component references
rg -A 3 "sql_pairs_(indexing|deletion|retrieval)"

# Check for any potential old references to sql_pairs_preparation
rg "sql_pairs_preparation"
Length of output: 20474
🧰 Tools

🪛 yamllint (1.35.1)

[error] 112-112: trailing spaces

(trailing-spaces)
wren-ai-service/tools/config/config.example.yaml (1)

Line range hint 125-133: Configuration is consistent with docker/config.example.yaml

🧰 Tools

🪛 yamllint (1.35.1)

[error] 130-130: trailing spaces

(trailing-spaces)

wren-ai-service/tools/config/config.full.yaml (1)

Line range hint 144-152: Verify the removal of LLM dependency from sql_pairs_indexing.

The LLM dependency has been removed from the sql_pairs_indexing pipeline while being retained in sql_pairs_retrieval. Please confirm if this architectural change is intentional and doesn't impact the indexing functionality.

🧰 Tools

🪛 yamllint (1.35.1)

[error] 149-149: trailing spaces

(trailing-spaces)

wren-ai-service/src/pipelines/indexing/sql_pairs.py

wren-ai-service/tools/mdl_to_str.py

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs_deletion.py (1)
20-25: Consider using a test fixture for the pairs.json path.

Instead of hardcoding the test data path, consider:

Moving it to a test fixture or constant

Using a temporary file for isolation
This would make the tests more maintainable and prevent potential issues with file locations in different environments.
+# At the top of the file
+TEST_SQL_PAIRS_PATH = "tests/data/pairs.json"

-        **pipe_components["sql_pairs_indexing"], sql_pairs_path="tests/data/pairs.json"
+        **pipe_components["sql_pairs_indexing"], sql_pairs_path=TEST_SQL_PAIRS_PATH
wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs.py (2)
Line range hint 10-34: Enhance test coverage for SQL pairs indexing.

While the test verifies basic document storage, consider adding assertions for:

MDL string integration

Project ID in metadata

Specific content from pairs.json
     assert document.meta.get("sql_pair_id"), "sql_pair_id should be in meta"
     assert document.meta.get("sql"), "sql should be in meta"
+    assert document.meta.get("project_id") == "fake-id", "project_id should match"
+    # Verify MDL integration
+    assert document.meta.get("mdl") is not None, "mdl should be in meta"
Line range hint 51-71: Enhance multi-project ID test coverage.

The test could be more thorough in verifying the indexed documents:

Add assertions for both project IDs

Verify that documents are correctly associated with their respective projects

Check that MDL strings are properly stored
     assert len(documents) == 2
+    # Verify documents for second project
+    documents_2 = store.filter_documents(
+        filters={
+            "operator": "AND",
+            "conditions": [
+                {"field": "project_id", "operator": "==", "value": "fake-id-2"},
+            ],
+        }
+    )
+    assert len(documents_2) == 2

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8d9a9d4 and b165de2.

📒 Files selected for processing (24)

deployment/kustomizations/base/cm.yaml (2 hunks)
deployment/kustomizations/base/deploy-wren-ai-service.yaml (2 hunks)
docker/config.example.yaml (1 hunks)
docker/docker-compose-dev.yaml (1 hunks)
docker/docker-compose.yaml (1 hunks)
wren-ai-service/Justfile (1 hunks)
wren-ai-service/src/config.py (1 hunks)
wren-ai-service/src/globals.py (2 hunks)
wren-ai-service/src/pipelines/generation/sql_generation.py (1 hunks)
wren-ai-service/src/pipelines/generation/utils/sql.py (1 hunks)
wren-ai-service/src/pipelines/indexing/__init__.py (2 hunks)
wren-ai-service/src/pipelines/indexing/sql_pairs.py (1 hunks)
wren-ai-service/src/pipelines/indexing/sql_pairs_deletion.py (1 hunks)
wren-ai-service/src/pipelines/indexing/sql_pairs_preparation.py (0 hunks)
wren-ai-service/src/pipelines/retrieval/sql_pairs_retrieval.py (1 hunks)
wren-ai-service/src/web/v1/services/semantics_preparation.py (1 hunks)
wren-ai-service/src/web/v1/services/sql_pairs_preparation.py (2 hunks)
wren-ai-service/tests/data/pairs.json (1 hunks)
wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs.py (2 hunks)
wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs_deletion.py (1 hunks)
wren-ai-service/tests/pytest/services/test_sql_pairs_preparation.py (2 hunks)
wren-ai-service/tools/config/config.example.yaml (1 hunks)
wren-ai-service/tools/config/config.full.yaml (1 hunks)
wren-ai-service/tools/mdl_to_str.py (1 hunks)

💤 Files with no reviewable changes (1)

wren-ai-service/src/pipelines/indexing/sql_pairs_preparation.py

🚧 Files skipped from review as they are similar to previous changes (20)

wren-ai-service/src/config.py
deployment/kustomizations/base/deploy-wren-ai-service.yaml
wren-ai-service/src/pipelines/generation/utils/sql.py
wren-ai-service/Justfile
wren-ai-service/tests/data/pairs.json
wren-ai-service/tools/config/config.example.yaml
wren-ai-service/src/pipelines/retrieval/sql_pairs_retrieval.py
docker/config.example.yaml
deployment/kustomizations/base/cm.yaml
wren-ai-service/src/web/v1/services/semantics_preparation.py
docker/docker-compose.yaml
wren-ai-service/tools/config/config.full.yaml
docker/docker-compose-dev.yaml
wren-ai-service/tools/mdl_to_str.py
wren-ai-service/src/pipelines/indexing/sql_pairs_deletion.py
wren-ai-service/src/web/v1/services/sql_pairs_preparation.py
wren-ai-service/src/pipelines/generation/sql_generation.py
wren-ai-service/tests/pytest/services/test_sql_pairs_preparation.py
wren-ai-service/src/pipelines/indexing/init.py
wren-ai-service/src/pipelines/indexing/sql_pairs.py

🔇 Additional comments (4)

wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs_deletion.py (2)

5-5: LGTM! Component renaming reflects the functionality better.

The change from "preparation" to "indexing" better represents the actual functionality of storing and managing SQL pairs.

Also applies to: 13-13

29-32: 🛠️ Refactor suggestion

Enhance test robustness for SQL pair deletion.

The test uses hardcoded SQL pair IDs without verifying their existence in pairs.json. Consider:

Reading actual IDs from the test data file

Adding assertions to verify the specific documents were deleted

Let's verify the content of pairs.json:
wren-ai-service/src/globals.py (2)

79-82: LGTM! Clean integration of SQL pairs indexing.

The addition of the SQL pairs pipeline with configuration from settings is well-structured and consistent with the overall changes.

227-228: Consider renaming the preparation service for consistency.

While the component has been updated to use indexing.SqlPairs, the service class name SqlPairsPreparationService still contains "preparation". Consider renaming it to maintain consistency with the new indexing terminology.
-        sql_pairs_preparation_service=SqlPairsPreparationService(
+        sql_pairs_indexing_service=SqlPairsIndexingService(
Note: This change would require updating the service class name and its usages across the codebase.

Let's check for other occurrences:

wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs.py (2)

Line range hint 38-71: Consider adding negative test cases.

While the test covers multiple project IDs well, consider adding test cases for:

Invalid MDL string format

Non-existent project IDs

Empty or malformed SQL pairs file

Line range hint 1-71: Consider separating test data setup.

To improve test maintainability, consider:

Moving the test data setup to a fixture

Creating a dedicated test data factory for generating MDL strings

Using parametrized tests for different SQL pair scenarios

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b165de2 and 13f4ff0.

📒 Files selected for processing (2)

wren-ai-service/src/pipelines/indexing/sql_pairs.py (1 hunks)
wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

wren-ai-service/src/pipelines/indexing/sql_pairs.py

⏰ Context from checks skipped due to timeout of 90000ms (2)

GitHub Check: pytest
GitHub Check: Analyze (go)

🔇 Additional comments (3)

wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs.py (3)

5-5: LGTM! Import and component changes align with the pipeline refactoring.

The changes from SqlPairsPreparation to SqlPairs and the component renaming to "sql_pairs_indexing" maintain consistency with the broader architectural changes.

Also applies to: 13-13

10-10: Function name updated for consistency.

The function name has been updated from "preparation" to "indexing" as previously suggested.

20-25: Verify the test data file existence and format.

The test relies on tests/data/pairs.json. Let's verify its existence and format.

✅ Verification successful

Test data file exists and is properly formatted

The file exists at "wren-ai-service/tests/data/pairs.json" with valid JSON structure containing appropriate test data for SQL pairs testing.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Check if the test data file exists and validate its JSON structure
if [ -f "tests/data/pairs.json" ]; then
    echo "File exists. Validating JSON structure..."
    jq '.' tests/data/pairs.json
else
    echo "Error: tests/data/pairs.json not found"
    exit 1
fi
Length of output: 130

Script:
#!/bin/bash
# Search for any pairs.json files in the repository
echo "Searching for pairs.json files:"
fd pairs.json

echo -e "\nSearching for any JSON files in tests directory:"
fd -e json . tests/

echo -e "\nSearching for other test files that might use pairs.json:"
rg -l "pairs.json" tests/
Length of output: 559

Script:
#!/bin/bash
# Validate the JSON structure of the found file
echo "Validating JSON structure of the found file:"
jq '.' wren-ai-service/tests/data/pairs.json
Length of output: 372

cyyeh

LGTM

docker/docker-compose.yaml

cyyeh

lgtm

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs.py (1)
38-58: Update documentation strings in test functions

The docstrings or comments within the test functions still refer to the old sql_pairs_preparation naming. Consider updating any internal documentation to match the new naming convention for clarity.

Apply this diff to update the comments:
# Existing comments or docstrings, if any
Note: Since the provided code does not include comments, ensure that any existing comments are updated accordingly.
deployment/kustomizations/base/cm.yaml (1)

184-184: Add newline at end of file.

Add a newline character at the end of the file to comply with YAML best practices.

🧰 Tools

🪛 yamllint (1.35.1)

[error] 184-184: no new line character at the end of file

(new-line-at-end-of-file)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 13f4ff0 and 81f13bf.

📒 Files selected for processing (21)

deployment/kustomizations/base/cm.yaml (2 hunks)
docker/config.example.yaml (1 hunks)
wren-ai-service/Justfile (1 hunks)
wren-ai-service/src/config.py (1 hunks)
wren-ai-service/src/globals.py (2 hunks)
wren-ai-service/src/pipelines/generation/sql_generation.py (1 hunks)
wren-ai-service/src/pipelines/generation/utils/sql.py (1 hunks)
wren-ai-service/src/pipelines/indexing/__init__.py (2 hunks)
wren-ai-service/src/pipelines/indexing/sql_pairs.py (1 hunks)
wren-ai-service/src/pipelines/indexing/sql_pairs_deletion.py (1 hunks)
wren-ai-service/src/pipelines/indexing/sql_pairs_preparation.py (0 hunks)
wren-ai-service/src/pipelines/retrieval/sql_pairs_retrieval.py (1 hunks)
wren-ai-service/src/web/v1/services/semantics_preparation.py (1 hunks)
wren-ai-service/src/web/v1/services/sql_pairs_preparation.py (2 hunks)
wren-ai-service/tests/data/pairs.json (1 hunks)
wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs.py (2 hunks)
wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs_deletion.py (1 hunks)
wren-ai-service/tests/pytest/services/test_sql_pairs_preparation.py (2 hunks)
wren-ai-service/tools/config/config.example.yaml (1 hunks)
wren-ai-service/tools/config/config.full.yaml (1 hunks)
wren-ai-service/tools/mdl_to_str.py (1 hunks)

💤 Files with no reviewable changes (1)

wren-ai-service/src/pipelines/indexing/sql_pairs_preparation.py

🚧 Files skipped from review as they are similar to previous changes (15)

wren-ai-service/tests/pytest/services/test_sql_pairs_preparation.py
docker/config.example.yaml
wren-ai-service/tools/config/config.example.yaml
wren-ai-service/src/pipelines/retrieval/sql_pairs_retrieval.py
wren-ai-service/src/web/v1/services/semantics_preparation.py
wren-ai-service/Justfile
wren-ai-service/tests/data/pairs.json
wren-ai-service/tools/mdl_to_str.py
wren-ai-service/src/web/v1/services/sql_pairs_preparation.py
wren-ai-service/src/pipelines/generation/utils/sql.py
wren-ai-service/tools/config/config.full.yaml
wren-ai-service/src/pipelines/indexing/sql_pairs_deletion.py
wren-ai-service/src/pipelines/generation/sql_generation.py
wren-ai-service/src/config.py
wren-ai-service/src/pipelines/indexing/init.py

🧰 Additional context used

🪛 yamllint (1.35.1)

deployment/kustomizations/base/cm.yaml

[error] 184-184: no new line character at the end of file

(new-line-at-end-of-file)

⏰ Context from checks skipped due to timeout of 90000ms (2)

GitHub Check: pytest
GitHub Check: Analyze (go)

🔇 Additional comments (7)

wren-ai-service/src/pipelines/indexing/sql_pairs.py (1)

54-57: ⚠️ Potential issue

Add exception handling for JSON parsing errors in boilerplates function

The function boilerplates parses mdl_str without handling potential JSON decoding exceptions. If mdl_str contains invalid JSON, orjson.loads will raise an exception, which could cause the pipeline to fail unexpectedly.

Apply this diff to handle JSON parsing errors gracefully:
def boilerplates(
    mdl_str: str,
) -> Set[str]:
+   try:
        mdl = orjson.loads(mdl_str)
+   except orjson.JSONDecodeError as e:
+       logger.error(f"Error parsing MDL JSON string: {e}")
+       return set()
Likely invalid or redundant comment.
wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs_deletion.py (2)

20-25: Update test to match new SqlPairs initialization parameters

The instantiation of SqlPairs now includes sql_pairs_path, but the project_id parameter is passed during the run method call. Ensure that the test accurately reflects the new initialization and method signatures.

29-32: Verify SQL pair IDs correspond to the correct project_id

In the deletion test, SQL pair IDs ["1", "2"] are used with id="fake-id-2", but the documents were added with project_id="fake-id". Ensure that the deletion call uses the correct project_id to accurately test the deletion functionality.

Consider updating the deletion call to match the project_id used during insertion:
await sql_pairs_deletion.run(
-   id="fake-id-2",
+   id="fake-id",
    sql_pair_ids=["1", "2"]
)
wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs.py (1)

10-25: Ensure consistent naming of test functions

The test function name has been updated to test_sql_pairs_indexing_saving_to_document_store, which improves consistency with the new SqlPairs class. Good job on updating the naming to reflect the changes.

deployment/kustomizations/base/cm.yaml (1)

155-157: LGTM! SQL pairs pipeline configuration is well-structured.

The new pipeline components are properly configured with appropriate dependencies, and the removal of the LLM dependency from sql_pairs_indexing aligns with the PR objective of simplifying the pipeline.

wren-ai-service/src/globals.py (2)

79-82: LGTM! SQL pairs integration is properly configured.

The new SQL pairs pipeline is well-integrated into the semantics preparation service with proper dependency injection and configuration.

227-228: Verify sql_pairs_path configuration.

Ensure that the sql_pairs_path setting is properly configured in the environment and the path exists in the deployment.

coderabbitai · 2025-01-10T03:44:28Z

wren-ai-service/src/pipelines/indexing/sql_pairs.py

+    return [
+        SqlPair(
+            id=pair.get("id"),
+            question=pair.get("question"),
+            sql=pair.get("sql"),
+        )
+        for boilerplate in boilerplates


⚠️ Potential issue

Ensure id field is not missing in SqlPair instances

In the sql_pairs function, the id field is retrieved using pair.get("id"), which may return None if the key is missing. Since SqlPair requires id to be a string, a missing id could lead to validation errors or unintended behavior.

Apply this diff to enforce the presence of the id field:

def sql_pairs( boilerplates: Set[str], external_pairs: Dict[str, Any], ) -> List[SqlPair]: return [ SqlPair( - id=pair.get("id"), + id=pair["id"], question=pair.get("question"), sql=pair.get("sql"), ) for boilerplate in boilerplates if boilerplate in external_pairs for pair in external_pairs[boilerplate] ]

Alternatively, provide a default id or handle cases where id might be missing to prevent potential errors.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

return [

SqlPair(

id=pair.get("id"),

question=pair.get("question"),

sql=pair.get("sql"),

)

for boilerplate in boilerplates

return [

SqlPair(

id=pair["id"],

question=pair.get("question"),

sql=pair.get("sql"),

)

for boilerplate in boilerplates

paopa added module/ai-service ai-service related ci/ai-service ai-service related labels Jan 2, 2025

paopa force-pushed the feat/embed-sql-pairs branch 5 times, most recently from 7bd856c to 75fdd82 Compare January 9, 2025 04:01

paopa marked this pull request as ready for review January 9, 2025 04:09

paopa force-pushed the feat/embed-sql-pairs branch from 75fdd82 to 8d9a9d4 Compare January 9, 2025 04:10

paopa requested a review from cyyeh January 9, 2025 04:22

coderabbitai bot reviewed Jan 9, 2025

View reviewed changes

wren-ai-service/src/pipelines/indexing/sql_pairs.py Outdated Show resolved Hide resolved

wren-ai-service/src/pipelines/indexing/sql_pairs.py Outdated Show resolved Hide resolved

wren-ai-service/tools/mdl_to_str.py Outdated Show resolved Hide resolved

paopa force-pushed the feat/embed-sql-pairs branch from 8d9a9d4 to b165de2 Compare January 9, 2025 05:46

coderabbitai bot reviewed Jan 9, 2025

View reviewed changes

wren-ai-service/tests/pytest/pipelines/indexing/test_sql_pairs.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Jan 9, 2025

View reviewed changes

cyyeh approved these changes Jan 10, 2025

View reviewed changes

cyyeh reviewed Jan 10, 2025

View reviewed changes

docker/docker-compose.yaml Outdated Show resolved Hide resolved

paopa added 13 commits January 10, 2025 11:36

chore: refactor partial code

8cf291e

feat: pushdown the sql pair class to the core

f40e0bc

feat: tool to convert mdl to str for http request testing

5340ee7

feat: index sql pairs if MDL includes the key

ea99262

fix: failed test cases

e4df29b

fix: correct the behavior for document id

b7c759c

feat: prompt enhancement for SQL pairs

2819a4f

feat: use question instead of sql summary

7d2b088

feat: sql pairs from external file

6d56393

add the todo for sql pair preparation endpoint

7399656

chore: comment for deletion pipe

8cbb369

fix: test cases

05f1e8f

chore: remove unused attribute in doc

5d87164

paopa added 6 commits January 10, 2025 11:36

chore: skip some test cases for sql pair service

eb4619c

fix: test cases for sql pair and deletion

513350d

chore: mount sql pair file for each deployment

923f5ce

chore: deal import, type and error handling suggestion

254c180

chore: rename test case, and correct pipeline type

c79f907

chore: remove empty pairs file from each deployment

81f13bf

paopa force-pushed the feat/embed-sql-pairs branch from 13f4ff0 to 81f13bf Compare January 10, 2025 03:41

cyyeh approved these changes Jan 10, 2025

View reviewed changes

coderabbitai bot reviewed Jan 10, 2025

View reviewed changes

cyyeh merged commit 186faa7 into main Jan 10, 2025
10 checks passed

cyyeh deleted the feat/embed-sql-pairs branch January 10, 2025 07:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(wren-ai-service): Embed the SQL pairs in MDL #1082

feat(wren-ai-service): Embed the SQL pairs in MDL #1082

paopa commented Jan 2, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 2, 2025 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

cyyeh left a comment

cyyeh left a comment

coderabbitai bot left a comment

coderabbitai bot Jan 10, 2025

feat(wren-ai-service): Embed the SQL pairs in MDL #1082

feat(wren-ai-service): Embed the SQL pairs in MDL #1082

Conversation

paopa commented Jan 2, 2025 • edited by coderabbitai bot Loading

Changes

Related Changes

Summary by CodeRabbit

Release Notes

coderabbitai bot commented Jan 2, 2025 • edited Loading

Walkthrough

Changes

Sequence Diagram

Poem

Finishing Touches

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

cyyeh left a comment

Choose a reason for hiding this comment

cyyeh left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Jan 10, 2025

Choose a reason for hiding this comment

paopa commented Jan 2, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 2, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)