Skip to content
This repository has been archived by the owner on Dec 9, 2024. It is now read-only.

Latest commit

 

History

History

streaming_pipeline

Streaming Pipeline

Real-time feature pipeline that:

  • ingests financial news from Alpaca
  • cleans & transforms the news documents into embeddings in real-time using Bytewax
  • stores the embeddings into the Qdrant Vector DB

The streaming pipeline is automatically deployed on an AWS EC2 machine using a CI/CD pipeline built in GitHub actions.

Table of Contents


1. Motivation

The best way to ingest real-time knowledge into an LLM without retraining the LLM too often is by using RAG.

To implement RAG at inference time, you need a vector DB always synced with the latest available data.

The role of this streaming pipeline is to listen 24/7 to available financial news from Alpaca, process the news in real-time using Bytewax, and store the news in the Qdrant Vector DB to make the information available for RAG.


architecture

2. Install

2.1. Dependencies

Main dependencies you have to install yourself:

  • Python 3.10
  • Poetry 1.5.1
  • GNU Make 4.3
  • AWS CLI 2.11.22

Installing all the other dependencies is as easy as running:

make install

When developing run:

make install_dev

Prepare credentials:

cp .env.example .env

--> and complete the .env file with your external services credentials.

2.2. Alpaca & Qdrant

Check out the Setup External Services section to see how to create API keys for them.

2.3. AWS CLI

deploy the streaming pipeline to AWS [optional]

You can deploy the streaming pipeline to AWS in 2 ways:

  1. running Make commands: install & configure your AWS CLI 2.11.22 as explained in the Setup External Services section
  2. using the GitHub Actions CI/CD pipeline: only create an account and generate a pair of credentials as explained in the Setup External Services section

3. Usage

3.1. Local

Run the streaming pipeline in real-time and production modes:

make run_real_time

To populate the vector DB, you can ingest historical data from the latest 8 days by running the streaming pipeline in batch and production modes:

make run_batch

For debugging & testing, run the streaming pipeline in real-time and development modes:

make run_real_time_dev

For debugging & testing, run the streaming pipeline in batch and development modes:

make run_batch_dev

To query the Qdrant vector DB, run the following:

make search PARAMS='--query_string "Should I invest in Tesla?"'

You can replace the --query_string with any question.

3.2. Docker

First, build the Docker image:

make build

Then, run the streaming pipeline in real-time mode inside the Docker image, as follows:

make run_real_time_docker

3.3. Deploy to AWS

3.3.1. Running Make commands

First, ensure you successfully configured your AWS CLI credentials.

After, run the following to deploy the streaming pipeline to a t2.small AWS EC2 machine:

make deploy_aws

NOTE: You can log in to the AWS console, go to the EC2s section, and you can see your machine running.

To check the state of the deployment, run:

make info_aws

To destroy the EC2 machine, run:

make undeploy_aws

3.3.2. Using the GitHub Actions CI/CD pipeline

Here, you only have to ensure you generate your AWS credentials.

Afterward, you must fork the repository and add all the credentials within your .env file into the forked repository secrets section.

Go to your forked repository -> Settings -> Secrets and variables -> Actions -> New repository secret.

Now, add all the secrets as in the image below.

GitHub Actions Secrets


Now, to automatically deploy the streaming pipeline to AWS using the GitHub Action's CI/CD pipeline, follow the next steps: Actions Tab -> Continuous Deployment (CD) | Streaming Pipeline action (on the left) -> Press "Run workflow".

GitHub Actions CD


To automatically destroy all the AWS components created earlier, you have to call another GitHub Actions workflow as follows: Actions Tab -> Destroy AWS Infrastructure -> Press "Run workflow"


In both scenarios, to see if your EC2 initialized correctly, connect to your EC2 machine and run:

cat /var/log/cloud-init-output.log

Also, to see that the streaming pipeline Docker container is running, run the following:

docker ps

You should see the streaming_pipeline docker container listed.

Note: You have to wait for ~5 minutes until everything is initialized fully.

3.4. Linting & Formatting

Check the code for linting issues:

make lint_check

Fix the code for linting issues (note that some issues can't automatically be fixed, so you might need to solve them manually):

make lint_fix

Check the code for formatting issues:

make format_check

Fix the code for formatting issues:

make format_fix