Vanna now supports Marqo as a native integration to help generate SQL using AI
What is Marqo?
Marqo is an end-to-end, multimodal vector search engine. Users can store and query unstructured data such as text, images, and code. You can see more about Marqo here . Once you've got a Marqo instance up and running, you can use it to generate SQL with the help of the Vanna Python library.
What is Vanna?
Vanna is a Python library that uses Retrieval Augmented Generation (RAG) to generate SQL queries.
Fundamentally it provides 2 high-level functions:
-
vn.train()
- This is used to store information in a vector database or vector search like Marqo. -
vn.ask()
- This will retrieve information from the vector database and use it to construct a prompt for the LLM. The LLM will then generate SQL queries based on the prompt.
How can you use it to help generate SQL?
You can give the Vanna Python library information about your database and then ask it to generate SQL queries for you. The metadata will be stored in Marqo and then used to generate SQL queries via your favorite LLM.
Start the Marqo Docker Container
Install the Vanna Python Library and the Marqo Python Library
pip install vanna
pip install marqo
Configure Vanna to use Marqo as the Vector Store, with your favorite LLM
In this case we're using OpenAI's GPT-4 model.
from vanna.marqo.marqo import Marqo_VectorStore
from vanna.openai.openai_chat import OpenAI_Chat
class MyVanna(Marqo_VectorStore, OpenAI_Chat):
def __init__(self, config=None):
Marqo_VectorStore.__init__(self, config=config) # You can pass additional Marqo configuration options like the url and model name here
OpenAI_Chat.__init__(self, config=config)
vn = MyVanna(config={'api_key': os.environ['OPENAI_API_KEY'], 'model': 'gpt-4'})
Add Training Data
An example of training data could be DDL statements, SQL queries, or any other text-based documentation you have about your database, business, or industry.
Here we'll add a simple DDL statement to the vector store.
vn.train(ddl='CREATE TABLE users (id INT, name VARCHAR(255), email VARCHAR(255), PRIMARY KEY (id))')
In reality, you'd want to add a lot more training data. You can add as much as you want. The more you add, the better the results will be. See this example notebook for how you can do this.
Ask a Question
Now we can ask a question and get the SQL, tabular results, automatically generated Plotly code, and potential follow-up questions.
vn.ask('How many users are there?')
Connect to a Database
You can also connect to a database to run the queries and get the tabular results, chart, and more. See this example notebook for how you can do this. The notebook also shows how to connect to various databases and other LLMs.
User Interface
You can use this with Streamlit or any other Python-based UI framework. We have 2 open-source streamlit apps that you can use as a starting point.