Simple Guide to Using Pinecone Vector Database
May 22, 2024In this tutorial, we're going to dive into using the Pinecone vector database. This is a super handy tool for managing vector-based data storage and similarity searches in machine learning applications.
I'll walk you through setting up your Pinecone account, creating an index, and integrating it with a Python script using the LangChain library. Let's get started!
What is Pinecone?
Pinecone is a high-performance vector database that's perfect for fast and scalable similarity searches. It's especially useful for applications that need to efficiently handle and query high-dimensional vectors, like the ones generated by machine learning models for tasks such as text and image embeddings.
Getting Started with Pinecone
Create an Account
First things first, head over to pinecone.io and create an account. Once you're registered, you can go ahead and create a new index.
Create an Index
- Name: Pick a name for your index.
- Dimension: Set the dimension to 1536. This matches the dimensionality of embeddings generated by OpenAI models. OpenAI embeddings, like those from GPT-3, use a 1536-dimensional vector space to capture the semantic meaning of text effectively.
API Key
Next, navigate to the API Keys section and copy your key. You'll need this in your Python script to interact with Pinecone.
Let's start coding ...
Next, let's create a Python script to store document chunks as embeddings in Pinecone using the LangChain library.
LangChain is a handy library that makes it easier to work with large language models and vector databases. It offers tools for loading documents, splitting text, generating embeddings, and interacting with vector stores like Pinecone.
Storing Data
Let's get started with our first file to store the content of a huge text file into Pinecone. We create a new python store.py
file and start with some imports.
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
from langchain_pinecone import PineconeVectorStore
We'll need to import modules to load documents, split text, create embeddings, and interface with the Pinecone vector database:
- TextLoader: Loads documents from a specified source.
- OpenAIEmbeddings: Uses OpenAI's models to generate embeddings.
- CharacterTextSplitter: Splits text documents into smaller chunks based on character count.
- PineconeVectorStore: Manages storing and retrieving vector data in Pinecone.
Now that we have our modules ready, let's move on to loading and splitting our text documents. We'll start by loading a document, then splitting it into smaller chunks for easier processing.
loader = TextLoader("grims.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
Here's what each part of the code does:
- loader: Creates an instance of
TextLoader
to load documents from a file namedgrims.txt
. - documents: Loads the content of
grims.txt
into memory. - text_splitter: Sets up
CharacterTextSplitter
with a chunk size of 1000 characters and no overlap (chunk_overlap=0
). This means each document will be split into segments of up to 1000 characters without any overlapping sections. - docs: Splits the loaded documents into smaller parts according to the rules defined by
text_splitter
.
Now that we've split the complete text into nice little chunks, it's time to create embeddings that large language models (LLMs) can understand. We'll then save these embeddings in Pinecone.
embeddings = OpenAIEmbeddings()
index_name = "testindex"
docsearch = PineconeVectorStore.from_documents(docs, embeddings, index_name=index_name)
Here's what this code does:
- embeddings: Initializes an
OpenAIEmbeddings
object to use OpenAI's model for converting text to embeddings. These embeddings are vector representations that capture the semantic meaning of the text. - index_name: Sets the name of the Pinecone index where the document vectors will be stored.
- docsearch: Utilizes the
from_documents
class method ofPineconeVectorStore
. This method processes the list of text documents (docs
), converts them to embeddings usingembeddings
, and stores them in the specified Pinecone index (index_name
).
After running the script, you can head over to the Pinecone website and check out your index. You'll see that it's been filled with the document embeddings.
Querying the Index
Now that we have our documents indexed, it's time to perform some similarity searches. To do this, we'll create a second Python script. Let's name it query.py
.
from langchain_pinecone import PineconeVectorStore
from langchain_openai import OpenAIEmbeddings
index_name = "groqtest" # Use the exact index name you used before
embeddings = OpenAIEmbeddings()
vectorstore = PineconeVectorStore(index_name=index_name, embedding=embeddings)
query = "What was the scissor-grinder singing?"
docs = vectorstore.similarity_search(query)
print(docs[0].page_content)
Here's what each part of the new script does:
- embeddings: An instance of
OpenAIEmbeddings
is created, configured to use a specific OpenAI model to generate embeddings. - vectorstore: An instance of
PineconeVectorStore
is initialized with the index name and the embeddings object. This sets up the connection to the Pinecone database with the capability to use OpenAI embeddings. - query: This is the text query for which we're seeking semantically similar documents or entries in the Pinecone database.
- docs: Calls the
similarity_search
method of thevectorstore
object with the query. This method uses the embeddings from OpenAI to convert the query into a vector and then performs a similarity search in the Pinecone index.
When you run this script, it will:
- Convert your query into a vector using OpenAI embeddings.
- Perform a similarity search in the Pinecone index to find the most semantically similar documents to your query.
- Return and print these documents, giving you the closest matches based on the content.
This allows you to quickly and easily find relevant information from your indexed documents based on the semantic similarity of your query.
Example Result
Running the script with a sample query might return something like this:
"As he came to the next village, he saw a scissor-grinder with his wheel, working and singing,
‘O’er hill and o’er dale So happy I roam, Work light and live well, All the world is my home; Then who so blythe, so merry as I?’"
Conclusion
This tutorial has shown how to set up Pinecone, create an index, and use LangChain to store and query document embeddings. Pinecone's powerful vector database combined with OpenAI's embeddings provides a robust solution for building applications that require efficient similarity searches.