Step 1: Create database clusters
Step 2: Input database clusters information
Step 3: Waiting clusters deploy
Step 4: Add network access whitelist
Step 5: Add database access user
Step 6: Connect to your local Atlas deployment or Atlas Cluster
Step 7: retrieve text from PDF
Step 8: local embedding PDF text then insert MongoDB
Step 9: check collections document record
Step 10: Create vector search index
Step 11: query by vector search index
Step 1: Create database clusters
Step 2: Input database clusters information
Step 3: Waiting clusters deploy
Step 4: Add network access whitelist
Step 5: Add database access user
Step 6: Connect to your local Atlas deployment or Atlas Cluster
Step 7: retrieve text from PDF
Step 8: local embedding PDF text then insert MongoDB
pip install sentence-transformers==2.7.0
pip install pymongo==4.7.2
pip install langchain==0.2.6
pip install langchain-mongodb==0.1.5
pip install pandas==2.2.0
pip install langchain-openai==0.1.20
pip install langchain-chroma==0.1.0
pip install langchain-core==0.2.26
pip install langchain-huggingface==0.0.3
pip install langchain-mongodb==0.1.4
from pymongo import MongoClient
from langchain_huggingface import HuggingFaceEmbeddings
print("get documents")
data = ""
with open("./txt_final/payment.txt","r",encoding="utf8") as file:
data = file.read()
print("Split txt into documents by page")
splits = data.split("www.iresearch.com.cn")
print("get model then embedding")
model = HuggingFaceEmbeddings(model_name="BAAI/bge-large-zh-v1.5")
print("Connect to your local Atlas deployment or Atlas Cluster")
mongo_client = MongoClient("mongodb+srv://<username>:<password>@internal-knowledge-base.xxxxx.mongodb.net/")
collection = mongo_client["internal-knowledge-base"]["papers"]
for split in splits:
embedding = model.embed_query(split)
collection.insert_one({ 'text_embedding': embedding, 'summary': split })
Step 9: check collections document record
PDF page 3
PDF page 4
Data structure
{
"_id": "66b79fd22e6781dc9195820fL",
"text_embedding": [0.019098538905382156, -0.0010181389516219497],
"summary": "Diversified development paths for third-party payment platforms Third-party payment platforms integrate into every detail of consumer life through lightweight reach...."
}
Step 10: Create vector search index
{
"fields": [
{
"type": "vector",
"path": "text_embedding",
"numDimensions": 1024,
"similarity": "cosine"
}
]
}
Step 11: query by vector search index
from pymongo import MongoClient
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_mongodb import MongoDBAtlasVectorSearch
import pprint
print("Connect to your local Atlas deployment or Atlas Cluster")
mongo_client = MongoClient("mongodb+srv://<username>:<password>@internal-knowledge-base.xxxxx.mongodb.net/")
collection = mongo_client["internal-knowledge-base"]["papers"]
model = HuggingFaceEmbeddings(model_name="BAAI/bge-large-zh-v1.5")
vector_store = MongoDBAtlasVectorSearch(
collection=collection,
embedding=model,
index_name="vector_index",
embedding_key="text_embedding",
text_key="summary"
)
query = "่่้ๅข" # payment
results = vector_store.similarity_search(query)
pprint.pprint(results)
Result:
English version
[
Document(metadata={'_id': {'$oid': '66b79fdc2e6781dc91958211'}}, page_content='Ant Group-Alipay Ecological Foundation}
Document(metadata={'_id': {'$oid': '66b79fcd2e6781dc9195820e'}}, page_content='The competitive landscape of independent third-party payment platforms has formed, led by Alipay"}
Document(metadata={'_id': {'$oid': '66b79fc32e6781dc9195820c'}}, page_content='Aikan Series Monthly Inventory of Tourism Activity in Scenic Areas"}
]
Chinese version
[
Document(metadata={'_id': {'$oid': '66b79fdc2e6781dc91958211'}}, page_content='่่้ๅขโๆฏไปๅฎ็ๆ็ญๅบ"}
Document(metadata={'_id': {'$oid': '66b79fcd2e6781dc9195820e'}}, page_content='็ฌ็ซ็ฌฌไธๆนๆฏไปๅนณๅฐ็ซไบๆ ผๅฑๅฝขๆไปฅๆฏไปๅฎไธบ้ฆ"}
Document(metadata={'_id': {'$oid': '66b79fc32e6781dc9195820c'}}, page_content='-่พ็ฐ็ณปๅ-ๆฏๅบๆ
ๆธธๆดป่ทๅบฆ็็นๆๆฅ"}
]
Reference:
https://python.langchain.com/v0.2/docs/tutorials/pdf_qa/
Build a PDF ingestion and Question/Answering system
https://www.mongodb.com/docs/atlas/atlas-vector-search/create-embeddings/
How to Create Vector Embeddings
https://www.mongodb.com/docs/atlas/atlas-vector-search/tutorials/local-rag/#std-label-local-rag
Build a Local RAG Implementation with Atlas Vector Search
https://www.mongodb.com/docs/atlas/atlas-vector-search/ai-integrations/langchain/#std-label-langchain
Get Started with the LangChain Integration
https://docs.llamaindex.ai/en/stable/examples/embeddings/huggingface/
Local Embeddings with HuggingFace
Editor
Danny Chan, specialty of FSI and Serverless
Kenny Chan, specialty of FSI and Machine Learning
Top comments (0)