DEV Community

Cover image for Understanding LangChain's RecursiveCharacterTextSplitter

Understanding LangChain's RecursiveCharacterTextSplitter

Youdiowei Eteimorde on August 12, 2023

Large language models are powerful tools with extensive capabilities; nonetheless, they grapple with a distinct limitation known as the context win...
Collapse
 
notsob profile image
scitlec

The maintainers of the Langchain documentation should link to your useful explanation.

Thanks!

Collapse
 
tisu19021997 profile image
Pham Minh Quang

I totally agree! The langchain documentation is just suck.

Collapse
 
eteimz profile image
Youdiowei Eteimorde

Thanks for your kind words 🥰 Who knows they might.

Collapse
 
megatux profile image
Cristian Molina

What about the chunk_overlap param?

Collapse
 
eteimz profile image
Youdiowei Eteimorde • Edited

The chunk_overlap parameter determines how much the chunks overlap with each other.

For example let's split your comment into three chunks.

What about | the chunk_ | overlap param?

Let's overlap each chunk with 5 characters:

What about the | about the chunk_ | chunk_overlap param?

If we didn't use chunk overlapping your comment would have lost is meaning when split.

Collapse
 
megatux profile image
Cristian Molina

Thanks! That makes sense but what value should I use if, for instance, I need to save the texts In a vectorDB later to augment a RAG?
Does it matter? If this is significant I'd add this information to the article.
Thanks again.

Thread Thread
 
eteimz profile image
Youdiowei Eteimorde

It is all depends on your data and what you are trying to achieve. The whole Augmenting LLMs with external knowledge is still in it's infancy. So you can experiment with different params to see how your LLM performs during RAG.

Collapse
 
james_stover_cb94b158d958 profile image
James Stover

Something doesn’t quite work right as I see some words throughout my text after splitting are broken apart with a space making 2 non-words of each of them. They have quite a few characters in between, so it isn’t frequent, but in a large body of text, these add up. I am concerned about the detrimental impact to the vector embeddings and retrieval then.

Collapse
 
eteimz profile image
Youdiowei Eteimorde

Splitting is far from perfect. Hopefully more efficient techniques will be developed.

Collapse
 
abirpahlwan profile image
Pahlwan Rabiul Islam

Thanks a lot

Collapse
 
ajeet214 profile image
Ajeet Verma

great explanation!

Collapse
 
devnish profile image
Ni

You should be writing more such!
For someone new to LangChain and text split, this post really went deep on the subject.

Thanks!

Collapse
 
eteimz profile image
Youdiowei Eteimorde

Thank you for your nice words ☺️

Collapse
 
ramkumartr profile image
RamKumar-T-R

Good write-up with more insightful knowledge on implementation part

Collapse
 
nikhilk19 profile image
Nikhil Kulkarni

This was very helpful, Thanks for the detailed explanation!

Collapse
 
githubedcults profile image
edcults

Thanks a lot for detailed explanation, I wonder why this is not linked or published as langchain blogs

Collapse
 
sducoued profile image
Sophie du Couédic

I am wondering why '.' is not part of the default separators? It seems to me that it would be effective to separate sentences.

Collapse
 
vishalnagda1 profile image
Vishal Nagda

It's really a valuable post.