Texts transforming

Texts transforming#

Langchain contains a set of tools for texts transforming.

Text splitters#

Langchain contains module text_splitter which contains implementations of approaches to split texts into pieces. Can be usefull, for example for chunking in RAG pipeline.

The following table shows the awailable text splitters.

Class Name

Description

Common Use Case

CharacterTextSplitter

Splits text based on a specified character (e.g., \n, ).

Simple, quick splitting where structural integrity is not a major concern.

RecursiveCharacterTextSplitter

The recommended default. Splits text based on a list of characters in a hierarchical order (e.g., ["\n\n", "\n", " "]) to maintain logical chunks.

General-purpose text, such as articles, essays, and unstructured documents.

TokenTextSplitter

Splits text based on the number of tokens, using a specific tokenizer (e.g., tiktoken for OpenAI models).

Preparing text to fit within a specific LLM’s context window.

HTMLHeaderTextSplitter

Splits HTML documents based on specified header tags (h1, h2, etc.).

Processing HTML content where you want to preserve sections defined by headers.

MarkdownTextSplitter

Splits Markdown documents based on Markdown syntax, such as headers and code blocks.

Processing Markdown files while keeping logical sections together.

SentenceTransformersTokenTextSplitter

Splits text using a tokenizer from the sentence-transformers library, based on a token count.

Working with models from the sentence-transformers library.

NLTKTextSplitter

Splits text into sentences using the NLTK library’s sentence tokenizer.

Splitting a document into individual sentences for fine-grained processing.

SpacyTextSplitter

Splits text into sentences using the spaCy library.

Similar to NLTK, but leverages spaCy for sentence boundary detection, which can be more robust for some languages.

SemanticChunker

A more advanced splitter that uses an embedding model to identify semantic breakpoints (topic shifts) in the text.

Creating semantically coherent chunks for more effective retrieval-augmented generation.

Language-specific Code Splitters

A family of splitters for various programming languages (e.g., PythonCodeTextSplitter, JavaScriptCodeTextSplitter).

Processing code files to keep functions, classes, and other logical blocks intact.


After initializing the splitter, use the sprit_text method to split the given text. The following cell demonstrates the application of the recursive text splitter to a given text.

from langchain.text_splitter import RecursiveCharacterTextSplitter

data = """M r. and Mrs. Dursley, of number four, Privet Drive, were proud to say
that they were perfectly normal, thank you very much. They were the last people
you’d expect to be involved in anything strange or mysterious, because they just
didn’t hold with such nonsense.

Mr. Dursley was the director of a firm called Grunnings, which made
drills. He was a big, beefy man with hardly any neck, although he did have a
very large mustache. Mrs. Dursley was thin and blonde and had nearly twice the
usual amount of neck, which came in very useful as she spent so much of her
time craning over garden fences, spying on the neighbors. The Dursleys had a
small son called Dudley and in their opinion there was no finer boy anywhere.

The Dursleys had everything they wanted, but they also had a secret, and
their greatest fear was that somebody would discover it. They didn’t think they
could bear it if anyone found out about the Potters. Mrs. Potter was Mrs.
Dursley’s sister, but they hadn’t met for several years; in fact, Mrs. Dursley
pretended she didn’t have a sister, because her sister and her good-for-nothing
husband were as unDursleyish as it was possible to be. The Dursleys shuddered
to think what the neighbors would say if the Potters arrived in the street. The
Dursleys knew that the Potters had a small son, too, but they had never even
seen him. This boy was another good reason for keeping the Potters away; they
didn’t want Dudley mixing with a child like that."""

out = RecursiveCharacterTextSplitter(
    chunk_size=500, chunk_overlap=20
).split_text(data)

for t in out:
    print(t, end="\n\n")
M r. and Mrs. Dursley, of number four, Privet Drive, were proud to say
that they were perfectly normal, thank you very much. They were the last people
you’d expect to be involved in anything strange or mysterious, because they just
didn’t hold with such nonsense.

Mr. Dursley was the director of a firm called Grunnings, which made
drills. He was a big, beefy man with hardly any neck, although he did have a
very large mustache. Mrs. Dursley was thin and blonde and had nearly twice the
usual amount of neck, which came in very useful as she spent so much of her
time craning over garden fences, spying on the neighbors. The Dursleys had a
small son called Dudley and in their opinion there was no finer boy anywhere.

The Dursleys had everything they wanted, but they also had a secret, and
their greatest fear was that somebody would discover it. They didn’t think they
could bear it if anyone found out about the Potters. Mrs. Potter was Mrs.
Dursley’s sister, but they hadn’t met for several years; in fact, Mrs. Dursley
pretended she didn’t have a sister, because her sister and her good-for-nothing
husband were as unDursleyish as it was possible to be. The Dursleys shuddered

to think what the neighbors would say if the Potters arrived in the street. The
Dursleys knew that the Potters had a small son, too, but they had never even
seen him. This boy was another good reason for keeping the Potters away; they
didn’t want Dudley mixing with a child like that.

Embeddings#

LangChain porvides interfaces for interacting with embedding models. The core class here is langchain_core.embeddings.Embeddings, the api reference here.

The following table shows the classes that implement the different embeddings model interfaces.

Class Name

Package

Embeddings

langchain_core.embeddings

OpenAIEmbeddings

langchain_openai

AzureOpenAIEmbeddings

langchain_openai

HuggingFaceEmbeddings

langchain_community.embeddings.huggingface

GoogleGenerativeAIEmbeddings

langchain_google_genai

GoogleVertexAIEmbeddings

langchain_google_vertexai

CohereEmbeddings

langchain_cohere

OllamaEmbeddings

langchain_ollama

VoyageEmbeddings

langchain_voyageai

JinaEmbeddings

langchain_community.embeddings.jina

FakeEmbeddings

langchain_core.embeddings.fake


Consider the following example that uses OllamaEmbeddings. It uses ollama as inference of the model, so ollama is supposed to be available.

from langchain_ollama import OllamaEmbeddings
embedder = OllamaEmtbeddings(model="all-minilm")

Use the embed_documents method to obtain the embeddings for a givel list of strings.

embeddings = embedder.embed_documents(
    ["Test embeddings", "some more complex text"]
)
type(embeddings)
list

An embedding is provided for each of the given documents.

len(embeddings)
2

And a dimentionality of embeddings depemends on the model used.

len(embeddings[0])
384