Character Text Splitter

Category: Splitter

Purpose

To create data chunks of a given token size using a separator.

Description

The Character Text Splitter component creates data chunks of a given token size using a separator. The data needs to be chunked if its token-size crosses the token-size limit imposed by the embedding model (used for the upserting activity). In the given activity, dataFomTextFile is configured to be chunked with chunk size set to 4000 characters.

Property

Data Type

Description

General

Display Name

String

Display name of the component.

Enable Pause

-NA-

Option to pause the activity (related to the component) during a job execution after receiving a pause control signal from the Cockpit application. The checkbox is selected by default.

Input

 

 

Chunk Overlap*

Int32

Represents the number of characters by which two consecutive chunks overlap done with an aim to impart a context to the content in the second chunk.

Chunk Size*

Int32

Represents the number of characters that define the size of each chunk.

Note: The data represented by the number of characters should represent a token-size that is within the limit imposed by the embedding model that is to be used for upserting and other AI/ML activities.

Separator*

String

A single or a set of character that is to be used to chunk data.

For example, using a newline character ("\n") as a separator, you can chunk data of a document such that each chunk represents a paragraph within the document.

Text*

String

The data that needs to be chunked.

In case of a large data, you can save the data in a file and read the data into a variable (using one of the components available in the Designer Toolbox), and then assign the variable to the Text property.

Misc

 

 

Enable Bookmark

-NA-

Option to set a bookmark.

Is Reserved

-NA-

Option to disable data tracing related to the component.

Output

 

 

Result*

ITextSplitter

An object that contains the generated chunks.

The object should be assigned to the Splitter property of an upserting component. (See Vector Store.)

Note: The property names marked with the * sign are the mandatory properties.

Generative AI

Splitter

OpenAI Tokenizer