Character Text Splitter
Category: Splitter
Purpose
To create data chunks of a given token size using a separator.
Description
The Character Text Splitter component creates data chunks of a given token size using a separator. The data needs to be chunked if its token-size crosses the token-size limit imposed by the embedding model (used for the upserting activity). In the given activity, dataFomTextFile is configured to be chunked with chunk size set to 4000 characters.
Property |
Data Type |
Description |
General |
||
Display Name |
String |
Display name of the component. |
Enable Pause |
-NA- |
Option to pause the activity (related to the component) during a job execution after receiving a pause control signal from the Cockpit application. The checkbox is selected by default. |
Input |
|
|
Chunk Overlap* |
Int32 |
Represents the number of characters by which two consecutive chunks overlap done with an aim to impart a context to the content in the second chunk. |
Chunk Size* |
Int32 |
Represents the number of characters that define the size of each chunk. Note: The data represented by the number of characters should represent a token-size that is within the limit imposed by the embedding model that is to be used for upserting and other AI/ML activities. |
Separator* |
String |
A single or a set of character that is to be used to chunk data. For example, using a newline character ("\n") as a separator, you can chunk data of a document such that each chunk represents a paragraph within the document. |
Text* |
String |
The data that needs to be chunked. In case of a large data, you can save the data in a file and read the data into a variable (using one of the components available in the Designer Toolbox), and then assign the variable to the Text property. |
Misc |
|
|
Enable Bookmark |
-NA- |
Option to set a bookmark. |
Is Reserved |
-NA- |
Option to disable data tracing related to the component. |
Output |
|
|
Result* |
ITextSplitter |
An object that contains the generated chunks. The object should be assigned to the Splitter property of an upserting component. (See Vector Store.) |
Note: The property names marked with the * sign are the mandatory properties. |