R&D

Comparative Study of Text Chunking Techniques

— Method Selection and Chunk Size Optimization —

| Go Kyono — CEO, TechJapan LLC

Abstract In Retrieval-Augmented Generation (RAG) systems leveraging Large Language Models (LLMs), text chunking is a critical preprocessing step that determines retrieval accuracy and generation quality. This paper organizes and compares the characteristics of major chunking methods, and reports findings on how chunk size selection affects system performance. Furthermore, it describes the CPO (Community-based Pruning Optimization) graph optimization approach developed by the author, including benchmark results and its application to improving Claude Code's context utilization efficiency.

1. Introduction

With the practical adoption of LLMs in recent years, RAG (Retrieval-Augmented Generation) architectures that retrieve and reference external knowledge while generating responses have become widely adopted. RAG performance heavily depends on retrieval accuracy, which is determined by the quality of text chunks stored in the index.

Chunking refers to the process of splitting lengthy documents into semantically appropriate units. The quality of this segmentation directly affects precision and recall during retrieval, and consequently the accuracy and comprehensiveness of LLM-generated responses. Despite this importance, systematic comparative studies of chunking methods remain insufficient, and practitioners often rely on rules of thumb for method selection.

This paper surveys representative chunking methods, compares their characteristics, and organizes findings related to chunk size optimization. The findings presented here are derived from the author's research and development activities; specific implementation details are omitted.

2. Overview and Comparison of Chunking Methods

Text chunking methods can be broadly classified into the following five categories based on their segmentation criteria. The overview and characteristics of each method are described below.

2.1 Fixed-size Chunking

The most basic method, which mechanically splits text at a predetermined character or token count. Its advantages are extremely simple implementation and fast processing speed. However, it has a fundamental drawback: splits can occur mid-sentence or mid-paragraph, easily compromising semantic coherence.

2.2 Fixed-size with Overlap

An improvement on fixed-size chunking that introduces overlap regions between adjacent chunks. The overlap preserves contextual information around chunk boundaries, tending to improve retrieval accuracy compared to plain fixed-size chunking. However, there is a trade-off in increased index size due to overlapping portions. Typically, overlap is set at 10-20% of the chunk size.

2.3 Semantic Chunking

A method that uses embedding models to calculate semantic similarity between sentences or paragraphs, splitting at semantic boundaries. Cosine similarity between embedding vectors of adjacent text fragments is computed, and points where similarity falls below a threshold are designated as chunk boundaries. Since semantically coherent chunks are generated, improved retrieval accuracy can be expected. However, the embedding computation cost significantly increases processing time compared to fixed-size methods.

2.4 Recursive Chunking

A method that applies multiple delimiters in order of priority, splitting text in stages. It first attempts to split by paragraph breaks (double newlines); if resulting chunks exceed the target size, it then splits by sentence boundaries (periods), and if necessary, by word boundaries, recursively applying finer granularity. This balanced method respects the natural structure of documents while keeping chunk sizes within a target range.

2.5 Document Structure-based Chunking

In structured documents such as Markdown, HTML, or PDF, this method uses document structure elements like headings, sections, and lists as splitting criteria. Since segmentation follows the logical structure of the document, each chunk has high semantic completeness. However, it is difficult to apply when input documents lack clear structure (e.g., plain text), and it depends on the document format.

2.6 Method Comparison

Table 1 presents a comparison of the above methods based on key evaluation criteria.

Table 1: Comparison of Chunking Methods
Method Semantic Coherence Processing Speed Ease of Implementation Size Uniformity Format Dependency
Fixed-sizeLowVery fastEasyHighNone
Fixed-size with overlapSlightly lowFastEasyHighNone
SemanticHighSlowModerateLowNone
RecursiveMedium-HighFastModerateMediumLow
Document structure-basedHighFastModerate-HighLowHigh

As the comparison table clearly shows, each method has inherent trade-offs, and no single universal method exists. Semantic chunking is advantageous when semantic coherence is prioritized, but increased processing costs are unavoidable. Recursive chunking is a practical choice when seeking a balance between processing speed and semantic quality.

3. Considerations on Optimal Chunk Size

Equally important to the choice of chunking method is chunk size, a parameter that affects both retrieval accuracy and generation quality, requiring careful consideration.

3.1 Impact of Chunk Size

Chunk size configuration can be understood as a problem of balancing the following two competing requirements.

Small Chunks (~256 tokens)

  • Finer retrieval granularity tends to improve Precision
  • Suitable for extracting specific facts and numerical data
  • Risk of insufficient context for LLM to generate accurate responses
  • Increased index size and retrieval cost due to higher chunk count

Large Chunks (1024+ tokens)

  • Rich contextual information tends to stabilize generation quality
  • Fewer chunks result in more efficient indexing
  • Risk of irrelevant information inclusion, decreasing Recall
  • Pressure on LLM's context window

3.2 Experimental Findings

In the author's research and development, RAG system performance was evaluated across various chunk sizes on text data from multiple domains. The key findings are reported below.

  1. 256-512 tokens as a general recommended range: In many use cases, the 256-512 token range demonstrated the best balance between retrieval accuracy and generation quality. This range approximates the length of typical paragraphs, making it easier to maintain semantic completeness.
  2. Optimal values vary by domain characteristics: It was confirmed that optimal chunk size differs depending on the document domain, such as technical documents, legal documents, or general articles. In domains where specialized terminology definitions and statutory citations are frequent, slightly larger chunk sizes (512-768 tokens) were effective.
  3. Correlation with question granularity: It was observed that smaller chunks are advantageous for factoid-type questions (those asking for specific facts), while larger chunks are favorable for summary/explanatory questions.
  4. Effect of overlap: Setting overlap equivalent to 10-15% of the chunk size was confirmed to mitigate information loss at chunk boundaries. However, overlap exceeding 20% showed performance degradation due to increased redundancy.
Note: The above findings are results from the author's experimental environment. Optimal values may vary depending on the embedding model, LLM, evaluation dataset, and other factors. Benchmark evaluation using target domain data is recommended for production deployment.

3.3 Framework for Chunk Size Selection

  1. Set a baseline: Start with approximately 512 tokens as the baseline.
  2. Analyze use cases: Analyze the expected question granularity and document characteristics, adjusting size as needed.
  3. Apply overlap: Set overlap at 10-15% of the chunk size.
  4. Evaluate and iterate: Use representative query sets to evaluate retrieval accuracy and generation quality, fine-tuning parameters accordingly.

4. Graph Optimization Approach with CPO

The chunking methods described in previous chapters all focus on text "segmentation." However, simply storing chunked content in a search index leaves issues of redundancy and retrieval noise unresolved. This chapter introduces CPO (Community-based Pruning Optimization), a graph optimization approach developed by the author. CPO applies mathematically principled optimization to post-chunking entity groups, achieving substantial improvements in search quality and token efficiency.

4.1 Limitations of Conventional Methods

Conventional regex-based syntactic extraction for entity management has the following fundamental limitations.

Table 2: Limitations of Conventional Syntactic Extraction
ProcessMethodLimitation
Entity extractionRegex pattern matchingMisses semantic relationships
ClassificationFile path keyword matchingRigid classification via fixed rules
Confidence scoringDocstring presence / parameter countIgnores usage frequency and references
CategorizationManually defined fixed categoriesDivergence from actual code semantics

In other words, conventional methods can extract "what exists" but struggle to grasp "what is important," "what is similar," and "what forms a group" — the semantic relationships between entities.

4.2 The Four Mathematical Principles of CPO

CPO achieves structural optimization of chunk/entity groups by combining the following four mathematical principles.

4.2.1 Louvain Community Detection — Automatic Category Discovery

A similarity graph is constructed from chunk embedding vectors, and Louvain algorithm community detection is applied. This makes it possible to "discover" categories rather than "define" them manually. For example, semantically cohesive groups such as authentication function clusters, UI component clusters, and data transformation clusters are automatically extracted.

4.2.2 Nash Equilibrium — Redundant Entity Removal

In codebases spanning multiple projects, similar functions (e.g., formatDate(), handleError()) exist in large quantities as duplicates. Pruning informed by the Nash equilibrium concept classifies representative entities as active (retained for search) and redundant duplicates as pruned (removed as search noise). This simultaneously achieves improved retrieval accuracy and reduced index size.

4.2.3 Stackelberg Hierarchical Structure — Multi-granularity Retrieval

The required information granularity varies by use case. Based on Stackelberg game theory, chunks are structured into three hierarchical levels.

Table 3: Stackelberg Hierarchical Structure
LevelPurposeExample
Level 2 (Summary)Overview / decision-making"Auth module: 15 functions, OAuth+JWT"
Level 1 (Representative)Pattern recommendation / design reference"Representative JWT token verification pattern"
Level 0 (Detail)Specific code referenceverifyJWT(token, secret) implementation

4.2.4 Percolation Theory — Automatic Category Boundary Determination

Boundaries between categories like "authentication" and "API" tend to be ambiguous when defined manually. By applying percolation theory, the similarity threshold θ is automatically determined, mathematically guaranteeing category cohesion. This prevents semantically unnatural category divisions.

5. Benchmark Results

Table 4 shows the changes in key metrics before and after applying CPO graph optimization.

Table 4: Performance Comparison Before and After CPO
MetricConventional (Pre-CPO)Post-CPO
Entity count~3,000 (flat structure)~1,500 active + summary nodes
Categorization5 manually defined categories (partially unimplemented)Auto-discovered via Louvain
Search methodFlat query search onlyHierarchical search (overview → representative → detail)
Redundant entitiesDuplicates across 12 projectsRemoved via Nash equilibrium
Category boundariesManually defined (ambiguous)Auto-determined via percolation theory

The approximately 50% reduction in entity count results from removing redundant chunks — a structural optimization that does not entail information loss. In fact, improved retrieval accuracy was confirmed through the removal of search noise. Additionally, the introduction of hierarchical search enabled information retrieval at appropriate granularity levels according to the use case.

Note: The mathematical principles of CPO are applicable not only to code chunks but also to knowledge entities (ontology) using the same principles. For ontologies, where semantic relationships between entities are more explicit, community detection accuracy is expected to be even higher.

6. Novelty Analysis — Comparison with Prior Work

A comprehensive survey of existing papers and prior research was conducted for each component of CPO. This chapter reports the novelty assessment for each component and the distance from the closest existing research.

6.1 Novelty Assessment by Component

Table 5: Novelty Assessment of CPO Components
ComponentNoveltyPrior Research Status
Congestion game for chunk pruning★★★ Fully novelNo precedent in IR/NLP. Closest: "Pruning as a Game" (arXiv 2512, 2025.12) but for NN weight pruning
Percolation theory for auto θ determination★★★ Fully novelNo literature on using phase transition points of embedding similarity graphs as thresholds
Stackelberg for hierarchical optimization★★★ Fully novelNo precedent for leader-follower formulation between detail/summary
Louvain for chunk clustering★★ Partially novelGraphRAG (MS) applies Leiden to KGs. Application to embedding similarity graphs is novel
LLM-free RAPTOR alternative★★ Partially novelExtractive summarization itself is classical. Positioning as RAPTOR alternative is novel
Integrated pipeline★★★ Fully novelNo system integrating all 5 elements exists in literature

6.2 Distance from Closest Existing Research

The positional relationship between CPO and related existing research is shown below. "Game theory × Information Retrieval" is an emerging field where papers began appearing in late 2025, and this research is the first to apply it to chunk optimization.

CPO (ours) | +-- Distance: Far --- GraphRAG (Microsoft, 2024) | Leiden on KGs -> LLM summarization. Neither embedding graphs nor game theory | +-- Distance: Far --- RAPTOR (Stanford, 2024) | GMM + LLM summarization. No graphs, no game theory | +-- Distance: Moderate -- Pruning as a Game (arXiv 2512, 2025) | Congestion game for NN weight pruning. Different domain (NN vs text) | +-- Distance: Moderate -- Game-Theoretic Vector Search (arXiv 2508, 2025) Zero-sum game for latent space compression. Concerns vector dimensions, not chunks

6.3 Application Domains

While CPO is based on mathematically rigorous formulations, its applications are highly practical. Mathematical guarantees provide a strong advantage in technical selection through result explainability.

Internal Knowledge Bases

Integration and redundancy removal of knowledge spanning multiple projects. Most effective in environments managing knowledge across 12+ projects in a unified manner.

Large-scale Codebase RAG

Format-agnostic design enables application regardless of programming language or framework. Directly improves code search accuracy.

Document-intensive Environments

Redundancy removal without LLM costs. Contributes to improved search accuracy in workplaces with large accumulated document collections.

Edge / On-premise Environments

No LLM required, eliminating API charges. Processing pipelines can be built entirely on NPU/GPU alone.

7. Context Utilization Efficiency in Claude Code

The chunking optimization and CPO graph optimization described above directly contribute to efficient utilization of Claude Code's context window. This chapter describes efforts to improve Claude Code's practical performance through RAG systems.

7.1 Context Window Challenges

While Claude Code possesses powerful code generation and comprehension capabilities, its context window size has constraints. When working with large codebases, including all relevant code chunks in the context is not practical, making efficient selection and provision of relevant information the key to practical utility.

7.2 Context Efficiency via CPO

CPO graph optimization improves Claude Code's context utilization from the following three perspectives.

  1. Token savings through redundant chunk removal: Nash equilibrium-based pruning removes redundant entities from the index. This reduces noise in search results and decreases the token count injected into the context window. The reduction from approximately 3,000 to approximately 1,500 entities represents a substantial reduction in wasteful information in the context.
  2. Adaptive granularity selection through hierarchical search: The Stackelberg hierarchical structure enables provision of appropriately granular information to Claude Code based on task complexity. Selecting Level 2 (summary) for overview understanding and Level 0 (detail) for specific implementation improves context window utilization efficiency.
  3. Improved comprehension through semantically cohesive chunks: Optimal category segmentation via Louvain community detection and percolation theory improves the semantic coherence of chunks provided to Claude Code. This enables the LLM to more accurately grasp inter-chunk relationships, expecting improvements in response quality.

7.3 Practical Effects

An important aspect of the CPO optimization pipeline is its ease of integration with existing systems. It can be applied simply by adding the collection-agnostic CPO graph optimizer as a downstream stage after existing chunking processes. The same pipeline is applicable not only to code chunk collections but also to ontology (knowledge entity) collections, enabling unified optimization of the entire system's knowledge management.

8. Discussion and Summary

8.1 Method Selection Guidelines

When selecting a chunking method, it is essential to choose an appropriate approach based on system requirements and constraints.

Prototyping / Validation

Recursive chunking is recommended. Low implementation cost with sufficient quality. Fixed-size with overlap is also a viable alternative.

High-accuracy Production

Semantic chunking, or a combination with document structure-based chunking, is recommended. Processing cost increases, but improved search quality is expected.

Structured Document-centric Systems

Document structure-based chunking should be the first choice, with recursive chunking as a fallback for documents with unclear structure.

High-speed Processing of Large Data

Fixed-size with overlap is recommended. Offers an excellent balance between processing speed and index quality.

8.2 Future Prospects

  • Adaptive chunking: An approach that dynamically switches chunk size and splitting methods based on document content and structure.
  • Hierarchical chunking: A method that generates chunks at multiple granularity levels (coarse and fine) and selects the appropriate granularity during retrieval.
  • Multimodal support: Extension of chunking methods to documents containing not only text but also tables, figures, and mathematical formulas.
  • LLM-powered chunking: A method that uses the LLM itself to estimate optimal chunk boundaries. High accuracy but with cost trade-offs.

8.3 Conclusion

This paper surveyed major text chunking methods, compared their characteristics, and reported findings on chunk size selection. Chunking is the "unsung hero" of RAG systems, and its optimization directly improves overall system quality. The author hopes this paper serves as a reference for practitioners formulating chunking strategies.

For inquiries about RAG system chunking optimization or CPO graph optimization implementation, please feel free to contact us.

Free Consultation

Related Articles