Abstract In Retrieval-Augmented Generation (RAG) systems leveraging Large Language Models (LLMs), text chunking is a critical preprocessing step that determines retrieval accuracy and generation quality. This paper organizes and compares the characteristics of major chunking methods, and reports findings on how chunk size selection affects system performance. Furthermore, it describes the CPO (Community-based Pruning Optimization) graph optimization approach developed by the author, including benchmark results and its application to improving Claude Code's context utilization efficiency.
1. Introduction
With the practical adoption of LLMs in recent years, RAG (Retrieval-Augmented Generation) architectures that retrieve and reference external knowledge while generating responses have become widely adopted. RAG performance heavily depends on retrieval accuracy, which is determined by the quality of text chunks stored in the index.
Chunking refers to the process of splitting lengthy documents into semantically appropriate units. The quality of this segmentation directly affects precision and recall during retrieval, and consequently the accuracy and comprehensiveness of LLM-generated responses. Despite this importance, systematic comparative studies of chunking methods remain insufficient, and practitioners often rely on rules of thumb for method selection.
This paper surveys representative chunking methods, compares their characteristics, and organizes findings related to chunk size optimization. The findings presented here are derived from the author's research and development activities; specific implementation details are omitted.
2. Overview and Comparison of Chunking Methods
Text chunking methods can be broadly classified into the following five categories based on their segmentation criteria. The overview and characteristics of each method are described below.
2.1 Fixed-size Chunking
The most basic method, which mechanically splits text at a predetermined character or token count. Its advantages are extremely simple implementation and fast processing speed. However, it has a fundamental drawback: splits can occur mid-sentence or mid-paragraph, easily compromising semantic coherence.
2.2 Fixed-size with Overlap
An improvement on fixed-size chunking that introduces overlap regions between adjacent chunks. The overlap preserves contextual information around chunk boundaries, tending to improve retrieval accuracy compared to plain fixed-size chunking. However, there is a trade-off in increased index size due to overlapping portions. Typically, overlap is set at 10-20% of the chunk size.
2.3 Semantic Chunking
A method that uses embedding models to calculate semantic similarity between sentences or paragraphs, splitting at semantic boundaries. Cosine similarity between embedding vectors of adjacent text fragments is computed, and points where similarity falls below a threshold are designated as chunk boundaries. Since semantically coherent chunks are generated, improved retrieval accuracy can be expected. However, the embedding computation cost significantly increases processing time compared to fixed-size methods.
2.4 Recursive Chunking
A method that applies multiple delimiters in order of priority, splitting text in stages. It first attempts to split by paragraph breaks (double newlines); if resulting chunks exceed the target size, it then splits by sentence boundaries (periods), and if necessary, by word boundaries, recursively applying finer granularity. This balanced method respects the natural structure of documents while keeping chunk sizes within a target range.
2.5 Document Structure-based Chunking
In structured documents such as Markdown, HTML, or PDF, this method uses document structure elements like headings, sections, and lists as splitting criteria. Since segmentation follows the logical structure of the document, each chunk has high semantic completeness. However, it is difficult to apply when input documents lack clear structure (e.g., plain text), and it depends on the document format.
2.6 Method Comparison
Table 1 presents a comparison of the above methods based on key evaluation criteria.
| Method | Semantic Coherence | Processing Speed | Ease of Implementation | Size Uniformity | Format Dependency |
|---|---|---|---|---|---|
| Fixed-size | Low | Very fast | Easy | High | None |
| Fixed-size with overlap | Slightly low | Fast | Easy | High | None |
| Semantic | High | Slow | Moderate | Low | None |
| Recursive | Medium-High | Fast | Moderate | Medium | Low |
| Document structure-based | High | Fast | Moderate-High | Low | High |
As the comparison table clearly shows, each method has inherent trade-offs, and no single universal method exists. Semantic chunking is advantageous when semantic coherence is prioritized, but increased processing costs are unavoidable. Recursive chunking is a practical choice when seeking a balance between processing speed and semantic quality.
3. Considerations on Optimal Chunk Size
Equally important to the choice of chunking method is chunk size, a parameter that affects both retrieval accuracy and generation quality, requiring careful consideration.
3.1 Impact of Chunk Size
Chunk size configuration can be understood as a problem of balancing the following two competing requirements.
Small Chunks (~256 tokens)
- Finer retrieval granularity tends to improve Precision
- Suitable for extracting specific facts and numerical data
- Risk of insufficient context for LLM to generate accurate responses
- Increased index size and retrieval cost due to higher chunk count
Large Chunks (1024+ tokens)
- Rich contextual information tends to stabilize generation quality
- Fewer chunks result in more efficient indexing
- Risk of irrelevant information inclusion, decreasing Recall
- Pressure on LLM's context window
3.2 Experimental Findings
In the author's research and development, RAG system performance was evaluated across various chunk sizes on text data from multiple domains. The key findings are reported below.
- 256-512 tokens as a general recommended range: In many use cases, the 256-512 token range demonstrated the best balance between retrieval accuracy and generation quality. This range approximates the length of typical paragraphs, making it easier to maintain semantic completeness.
- Optimal values vary by domain characteristics: It was confirmed that optimal chunk size differs depending on the document domain, such as technical documents, legal documents, or general articles. In domains where specialized terminology definitions and statutory citations are frequent, slightly larger chunk sizes (512-768 tokens) were effective.
- Correlation with question granularity: It was observed that smaller chunks are advantageous for factoid-type questions (those asking for specific facts), while larger chunks are favorable for summary/explanatory questions.
- Effect of overlap: Setting overlap equivalent to 10-15% of the chunk size was confirmed to mitigate information loss at chunk boundaries. However, overlap exceeding 20% showed performance degradation due to increased redundancy.
Note: The above findings are results from the author's experimental environment. Optimal values may vary depending on the embedding model, LLM, evaluation dataset, and other factors. Benchmark evaluation using target domain data is recommended for production deployment.
3.3 Framework for Chunk Size Selection
- Set a baseline: Start with approximately 512 tokens as the baseline.
- Analyze use cases: Analyze the expected question granularity and document characteristics, adjusting size as needed.
- Apply overlap: Set overlap at 10-15% of the chunk size.
- Evaluate and iterate: Use representative query sets to evaluate retrieval accuracy and generation quality, fine-tuning parameters accordingly.
4. Graph Optimization Approach with CPO
The chunking methods described in previous chapters all focus on text "segmentation." However, simply storing chunked content in a search index leaves issues of redundancy and retrieval noise unresolved. This chapter introduces CPO (Community-based Pruning Optimization), a graph optimization approach developed by the author. CPO applies mathematically principled optimization to post-chunking entity groups, achieving substantial improvements in search quality and token efficiency.
4.1 Limitations of Conventional Methods
Conventional regex-based syntactic extraction for entity management has the following fundamental limitations.
| Process | Method | Limitation |
|---|---|---|
| Entity extraction | Regex pattern matching | Misses semantic relationships |
| Classification | File path keyword matching | Rigid classification via fixed rules |
| Confidence scoring | Docstring presence / parameter count | Ignores usage frequency and references |
| Categorization | Manually defined fixed categories | Divergence from actual code semantics |
In other words, conventional methods can extract "what exists" but struggle to grasp "what is important," "what is similar," and "what forms a group" — the semantic relationships between entities.
4.2 The Four Mathematical Principles of CPO
CPO achieves structural optimization of chunk/entity groups by combining the following four mathematical principles.
4.2.1 Louvain Community Detection — Automatic Category Discovery
A similarity graph is constructed from chunk embedding vectors, and Louvain algorithm community detection is applied. This makes it possible to "discover" categories rather than "define" them manually. For example, semantically cohesive groups such as authentication function clusters, UI component clusters, and data transformation clusters are automatically extracted.
4.2.2 Nash Equilibrium — Redundant Entity Removal
In codebases spanning multiple projects, similar functions (e.g., formatDate(), handleError()) exist in large quantities as duplicates. Pruning informed by the Nash equilibrium concept classifies representative entities as active (retained for search) and redundant duplicates as pruned (removed as search noise). This simultaneously achieves improved retrieval accuracy and reduced index size.
4.2.3 Stackelberg Hierarchical Structure — Multi-granularity Retrieval
The required information granularity varies by use case. Based on Stackelberg game theory, chunks are structured into three hierarchical levels.
| Level | Purpose | Example |
|---|---|---|
| Level 2 (Summary) | Overview / decision-making | "Auth module: 15 functions, OAuth+JWT" |
| Level 1 (Representative) | Pattern recommendation / design reference | "Representative JWT token verification pattern" |
| Level 0 (Detail) | Specific code reference | verifyJWT(token, secret) implementation |
4.2.4 Percolation Theory — Automatic Category Boundary Determination
Boundaries between categories like "authentication" and "API" tend to be ambiguous when defined manually. By applying percolation theory, the similarity threshold θ is automatically determined, mathematically guaranteeing category cohesion. This prevents semantically unnatural category divisions.
5. Benchmark Results
Table 4 shows the changes in key metrics before and after applying CPO graph optimization.
| Metric | Conventional (Pre-CPO) | Post-CPO |
|---|---|---|
| Entity count | ~3,000 (flat structure) | ~1,500 active + summary nodes |
| Categorization | 5 manually defined categories (partially unimplemented) | Auto-discovered via Louvain |
| Search method | Flat query search only | Hierarchical search (overview → representative → detail) |
| Redundant entities | Duplicates across 12 projects | Removed via Nash equilibrium |
| Category boundaries | Manually defined (ambiguous) | Auto-determined via percolation theory |
The approximately 50% reduction in entity count results from removing redundant chunks — a structural optimization that does not entail information loss. In fact, improved retrieval accuracy was confirmed through the removal of search noise. Additionally, the introduction of hierarchical search enabled information retrieval at appropriate granularity levels according to the use case.
Note: The mathematical principles of CPO are applicable not only to code chunks but also to knowledge entities (ontology) using the same principles. For ontologies, where semantic relationships between entities are more explicit, community detection accuracy is expected to be even higher.
6. Novelty Analysis — Comparison with Prior Work
A comprehensive survey of existing papers and prior research was conducted for each component of CPO. This chapter reports the novelty assessment for each component and the distance from the closest existing research.
6.1 Novelty Assessment by Component
| Component | Novelty | Prior Research Status |
|---|---|---|
| Congestion game for chunk pruning | ★★★ Fully novel | No precedent in IR/NLP. Closest: "Pruning as a Game" (arXiv 2512, 2025.12) but for NN weight pruning |
| Percolation theory for auto θ determination | ★★★ Fully novel | No literature on using phase transition points of embedding similarity graphs as thresholds |
| Stackelberg for hierarchical optimization | ★★★ Fully novel | No precedent for leader-follower formulation between detail/summary |
| Louvain for chunk clustering | ★★☆ Partially novel | GraphRAG (MS) applies Leiden to KGs. Application to embedding similarity graphs is novel |
| LLM-free RAPTOR alternative | ★★☆ Partially novel | Extractive summarization itself is classical. Positioning as RAPTOR alternative is novel |
| Integrated pipeline | ★★★ Fully novel | No system integrating all 5 elements exists in literature |
6.2 Distance from Closest Existing Research
The positional relationship between CPO and related existing research is shown below. "Game theory × Information Retrieval" is an emerging field where papers began appearing in late 2025, and this research is the first to apply it to chunk optimization.
6.3 Application Domains
While CPO is based on mathematically rigorous formulations, its applications are highly practical. Mathematical guarantees provide a strong advantage in technical selection through result explainability.
Internal Knowledge Bases
Integration and redundancy removal of knowledge spanning multiple projects. Most effective in environments managing knowledge across 12+ projects in a unified manner.
Large-scale Codebase RAG
Format-agnostic design enables application regardless of programming language or framework. Directly improves code search accuracy.
Document-intensive Environments
Redundancy removal without LLM costs. Contributes to improved search accuracy in workplaces with large accumulated document collections.
Edge / On-premise Environments
No LLM required, eliminating API charges. Processing pipelines can be built entirely on NPU/GPU alone.
7. Context Utilization Efficiency in Claude Code
The chunking optimization and CPO graph optimization described above directly contribute to efficient utilization of Claude Code's context window. This chapter describes efforts to improve Claude Code's practical performance through RAG systems.
7.1 Context Window Challenges
While Claude Code possesses powerful code generation and comprehension capabilities, its context window size has constraints. When working with large codebases, including all relevant code chunks in the context is not practical, making efficient selection and provision of relevant information the key to practical utility.
7.2 Context Efficiency via CPO
CPO graph optimization improves Claude Code's context utilization from the following three perspectives.
- Token savings through redundant chunk removal: Nash equilibrium-based pruning removes redundant entities from the index. This reduces noise in search results and decreases the token count injected into the context window. The reduction from approximately 3,000 to approximately 1,500 entities represents a substantial reduction in wasteful information in the context.
- Adaptive granularity selection through hierarchical search: The Stackelberg hierarchical structure enables provision of appropriately granular information to Claude Code based on task complexity. Selecting Level 2 (summary) for overview understanding and Level 0 (detail) for specific implementation improves context window utilization efficiency.
- Improved comprehension through semantically cohesive chunks: Optimal category segmentation via Louvain community detection and percolation theory improves the semantic coherence of chunks provided to Claude Code. This enables the LLM to more accurately grasp inter-chunk relationships, expecting improvements in response quality.
7.3 Practical Effects
An important aspect of the CPO optimization pipeline is its ease of integration with existing systems. It can be applied simply by adding the collection-agnostic CPO graph optimizer as a downstream stage after existing chunking processes. The same pipeline is applicable not only to code chunk collections but also to ontology (knowledge entity) collections, enabling unified optimization of the entire system's knowledge management.
8. Discussion and Summary
8.1 Method Selection Guidelines
When selecting a chunking method, it is essential to choose an appropriate approach based on system requirements and constraints.
Prototyping / Validation
Recursive chunking is recommended. Low implementation cost with sufficient quality. Fixed-size with overlap is also a viable alternative.
High-accuracy Production
Semantic chunking, or a combination with document structure-based chunking, is recommended. Processing cost increases, but improved search quality is expected.
Structured Document-centric Systems
Document structure-based chunking should be the first choice, with recursive chunking as a fallback for documents with unclear structure.
High-speed Processing of Large Data
Fixed-size with overlap is recommended. Offers an excellent balance between processing speed and index quality.
8.2 Future Prospects
- Adaptive chunking: An approach that dynamically switches chunk size and splitting methods based on document content and structure.
- Hierarchical chunking: A method that generates chunks at multiple granularity levels (coarse and fine) and selects the appropriate granularity during retrieval.
- Multimodal support: Extension of chunking methods to documents containing not only text but also tables, figures, and mathematical formulas.
- LLM-powered chunking: A method that uses the LLM itself to estimate optimal chunk boundaries. High accuracy but with cost trade-offs.
8.3 Conclusion
This paper surveyed major text chunking methods, compared their characteristics, and reported findings on chunk size selection. Chunking is the "unsung hero" of RAG systems, and its optimization directly improves overall system quality. The author hopes this paper serves as a reference for practitioners formulating chunking strategies.
For inquiries about RAG system chunking optimization or CPO graph optimization implementation, please feel free to contact us.
Free Consultation