Where to Buy
Amazon USAmazon UK
Amazon DE
Amazon IT
Amazon ES
Amazon PL
Amazon CA
Amazon AU
Amazon India
Apple Books (EPUB)
Read Sample (PDF)
Pay for the PDF and I'll send it to you by e-mail: Pay via Paypal and get PDF for $9.99 FOR CARD PAYMENTS, PLEASE SEND ME A MESSAGE AS WELL. PAYPAL MAY NOTIFY ME OF THE PAYMENT WITH SOME DELAY.
Inside Apache Solr and Lucene: Algorithms and Engineering Deep Dive
Rauf Aliev
How can you navigate the complex trade-offs between speed, memory consumption, and disk I/O when handling terabyte-scale data and thousands of concurrent users? This book dives deep into the core of Apache Solr and Lucene, offering answers from a system engineer's perspective. It explores the architectural decisions, data structures, and algorithms that enable these world-class search platforms to deliver exceptional performance and scalability, providing a blueprint for designing high-performance systems.
The insights in this book extend beyond the Solr and Lucene ecosystem. By using these platforms as a masterclass in pragmatic engineering, it offers valuable lessons for building any complex, data-intensive application. Their open-source codebases are a treasure trove of battle-tested solutions to universal challenges in concurrency, data partitioning, and distributed coordination. This book provides a curated tour of that treasure, distilling years of development and thousands of lines of code into core principles and patterns. It offers a unique opportunity to learn from the architectural choices of systems designed for immense scale and load, delivering invaluable lessons for system architects and engineers tasked with building resilient, high-performance software.
Table of Contents
- 1. Introduction to Solr and Lucene Internals
-
2. The Index
- 2.1. Structure of the Inverted Index
- 2.2. Indexing Beyond Text
- 2.3. Compression techniques: variable-byte encoding, frame-of-reference, delta encoding
- 2.4. Segment immutability and merging
- 2.5. On-Disk Formats
-
3. Indexing Pipeline: From Documents to Index
- 3.1. Document Ingestion
- 3.2. Algorithms for Term Extraction and Field Indexing
- 3.3. Engineering optimizations
- 3.4. Handling updates and deletes
-
4. Query Parsing and Execution
- 4.1. Query parser architecture: from user input to Lucene query objects
- 4.2. Boolean Query Processing
- 4.3. Intersection Algorithms
- 4.4. Disjunction (OR) and Exclusion (NOT): Union and Difference Operations
-
5. Relevance Scoring and Ranking
-
5.1. Scoring Models
- 5.1.1. TF-IDF (ClassicSimilarity)
- 5.1.2. BM25 (DefaultSimilarity since Lucene 6.0)
- 5.1.3. Custom Scoring Implementations
- 5.1.4. Augmenting Scores with FeatureField
- 5.1.5. Advanced Ranking with Learning to Rank (LTR)
- 5.1.6. A Glimpse into the Models: Pointwise, Pairwise, and Listwise
- 5.1.7. Integration: The Re-ranking Query
- 5.2. Engineering Details
- 5.3. Field Weighting and Boosting
- 5.4. Scoring in Distributed Systems: Challenges of Consistent Scoring Across Shards
-
5.1. Scoring Models
-
6. Pagination and Result Retrieval
- 6.1. Standard Pagination: Top-K Collection with Priority Queues (Min-Heap)
- 6.2. Deep Paging Challenges: Performance Bottlenecks and Memory Usage
- 6.3. Cursor-Based Pagination
- 6.4. Engineering Trade-Offs
- 6.5. Highlighting and Query-Biased Summarization
-
7. Faceting and Aggregations
- 7.1. Facet Computation: Algorithms for Counting and Grouping
- 7.2. Field-Based vs. Query-Based Faceting: Implementation Differences
- 7.3. Distributed Faceting: Merging Facet Counts Across Shards
- 7.4. Performance Optimizations: Caching, Pre-Computed Facets, and Doc Values
-
8. SolrCloud: Distributed Search Engineering
- 8.1. Sharding and Replication: Data Partitioning and Fault Tolerance
- 8.2. ZooKeeper’s Role: Coordination, Configuration, and Leader Election
- 8.3. Distributed Query Execution: Scatter-Gather, Coordinator Overhead, and Load Balancing
- 8.4. Consistency vs. Performance: Trade-Offs in Replication Strategies
-
9. Performance Optimizations and Caching
- 9.1. Query Caching: Filter Cache, Query Result Cache, and Document Cache
- 9.2. Index-Time Optimizations: Doc Values, Stored Fields, and Norms
- 9.3. JVM Tuning: Garbage Collection, Heap Management, and Memory-Mapped I/O
- 9.4. Hardware Considerations: SSDs vs. HDDs, CPU Vectorization (SIMD)