Does Cassandra Have Secondary Index?

Yes, Cassandra does have secondary indexes. These indexes provide a powerful mechanism to query data efficiently using columns that are not part of a table's primary key.

Understanding Cassandra Secondary Indexes

Secondary indexes in Apache Cassandra are a crucial feature for enabling flexible data retrieval. Their primary purpose is to allow users to filter a table for data stored in non-primary key columns. Without a secondary index, queries against non-primary key columns require a full table scan, which is highly inefficient and often not allowed in production Cassandra clusters for performance reasons.

Cassandra's architecture is built for distributed operations, and this applies to how it handles secondary indexes as well. When you create a secondary index on a column, Cassandra internally creates a hidden, distributed index table for that column. This index is stored and distributed across the cluster nodes, similar to how regular data is stored. Each node maintains index entries only for the data it owns, ensuring that index lookups are also distributed and scalable.

How Cassandra Stores and Distributes Indexes

When a secondary index is created and data is written or updated, Cassandra automatically manages the index entries. Here’s a brief overview:

Local Indexing: Each Cassandra node is responsible for indexing the data it stores. This means an index for a specific row is created and maintained on the same node that holds the primary copy of that row.
Distributed Nature: Because data is partitioned and distributed across multiple nodes, the secondary index itself is also distributed. When a query uses a secondary index, Cassandra coordinates the lookup across the relevant nodes to retrieve the desired data.
Asynchronous Updates: Index updates generally happen asynchronously with data writes, ensuring that write operations remain fast. However, reads might need to wait for eventual consistency of the index.

When to Use Cassandra Secondary Indexes

Secondary indexes are beneficial in specific scenarios where you need to query data based on attributes other than your primary key.

Ad-Hoc Queries: Useful for flexible querying when the exact access patterns aren't known beforehand or when your application needs to support various filtering options.
Low Cardinality Columns: Columns with a limited number of unique values (e.g., status - active, inactive, pending). Indexing these can be efficient for finding all rows with a particular status.
Specific Lookup Patterns: When you frequently need to retrieve data where a non-primary key column has a precise value (e.g., WHERE email = '[email protected]').

Example Use Cases:

Finding all users from a specific city.
Retrieving all products with a category of 'Electronics'.
Filtering orders by their status ('shipped', 'delivered', 'pending').

Important Considerations for Secondary Indexes

While powerful, secondary indexes are not a silver bullet and come with performance implications. Careful consideration is crucial to avoid performance bottlenecks.

Consideration	Description
High Cardinality	Indexing columns with many unique values (e.g., timestamps, long text descriptions) can lead to performance issues during reads, as queries might need to hit many nodes. Queries on high-cardinality indexed columns can be inefficient due to distributed lookups.
Write Performance	Every write to an indexed column also incurs an additional write to the corresponding index. While generally fast, a high volume of writes across many indexed columns can add overhead.
Query Performance	Queries using secondary indexes can still be slower than queries using the primary key because they often involve coordinating requests across multiple nodes. Avoid queries that return very large result sets.
Equality Checks	Secondary indexes are most effective for equality (`=`) checks. They generally do not perform well for range queries (`>`, `<`, `LIKE`) unless combined with other primary key components.
Index Size	Each index consumes disk space. Excessive indexing, especially on high-cardinality columns, can lead to significant storage overhead.

Practical Example

Let's illustrate how to create and use a secondary index in Cassandra.

Suppose you have a users table:

CREATE TABLE users (
    user_id UUID PRIMARY KEY,
    username TEXT,
    email TEXT,
    city TEXT,
    created_date TIMESTAMP
);

You frequently need to find users by their city. Without an index, this query would be inefficient:

SELECT * FROM users WHERE city = 'New York'; // This query will likely fail or be very slow without an index

To enable efficient queries on the city column, you can create a secondary index:

CREATE INDEX ON users (city);

Once the index is built (which happens automatically in the background), you can now efficiently query users by city:

SELECT * FROM users WHERE city = 'New York';

This query will now leverage the secondary index, allowing Cassandra to quickly locate the relevant data across the cluster.