07-30-2023, 12:42 PM
We store events in multiple tables depending on category.
Each event have an id but contains multiple subelements.
We have a lookup table to find events using the subelement_id.
Each subelement can participate at max in 7 events.
Hence the partition will hold max 7 rows.
We will have 30-50 BILLIONS of rows in eventlookup over a period of 5 years.
CREATE TABLE eventlookup (
subelement_id text,
recordtime timeuuid,
event_id text,
PRIMARY KEY ((subelement_id), recordtime)
)
Problem: How do we delete old data once we reach the 5 (or some other number) year mark.
We want to purge the "tail" at some specific intervals, say every week or month.
Approaches investigated so far:
- TTL of X years (performs well, but TTL needs to be known before hand, 8 extra bytes for each column)
- NO delete - simply ignore the problem (somebody else's problem :0)
- Rate limited single row delete (do complete table scan and potentially billions of delete statements)
- Split the table to multiple tables -> "CREATE TABLE eventlookup**YYYY**". Once a year is not needed, simply drop it. (Problem is every read should potentially query all tables)
Is there any other approaches we can consider ?
Is there a design decision we can make now ( we are not in production yet) that will mitigate the future problem?
Each event have an id but contains multiple subelements.
We have a lookup table to find events using the subelement_id.
Each subelement can participate at max in 7 events.
Hence the partition will hold max 7 rows.
We will have 30-50 BILLIONS of rows in eventlookup over a period of 5 years.
CREATE TABLE eventlookup (
subelement_id text,
recordtime timeuuid,
event_id text,
PRIMARY KEY ((subelement_id), recordtime)
)
Problem: How do we delete old data once we reach the 5 (or some other number) year mark.
We want to purge the "tail" at some specific intervals, say every week or month.
Approaches investigated so far:
- TTL of X years (performs well, but TTL needs to be known before hand, 8 extra bytes for each column)
- NO delete - simply ignore the problem (somebody else's problem :0)
- Rate limited single row delete (do complete table scan and potentially billions of delete statements)
- Split the table to multiple tables -> "CREATE TABLE eventlookup**YYYY**". Once a year is not needed, simply drop it. (Problem is every read should potentially query all tables)
Is there any other approaches we can consider ?
Is there a design decision we can make now ( we are not in production yet) that will mitigate the future problem?