ScyllaDB implementation: Lists in Medium’s feature store, Part 2

2 days ago 7

Learn how we’re using ScyllaDB to build a fast and scalable data layer for lists in our feature store. Photo by Mathew Schwartz on Unsplashtl;dr This is the second installment of our series on the new list feature type in Medium’s feature store, which describes how we implemented the lists data layer using ScyllaDB. See Part 1: Laying the foundations, the previous installment, for an introduction to the feature store and the list feature type. We’ll first introduce ScyllaDB, a database designed for high performance and scalability, and go over the ScyllaDB concepts needed to understand the rest of this article: clusters, nodes, tables, rows, columns, partitions, secondary indexes, and time-to-live. Then we’ll describe the ScyllaDB data model we’ve defined for lists in order to support the significant load that will come from list operations. This data model is based on a single table storing all items for all the lists managed by the feature store. The primary key of this table is designed so that most list operations can rely on a single, efficient ScyllaDB query. Finally we’ll go over each identified list operation and detail the ScyllaDB query/queries implementing it, and describe how these queries are efficient thanks to our data model. Let’s get started! About ScyllaDB ScyllaDB is a database designed for use cases that require high throughput, low latency, and high scalability, which makes it well-suited for use as the data layer of Medium’s feature store. In this section we’ll cover a few ScyllaDB concepts that you must be aware of in order to understand the rest of this article. Please refer to the following documentations to learn more: ScyllaDB User Guide covers a wide range of topics, including data modeling, queries, features, etc. ScyllaDB Architecture goes deep into how ScyllaDB achieves its performance goals. Fundamental concepts A cluster is the highest-level structure of a ScyllaDB database. It is a collection of interconnected nodes, each of which is a physical or virtual server storing and managing portions of the cluster’s data in order to provide high availability, scalability, performance, and fault tolerance. A table is a two-dimensional data structure comprised of rows and columns. On the surface this is similar to tables in relational SQL databases like MySQL or PostgreSQL. One important difference, however, is that with ScyllaDB you must know your access patterns upfront in order to define a data model that’s suited for these access patterns, as ScyllaDB doesn’t support ad-hoc queries in the same way as relational databases. In particular, defining the right primary key (see below) is central to enabling efficient queries on a table. A row represents a single record in a table. A row is a collection of one or more columns storing data related to the record. For instance, one row of the orders table could represent an order made by the “Alice” user on 2024–09–03 while another could represent an order made by the “Bob” user on 2024–08–30. Each row in a table is uniquely identified by its primary key (see below), which is composed of one or more of the row’s columns. A column stores a single piece of data in a row. Each column in a table has an associated name and data type (BOOLEAN, INT, TEXT, etc.; full list here) specified when the table was created. The data type of a given column is the same for all the rows with a value for that column. For instance, the orders table might have the following columns: A user_id column of type TEXT. A date column of type TIMESTAMP. A value column of type FLOAT. (The table likely would have some more columns, like the product that was ordered, for instance. We’ve omitted them here for readability.) The row for an order made by the “Alice” user on Sep. 3, 2024 might then have the following column values: user_id:"Alice", date:2024-09-03, and value:43.21. The following schemas illustrate what we’ve covered so far: The fundamental ScyllaDB concepts and their relationshipsSample rows and columns for the ‘orders’ tablePartitions and primary/partition/clustering keys A table is divided into multiple partitions, where each partition is mapped to a specific node of the cluster via consistent hashing. Each row of the table is stored in one of the table’s partitions, according to its partition key (see below). This allows a ScyllaDB table to store large quantities of data (think terabytes) in a scalable and efficient manner. One of the goals when defining a ScyllaDB data model is to spread your data as evenly as possible across the partitions of the table, so that requests to the table are balanced across the nodes of the cluster. The partitions of a table are spread across the nodes of the clusterThe primary key of a row is used to uniquely identify the row within its table. It is composed of two parts: The partition key is used to determine the partition the row belongs to. It is mandatory and composed of one or more columns. All rows in a given partition share the same partition key value. For instance, the partition key of the orders table might be composed of the user_id column; this way all orders made by a given user are stored in the same partition, and we can expect that user_id values are diverse enough to allow data to be spread evenly across partitions. The clustering key is used to sort the rows within their partition. It is optional and composed of zero (when absent), one, or more columns. For instance, the clustering key of the orders table might be composed of the date column, so that we can efficiently retrieve the orders of a given user between two dates. A sample partition for the ‘orders’ tableSecondary indexes By default, ScyllaDB rejects queries based on a column that isn’t part of the table’s primary key, as these queries don’t have predictable performance (worst case is that they would end up scanning the whole table). A secondary index is an additional data structure attached to a table that enables efficient querying on a non-primary key column of that table. Once a secondary index has been created, ScyllaDB transparently updates this index each time a row with a value for the indexed column is inserted into, updated, or deleted from the table. ScyllaDB supports two types of secondary indexes: A global secondary index (GSI) is managed independently of its base table. Its data is distributed across potentially different nodes from the ones storing the table’s data. GSIs are used to query data independently from the table’s partition key. A local secondary index (LSI) is an index sharing the same partition key as its table, which results in the index data being stored in the same nodes as its table. An LSI is more efficient than a GSI for queries relying on the table’s partition key in addition to a non-primary key column, as such queries can be executed by a single node. For instance, a GSI on the orders table’s value column would allow queries like “fetch all orders with a given value” to be executed efficiently. And an LSI on the table’s partition key plus the value column would be appropriate for queries like “fetch all orders for a given user with a given value”. A sample secondary index on the ‘value’ column of the ‘orders’ tableTime-to-live (TTL) In any given row, each column that is not part of the table’s primary key can have an associated time-to-live (TTL), defined as the time, in seconds, until the column value expires. The TTL of a column value can be set when the value is inserted or updated. Once the TTL of a column value reaches zero, ScyllaDB automatically deletes that value from the row. If all non-primary key column values in a row have been deleted via their TTL, then ScyllaDB also automatically deletes the row itself. TTLs are especially useful when dealing with huge amounts of data in a table: by setting a TTL on each row, we help reduce the size (and thus the price) of the storage used by the table, which might otherwise grow out of control over the long run with data that we don’t actually care about. ScyllaDB data model for lists Now that we have a better understanding of ScyllaDB, let’s dive into the data model we’re using for lists. This section builds up on the notions defined in Part 1: Laying the foundations; please refer to it if you need a refresher on these. Goals and requirements The goal here is to define a ScyllaDB data model with which to store list items and that makes it possible to implement list operations in an efficient and scalable way. In particular, the data model must be able to support the significant load that will come from the Get List Items and Add List Items operations; as we’ve seen in part 1, we expect 100k to 1 million items to be fetched each second for Get List Items and 10k to 100k items to be stored each second for Add List Items. The analysis of our use cases for lists also identified the following requirement: the data model must allow a given list to hold multiple items with the same timestamp as long as these items all have distinct values. This is needed for instance by the story_presented list feature for the user entity, which lists the IDs of the stories that were presented (via a story preview for instance) to a given user. There are various UI surfaces throughout Medium presenting multiple stories at once (the homepage for instance); clients can thus report multiple story presentations with the same timestamp. The ‘list_items’ table Our ScyllaDB data model for lists relies on the following list_items table: CREATE TABLE list_items ( feature_key TEXT, entity_id TEXT, item_key TEXT, value BLOB, PRIMARY KEY ((feature_key, entity_id), item_key))WITH CLUSTERING ORDER BY (item_key DESC) AND DEFAULT_TIME_TO_LIVE = $defaultTTL; This table stores all the list items for all the list features managed by the feature store. It has the following columns: feature_key: The key identifying the list feature, resulting from the concatenation of the list’s entity type, feature name, and feature version: "$entityType#$featureName|$featureVersion". entity_id: The ID of the entity the list belongs to. item_key: The key identifying the item in the list. See below for details. value: The bytes representation of the value of the list item. The meaning of these bytes is opaque to the data model. The partition key of the table is composed of the feature_key and entity_id columns, so that all the items of a given list are stored in the same ScyllaDB partition. The item_key column is used as the table’s clustering key, which, along with the CLUSTERING ORDER option, enables efficient retrieval of list items in reverse-chronological order. The structure of the ‘list_items’ tableThe ‘item_key’ column The item_key column of the list_items table holds the key identifying the item in the list it belongs to. This key is a concatenated string comprised of the following parts: The first part of the key is the string representation, in base 10, of the item’s timestamp as a number of nanoseconds since the Unix epoch. This string is left-padded with zeroes if needed in order to reach 19 characters. (19 digits for Unix timestamps in nanoseconds cover up to November 11, 2286, which should hopefully be enough for our use cases). The second part is the one-character separator ‘#’. The third and last part is the MD5 hash of the item’s value, encoded into Base64. For instance, the following item_key value: 1724949845430000000#cUEirgoj2hOWvLLGgJ8hrQ== identifies a list item with a timestamp set to August 29, 2024 at 16:44:05 UTC and a value which MD5 hash is cUEirgoj2hOWvLLGgJ8hrQ== when encoded into Base64. The format of item_key values is designed so that items in a list are sorted following the order of their timestamp (thanks to the clustering key of the list_items table). The Get List Items operation (see below) relies on this to discard items with a timestamp older than its minTimestamp parameter; this is achieved by adding a WHERE item_key >= $minTimestampStr clause, where $minTimestampStr is the string representation, as defined above, of minTimestamp. As the item_key column is part of the table’s primary key, this format also also allows multiple items to have the same timestamp as long as the items all have distinct values — a key requirement for our system. The first part of the item_key column for these items will all have the same value (their timestamp); their last part (the MD5 hash of their value) will always be distinct. In the pseudo-code samples below we’ll use the following functions related to the item_key column: buildItemKeyTimestamp takes an item timestamp on input and returns on output the string representation of this timestamp for item_key values, as defined above. buildItemKey takes an item on input and returns on output the item_key value for this item, also as defined above. The ‘list_items_by_value’ local secondary index Our ScyllaDB data model also relies on a list_items_by_value local secondary index (LSI) on the list_items table: CREATE INDEX list_items_by_valueON list_items((feature_key, entity_id), value); This index on the value column is used to implement the Remove List Items with Value operation (see details on this operation below). Using a local secondary index rather than a global one ensures that the data for that index is stored on the same node as the data for the base table, which is expected to be faster than relying on a global secondary index, as noted in the ScyllaDB documentation. Handling TTLs for list items Each row of the list_items table has an associated time-to-live (TTL) value set to the duration between the expiration date of the associated item and the time at which the row was inserted into the table. An item’s expiration date is equal to the timestamp of the item shifted by the TTL of its list feature. ScyllaDB will automatically delete the rows with an expired TTL, thus ensuring that the size of the list_items table does not grow out of control. How TTL helps us control storage usage during and after the lifetime of a list featureThe list_items table is also configured with a default TTL value (represented by $defaultTTL in the CREATE statement above). This default TTL acts as a safeguard preventing us from inadvertently inserting rows without a TTL (when running backfill operations for instance), which might then end up living forever in the table. Side notes on the ScyllaDB data model We collaborated with the team at ScyllaDB to define this data model. One of their solutions architects reviewed the first draft of the list_items table definition and provided useful feedback on the following topics: There is no advantage to using short column names with ScyllaDB. As we’ll see in part 3 of this series, this is different with DynamoDB. Using a single feature_key column concatenating the entity type, feature name, and feature version is expected to be slightly faster than using distinct entity_type, feature_name, and feature_version columns. The list_items_by_value local secondary index (LSI) will be more efficient than a plain scan of the whole table as long as we’re fetching less than 60% of the table rows via the LSI. If we go past this 60% threshold then a scan might be more appropriate. Our data model uses both automatic row deletion via the time-to-live (TTL) feature and explicit row deletion via DELETE queries. In some circumstances this can lead to bad performance. ScyllaDB offered to work together with us to tune table compaction once we had enough production data to work with. Implementing list operations with ScyllaDB Let’s now review how each list operation is implemented on top of the ScyllaDB data model for lists we’ve just defined. See Part 1: Laying the foundations for more details on each of these operations and on their expected call rates. Metadata-related operations The Create List Feature and Create List Feature Version operations are fully handled by the existing metadata layer of the feature store; they are independent from the lists data model, so we won’t talk about them here. The Delete List Feature and Delete List Feature Version operations also primarily concern the metadata layer of the feature store. Ideally these operations would initiate the deletion of the associated items from the list_items table (i.e. all items for all lists for all entities for a given feature name and optional version). These item deletions have however no efficient implementation in our data model, so the Delete List Feature and Delete List Feature Version operations do not initiate them. This is an accepted limitation: the time-to-live associated with each item in the list_items table ensures that items from deleted lists will be automatically expunged from the table at some point. Get List Items This operation retrieves items in a given list. It has the following input parameters: entityType: The type of the entity the list belongs to. entityID: The identifier of the entity the list belongs to. featureName: The name of the list feature. featureVersion: The version of the list feature (optional, defaults to ""). minTimestamp: The lower bound for the timestamp of list items to retrieve. limit: The maximum number of list items to retrieve. On output, this operation returns the items of the list identified by the entityType, entityID, featureName, and featureVersion parameters. Items are returned in the reverse chronological order of their timestamps, starting with the item with highest timestamp. Items with a timestamp older than minTimestamp are discarded. And the operation returns at most limit items. The ScyllaDB query for the Get List Items operation is as follows (where $param placeholders are expanded to the value of the corresponding param input parameter): SELECT value, item_keyFROM list_itemsWHERE feature_key = '$entityType#$featureName|$featureVersion' AND entity_id = $entityID AND item_key >= ${buildItemKeyTimestamp(minTimestamp)}ORDER BY item_key DESCLIMIT $limit; The partition key of the list_items table (feature_key + entity_id) ensures that this query hits a single ScyllaDB partition for maximum efficiency. And the table’s clustering key (item_key, which value starts with the item’s timestamp) is used to both (a) sort items following the reverse chronological order of their timestamps, taking advantage of the “CLUSTERING ORDER” table option (see the list_items table definition above), and (b) discard items older than minTimestamp. Sample Get List Items operationAdd List Items This operation adds a set of items to a given list. It has the following input parameters: entityType: The type of the entity the list belongs to. entityID: The identifier of the entity the list belongs to. featureName: The name of the list feature. featureVersion: The version of the list feature (optional, defaults to ""). items: The set of items to add. Each item is described by its value and timestamp. The Add List Items operation returns nothing. It is implemented as follows: BEGIN BATCH -- for each $item in $items: INSERT INTO list_items(feature_key, entity_id, item_key, value) VALUES ( '$entityType#$featureName|$featureVersion', $entityID, ${buildItemKey(item)}, ${item.Value}, ) USING TTL ${item.Timestamp + ttl - now};APPLY BATCH; Each item is added to the table with an INSERT statement. All the INSERT statements are sent and executed through a single BATCH statement. The batch is performed as logged (the default behavior) to ensure that either all insertions complete or none do; this atomic behavior is made possible by the fact that all insertions are targeting the same partition (defined by the feature_key and entity_id columns). Sample Add List Items operationRemove List Items with Value This operation removes all items with a given value in a list. It has the following input parameters: entityType: The type of the entity the list belongs to. entityID: The identifier of the entity the list belongs to. featureName: The name of the list feature. featureVersion: The version of the list feature (optional, defaults to ""). value: The value to remove. The Remove List Items with Value operation returns nothing. Unlike all other list operations, which are implemented with a single request to ScyllaDB, the Remove List Items with Value operation has to go through two steps in sequence. The first step runs the following ScyllaDB query to fetch the item_key values of the items to delete: SELECT item_keyFROM list_itemsWHERE feature_key = '$entityType#$featureName|$featureVersion' AND entity_id = $entityID AND value = $value; This first query relies on the list_items_by_value local secondary index on the list_items table to efficiently identify the items with the specified value. Using this index is expected to be faster than scanning the table as the query is highly selective: the number of list items with the specified value should be very low compared to the total number of list items. Once the item_key values have been retrieved, the second step runs batches of DELETE statements to delete the items: -- for each batch of item_key values:DELETE FROM list_itemsWHERE feature_key = '$entityType#$featureName|$featureVersion' AND entity_id = $entityID AND item_key IN ($itemKeyBatch); Each batch contains at most 10 item_key values. This approach is a middle ground between (a) running a single DELETE statement with all the (unbounded) item_key values in the “WHERE … IN …” clause and (b) running one DELETE statement per item to delete. Both (a) and (b) are expected to have performances that degrade non-linearly as the number of items to delete increases. Sample Remove List Items with Value operationRemove All List Items This operation removes all items in a list. It has the following input parameters: entityType: The type of the entity the list belongs to. entityID: The identifier of the entity the list belongs to. featureName: The name of the list feature. featureVersion: The version of the list feature (optional, defaults to ""). The Remove All List Items operation returns nothing. It is implemented with a single DELETE statement relying on the partition key of the list_items table (feature_key + entity_id): DELETE FROM list_itemsWHERE feature_key = '$entityType#$featureName|$featureVersion' AND entity_id = $entityID; As all the items are in the same partition, ScyllaDB is able to perform the deletion atomically: either all the rows (i.e. the whole partition) will be deleted, or none will. Sample Remove All List Items operationRecap and next steps Thank you for making it this far! We’ve covered quite a lot of ground, so let’s recap. ScyllaDB is a database designed for use cases requiring high throughput, low latency, and high scalability. In order to enable efficient queries, a ScyllaDB table must define a primary key that is appropriate for the table’s data access patterns. A primary key is composed of a partition key, which defines how the rows in the table are mapped to partitions, and an optional clustering key, which defines how rows are sorted within a partition. Data access patterns relying on a non-primary key column can be made efficient by using secondary indexes. Column values in a ScyllaDB table can have an associated time-to-live (TTL), which is a duration after which ScyllaDB will automatically delete the value (and, potentially, the row it belongs to). The ScyllaDB data model we’ve defined for lists in Medium’s feature store relies on a single list_items table storing all the items for all the lists managed by the feature store. Each row of this table stores the data for a single list item. The table’s partition key identifies the list each item belongs to; its values are built from the list’s entity type, feature name, feature version, and entity ID. As a result, all the items of a given list are stored in the same ScyllaDB partition, which enables efficient execution of queries on that list. The table’s clustering key is designed with the following goals: (a) enable efficient retrieval of list items in reverse chronological order and (b) support the requirement stating that a given list can hold multiple items with the same timestamp as long as these items all have distinct values. The list_items_by_value secondary index enables efficient queries on the table’s value column, which isn’t part of the table’s primary key. Each row of the list_items table has an associated TTL which ensures that the size of the table does not grow out of control with data that we don’t care about anymore. One accepted limitation of the data model is that it provides no efficient way to delete all the items associated with a given list feature or list feature version (for the Delete List Feature and Delete List Feature Version operations). The mandatory row TTLs provide a mitigation strategy by ensuring that these list items will be automatically deleted from the table at some point. List operations are relatively straightforward to implement, thanks to the data model we’ve defined. All operations except Remove List Items with Value perform a single request to the ScyllaDB cluster. Get List Items relies on the table’s partition key to identify the list and on the table’s clustering key to fetch items in reverse-chronological order with a lower bound on the item timestamps. Add List Items atomically inserts the requested list items into the table via a batch statement. Remove List Items with Value is the only operation that cannot be implemented with a single request; it first fetches the clustering key values of the items to delete, relying on the list_items_by_value secondary index, then sends batched deletion requests for these items. Remove All List Items requests all the rows with a given partition key value to be deleted, which ScyllaDB is able to perform atomically. The next installment in the series will describe how we also used DynamoDB to implement the data layer for list features. As we’ll see, there are numerous similarities and some subtle differences between ScyllaDB and DynamoDB. Stay tuned to learn more! ScyllaDB implementation: Lists in Medium’s feature store, Part 2 was originally published in Medium Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.


View Entire Post

Read Entire Article