chunkedseq
container library for large in-memory data sets
|
This documentation covers essential concepts that are needed to implement custom data structures out of various instantiations of the chunkedseq structure. Just like the Finger Tree of Hinze and Patterson, the chunkedseq can be instantiated in certain ways to yield asymptotically efficient data structures, such as associative maps, priority queues, weighted sequences, interval trees, etc. A summary of these ideas that is presented in greater detail can be find in the original publication on finger trees and in a blog post.
In this tutorial, we present the key mechanism for building customized data structures: monoid-cached measurement. We show how to use monoid-cached measurements to implement a powerful form of split operation that affects chunkedseq containers. Using this split operation, we then show how to apply our cached measurement scheme to build two new data structures:
Let \(S\) denote the type of the items contained by the chunkedseq container and \(T\) the type of the cached measurements. For example, suppose that we want to define a weighted chunkedseq container of std::string
s for which the weights have type weight_type
. Then we have: \(S = \mathtt{std::string}\) and \(T = \mathtt{weight\_type}\). How exactly are cached measurements obtained? The following two methods are the ones that are used by our C++ package.
A measure function is a function \(m\) that is provided by the client; the function takes a single item and returns a single measure value:
\[ \begin{array}{rcl} m(s) & : & S \rightarrow T \end{array} \]
Suppose we want to use our measurement to represent the number of items that are stored in the container. We call this measure the size measure. The measure of any individual item always equals one:
\[ \begin{array}{rclll} \mathtt{size}(s) & : & S \rightarrow \mathtt{long} & = & 1 \end{array} \]
The string-size measurement assigns to each item the weight equal to the number of characters in the given string.
\[ \begin{array}{rclll} \mathtt{string\_size}(str) & : & \mathtt{string} \rightarrow \mathtt{long} & = & str.\mathtt{size}() \end{array} \]
Sometimes it is convenient to have the ability to compute, all at once, the combined measure of a group of items that is referenced by a given "basic" segment. For this reason, we require that, in addition to \(m\), each measurement scheme provides a segment-wise measure operation, namely \(\vec{m}\), which takes the pair of pointer arguments \(begin\) and \(end\) which correspond to a basic segment, and returns a single measured value.
\[ \begin{array}{rcl} \vec{m}(begin, end) & : & (S^\mathtt{*}, S^\mathtt{*}) \rightarrow T \end{array} \]
The first and second arguments correspond to the range in memory defined by the segment \((begin, end]\). The value returned by \(\vec{m}(begin, end)\) should equal the sum of the values \(m(\mathtt{*}p)\) for each pointer \(p\) in the range \((begin, end]\).
This operation is simply \(\vec{m}(begin, end) = |end-begin|\), where our segment is defined by the sequence of items represented by the range of pointers \((begin, end]\).
The measure descriptor is the name that we give to the C++ class that describes a given measurement scheme. This interface exports deinitions of the following types:
Type | Description |
---|---|
value_type | type \(S\) of items stored in the container |
measured_type | type \(T\) of item-measure values |
And this interface exports definitions of the following methods:
Members | Description |
---|---|
measured_type operator()(const value_type& v) | returns \(m(\mathtt{v})\) |
measured_type operator()(const value_type* begin, const value_type* end) | returns \(\vec{m}(\mathtt{begin}, \mathtt{end})\) |
Our first kind of measurement is one that does nothing except make fresh values whose type is the same as the type of the second template argument of the class.
The trivial measurement is useful in situations where cached measurements are not needed by the client of the chunkedseq. Trivial measurements have the advantage of being (almost) zero overhead annotations.
This kind of measurement is useful for maintaining fast access to the count of the number of items stored in the container.
This technique allows the client to supply to the internals of the chunkedseq container an arbitrary weight function. This client-supplied weight function is passed to the following class by the third template argument.
Often it is useful to combine meaurements in various configurations. For this purpose, we define the measured pair, which is just a structure that has space for two values of two given measured types, namely Measured1
and Measured2
.
The combiner measurement just combines the measurement strategies of two given measures by pairing measured values.
Recall that a monoid is an algebraic structure that consists of a set \(T\), an associative binary operation \(\oplus\) and an identity element \(\mathbf{I}\). That is, \((T, \oplus, \mathbf{I})\) is a monoid if:
Examples of monoids include the following:
A group is a closely related algebraic structure. Any monoid is also a group if the monoid has an inverse operation \(\ominus\):
We require that the descriptor export a binding to the type of the measured values that are related by the algebra.
Type | Description |
---|---|
value_type | type of measured values \(T\) to be related by the algebra |
We require that the descriptor export the following members. If has_inverse
is false, then it should be safe to assume that the inverse(x)
operation is never called.
Static members | Description |
---|---|
const bool has_inverse | true , iff the algebra is a group |
value_type identity() | returns \(\mathbf{I}\) |
value_type combine(value_type x, value_type y) | returns x \(\oplus\) y |
value_type inverse(value_type x) | returns \(\ominus\) x |
The trivial algebra does nothing except construct new identity elements.
The algebra that we use for integers is a group in which the identity element is zero, the plus operator is integer addition, and the minus operator is integer negation.
Just like with the measurement descriptor, an algebra descriptor can be created by combining two given algebra descriptors pairwise.
A scan is an iterated reduction that maps to each item \(v_i\) in a given sequences of items \(S = [v_1, v_2, \ldots, v_n]\) a single measured value \(c_i = \mathbf{I} \oplus m(v_1) \oplus m(v_2) \oplus \ldots \oplus m(v_i)\), where \(m(v)\) is a given measure function. For example, consider the "size" (i.e., weight-one) scan, which is specified by the use of a particular measure function: \(m(v) = 1\). Observe that the size scan gives the positions of the items in the sequence, thereby enabling us later on to index and to split the chunkedseq at a given position.
For convenience, we define scan formally as follows. The operator returns the combined measured values of the items in the range of positions \([i, j)\) in the given sequence \(s\).
\[ \begin{array}{rclr} M_{i,j} & : & \mathtt{Sequence}(S) \rightarrow T \\ M_{i,i}(s) & = & \mathbf{I} & \\ M_{i,j}(s) & = & \oplus_{k = i}^j m(s_k) & \mathrm{if} \, i < j \end{array} \]
The cached value of an internal tree node \(k\) in the chunkedseq structure is computed by \(M_{i,j}(s)\), where \(s = [v_i, \ldots, v_j]\) represents a subsequence of values contained in the chunks of the subtree below node \(k\). When this reduction is performed by the internal operations of the chunkedseq, this expression is broken up into a set of subexpressions, for example: \(((m(v_i) \oplus m(v_{i+1})) \oplus (m(v_{i+2}) \oplus m(v_{i+3}) \oplus (m(v_{i+4}) \oplus m(v_{i+5}))) ... \oplus m(v_j))\). The partitioning into subexpressions and the order in which the subexpressions are combined depends on the particular shape of the underlying chunkedseq. Moreover, the particular shape is determined uniquely by the history of update operations that created the finger tree. As such, we could build two chunkedseqs by, for example, using different sequences of push and pop operations and end up with two different chunkedseq structures that represent the same sequence of items. Even though the two chunkedseqs represent the same sequence, the cached measurements of the two chunkedseqs are combined up to the root of the chunkedseq by two different partitionings of combining operations. However, if \(\oplus\) is associative, it does not matter: regardless of how the expression are broken up, the cached measurement at the root of the chunkedseq is guaranteed to be the same for any two chunkedseqs that represent the same sequence of items. Commutativity is not necessary, however, because the ordering of the items of the sequence is respected by the combining operations performed by the chunkedseq.
Suppose we have a cached measurement \(C = M_{i,j}(s)\) , where \(s = [v_i, \ldots, v_j]\) represents a subsequence of values contained in the same chunk somewhere inside our chunkedseq structure. Now, suppose that we wish to remove the first item from our sequence of measurements, namely \(v_i\). On the one hand, without an inverse operation, and assuming that we have not cached partial sums of \(C\), the only way to compute the new cached value is to recompute \((m(v_{i+1}) \oplus ... \oplus m(v_j))\). On the other hand, if the inverse operation is cheap, it may be much more efficient to instead compute \(\ominus m(v_i) \oplus C\).
Therefore, it should be clear that using the inverse operation can greatly improve efficiency in situations where the combined cached measurement of a group of items needs to be recomputed on a regular basis. For example, the same situation is triggered by the pop operations of the chunks stored inside the chunkedseq structure. On the one hand, by using inverse, each pop operation requires only a few additional operations to reset the cached measured value of the chunk. On the other, if inverse is not available, each pop operation requires recomputing the combined measure of the chunk, which although constant time, takes time proportion with the chunk size, which can be a fairly large fixed constant, such as 512 items. As such, internally, the chunkedseq operations use inverse operations whenever permitted by the algebra (i.e., when the algebra is identified as a group) but otherwise fall back to the most general strategy when the algebra is just a monoid.
The cached-measurement policy binds both the measurement scheme and the algebra for a given instantiation of chunkedseq. For example, the following are cached-measurement policies:
The interface exports four key components: the type of the items in the container, the type of the measured values, the measure function to gather the measurements, and the algebra to combine measured values.
Type | Description |
---|---|
measure_type | type of the measure descriptor |
algebra_type | type of the algebra descriptor |
value_type | type \(S\) of items to be stored in the container |
measured_type | type \(T\) of measured values |
size_type | size_t |
The only additional function that is required by the policy is a swap operation.
Static members | Description |
---|---|
void swap(measured_type& x, measured_type& y) | exchanges the values of x and y |
This trivial cached measurement is, by itself, completely inert: no computation is required to maintain cached values and only a minimum of space is required to store cached measurements on internal tree nodes of the chunkedseq.
In our implementation, we use this cached measurement policy to maintain the size information of the container. The size()
methods of the different chunkedseq containers obtain the size information by referencing values cached inside the tree by this policy.
Arbitrary weights can be maintained using a slight generalization of the size
measurement above.
Using the same combiner pattern we alredy presented for measures and algebras, we can use the following template class to build combinations of any two given cached-measurement policies.
Logically, the split operation on a chunkedseq container divides the underlying sequence into two pieces, leaving the first piece in the container targeted by the split and moving the other piece to another given container. The position at which the split occurs is determined by a search process that is guided by a predicate function. What carries out the search process? That job is the job of the internals of the chunkedseq class; the client is responsible only to provide the predicate function that is used by the search process. Formally, a predicate function is simply a function \(p\) which takes a measured value and returns either true
or false
.
\[ \begin{array}{rcl} p(m) & : & T \rightarrow \mathtt{bool} \end{array} \]
The search process guarantees that the position at which the split occurs is the position \(i\) in the target sequence, \(s = [v_1, \ldots, v_i, \ldots v_n]\), at which the value returned by \(p(M_{0,i}(s))\) first switches from false to true. The first part of the split equals \([v_1, \ldots, v_{i-1}]\) and the second \([v_i, \ldots, v_n]\).
In our C++ package, we represent predicate functions as classes which export the following public method.
Members | Description |
---|---|
bool operator()(measured_type m) | returns \(p(\mathtt{m})\) |
Let us first consider the small example which is given already for the weighted container. The action performed by the example program is to divide a given sequence of strings so that the first piece of the split contains approximately half of the even-length strings and the second piece the second half. In our example code (see the page linked above), we assign to each item a certain weight, which is defined formally as follows: if the length of the given string is an even number, return a 1; else, return a 0.
\[ \begin{array}{rcccl} m(str) & : & \mathtt{string} \rightarrow \mathtt{int} & = & \left\{ \begin{array}{l l} 1 & \quad \mathrm{if}\, str.\mathtt{size()}\, \mathrm{is\, an\, even\, number}\\ 0 & \quad \mathrm{otherwise} \end{array} \right. \end{array} \]
Let \(n\) denote the number of even-length strings in our source sequence. Then, the following predicate function delivers the exact split that we want.
\[ \begin{array}{rcccl} p(m) & : & int \rightarrow \mathtt{bool} & = & m \geq n/2 \end{array} \]
Let \(s\) denote the sequence of strings (i.e., ["Let's", "divide", "this", "string", "into", "two", "pieces"]
that we want to split. The following table shows the logical states of the split process.
\(i\) | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|---|
\(v_i\) | Let's | divide | this | string | into | two | pieces |
\(m(v_i)\) | 0 | 1 | 1 | 1 | 1 | 0 | 1 |
\(M_{0,i}(s)\) | 0 | 1 | 2 | 3 | 4 | 4 | 5 |
\(p(M_{0,i}(s))\) | false | false | false | true | true | true | true |
Our final example combines all of the elements of cached measurement to yield an asymptotically efficient implementation of associative maps. The idea behind the implementation is to represent the map internally by a chunkedseq container of key-value pairs. The key to efficiency is that the items in the chunkedseq are stored in descending order. When key-value pairs are logically added to the map, the key-value pair is physically added to a position in the underlying sequence so that the descending order is maintained. The insertion and removal of key-value pairs is achieved by splitting by a certain predicate function which we will see later. At a high level, what is happening is a kind of binary search that navigates the underlying chunkedseq structure, guided by carefully chosen cached key values that annotate the interior nodes of the chunkedseq.
Our implementation uses optional values, which are values that logically either contain a value of a given type or contain nothing at all. The concept is similar to that of the null pointer, except that the optional value applies to any given type, not just pointers.
Observe that our class implements the "less-than" operator: <
. Our implementation of this operator lifts any implementation of the same operator at the type Item
to the space of our option<Item>
: that is, our operator treats the empty (i.e., nullary) optional value as the smallest optional value. Otherwise, our the comparison used by our operator is the implementation already defined for the given type, Item
, if such an implementation is available.
The type of value returned by the measure function (i.e., measured_type
) is the optional key value, that is, a value of type option<key_type>
. The measure function simply extracts the smallest key value from the key-value pairs that it has at hand and packages them as an optional key value.
The monoid uses for its identity element the nullary optional key. The combining operator takes two optional key values and of the two returns either the smallest one or the nullary optional key value.
The cache measurement policy combines the measurement and monoid descriptors in a straightforward fashion.
The associative map class maintains the underlying sorted sequence of key-value pairs in the field called seq
. The method called upper
is the method that is employed by the class to maintain the invariant on the descending order of the keys. This method returns either the position of the first key that is greater than the given key, or the position of one past the end of the sequence if the given key is the greatest key.
As is typical of STL style, the indexing operator is used by the structure to handle both insertions and lookups. The operator works by first searching in its underlying sequence for the key referenced by its parameter; if found, the operator updates the value component of the corresponding key-value pair. Otherwise, the operator creates a new position in the sequence to put the given key by calling the insert
method of seq
at the appropriate position.