Last week, a new paper dropped on the problem of colored k-mer indexing. It’s called Fulgor: A fast and compact k-mer index for large-scale matching and color queries, by Jason Fan and others.
As a person who is deeply invested in this particular problem, reading this paper was an exciting journey. My tool Themisto is cited throughout the paper, and the benchmarks are run against Themisto, so I think I have some very relevant expertise to comment on the paper. In this post, I give my thoughts.
But first, let me set the context for the paper with some background.
The paper presents an index for the color k-mer indexing problem. The problem is as follows. Given a database of reference sequences, build an index that can quickly return the identifiers of reference sequences containing a given query k-mer. The reference sequence identifiers also called colors, hence the name of the problem.
This is basically an inverted index specialized to k-mers. The challenge is to design the index so that it takes advantage of redundancies that are typical to genomic databases, and works well with streaming queries that query all overlapping k-mers of a query sequence in succession.
Most solutions to the problem factor the index into two largely independent parts: (1) Lookup: a dictionary mapping the distinct k-mers of the database into integers, and (2) Colors: a dictionary mapping the integer identifiers of k-mers to sets of reference sequence identifiers having those k-mers. Tools such as VARI, Mantis, Bifrost, Metagraph, Themisto (my tool) and now the newly introduced Fulgor all work like this.
There are very good data structures for the first part of this indexing scheme. Any hash table will do, but there are also more space-efficient hashing schemes specialized to k-mers, such as BLight and SSHash. In addition to these, there is another family of solutions that is based on variants of the Burrows-Wheeler transform (see dBG-FM, BOSS and SBWT). Both hashing and BWT-based solutions can get down to just ~2-10 bits per k-mer of space while still supporting fast lookups. The BWT-based indexes are typically smaller, but hashing-based solutions tend to have faster queries due to better memory locality.
I think that the k-mer lookup part of the index has been studied quite thoroughly, leaving only small constant coefficient improvements left to discover. The big improvements to be had are in the second part of the indexing scheme.
The second part is essentially a compressed representation of the color matrix of the data. In the color matrix, the rows represent distinct k-mers of the data, and the columns the distinct reference sequences. The value at cell (i,j) in the color matrix is 1 if k-mer i is found in reference with color j. Otherwise, cell (i,j) contains a zero.
This matrix is extremely compressible. For starters, it’s very sparse. Besides being sparse, the columns of related reference sequences are very similar to each other. The recent paper on Phylogenetic compression by Břinda and others (see my blog post about it here), showed that the information in the color matrix for datasets at the scale of terabytes can be compressed to just a few tens of gigabytes. We’re talking multiple orders of magnitude here.
While the paper of Břinda and others is very impressive, they currently need to decompress batches of the index in order to support queries. Ideally, a compressed database also supports fast queries directly on the compressed data. Fulgor is the latest addition to the array of tools with this goal in mind.
Fulgor uses the SSHash k-mer hash dictionary for the first part of the index. As a side product, the SSHash function also returns the unitig containing the query k-mer. Unitigs here are maximal non-branching paths in the colored de Bruijn graph of the input, such that the colors of all k-mers in the unitig are the same. The second part of the index maps the unitigs to the color sets. The unitigs are permuted in memory such that unitigs having the same colors are adjacent in memory, which is exploited to save space.
Like in Themisto and Bifrost, the distinct color sets are represented using a hybrid encoding. Fulgor encodes color sets having between 25%-75% of the color present as bit maps, and the rest using delta-gap coding. The figure below from the paper illustrates the index.
Comparison against Themisto
The authors compare Fulgor to my colored de Bruijn graph tool Themisto. Two datasets were used: 150k Salmonella genomes, and 31k genomes from a heterogenous collection of gut bacteria. Fulgor is about half the size of Themisto on the former, and one quarter on the latter.
Where does the size difference come from? First, the k-mer-to-color-set mapping of Themisto takes more space than that of Fulgor. Thanks to the unitig reordering trick, Fulgor needs only 1 + o(1) bits per unitig for this mapping, which is almost negligible in the grand scheme of things. In contrast, Themisto, on the gut dataset, is reported to use about 91GB to store this mapping, which is about 65% of the reported total index size.
Themisto uses a lot of space here because in the default configuration, Themisto stores one integer for each k-mer, such that the stored integer points to the color set of that k-mer. The integers are represented using log2(number of distinct color sets) bits each. Unlike Fulgor, Themisto also indexes both k-mer orientations (forward and reverse complement), whereas Fulgor can get away with indexing just one orientation while still supporting queries for both.
But! Themisto also has an option to sparsify this mapping, storing integers only for every d nodes in each unitig. This feature is not enabled by default (i.e. d = 1), but I have found that the queries are hardly any slower if a larger value of d is used. The authors do not experiment with higher values of d. Clearly I should set a higher value for d in the default configuration of Themisto!
This is not the only reason for the difference in size however. The hybrid encoding of the color sets in Fulgor is slightly more advanced than the one in Themisto:
– Themisto encodes sparse sets as simple integer arrays for fast scanning, whereas Fulgor uses Elias delta coding of the gaps, which can be more efficient if the gaps are small on average.
– In Themisto, very dense sets are encoded as bitmaps, whereas in those cases Fulgor will take the complement of the bitmap and encode it as sparse.
The second point especially makes a big difference in the Salmonella dataset, where the authors observe that 50% of the encoded integers belong the color sets that are at least 90% dense.
All of the techniques used to represent the color sets in Fulgor could be implemented in Themisto as well. Themisto also has an option to use Roaring bitmaps to represent the color sets, which I believe has a separate compression mode for very dense sets, like Fulgor does. However, the trick where the unitigs are permuted in memory to compress the color mapping is not available for Themisto because Themisto is not based on explicitly stored unitigs. Adopting that trick would require swapping the SBWT index of Themisto for some kind of an index for the unitigs (minimizers, FM-index…), which would be a major rewrite.
The authors benchmark the queries on two kinds of workloads: one with a lot of hits, and one with only a few hits. Fulgor is found 3-5 times faster. That sounds believable to me, since SSHash has better cache properties than SBWT.
There is one caveat mentioned in the experiments: index loading time is included in the query time. This hurts Themisto more than Fulgor because Themisto indexes are larger.
This is a common methodological snag I’ve seen in many papers. While the time taken to load the index can be a real issue for the end user, I think that when comparing data structures, it would make the most sense to assume that the data structures are already loaded into memory before the timer is started – otherwise index loading time may dominate the experiment, which could obscure the query efficiency of the data structure.
Measuring the actual query time might not be so easy to do however, as it may require diving into the source code of the tools to insert a timer to the point where the queries actually begin. To make this easier, I’ve long ago made Themisto print the current time when the queries start. I encourage other tool developers to do the same.
This issue might make big difference in the experiments of the Fulgor paper. In the experiments, the gut bacteria index took 139GB, so if it is loaded into memory at a rate of, say 1GB/s, then it takes more than two minutes to load. The total time of Themisto on this dataset was just 2min 58s on the high-hit workload and 2min 45s on the low-hit workload, so it is possible that the index loading time took a significant fraction of the total time. It’s hard to judge this from the manuscript alone.
Rest of the paper
The paper also contains experiments on different flavors of pseudoalignment algorithms, including skipping heuristics like those used in Kallisto. While this is interesting work, this post is already long enough so I won’t cover those.
Congratulations to the authors for this work! It seems I need to get back to work to keep Themisto competitive. Since Fulgor uses tricks that are specific to the unitig minimizer index, we would need to introduce tricks that are specific to the SBWT.
The fundamental difference between the two indexing approaches is that the unitig index groups together k-mers that are adjacent in the text, whereas the SBWT groups together k-mers that are adjacent in the lexicographic order. Techniques that crucially depend on the lexicographic order could potentially set Themisto apart from Fulgor.
Acknowledgements: Thank you to ChatGPT for coming up with the title of the post.
Conflicts of interest: I have a paper in submission with the fourth author Giulio Pibiri.