Secure indexes
A primer on Bloom filters
Bloom filters are a data structure which encodes a set of items, with some special properties.
- Items can never be removed.
- The structure is very memory-efficient.
- There can be false positives when testing for presence of an element in the set.
- There can never be a false negative however.
Bloom filters rely on a hash function to work. It will hash elements which are added to the structure and hash elements which are tested against the set. At a lower level, the set only contains hashes of items.
We can use Bloom filters to produce efficient search indexes: make the bloom filter containing all words from a document; do the same for every document you want to be able to search, and now instead of scanning each document you can just test your search query against each filter to know which documents (likely) contain your search terms.
Secure indexes
I came across secure indexes as I was researching how to bring full-text search to end-to-end encrypted documents stored on a remote server. Bloom filters would be undesirable in that case because they leak information about the document’s contents. For example if we were encrypting invoices and I wanted to know whether companies Foo and Bar work together I could try to find invoices matching both “Foo” and “Bar” and deduce with some level of confidence that they are partners or not, depending on how many documents match.
A naïve approach to secure indexes is to see them as Bloom filters where the hash function also depends on a secret. In effect we can build secure indexes simply by replacing the hash function from a Bloom filter implementation by an HMAC function. Provided that the secret is identical for all documents and known only to the client, we can implement efficient search over encrypted documents with this construct. To perform a search the user only needs to generate the “trapdoor” for the search terms. This happens to be identical to a secure index containing all words from the search query. With the trapdoor, the server can iterate over all secure indexes, returning a positive answer for each index which contains all bits from the trapdoor. Such a mechanism has the following properties:
- The server knows nothing of the contents of documents and indices besides an approximation of the amount of distinct words.
- The server knows nothing of the search terms, besides an approximation of the amount of distinct words.
- Comparing the search terms with a secure index is nothing more than a bitwise-and followed by a comparison.