Large customers labeled many thousands of continuously reported metrics and desired an ability to search for specific quantities with partially remembered names. We set up an Elastic Search cluster but found we had to condition the inbound traffic to mold our traffic to this database's expectations.
We wanted new names to appear within a minute of its first occurrence in the data stream.
We choose Bloom Filter for storage efficiency. We could tolerate false positives at some rate but not false negatives.
We deployed our own work in a cluster and devised a mechanism for repartitioning the filter. This also supported warm start on recovery.
# Architecture
A founding engineer worked out the microservice architecture and provided the first implementation in Java. I found it hard to see all of the features of this from the source code. Once familiar with the threads and objects I asked for a whiteboard explanation and asked a lot of "why" questions. The mechanisms by which back-pressure propagated was typical of subtle design.
My last question was "how did you get to know all of this?" It wasn't his first service, he said. He had learned by hard experience.
# Modeling
We had enough measurement to know there were scaling issues. I can't remember how Bloom came to our attention but the analysis quoted online made it look promising. We set out to confirm this before proceeding. wikipedia
I selected an implementation and ran artificial data against it. We varied the workload statics and became familiar with response times by feeding our modeled measurements into our own product, same as we would soon monitoring this work in production.
# Deployment
We were working during a period where any production modification had to be approved and scheduled. For us this was about once a day.
We tapped production data flows and directed a variable share of this through our filter and into a newly provisioned Elastic Search cluster. We would prove this worked reliably before offering client query capability. Each day we designed a new overnight test. We often checked in from home to be sure what we saw was matching our model.
At full load we were unable to rediscover the cache content fast enough to meet our desired latency rates. Instead we periodically merged and wrote the cache in the Bloom format. We demonstrated we could redistribute this between new nodes to rebalancing.
# Epilog
My colleague, Jason, was later tapped to lead similar technology in the next generation product. I drifted to higher level architectural modeling as a service to other architects.
This work, now in production, was successfully transferred to the product group that would rightfully own it. Following their "data driven" commitment they ran their own performance tests and found our work wholly satisfactory.
This began a period of microservice conversion and a corresponding growth in the architecture team. I remember a conversation where one engineer that felt the architects didn't have the real experience to justify their role. Then he asks, "what have you written that runs at our production volume?" Fortunately I had this story to tell and knew enough details to satisfy.