Using more from database instead of sidekiq workers

At this post I’ll talk about an approach to simplify things a little bit when using sidekiq1 workers to process small updates on tables inside a database such as counters.

Another day at atos6 when investigate why our service was struggling and running under pressure, I’ve found that one of our sidekiq workers was causing the outage for two reasons: eating up the database connection pool and it run hundred times per day. You’ll thought that this isn’t much to take down, but this was in concurrency with another users and this workers was doing one thing, update count cache column, updating 3 count cache column, for statistics purpose and to measure evolving from customers, on a table which has a high access on our system.

When the outage occurs, I’ve remove small workers from our queue, which I don’t care so much, some of them updates a latitude and longitude and this task can be call through a rake task which runs periodically. Another take was to increase the database connection pool and waiting the queue to flush.

After the recovery, I’ve take some notes about why the outage occurs, take some screenshots from the Scout Apm2 which we’ve been using for quite some time.

Scout Apm dashboard that shows a lot of requests allocation
Scout Apm shows a lot of requests allocation

If you take a look at the screenshot, the pattern was repeating over time, but at that day, besides the worker called hundred of times, another heavy jobs was running too and our service was busing (normally between Wednesday and Sunday) and users accessing to take reports, updating data and etc.

Well, to take a possible approach of refactoring or maybe decrease the priority of such worker and not slowing down our service anymore, I’ve collected logs from at least 2 months and processed to analyze a snapshot about how many times that worker was called and not surprisingly they was at our top 5 (called 39522 times).

With my hands full of data, I’ve started to analyze the feature that at the end fire the workers. The service object which handles all the logic to update those counters was causing a lot of N+1 query, because they receives one id from the worker and call:

1class CountersUpdateBusiness
2 def initialize(customer_id)
3 @customer = Customer.find(customer_id)
4 end
5
6 # ...
7
8 private
9
10 attr_reader :customer
11end

And along the code, calling something like customer.reports_count = customer.reports.count or customer.active_reports_count = customer.reports.active.count. For the first example, a simple belongs_to :customer, counter_cache: :reports_count (since we’re using rails and activerecord) should be suffice, but the last one can’t be accomplished with the same approach and I’ve thought about to make a SQL trigger3 to update the counter after a create or update the records scoped by the active flag.

On the process, I’ve refactored and simplified a lot of code, to take a peek see the screenshot below:

Refactor after remove worker
Refactor after remove worker

Conclusion

That was such a good job, because I’ve simplified the update of the counters, leaves the sidekiq queues to handle heavy workers asynchronously (email deliveries, change data in bulk and etc) and avoid to compete with each other, collaborating to another outage.