Memcached server profiling with mk-query-digest

Introduction

Monitoring Memcached through hit rate, evicitions sometime is not enough to find what is really happening.
Maatkit have released a Perl toolkit for MySQL and PosteGreSQL, but one tool on this toolkit can also profile Memcached traffic.
This article will focus on mk-query-digest, the Memcached traffic profiler.

Dumping tcp traffic with tcpdump

Before running tcpdump, be aware that it take ~35% load of a one vCPU server here and the output file increase in size at a rate of 300 MB / min, but this may vary, depending on your cached data and requests rate.

You can also tcpdump directly from cache server to get all requests issued to this server, or from an Apache server (In case you only get memcached request asked by this particular server).

Installation

To start profiling we first have to capture memcached network traffic with tcpdump, in case you don’t have it installed, you can grab it from www.tcpdump.org or using your system installation prefered method.

Using tcpdump

To start the tcpdump, use the command below :

For more information about tcpdump filters, take a look at this documentation
Server will respond if everything is fine :

To stop tcpdump, hit CTRL + C, result will show like this :

And were done with the tcpdump. Let’s profile !

Memcached profiling

Although mk-query-digest is designed to be as efficient as possible, it can use a lot of CPU and memory, so I prefer to move log output on our log analysis server. After log file is moved, we can install mk-query-digest.

Download the latest release of mk-query-digest

Simply execute

or

We also need to enable execution of this perl script

Profiling with mk-query-digest

Then we can start profiling with the help of http://www.maatkit.org/doc/mk-query-digest.html

During profiling, mk-query-digest will inform us that it is always running (A nice addition of a recent release)

Note that if you CTRL + C before the end, mk-query-digest will also put display the results of data he has already computed

Results by Memcached server response code

First command is to show all memcached response code from tcpdump

After some time result will be displayed.

mk-query-digest_global.png

What it tell us :

General & time stats

mk-query-digest_global_overall.png
First line show us that during 331.83k memcached queries, there was 4 unique results for the -group-by res (VALUE, INTERRUPTED, NOT_FOUND, STORED), and that during the time interval it was issued 780.87 query per second with a 1.42x concurrency.
Lines below show us time taken to respond to theses commands (min, max, avg, 95%, …)

Commands, miss & errors

mk-query-digest_global_commands.png

Finally, we see what commands were issued to memcached, and what errors takes place (that is why % sum are above 100%) but we have to distingish two set of datas :

Memc_get tell us logically that 96% of 331K commands where get, 3% were set commands, and 4 commands where replace, this finally lead us to 100%
Memc_miss is for any command that result in a NOT_FOUND, in this case, 3% of all command tried to access to a nonexistent key.
Memc_error refers to retrieval (gets) commands that were INTERRUPTED, 56 of the 308K gets were in this case.

Memcached results

mk-query-digest_global_results.png
The columns should be fairly self-explanatory: rank, query ID (Not in use here), response time sum and percentage of total, number of calls and response time per call, and results (asked by our -group-by res)

First observation

As we can deduce from above screen, it seems that this server have a pretty good hit ratio (Only 3% miss) and every NOT_FOUND seems to be followed by a STORED (A miss in a get command followed by a set), and everything is fast. But we found 56 INTERRUPTED (0.0001%) that takes alone 24.8% of the time taken to execute 270K commands. Next we have to find what is wrong with these 0.0001% of total command that goes INTERRUPTED.

Following results : VALUE result

After this summary, every result is detailed, first the VALUE result

mk-query-digest_value_global.png

Let’s see what each part say :

General & time stats

mk-query-digest_value_overall.png
This part is similar to the general time & stats screen, you’ve got here time stats (total, max, avg, 95%, …), hosts that asked for values with percent repartition, size of return values (total, max, avg, …), and distribution of commands (I think the two set commands is a misread by mk-query-digest)

Query time distribution

mk-query-digest_value_query_time.png
This is particulary interesting logarithmic chart of time clustering, it show distribution of time of requests execution. As we see, most of them are under 100us, and a small part of them are a bit slow.
You have to check this charts, high response time must be avoid.

Looking a overall stats, it seem one request take 200s to execute, it can be a lot of things, client closing connection, network problem, in my case, 314s total - 200s max = 114 s to execute and send 308K get request seem to be pretty good, i will not investigate this more, but will watch it regulary. I will make another profiling to see what’s happening after explaining INTERRUPTED.

INTERRUPTED

mk-query-digest_interrupted_global.png
Time chart show that INTERRUPTED requests take lot of time, that one object is 1023.86kBytes (limit is 1024), but key print can’t show us if one particular key is the cause.
So we have to change our mk-query-digest parameters to investigate further

Memcached result INTERRUPTED, grouped by keys

What I want is to see what key retreival were interrupted, to see if a particular key is broken or something similar.

Building the mk-query-digest query

Mk-query-digest support PERL regular expressions and give acces to EventAttributes (see this documentation).
Do not hesistate to test your own commands, if something is wrong, mk-query-digest will tell it when you launch it.
So with some tweak, we now have a request that will ask to only show INTERRUPTED events grouped by key to find what iis messed up.

Result

mk-query-digest_interrupted_bykey.png
We now see that one particular key take 116 sec, the others are bellow 1 sec.
By using phpMemcachedAdmin, to make a get request with theses key, nothing seemed wrong, values were returned very quickly, so my first clue about connection closed seems right.

Another thing i note is that using hash for memcached key is a good idea, but our developers need to put some ID before, like projectA_aee2ff …., because looking a theses key, i can’t see what it refers to.

Looking for big return values

We can also look for response time of big values, or find values that will go beyond the 1MBytes Memcached limit.

Results

I only get one result for this particular request mk-query-digest_bigvalues.png

This key is exactly at the Memcached maximum object size, and must be watched and/or reduced in size.

Useful Options

-save-results
Save results to the specified file.

-timeline
Show a timeline of events.

mk-query-digest_timeline.png

-watch-server
This option tells mk-query-digest which server IP address and port (like “10.0.0.1:3306”) to watch when parsing tcpdump

Resources

Tcpdump homepage : www.tcpdump.org
Tcpdump filters explanation : http://www.cs.ucr.edu/~marios/ethereal-tcpdump.pdf
Maatkit homepage : http://www.maatkit.org/
Mk-query-digest homepage : http://www.maatkit.org/doc/mk-query-digest.html

Other Memcached Resources

You can find a lot of Memcached resources on this Memcached resources page.

Leave a Reply