Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nhse o34 orkv.i58 refactor #59

Open
wants to merge 2 commits into
base: openriak-3.4
Choose a base branch
from

Conversation

martinsumner
Copy link

#58

Changes:

  • Updates the cache, so that the timestamp of the stats in the cache, is the timestamp when the stats are returned and not the timestamp when the stats were requested. This protects against a backlog at the cache all requiring stats fetches as the fetch is taking longer than 1s.. The cache time is also now an environment variable.
  • Refactor the riak_kv_status:aliases/0 function to loop more efficiently without the multitude of calls to orddict:append/3 which feature in the eprof profile of the stats call. This change still requests the value for a list of datapoints under the key, as opposed to separate fetches - to avoid, for example, multiple functional calls to get vnodeq stats. The change relies on exometer_alias:prefic_foldl/3 being an ordered fold.
  • Remove altogether the sys monitor count from the standard stats. The cost of doing this far outweighs the benefit. The function to produce the stats is moved to riak_kv_util module with other troubleshooting functions should someone wish to call it manually. It has been confirmed that known timeouts in production directly related to this stat.
  • Increases the default timeout on the call to the riak_kv_http_cache, make it configurable within the query, and reports a more friendly error back to the operator when the timeout hits (rather than Erlang stacktrace).

Optimisations:

- Take the timestamp to the cache after get_stats/0 has returned, so that if get_stats/0 takes > 1s any requests in the queue for riak_kv_http_cache will still use the cache.

- refactor riak_kv_status:aliases/0 to use simple lists rather than orddict.

- remove altogether the sys_monitor_count, it is simply too expensive.  Available as a riak_kv_util module function instead for the experienced operator.
It takes more than 5s on some systems at the moment - and this then dumps an unhelpful crashdump to the user.  Make a longer default timeout, allow the timeout to be passed by the operator, and also return more operator-friendly error on timeout occurring.
@martinsumner
Copy link
Author

OpenRiak/riak_test#35


start_link() ->
gen_server:start_link({local, ?MODULE}, ?MODULE, [], []).

get_stats() ->
gen_server:call(?MODULE, get_stats).
get_stats(Timeout) ->

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice with either a type or a comment stating in which unit the Timeout is given.
-spec get_stats(Milliseconds :: non_neg_integer()) -> ..... or similar

check_cache(#st{ts = Then} = S) ->
CacheTime = application:get_env(riak_kv, http_stats_cache_seconds, 1),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One could be generous and make the configurable time in milliseconds instead of course seconds. That would help if one ever want to test these timeouts in QuickCheck like fashion.
Since this is a new parameter, setting default to 1000 and have it provided in milliseconds is not harming backward compatibility.

produce_body/2,
pretty_print/2
]).
-export([get_stats/0]).

-define(TIMEOUT, 30000).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with comment:

Suggested change
-define(TIMEOUT, 30000).
-define(TIMEOUT, 30000). %% in milliseconds

{true,
wrq:append_to_resp_body(
io_lib:format(
"Bad timeout value ~0p",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add in explanation "should be provided in miliseconds"
There is an argument for letting 0 be a valid value for timeout, basically stating, "asap".


%% @spec pretty_print(webmachine:wrq(), context()) ->
%% {string(), webmachine:wrq(), context()}
%% @doc Format the respons JSON object is a "pretty-printed" style.
pretty_print(RD1, C1=#ctx{}) ->
{Json, RD2, C2} = produce_body(RD1, C1),
{json_pp:print(binary_to_list(list_to_binary(Json))), RD2, C2}.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was the reason to do this that Json can be a deeplist and it is flattened in this way to make json_pp:print work?

AllStats =
exometer_alias:prefix_foldl(
<<>>,
fun(Alias, Entry, DP, Acc) -> [{Entry, {DP, Alias}}|Acc] end,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Later on we make an assumption on how these entries are kind of sorted such that consecutive entries end up next to each other, right?
Would be good to provide this assumption in a comment. Otherwise it is weird that if
Entry == PrevEntry you "randomly" discard the new entry.
A comment would explain this for future developers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants