Jump to content
Sign in to follow this  
Just_J

StatsD and [Insight]

Recommended Posts

From reading the forum and scouring the Internet, it appears that rippled monitoring information is a bit scarce - so hoping someone might have a super-secret document out there.

I have an implementation of StatsD running and am aggregating / normalizing via some custom python code for our 2 nodes and validators. 

So far, so good.  BUT  ...

What would really be great is a document outlining what rippled produces as operational metrics output to the [insight] server, as well as the scale and unit of the values.  I'm sure at some point in the development cycle a schema was created - right?

If not, obviously I will just go to the logs and derive/interpret what I need manually ... but would be nice to have a manual!

 

Share this post


Link to post
Share on other sites

Quite a while ago I wrote a wiki article about monitoring rippled, but I guess by now it is a bit outdated. These days monitoring is done via prometheus, not graphite or statsd.

There are not many metrics exported and they seem quite straightforward anyways, but in general I am afraid you'll have a hard time reliably monitoring what is actually going on with a rippled server and if it is currently in a failure state.

Share this post


Link to post
Share on other sites

Oh I read your article @Sukrim :)

Was hoping someone in Ripple had a metrics schema handy ... But it sounds like perhaps monitoring information has either not been developed, or been developed for public usage.

 

Share this post


Link to post
Share on other sites
7 hours ago, Sukrim said:

Quite a while ago I wrote a wiki article about monitoring rippled, but I guess by now it is a bit outdated. These days monitoring is done via prometheus, not graphite or statsd.

There are not many metrics exported and they seem quite straightforward anyways, but in general I am afraid you'll have a hard time reliably monitoring what is actually going on with a rippled server and if it is currently in a failure state.

Just to make sure I understand what you mean - are you saying that your monitoring is now done through Prometheus, or that Ripple is now or has stated that they are supporting the Prometheus platform for operational metrics going forward?

I mean, what comes out of the rippled stream is what comes out of the stream ... hmmmm ... time for a packet capture

Share this post


Link to post
Share on other sites

I mean that anyone doing monitoring using something other than Prometheus better has a VERY good reason that is not "legacy" for doing so. Ripple has a huge lack in anything defined beyond the code (packaging, logging, deployment, monitoring...).

Unfortunately rippled is not the easiest code base to work with and even if I took the time and effort to get my C++ up to speed and implement a whole Prometheus endpoint in there (they wrote their own, a little bit ugly, "/crawl" endpoint that also exposes some metrics by the way which is nearly undocumented and leaks your whole peer list by default), there is no way of knowing if that is even going to make it upstream. I don't want to do this just randomly and the feature suggestions or discussions I've seen on github so far were not encouraging at all.

Share this post


Link to post
Share on other sites
24 minutes ago, Sukrim said:

I mean that anyone doing monitoring using something other than Prometheus better has a VERY good reason that is not "legacy" for doing so. Ripple has a huge lack in anything defined beyond the code (packaging, logging, deployment, monitoring...).

Unfortunately rippled is not the easiest code base to work with and even if I took the time and effort to get my C++ up to speed and implement a whole Prometheus endpoint in there (they wrote their own, a little bit ugly, "/crawl" endpoint that also exposes some metrics by the way which is nearly undocumented and leaks your whole peer list by default), there is no way of knowing if that is even going to make it upstream. I don't want to do this just randomly and the feature suggestions or discussions I've seen on github so far were not encouraging at all.

Gotcha ... Thanks @Sukrim for your input. My C++ coding days are long behind me, so I too would not be able to take on such an effort. I'm only doing higher-level languages these days. Was hoping that there was a schema or API Ripple had published other than what is currently in the wild.

As far as the leak of IPs - I think you are referring to the peer crawl method ... I agree that is a problem as no information regarding internal/private networking should ever be exposed due to the security implications. I hope that is fixed as early as .81 ... If the IP address output of the peer crawl could be limited to passing that internal IP information to configured clustered nodes it would be fantastic. I can see how internally a peer crawl that exposes internal IPs only to configured clustered nodes might be useful in the case of internal network nameserver issues - but even that may not be too helpful unless you have a large amount of clustered nodes and rely solely on name resolution for peer communication that exists outside of your Docker/Kubernetes/VM stack configuration. 

Share this post


Link to post
Share on other sites

I don't really understand what you are referring to by "peer crawl method". I mean the /crawl endpoint when talking to a rippled server on its peer port (usually 51235). Take some server from https://peers.ripple.com (e.g. 169.54.137.6), then call https://IP_ADDRESS:51235/crawl (e.g. https://169.54.137.6:51235/crawl), ignore the cert error and you'll get a list of IPs and NodeIDs as well as other information about direct peers of this server.

Share this post


Link to post
Share on other sites

I've moved all stat collections to Prometheus and Grafana dashboards and am using the information to further hone the container/pod configurations of our validators and nodes ...

In doing so, I've come across a pattern of event timing/occurrence that has me scratching my head ...

Does anyone ( @nikb ) know the significance or reason behind 2m 50s and 2:50 ???

Seems to be internal job processes being triggered at such periods, but it seems like an odd timing ... unless of course its a derivative of underlying consensus round processing.

Or it could just be something I'm inducing in my collection and parsing methods (which is probably much more likely)

Any ideas?

Share this post


Link to post
Share on other sites
Sign in to follow this  

×