Gkrellm replacement
From Fernseher
Revision as of 21:40, 8 June 2008 by 66.188.205.220 (talk)
gkrellm/gkrellmd is the suxors.
Would be good to have a replacement in place. A couple of components would be nice.
- Logging daemons. These would sit on each host and collect data every interval and save the data to (network available and local) log files so I could keep track of how things have gone over time. Could also serve the collected data over a socket interface.
- Server side aggregator. This would sit on my webserver, get data from the logging daemons and the old log files, and make it available for a web client (or whatever) to display.
- Client side viewers. This might be a ajaxy program thingy that displays all the realtime data and allows navigation of the historical data as well.
Things to keep track of:
- Temperatures
- fan speeds
- Voltages
- Processor speeds
- Processor load
- Number of processes
- memory utilization
- network traffic
- disk traffic
- battery state, rates, and remaining time estimates
- local date/time
- local hostname
- local uptime
- raid data
Where to find these things:
- /sys/class/hwmon/*/device/ (for temps, fan speeds, voltages)
- in#_label(I) in#_input(o)
- fan#_label(I) fan#_input(o) pwm#(o)
- temp#_label(I) temp#_input(o)
- asume X# for label, and assume nothing for pwm
- /sys/class/net/*/statistics/
- rx_bytes(o) rx_packets(o)
- tx_bytes(o) tx_packets(o)
- /sys/devices/system/cpu/*/cpufreq/ (for cpu frequencies)
- cpuinfo_max_freq(I) scaling_max_freq(I)
- cpuinfo_cur_freq(o) scaling_cur_freq(o)
- cpuinfo_min_freq(I) scaling_min_freq(I)
- /proc/acpi/thermal_zone/*/temperature
- /proc/acpi/battery/*/info
- design capacity(I)
- last full capacity(o)
- design voltage(I)
- design capacity warning(I)
- design capacity low(I)
- /proc/acpi/battery/*/state
- capacity state(o)
- charging state(o)
- present rate(o)
- remaining capacity(o)
- present voltage(o)
- `hostname`(I)
- `date`(o)
- /proc/version(I)
- /proc/loadavg(o)
- <1min> <5min> <15m> <cur>/<total> *blah*
- /proc/uptime(o)
- <up> <idle>
- /proc/mdstat(I)
- md# : active raidN <discs>
- blocks chunk size XX [4/4] UUUU
- md# : active raidN <discs>
- /proc/meminfo
- MemTotal(I)
- MemFree(o)
- Buffers(o)
- Cached(o)
- SwapCached(o)
- SwapTotal(I)
- SwapFree(o)
- /proc/diskstats(o)
- <maj> <min> <dev> x x <blks read> x x x <blks written> x x x x
- /proc/stat(o)
- cpu[#] <usr> <nice> <sys> <idle> *blah*
Each class of thing should be distilled down to a small number of numbers that can all go on one line for each sample time, and each gets their own log file and data header in the served stream. Each time the daemon starts itself, it should write to each log file the current date and time, the hostname, the kernel version, and what it is keeping track of, and what each column of numbers means. Also, what the time interval between samples is going to be (though some error must be assumed to exist).