This page provides guidance and best practices for consuming CockroachDB logs for critical information. It outlines how to externalize logs. For details on which log messages to externalize and how to interpret them, refer to Critical Log Messages.
Types of logs
CockroachDB has two different types of logs, structured and unstructured.
Structured logs
Structured logs contain detailed, machine-readable information that can be easily parsed and analyzed. They are particularly useful for:
- Notable Events: Capturing significant events in a structured format allows for easier querying and analysis. For example, job state changes or specific events like backups.
- Audit and Compliance: Structured logs are ideal for audit trails and compliance monitoring because they provide a consistent format that can be easily searched and filtered.
- Performance Monitoring: When tracking performance metrics, structured logs can help in identifying patterns and anomalies due to their detailed and consistent format.
Unstructured logs
Unstructured logs are more free-form and human-readable compared to machine-processed structured logs. They are particularly useful for:
General Troubleshooting: They are useful for capturing a wide range of information that may not fit into a predefined structure, such as error messages or stack traces. Events not documented on Notable Events will have an unstructured format in log messages.
Storage considerations for file sinks
Log sinks route events from specified logging channels to destinations outside CockroachDB. These destinations currently include file sinks, network sinks (Fluentd-compatible servers and HTTP servers), and the standard error stream (stderr
).
With a file sink, when provisioning storage for your CockroachDB cluster, decide where the log files will be stored: either the same volume as the main data store or on a separate volume from the main data store. To determine this, measure and consider the following factors:
- How I/O intensive is your workload.
- How many log messages are written to files on disk.
- This number can be managed and configured with the logging yaml file. Log and audit only what you need.
- This number can be controlled or reduced by outputting logs to a network sink.
- What is the operational overhead and cost to provision a separate volume.
For greater disk resilience, consider the following:
- Write logs to a network sink. However:
- This may not be possible due to your environment. You need to have a deployed network sink (a Fluentd-compatible or HTTP server).
- This could create more operational overhead. You need to manage that network sink.
- This will impact troubleshooting using
cockroach debug zip
which relies on log files written to disk. Collecting these log files is critical when submitting issues to Cockroach Labs support. There is a tradeoff of disk resilience versus observability with more collected logs.
- Use the
buffering
option for file sink if you do not need auditing. - Use fine-grained auditing (Table-based SQL Audit Logging or Role-based SQL Audit Logging) to reduce IO and potential impact during disk stalls.
CockroachDB can still hold mutexes during disk stalls when logging a message (without buffering enabled), which could impact workload stability. Therefore, reducing the frequency of writing logs to disk or enabling buffering can increase workload resilience.
Provision a separate volume for message logging
Segregating database transaction logging (the LSM in CockroachDB) from all other I/O, particularly from the small-sized, sequential, non-stop message logging, is a standard practice for databases. Ideally, the CockroachDB LSM store should be placed on a dedicated volume to ensure that no other I/O interferes with the primary storage I/O. The I/O profile of CockroachDB LSM is significantly different from the sequential, small-size append I/O of message logging, so it is crucial to prevent the latter from interfering with the former.
To achieve this, move message logging I/O to a different device. The most suitable device depends on your platform. For bare metal, the OS disk may be an appropriate location. In a cloud environment, such as AWS, you might consider using an inexpensive GP3 volume with no additional IOPS provisioned, since logging does not require much I/O. The goal is to segregate message logging not because it consumes significant I/O bandwidth, but because its I/O profile is different and could disrupt LSM I/O.
Regarding the storage size for message logging, you will not need much space. The amount of message logging is strictly controlled, so it will not grow uncontrollably. Therefore, it is acceptable to use the OS disk as long as it persists through power-off events. The message log files are managed by log rotation, with configurable maximum file size and the number of files retained using max-file-size
and max-group-size
file sink parameters, respectively. Calculate the required space based on the default settings, or adjust the settings to fit your disk capacity. Log rotation will ensure that the space is maintained.
Use json
format with third-party tools
With third-party tool consumption that read logs programmatically, use json
format rather than parse the crdb_v2
format. The JSON object is guaranteed to not contain unescaped newlines or other special characters, and the entry as a whole is followed by a newline character. This makes the format suitable for processing over a stream unambiguously.
Prioritize logs based on severity
The various log formats are guaranteed to contain a severity level for each log event.
- With
json
format, iftag-style: compact
is specified, usesev
field, while iftag-style: verbose
, useseverity
field. - With
crdb_v2
format, use the first character of the prefix of each log entry. Possible values are I for INFO, W for WARNING, E for ERROR, and F for FATAL.
Use logs for performance tuning
Use the SQL_EXEC
and SQL_PERF
log channels to examine SQL queries and filter slow queries in order to optimize or troubleshoot performance. Refer to Performance tuning. It is also possible to log SQL statistics to Datadog, which uses the TELEMETRY
channel.
Use logs for security and compliance
Use the SESSIONS
, USER_ADMIN
, PRIVILEGES
, SENSITIVE_ACCESS
log channels to monitor connection and authentication events, changes to user/role administration and privileges, and any queries on audited tables. Refer to Security and audit monitoring.
Consider workload performance impact
Enabling logging of events in the following log channels may have a performance impact:
SQL_EXEC
: Logging cluster-wide executions by enabling thesql.trace.log_statement_execute
cluster setting will incur considerable overhead and may have a negative performance impact.SQL_PERF
: Settingsql.log.slow_query.latency_threshold
to a non-zero time enables tracing on all queries, which impacts performance. After debugging, set the value back to0s
to disable the log.SESSIONS
: Logging client connection and session authentication events are enabled by theserver.auth_log.sql_connections.enabled
cluster setting and theserver.auth_log.sql_sessions.enabled
cluster setting respectively. These logs perform one disk I/O per event. Enabling each setting will impact performance.SENSITIVE_ACCESS
: Enabling Table-based SQL Audit Logging or Role-based SQL Audit Logging can negatively impact performance. Log only what you require to limit impact to your workload. Use this channel for security purposes only.
Customize buffering of log messages
Depending on the use case, a log sink can be configured to be auditable or buffered. However, there is a tradeoff between auditing requirements and performance.
Auditable
In the case of security-related logs, use the logging YAML file to configure the log sink to be auditable, by setting auditable
to true
. This guarantees non-repudiability for any logs in the sink, but can incur a performance overhead and higher disk IOPS consumption. When auditable
is enabled:
exit-on-error
is enabled which stops the CockroachDB node if an error is encountered while writing to the sink. This prevents the loss of any log entries.buffered-writes
is disabled if the sink is underfile-groups
.
File-based audit logging cannot coexist with the buffering configuration, so disable either buffering
or auditable
.
Buffered
Use the logging YAML file to configure buffering settings to optimize log performance, refer to Log buffering for network sinks. For example, modify the following buffering
options:
max-staleness
: The maximum time logs can stay in the buffer before being flushed.flush-trigger-size
: The size threshold that triggers a buffer flush.max-buffer-size
: The maximum buffer size. If this limit is exceeded, new log messages are dropped until the buffer size falls below this value.
Disable buffering for specific log channels if needed, by setting buffering: NONE
for a given channel.
Override default buffering settings for specific channels to ensure timely log flushing.
File-based audit logging cannot coexist with the buffering configuration, so disable either buffering
or auditable
.
For detailed configurations and examples, refer to Configure Logs.
Use epoch timestamp
By default, the log output format json
has a timestamp
field that contains epoch values for backward-compatibility. When sending log output to a third-party log collector, the log collector can be configured to transform the epoch values in the timestamp
field into a human-readable format.
When inspecting a json
formatted log file produced by CockroachDB, you can use the command cockroach debug merge-logs
to convert the log into crdb-v1
format which includes timestamps in the rfc3339
format, for example "2006-01-02T15:04:05.999999999Z".
There is an optional datetime
field for json
format, which contains values in human-readable format. However, enabling the datetime
field introduces CPU overhead.