2025-02-24
a few days ago, I was tailing the nginx logs on a kubernetes ingress controller, and I needed to figure out the rate of requests going to each upstream service.
while the rate calculation is another story, I needed to parse the logs first.
the kubernetes nginx ingress controller logs are extended from the
“standard” access_log format combined
.
the kubernetes nginx ingress controller’s log format is called upstreaminfo
.
log_format combined '$remote_addr - $remote_user [$time_local] '
'"$request" $status $body_bytes_sent '
'"$http_referer" "$http_user_agent"';
log_format upstreaminfo
'$remote_addr - $remote_user [$time_local] '
'"$request" $status $body_bytes_sent '
'"$http_referer" "$http_user_agent" '
'$request_length $request_time [$proxy_upstream_name] [$proxy_alternative_upstream_name] $upstream_addr '
'$upstream_response_length $upstream_response_time $upstream_status $req_id';
you see how upstreaminfo
basically keeps the first part
of itself the exact same as combined
? one could
assume this is simply to ensure existing parsers do
NOT break.
on a side note; i’ve had to reset this log_format
back
to upstreaminfo $additional_fields
due to people
configuring their own fields (e.g., cache hit status) right in the
middle of other fields, breaking existing queries which expected the log
format to be upstreaminfo
. i digress.
as you may see in the log format above, it is highly likely that
spaces are NOT a good field separator (e.g., a $request
typically looks like GET /abc
,
$http_user_agent
contains spaces, etc.)
the FPAT
variable is what I learned about. this allows
specifying more complex “field patterns”, which allow for defining more
complex field expressions with “regex” :magic_spongebob_rainbow:
i think this is specific to GNU awk, docs here, but awk isn’t a posix standard now (is it??)
blatantly copying from the docs, assume the following input
Robbins,Arnold,"1234 A Pretty Street, NE",MyTown,MyState,12345-6789,USA
FPAT = "([^,]+)|(\"[^\"]+\")"
this FPAT expression is for two patterns
[^,]+
, non-comma, non-empty\"[^\"]+\"
, starts-with "
,
non-"
, non-empty, ends-with "
and this is supposed to parse the entire regex
BEGIN {
FPAT = "([^,]+)|(\"[^\"]+\")"
}
{
print "NF = ", NF
for (i = 1; i <= NF; i++) {
printf("$%d = <%s>\n", i, $i)
}
}
and this should give us
NF = 7
$1 = <Robbins>
$2 = <Arnold>
$3 = <"1234 A Pretty Street, NE">
$4 = <MyTown>
$5 = <MyState>
$6 = <12345-6789>
$7 = <USA>
assume the same program as above, only the FPAT
variable
changes as specified;
the following inputs thus produce;
FPAT = "([^,]+)|(\"[^\"]+\")"
Robbins,Arnold,"1234 ""A Pretty Street"", NE",MyTown,MyState,12345-6789,USA
NF = 8
$1 = <Robbins>
$2 = <Arnold>
$3 = <"1234 ""A Pretty Street"">
$4 = < NE">
$5 = <MyTown>
$6 = <MyState>
$7 = <12345-6789>
$8 = <USA>
FPAT = "([^,]+)|(\"[^\"]+\")|(\\([^\\)]+\\))"
Robbins,Arnold,"1234 (A Pretty Street), NE",MyTown,(MyState,12345-6789),USA
NF = 6
$1 = <Robbins>
$2 = <Arnold>
$3 = <"1234 (A Pretty Street), NE">
$4 = <MyTown>
$5 = <(MyState,12345-6789)>
$6 = <USA>
hmm; it seems to like the longest match? but how does it know that
,
is also a delimiter / not a part of the field?
elementary. a “field pattern” overrides a “field separator”
FPAT = "[A-Z]+"
AAA hahaha BBB xxyyzz CCC
NF = 3
$1 = <AAA>
$2 = <BBB>
$3 = <CCC>
okay, and how about the order?
fp1 = "(\"[^\"]*\")"
fp2 = "(\\([^\\)]*\\))"
FPAT = fp1 "|" fp2
ABC"def(ghi"jkl)MNO
NF = 1
$1 = <"def(ghi">
and what if we did
FPAT = fp2 "|" fp1
still the same output.
NF = 1
$1 = <"def(ghi">
this effectively means, it’s the first of all patterns that “match”. and what if we have two patterns that match the same?
fp1 = "[AB]+"
fp2 = "[ABCD]+"
FPAT = fp1 "|" fp2
zzzAABBCCDDzzz
NF = 1
$1 = <AABBCCDD>
doesn’t matter if we flip FPAT = fp2 "|" fp1
. it’s the
longest pattern that matches.
summarizing,
wait a minute, isn’t this just how regex works (longest pattern match)? ugh. such a waste of time /s.
you can simply derive how to parse nginx access_log combined from this.
here, i am printing out the proxy_upstream_name
variable
after parsing the line.
BEGIN {
fp1 = "\\[[^]]*]" # [...]
fp2="\"[^\"]*\"" # "..."
fp3="[^ ]+" # ^\s+
FPAT = fp1 "|" fp2 "|" fp3
expected_field_count = 18
}
NF != expected_field_count { next }
{
remote_addr=$1
remote_user=$3
time_local=$4
request=$5
status=$6
body_bytes_sent=$7
http_referer=$8
http_user_agent=$9
request_length=$10
request_time=$11
proxy_upstream_name=$12
proxy_alternative_upstream_name=$13
upstream_addr=$14
upstream_response_length=$15
upstream_response_time=$16
upstream_status=$17
req_id=$18
# additional fields
# host=$19
# upstream_cache_status=$20
print proxy_upstream_name
}
here’s the stern command i used for tailing all pods’ logs.
stern --context=$cluster -n ingress-nginx $ngx_deployment_name -c controller --since=1s --only-log-lines=true --output=raw |\
rg '"(GET|HEAD|POST|PUT|PATCH|DELETE|OPTIONS)' |\
gawk -f upstreaminfo.awk