parsing the nginx access log with awk

2025-02-24

a few days ago, I was tailing the nginx logs on a kubernetes ingress controller, and I needed to figure out the rate of requests going to each upstream service.

while the rate calculation is another story, I needed to parse the logs first.

the kubernetes nginx ingress controller logs are extended from the “standard” access_log format combined. the kubernetes nginx ingress controller’s log format is called upstreaminfo.

log_format combined '$remote_addr - $remote_user [$time_local] '
                    '"$request" $status $body_bytes_sent '
                    '"$http_referer" "$http_user_agent"';

log_format upstreaminfo
                    '$remote_addr - $remote_user [$time_local] '
                    '"$request" $status $body_bytes_sent '
                    '"$http_referer" "$http_user_agent" '
                    '$request_length $request_time [$proxy_upstream_name] [$proxy_alternative_upstream_name] $upstream_addr '
                    '$upstream_response_length $upstream_response_time $upstream_status $req_id';

you see how upstreaminfo basically keeps the first part of itself the exact same as combined? one could assume this is simply to ensure existing parsers do NOT break.

on a side note; i’ve had to reset this log_format back to upstreaminfo $additional_fields due to people configuring their own fields (e.g., cache hit status) right in the middle of other fields, breaking existing queries which expected the log format to be upstreaminfo. i digress.

awk’s FPAT

as you may see in the log format above, it is highly likely that spaces are NOT a good field separator (e.g., a $request typically looks like GET /abc, $http_user_agent contains spaces, etc.)

the FPAT variable is what I learned about. this allows specifying more complex “field patterns”, which allow for defining more complex field expressions with “regex” :magic_spongebob_rainbow:

i think this is specific to GNU awk, docs here, but awk isn’t a posix standard now (is it??)

basic examples

input1

blatantly copying from the docs, assume the following input

Robbins,Arnold,"1234 A Pretty Street, NE",MyTown,MyState,12345-6789,USA

FPAT = "([^,]+)|(\"[^\"]+\")"

this FPAT expression is for two patterns

[^,]+, non-comma, non-empty
\"[^\"]+\", starts-with ", non-", non-empty, ends-with "

and this is supposed to parse the entire regex

BEGIN {
    FPAT = "([^,]+)|(\"[^\"]+\")"
}

{
    print "NF = ", NF
    for (i = 1; i <= NF; i++) {
        printf("$%d = <%s>\n", i, $i)
    }
}

and this should give us

NF =  7
$1 = <Robbins>
$2 = <Arnold>
$3 = <"1234 A Pretty Street, NE">
$4 = <MyTown>
$5 = <MyState>
$6 = <12345-6789>
$7 = <USA>

nested patterns?

input2

assume the same program as above, only the FPAT variable changes as specified;

the following inputs thus produce;

FPAT = "([^,]+)|(\"[^\"]+\")"

Robbins,Arnold,"1234 ""A Pretty Street"", NE",MyTown,MyState,12345-6789,USA

NF =  8
$1 = <Robbins>
$2 = <Arnold>
$3 = <"1234 ""A Pretty Street"">
$4 = < NE">
$5 = <MyTown>
$6 = <MyState>
$7 = <12345-6789>
$8 = <USA>

input3

FPAT = "([^,]+)|(\"[^\"]+\")|(\\([^\\)]+\\))"

Robbins,Arnold,"1234 (A Pretty Street), NE",MyTown,(MyState,12345-6789),USA

NF =  6
$1 = <Robbins>
$2 = <Arnold>
$3 = <"1234 (A Pretty Street), NE">
$4 = <MyTown>
$5 = <(MyState,12345-6789)>
$6 = <USA>

input4

hmm; it seems to like the longest match? but how does it know that , is also a delimiter / not a part of the field?

elementary. a “field pattern” overrides a “field separator”

FPAT = "[A-Z]+"

AAA hahaha BBB xxyyzz CCC

NF =  3
$1 = <AAA>
$2 = <BBB>
$3 = <CCC>

okay, and how about the order?

input5

fp1 = "(\"[^\"]*\")"
fp2 = "(\\([^\\)]*\\))"
FPAT = fp1 "|" fp2

ABC"def(ghi"jkl)MNO

NF =  1
$1 = <"def(ghi">

and what if we did

FPAT = fp2 "|" fp1

still the same output.

NF =  1
$1 = <"def(ghi">

this effectively means, it’s the first of all patterns that “match”. and what if we have two patterns that match the same?

input6

fp1 = "[AB]+"
fp2 = "[ABCD]+"
FPAT = fp1 "|" fp2

zzzAABBCCDDzzz

NF =  1
$1 = <AABBCCDD>

doesn’t matter if we flip FPAT = fp2 "|" fp1. it’s the longest pattern that matches.

summarizing,

field patterns override field separators for identifying fields
when multiple patterns are specified, the pattern which starts matching first gets picked
when two patterns match, the pattern matching the longer string gets picked

wait a minute, isn’t this just how regex works (longest pattern match)? ugh. such a waste of time /s.

awk to parse kubernetes upstreaminfo

you can simply derive how to parse nginx access_log combined from this.

here, i am printing out the proxy_upstream_name variable after parsing the line.

BEGIN {
    fp1 = "\\[[^]]*]" # [...]
    fp2="\"[^\"]*\"" # "..."
    fp3="[^ ]+" # ^\s+
    FPAT = fp1 "|" fp2 "|" fp3

    expected_field_count = 18
}

NF != expected_field_count { next }

{
    remote_addr=$1
    remote_user=$3
    time_local=$4
    request=$5
    status=$6
    body_bytes_sent=$7
    http_referer=$8
    http_user_agent=$9
    request_length=$10
    request_time=$11
    proxy_upstream_name=$12
    proxy_alternative_upstream_name=$13
    upstream_addr=$14
    upstream_response_length=$15
    upstream_response_time=$16
    upstream_status=$17
    req_id=$18
    # additional fields
    # host=$19
    # upstream_cache_status=$20

    print proxy_upstream_name

}

here’s the stern command i used for tailing all pods’ logs.

stern --context=$cluster -n ingress-nginx $ngx_deployment_name -c controller --since=1s --only-log-lines=true --output=raw |\
    rg '"(GET|HEAD|POST|PUT|PATCH|DELETE|OPTIONS)' |\
    gawk -f upstreaminfo.awk

home