home

parsing the nginx access log with awk

2025-02-24

a few days ago, I was tailing the nginx logs on a kubernetes ingress controller, and I needed to figure out the rate of requests going to each upstream service.

while the rate calculation is another story, I needed to parse the logs first.

the kubernetes nginx ingress controller logs are extended from the “standard” access_log format combined. the kubernetes nginx ingress controller’s log format is called upstreaminfo.

log_format combined '$remote_addr - $remote_user [$time_local] '
                    '"$request" $status $body_bytes_sent '
                    '"$http_referer" "$http_user_agent"';

log_format upstreaminfo
                    '$remote_addr - $remote_user [$time_local] '
                    '"$request" $status $body_bytes_sent '
                    '"$http_referer" "$http_user_agent" '
                    '$request_length $request_time [$proxy_upstream_name] [$proxy_alternative_upstream_name] $upstream_addr '
                    '$upstream_response_length $upstream_response_time $upstream_status $req_id';

you see how upstreaminfo basically keeps the first part of itself the exact same as combined? one could assume this is simply to ensure existing parsers do NOT break.

on a side note; i’ve had to reset this log_format back to upstreaminfo $additional_fields due to people configuring their own fields (e.g., cache hit status) right in the middle of other fields, breaking existing queries which expected the log format to be upstreaminfo. i digress.

awk’s FPAT

as you may see in the log format above, it is highly likely that spaces are NOT a good field separator (e.g., a $request typically looks like GET /abc, $http_user_agent contains spaces, etc.)

the FPAT variable is what I learned about. this allows specifying more complex “field patterns”, which allow for defining more complex field expressions with “regex” :magic_spongebob_rainbow:

i think this is specific to GNU awk, docs here, but awk isn’t a posix standard now (is it??)

basic examples

input1

blatantly copying from the docs, assume the following input

Robbins,Arnold,"1234 A Pretty Street, NE",MyTown,MyState,12345-6789,USA
FPAT = "([^,]+)|(\"[^\"]+\")"

this FPAT expression is for two patterns

and this is supposed to parse the entire regex

BEGIN {
    FPAT = "([^,]+)|(\"[^\"]+\")"
}

{
    print "NF = ", NF
    for (i = 1; i <= NF; i++) {
        printf("$%d = <%s>\n", i, $i)
    }
}

and this should give us

NF =  7
$1 = <Robbins>
$2 = <Arnold>
$3 = <"1234 A Pretty Street, NE">
$4 = <MyTown>
$5 = <MyState>
$6 = <12345-6789>
$7 = <USA>

nested patterns?

input2

assume the same program as above, only the FPAT variable changes as specified;

the following inputs thus produce;

FPAT = "([^,]+)|(\"[^\"]+\")"

Robbins,Arnold,"1234 ""A Pretty Street"", NE",MyTown,MyState,12345-6789,USA
NF =  8
$1 = <Robbins>
$2 = <Arnold>
$3 = <"1234 ""A Pretty Street"">
$4 = < NE">
$5 = <MyTown>
$6 = <MyState>
$7 = <12345-6789>
$8 = <USA>

input3

FPAT = "([^,]+)|(\"[^\"]+\")|(\\([^\\)]+\\))"

Robbins,Arnold,"1234 (A Pretty Street), NE",MyTown,(MyState,12345-6789),USA
NF =  6
$1 = <Robbins>
$2 = <Arnold>
$3 = <"1234 (A Pretty Street), NE">
$4 = <MyTown>
$5 = <(MyState,12345-6789)>
$6 = <USA>

input4

hmm; it seems to like the longest match? but how does it know that , is also a delimiter / not a part of the field?

elementary. a “field pattern” overrides a “field separator”

FPAT = "[A-Z]+"

AAA hahaha BBB xxyyzz CCC
NF =  3
$1 = <AAA>
$2 = <BBB>
$3 = <CCC>

okay, and how about the order?

input5

fp1 = "(\"[^\"]*\")"
fp2 = "(\\([^\\)]*\\))"
FPAT = fp1 "|" fp2

ABC"def(ghi"jkl)MNO
NF =  1
$1 = <"def(ghi">

and what if we did

FPAT = fp2 "|" fp1

still the same output.

NF =  1
$1 = <"def(ghi">

this effectively means, it’s the first of all patterns that “match”. and what if we have two patterns that match the same?

input6

fp1 = "[AB]+"
fp2 = "[ABCD]+"
FPAT = fp1 "|" fp2

zzzAABBCCDDzzz
NF =  1
$1 = <AABBCCDD>

doesn’t matter if we flip FPAT = fp2 "|" fp1. it’s the longest pattern that matches.

summarizing,

wait a minute, isn’t this just how regex works (longest pattern match)? ugh. such a waste of time /s.

awk to parse kubernetes upstreaminfo

you can simply derive how to parse nginx access_log combined from this.

here, i am printing out the proxy_upstream_name variable after parsing the line.

BEGIN {
    fp1 = "\\[[^]]*]" # [...]
    fp2="\"[^\"]*\"" # "..."
    fp3="[^ ]+" # ^\s+
    FPAT = fp1 "|" fp2 "|" fp3

    expected_field_count = 18
}

NF != expected_field_count { next }

{
    remote_addr=$1
    remote_user=$3
    time_local=$4
    request=$5
    status=$6
    body_bytes_sent=$7
    http_referer=$8
    http_user_agent=$9
    request_length=$10
    request_time=$11
    proxy_upstream_name=$12
    proxy_alternative_upstream_name=$13
    upstream_addr=$14
    upstream_response_length=$15
    upstream_response_time=$16
    upstream_status=$17
    req_id=$18
    # additional fields
    # host=$19
    # upstream_cache_status=$20

    print proxy_upstream_name

}

here’s the stern command i used for tailing all pods’ logs.

stern --context=$cluster -n ingress-nginx $ngx_deployment_name -c controller --since=1s --only-log-lines=true --output=raw |\
    rg '"(GET|HEAD|POST|PUT|PATCH|DELETE|OPTIONS)' |\
    gawk -f upstreaminfo.awk

home