Defining & Measuring Software Quality Attributes at Scale

Managing a largely decoupled microservices-based system is always a challenge. We all have maintained an unpleasant bug tracker at some point or the other. Around a year back, when we witnessed a growth spurt, we felt a need for systematically tracking our production systems to capture the live performance of our products and any issues our clients might be facing. It is important to know what your end-users will face when using your products.

The problem with microservices is that the point of failure is not easily detected. A simple solution to this is a tracking id associated with every request. But, if you have already built a production system with hundreds of fully functioning microservices, adding a tracking id requires an unfortunate amount of code changes that may or may not change the code logic.

Also, we noticed that a majority of issues in our systems were not being reported. End-users frequently decide that it is easier to repeat the software activity rather than identifying & reporting the issues. A user may first think — “Probably I made a mistake & did not use the product correctly!”, “It was a glitch, just retry!”.

“Our job as software engineers should be more than writing code, but proactively identifying customer needs & make their experience better.”

Our Solution

Initial Thoughts

If you don’t assign a quantitative value to a parameter, you can’t track it and you can’t improve on it. Evidently, any parameter talked about in qualitative terms is as good as nothing.

Quality improvement is a branch of the larger umbrella of Quality Assurance. QI emphasizes on setting up the right parametric measures during the development and release stages, which will eventually help you measure how your system performs in the real world. Getting the actual data from the real world gives you insights on how your software is getting used outside the lab conditions (your development center).

We wrote an interesting article about one of our large implementations replacing legacy banking systems, which is expected to work in rural areas of India. see section World is not the cozy laboratory, we know that!. The article emphasizes on understanding your last mile user data & build for that.

Software Quality Attributes

When we first thought of creating such a system at Signzy, the first thing we addressed is the identification of Software Quality Attributes relevant to our systems which we are going to capture and improve upon. It is also advisable to classify them based on their criticality in your systems & impact it would have on your business if the said metrics decline.

An exhaustive list of system quality attributes in Software engineering can be found here.

We chose the below metrics and created definitions for each. While they are mostly standard, it should be noted that input and measured parameters can be customized to fit your needs. For instance, we changed the calculation methodology (which is intentionally omitted in the below table) for 4 of the metrics to fit an API & SaaS software we create.

 

access_log /var/log/nginx/access-capture.log logstash;
        log_format addHeaderlog '$remote_addr $remote_user 
$time_local "$request" $status $body_bytes_sent "$http_referer"
"$http_user_agent" "$http_x_forwarded_for" "$request_body"
"$http_Authorization" "$http_x_duid" "$http_x_ver" "$upstream_http_x_rqid"
$request_time $upstream_connect_time $upstream_header_time
$upstream_response_time "-" "-"';

Correspondingly the pipeline on elastic search can look like the below.

// Below request as the elastic search query
// for getting your pipeline information
GET _ingest/pipeline/my-pipeline
// for creating/updating the pipelines
PUT _ingest/pipeline/my-pipeline
// Refer elasticsearch (ELK) 7.2+ documentations for more details
{
  "my-pipeline" : {
    "description" : "Pipeline for nginx access.log file",
    "processors" : [
      {
        "grok" : {
          "field" : "message",
          "patterns" : [
            "%{DATA:remoteAddress} %{DATA:remoteUser} %
{HAPROXYDATE:requestTimestamp} %{QS:request} %{NUMBER:statusCode} %
{NUMBER:bodyBytesSent} %{QS:httpReferrer} %{QS:httpUserAgent} %
{QS:httpXForwardedFor} %{QS:requestBody} %{QS:requestAuth} %{QS:httpXDuid}%
{QS:httpXVer} %{QS:httpXRqid} %{DATA:requestTime} %
{DATA:upstreamConnectTime} %{DATA:upstreamHeaderTime} %
{DATA:upstreamResponseTime} %{QS:responseBody} %{QS:greedyData}"
          ],
          "on_failure" : [
            {
              "set" : {
                "field" : "errorMessage",
                "value" : "{{ _ingest.on_failure_message }}"
              }
            }
          ]
        }
      }
    ]
  }
}

File beat can be configured to send data to elastic search pipeline using a configuration similar to below:

Edit configuration file, filebeat.yml:

$ sudo vim /etc/filebeat/filebeat.yml

And insert configuration like below:

filebeat.inputs:
- type: log
  enabled: true
  paths:
    - /var/log/nginx/access.log
output.elasticsearch:
  hosts: ["log.server:32121"]
  pipeline: nginx-grok-1
setup.kibana:
  host: "log.server:32122"

** Please ensure the logs do not contain any sensitive data.

While the logs getting passed into a central system facilitates as a fast, efficient and live method of identifying issues, it has to be ensured that the logs don’t contain any personally identifiable data for your customers. The log processing systems, in general, are not designed to focus on security but rather on faster processing on telemetry data. You will have to ensure security.

Once the data has arrived in your analytics system, you can conduct a plethora of analytics and calculations to help you measure your system’s performance. An example would be a simple query like the below that aggregates based on response codes.

Query

GET filebeat-*/_search
{
  "size": 0,
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "@timestamp": {
              "format": "strict_date_optional_time",
              "gte": "now-1d/d",
              "lte": "now/d"
            }
          }
        }
      ],
      "filter": [
        {
          "match_all": {}
        }
      ],
      "should": [],
      "must_not": []
    }
  },
  "aggs": {
    "group_by_state": {
      "terms": {
        "field": "statusCode.keyword"
      }
    }
  }
}

Output

{
  "took" : 1656,
  "timed_out" : false,
  "_shards" : {
    "total" : 7,
    "successful" : 7,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "group_by_state" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 3488,
      "buckets" : [
        {
          "key" : "200",
          "doc_count" : 3316229
        },
        {
          "key" : "405",
          "doc_count" : 50592
        },
        {
          "key" : "422",
          "doc_count" : 22977
        },
        {
          "key" : "400",
          "doc_count" : 15295
        },
        {
          "key" : "201",
          "doc_count" : 10839
        },
        {
          "key" : "404",
          "doc_count" : 8353
        },
        {
          "key" : "500",
          "doc_count" : 2305
        },
        {
          "key" : "403",
          "doc_count" : 1929
        },
        {
          "key" : "410",
          "doc_count" : 1626
        },
        {
          "key" : "204",
          "doc_count" : 904
        }
      ]
    }
  }
}

Let’s say you want to calculate the response time percentiles, a query like below can instantly provide a relevant output that you can filter.

Query

GET filebeat-*/_search
{
 "size": 0,
 "query": {
   "bool": {
     "must": [{
         "range": {
           "@timestamp": {
             "format": "strict_date_optional_time",
             "gte": "now-1d/d",
             "lte": "now/d"
           }
         }
       }
     ],
     "filter": [
       {
         "match_all": {}
       }
     ],
     "should": [],
     "must_not": []
   }
 },
 "aggs": {
   "requestTime": {
     "range": {
       "script": "(doc['requestTime.keyword'].size()>0 && doc['requestTime.keyword'].value != null && doc['requestTime.keyword'].value != '-' ? Double.parseDouble(doc['requestTime.keyword'].value) : 0)",
       "ranges": [
         {"to": "0.5"},
         {"from": "0.5", "to":"1.0"},
         {"from": "1.0", "to":"1.5"},
         {"from": "1.5", "to":"2.0"},
         {"from": "2.0", "to":"2.5"},
         {"from": "2.5", "to":"3.0"},
         {"from":"3.0"}
       ]
     }

Output

{
  "took" : 1832,
  "timed_out" : false,
  "_shards" : {
    "total" : 7,
    "successful" : 7,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "requestTime" : {
      "buckets" : [
        {
          "key" : "*-0.5",
          "to" : 0.5,
          "doc_count" : 2775431
        },
        {
          "key" : "0.5-1.0",
          "from" : 0.5,
          "to" : 1.0,
          "doc_count" : 62018
        },
        {
          "key" : "1.0-1.5",
          "from" : 1.0,
          "to" : 1.5,
          "doc_count" : 25563
        },
        {
          "key" : "1.5-2.0",
          "from" : 1.5,
          "to" : 2.0,
          "doc_count" : 98898
        },
        {
          "key" : "2.0-2.5",
          "from" : 2.0,
          "to" : 2.5,
          "doc_count" : 108635
        },
        {
          "key" : "2.5-3.0",
          "from" : 2.5,
          "to" : 3.0,
          "doc_count" : 61629
        },
        {
          "key" : "3.0-*",
          "from" : 3.0,
          "doc_count" : 314262
        }
      ]
    }
  }
}

You may also use a variety of aggregations provided by ElasticSearch, for instance, percentile aggregations on numeric values.

Always use production data only to do quality improvement measurements and improvements. That’s the only way you see what your users see.

Results

We successfully identified and fixed 96% production bugs and learned a great deal about our legacy codebase. It served as an eye-opening experience for our tech team and continues to help us improve our code quality

Use Cases

  1. Detecting and fixing production anomalies
  2. Monitoring system changes after every deployment
  3. Growing our test case scenarios with observed patterns
  4. Visualizing our production load and accordingly scaling our system

Insights

It is important to create an executive report detailing how you are doing on your quality attributes. As always assign intensities to your statuses which will help you derive a holistic view of your system’s quality. The below methodology can give you guidelines to answer questions like:

  1. What is the current status of your metric?
  2. Whether it is progressing, stalled or compromised, etc?
  3. Default intensity values assigned to the critical, important and good to have metrics.

 

As experts say, find bugs early before they bite you back, and rightly so.

The below article treats the importance of identifying issues as early as possible & how costly issues can be if not identified early.

https://deanondelivery.com/product-managers-do-you-know-how-much-your-bugs-cost-72b6e36e7684?gi=140ff3e30546

About Signzy

Signzy is a market-leading platform redefining the speed, accuracy, and experience of how financial institutions are onboarding customers and businesses – using the digital medium. The company’s award-winning no-code GO platform delivers seamless, end-to-end, and multi-channel onboarding journeys while offering customizable workflows. In addition, it gives these players access to an aggregated marketplace of 240+ bespoke APIs that can be easily added to any workflow with simple widgets.

Signzy is enabling ten million+ end customer and business onboarding every month at a success rate of 99% while reducing the speed to market from 6 months to 3-4 weeks. It works with over 240+ FIs globally, including the 4 largest banks in India, a Top 3 acquiring Bank in the US, and has a robust global partnership with Mastercard and Microsoft. The company’s product team is based out of Bengaluru and has a strong presence in Mumbai, New York, and Dubai.

Visit www.signzy.com for more information about us.

You can reach out to our team at reachout@signzy.com

Written By:

Ankur and Prakriti