Source Lines of Code

The number of source lines of code in a file.

Interpretation

The attribute that source lines of code is expected to quantify is size. A file that is large is likely to have a high value for the source lines of code metric.

Evidence

Source Lines of Code has been empirically-validated to be associated with historical vulnerabilities in software in the following peer-reviewed research studies:

Searching for a Needle in a Haystack: Predicting Security Vulnerabilities for Windows Vista [2]

The empirical evidence overwhelming supports the notion that a source code file with high source lines of code is more likely to contain a security vulnerability.

Implications

The security implication(s) of a file having high source lines of code could be one or more of the following:

Size may make the file unwieldy to review increasing the potential for latent vulnerabilities.
Size may also make the file difficult to comprehend and subsequently change making likely for developers to introduce vulnerabilities.

Mitigations

The theoretical mitigation to lowering the source lines of code of a file is to have source code files with no code in them at all. However, the theoretical mitigation is not practical, at best, and meaningless, at worst. Therefore, the risk of latent vulnerabilities in a file with high source lines of code could be mitigated using one or more of the following suggestions:

Modularize the file to distribute the source code contained within it across multiple smaller files.

Implementation

In our implementation of the metric, we use SciTools Understand™ to collect the source lines of code metric from source code files.

The source code of the implementation of the metric will be made available on GitHub. If you need to collect the metric from your project, the implementation will also be made available as a container image on Docker Hub.

Languages

The metric implementation is limited to projects written in C/C++, C#, Ada, Basic, Fortran, Java, Jovial, Pascal, PL/M, Python, VHDL, Cobol, Web.

Example(s)

In this section, we present examples of the metric collected from popular open-source software projects.

Chromium

In this subsection, we present examples of the metric collected from the Chromium, the open-source project behind the Google Chrome web browser.

The metric examples presented here were collected at 6b9bf768231f commit to the master branch of the Chromium source code repository.

Summary

Chromium Source Lines of Code Distribution — Figure 1.1

Chromium Source Lines of Code Discriminatory — Figure 1.2

Shown in Figure 1.1 is the distribution of the metric collected from source code files in the Chromium project. Shown in Figure 1.2 is the comparison of the distribution of the metric collected from source code files in the Chromium project that were not historically vulnerable and those that were.

Thresholds

The thresholds of the metric in the Chromium project determined using the approach prescribed by Alves et al. [1] is shown in the table below.

Metric Range	value < 873	873 ≤ value < 1,461	1,461 ≤ value < 3,214	3,214 ≤ value
Risk Level	Low	Medium	High	Critical

Risky Files

The thresholds are used to classify source code files into appropriate risk levels. Shown below are the top and bottom three source code files from the Chromium project in each of the three non-trivial risk levels.

Path	Source Lines of Code	Percentile
`components/sync/protocol/proto_visitors.h`	873	70.0231
`chrome/browser/chromeos/file_manager/path_util_unittest.cc`	873	70.0231
`third_party/blink/renderer/platform/network/network_state_notifier_test.cc`	873	70.0231
...
`native_client_sdk/src/libraries/nacl_io/kernel_proxy.cc`	1,459	79.9768
`content/browser/accessibility/dump_accessibility_tree_browsertest.cc`	1,459	79.9768
`gpu/command_buffer/service/gles2_cmd_validation_implementation_autogen.h`	1,460	79.9908

Path	Source Lines of Code	Percentile
`third_party/protobuf/src/google/protobuf/extension_set.cc`	1,461	80.0188
`components/sync/engine_impl/sync_scheduler_impl_unittest.cc`	1,461	80.0188
`chrome/browser/content_settings/host_content_settings_map_unittest.cc`	1,462	80.0329
...
`ash/display/display_manager_unittest.cc`	3,157	89.9191
`ash/wm/overview/overview_session_unittest.cc`	3,163	89.9494
`cc/scheduler/scheduler_unittest.cc`	3,172	89.9799

Path	Source Lines of Code	Percentile
`gpu/command_buffer/client/gles2_cmd_helper_autogen.h`	3,214	90.0107
`content/browser/frame_host/navigation_controller_impl_unittest.cc`	3,223	90.0416
`content/browser/appcache/appcache_update_job_unittest.cc`	3,255	90.0728
...
`third_party/libxml/src/testapi.c`	31,132	98.9888
`third_party/hunspell/fuzz/hunspell_fuzzer_hunspell_dictionary.h`	37,181	99.3454
`third_party/sqlite/amalgamation/sqlite3.c`	68,243	100

OpenBSD

In this subsection, we present examples of the metric collected from the UNIX-like operating system developed by the OpenBSD project.

The metric examples presented here were collected at dbdab68da3b commit to the master branch of the OpenBSD source code repository.

Summary

OpenBSD Source Lines of Code Distribution — Figure 2.1

OpenBSD Source Lines of Code Discriminatory — Figure 2.2

Shown in Figure 2.1 is the distribution of the metric collected from source code files in the OpenBSD project. Shown in Figure 2.2 is the comparison of the distribution of the metric collected from source code files in the OpenBSD project that were not historically vulnerable and those that were.

Thresholds

The thresholds of the metric in the OpenBSD project determined using the approach prescribed by Alves et al. [1] is shown in the table below.

Metric Range	value < 2,530	2,530 ≤ value < 4,299	4,299 ≤ value < 7,755	7,755 ≤ value
Risk Level	Low	Medium	High	Critical

Risky Files

The thresholds are used to classify source code files into appropriate risk levels. Shown below are the top and bottom three source code files from the OpenBSD project in each of the three non-trivial risk levels.

Path	Source Lines of Code	Percentile
`gnu/llvm/tools/clang/lib/Parse/ParseObjc.cpp`	2,530	70.0137
`gnu/gcc/gcc/cp/semantics.c`	2,532	70.0416
`sys/dev/pci/drm/i915/intel_ddi.c`	2,533	70.0695
...
`gnu/gcc/gcc/cp/call.c`	4,273	79.8920
`sys/dev/microcode/symbol/spectrum24t_cf.h`	4,286	79.9392
`gnu/usr.bin/gcc/gcc/c-common.c`	4,295	79.9865

Path	Source Lines of Code	Percentile
`gnu/usr.bin/binutils-2.17/bfd/xcofflink.c`	4,299	80.0338
`gnu/usr.bin/binutils-2.17/bfd/elfxx-ia64.c`	4,318	80.0814
`gnu/llvm/lib/Target/PowerPC/PPCISelDAGToDAG.cpp`	4,319	80.1290
...
`sys/dev/pci/if_em_hw.c`	7,579	89.7582
`sys/dev/microcode/myx/ethp_z8e.h`	7,587	89.8418
`gnu/gcc/gcc/combine.c`	7,712	89.9267

Path	Source Lines of Code	Percentile
`gnu/llvm/lib/Analysis/ScalarEvolution.cpp`	7,755	90.0121
`gnu/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp`	7,803	90.0981
`gnu/gcc/gcc/config/sh/sh.c`	7,811	90.1841
...
`gnu/usr.bin/binutils-2.17/opcodes/m32c-desc.c`	49,967	98.5500
`sys/dev/microcode/udl/udl_huffman.h`	65,542	99.2720
`gnu/usr.bin/binutils-2.17/opcodes/m32c-opc.c`	66,091	100

Reference(s)

[1] Tiago L. Alves, Christiaan Ypma, and Joost Visser. 2010. Deriving Metric Thresholds From Benchmark Data. In Proceedings of the 26th International Conference on Software Maintenance (ICSM '10). 1-10. https://doi.org/10.1109/ICSM.2010.5609747

[2] Thomas Zimmermann, Nachiappan Nagappan, and Laurie Williams. 2010. Searching for a Needle in a Haystack: Predicting Security Vulnerabilities for Windows Vista. In Proceedings of the 3rd International Conference on Software Testing, Verification and Validation (ICST '10). 421-428. https://doi.org/10.1109/ICST.2010.32

Source Lines of Code

Interpretation

Evidence

Implications

Mitigations

Implementation

Languages

Example(s)

Chromium

Summary

Thresholds

Risky Files

Medium Risk

High Risk

Critical Risk

OpenBSD

Summary

Thresholds

Risky Files

Medium Risk

High Risk

Critical Risk

Reference(s)