Task: the paper is attached and 75 percent done I just need the research part done and then added to the final paper provided. below is the data I wanted to be
-
Data Collection: Based on the methodology and tools identified in Assignment 3, collect or prepare a set of data. Ensure that the data is sufficient in quantity and quality to support the analysis.
-
Data Analysis: Analyze the collected data to answer your research questions. This analysis should be methodical and may involve various techniques, such as statistical analysis, pattern recognition, or comparative studies.
-
Findings and Interpretation: Interpret the results of your analysis. Discuss what the data reveals about your research questions and the broader topic. Identify any significant findings, unexpected results, or areas for further research.
-
Submission: Prepare a presentation with (please upload the PPTs here):
- A detailed account of the data collection process
- The methods and techniques used in your analysis
- Interpretation of the findings, discussing how they relate to your research questions
- Any conclusions or recommendations based on your analysis
I. Introduction
We aim to gain a larger understanding of the potential, effectiveness, and challenges involved in utilizing Large Language Models (LLMs) as a means of identifying and correcting software vulnerabilities.
A. Research Questions
-
How do Large Language Models (LLMs), such as Open AI’s GPT-4, enhance the detection and correction of software vulnerabilities compared to traditional static analysis tools?
-
How effective are LLMs in identifying vulnerabilities in software compared to traditional vulnerability detection methods?
-
What challenges and limitations are associated with using Large Language Models (LLMs) for vulnerability detection in software?
II. Preliminary Data
Existing research already provides a small insight into the potential of integrating LLMs into common security practices. LLMs, in comparison to traditional static analysis tools, provide an estimated 20-25% higher false positive rate, but an average of 35% higher true positive rate [1]. GPT-4 detected approximately four times more vulnerabilities than conventional tools with only around 6.67% being false positives [2]. LLMs can detect vulnerability with great accuracy, up to 92.65%, outperforming their counterparts, as a result of their abilities to efficiently analyze natural language data [3]. Another LLM model, SecureFalcon, showcases an accuracy rate of 94% in vulnerability detection [4]. We hope that upon completion of our research, more conclusive results on the nature of the effectiveness of LLMs will be available.
III. Methodology
The purpose of this study is to directly compare the effectiveness of LLMs to traditional static analysis tools in identifying security vulnerabilities. Following a detailed analysis of the collected results, an assessment of each tools strengths and weaknesses will be conducted and compared to existing research.
A. Tools and Resources
The selection of each tool/resource to be used in the experiment serves a clearly defined purpose. A common theme in our selection process for each resource was to pick one tool that was utilized in at least one of the studies from our preliminary research, and one tool that was not. This would allow for the results of our evaluation to be easily comparable to that of existing research, while also presenting new data for discussion. Furthermore, we decided to select resources that are very commonly used in order to allow for an increase in reliability and accessibility.
-
Large Language Models (LLMs)
Because OpenAI models are among the most widely recognized and utilized LLMs, we decided to select GPT-3.5 Turbo and GPT-4 Turbo for analysis. We will initially employ GPT-3.5 Turbo, however, if the test cases provide inconclusive/unreliable results or if GPT-3.5 Turbo significantly underperforms against traditional static analysis tools, we will then transition to GPT-4 Turbo. This approach allows us to supply through testing while strategically managing resources to align with our objective of optimizing cost-efficiency in our methodology.
-
Traditional Static Analysis Tools
The two traditional static analysis tools selected for our experiment are Flawfinder and SonarQube. Flawfinder was utilized in related research which will give us a solid baseline for comparison. SonarQube was not which allows for original research to be presented for supplemental analysis and discussion. Because Flawfinder is only compatible with C/C++, these programming languages will be the focus of our research in relation to which datasets are chosen. Additionally, both SonarQube and Flawfinder are very common tools.
-
Datasets
Testing cases will be pulled from the Software Assurance Reference Dataset (SARD) and the Common Vulnerabilities and Exposures (CVE) Database. Both libraries are very trusted and reliable sources that receive continuous updates. With such extensive and diverse databases, it should eliminate any issues regarding finding testing cases relevant to the experiment.
-
Vulnerabilities
In order to narrow down the testing case options to align with the deadline for the final report, we decided to choose five specific vulnerabilities to be used in the data collection process: SQL Injection, Buffer Overflow, Cross-Site Scripting (XSS), Out-of-Bounds Write, and Broken Access Control. These vulnerabilities were strategically selected because of their high levels of practicality and real-world significance. Furthermore, by having a diverse selection of vulnerabilities, there is a greater chance of seeing significant differences in detection by LLMs versus traditional static analysis tools.
B. Evaluation Metrics
In order to foster consistency with existing research, each tool will be measured based on the following attributes:
-
True Positive Rate (TPR): TPR refers to the rate at which the tool was correctly able to detect a vulnerability within the test case.
-
False Positive Rate (FPR): FPR represents the ratio of flagging a non-vulnerability as a vulnerability.
-
True Negative Rate (TNR): TNR is the proportion of actual non-vulnerabilities that are correctly identified as negative by the tool.
-
False Negative Rate (FNR): FNR is the rate of failure in detecting existing vulnerabilities.
While such data will provide a general overview of how each tool perform, advanced metrics will be calculated that allow for more relevancy in the context of vulnerability detection effectiveness:
-
Accuracy: Accuracy is the measure of how often the tool is able to correctly determine the results of the test case, whether it be vulnerable or not.
-
Precision: Precision refers to the percentage of correctly identified vulnerabilities.
-
Recall: Recall represents the proportion of vulnerabilities that is able to be detected by the tool.
-
F1 Score: An F1 score combines the precision and recall in order to provide a measure of predictive performance.
C. Data Collection
The identified LLMs (GPT-3.5 Turbo & GPT-4 Turbo) will be directly compared to the traditional static analysis tools (SonarQube & Flawfinder) in their ability to detect security vulnerabilities in C/C++ databases (SARD & CVE). This will be done through analysis of the selected evaluation metrics. We will start the research by conducting preliminary testing on two instances of each vulnerability to gain insight into the subsequent procedures. This will allow for any necessary modification to occur before excessive resources are exhausted. Our objective is to examine a minimum of ten different cases for each vulnerability, with the option of expanding the testing scope should initial results prove inconclusive. A manual review process will take place subsequently in order to validate and verify the accuracy of the collected data. Detailed documentation of the entire procedure will be taken to maintain transparency and reliability.
IV. Conclusion
The main objectives of this project is to compare LLMs with traditional static analysis tools in terms of ability to effectively detect security vulnerabilities and to validate the potential of LLMs to enhance software security practices. The limited existing research acknowledges this potential, suggesting that LLMs have higher true positive rates with lower false positives. Future research on LLM optimizations for security should be conducted before mass integration of LLMs into the security workflow.