Let's face it, most current AI Cyber Threat Intelligence tools suck! It's not really the fault of artificial intelligence though, it's the fault of our own expectations being incorrectly set by an over-hyped marketing machine. The right AI and data science concepts applied to the right problems can achieve faster results at a lower cost than ever before possible; this is why Cybergeist takes a pragmatic, approach to using AI technology, and has even more room to embrace it further in the future.
I don't want to divulge the exact secret algorithm sauce that's used behind the Cybergeist scenes, but trust me, it's not magic, it's logic built upon industry experience and experimentation. With that said, it likely does deserve some explanation.
In addition to supervised machine learning, natural language processing, and Generative AI, the Cybergeist algorithm also relies on simple math, human input, and industry experience to make deterministic decisions; These algorithmic and human inputs come in the following forms:
The source information for Cybergeist primarily comes from unstructured text readers implemented as serverless FaaS processes. These readers consume content from blog posts, historic documents and threat analysis PDFs that are all in the public domain and TLP clear/white. This data is consumed and run through a series of NLP processes to make sense of it for later analysis. We think that focusing on unstructured input allows easy future extensions to consume other streams of human data that are not used today such as 'tweets' or streams of posts to a chat room.
Since the main process input is unstructured text, a couple of different NLP tools are required and used for both making sense of the statements made in a sentence, but also to identify subject and relevance of a report itself. This is why you may find 'score' ratings for documents scattered around the place. For a simple conceptual example, consider the below hypothetical sentence of text that could have appeared in a threat blog post.
In this report we discuss how the FooRansomwareGroup is exploiting CVE-0000-00001 to target UK financial institutions.
When you consider pseudo-code rule of the below we may identify a few interesting statements that are stored for later evaluation:
<subjectNounClass> ...junk... <subjectVerbClass> ...junk... <subjectNounClass>
The above would conceptually yield a list of assertions that cybergeist can leverage later in the algorithm.
Now these are all useful atomic statements, but as you may think, they are prone to a lot of noise and false positive assertions. This is where the multi-sourcing, scoring and user feedback kick in. When this approach is applied to a large enough data-set, built from an aggregate of over a hundred thousand public reports, news articles, and blog posts (including a ton of re-syndicated duplicate content!) since relying on one single statement could be problematic.
In the back end systems, we use a data type called a 'term' to describe and anchor the nouns or verbs. Examples of a term could be:
Terms are classified for collective use, can have aliases, and also associations to MITRE ATT&CK elements. We're working (slowly) on a MITRE ATT&CK view of our data.
Generative AI is used to build specific summary information about classes of terms, such as Threat Actors, Malware, and Vulnerabilities. Right now ChatGPT is being used for the generation, but you may find a few that were generated with LAMA2 since we're continuously experimenting on how to lower our costs!
We're running Cybergeist on a shoestring budget out of our own shallow pockets today, there are no VCs, rich uncles that have own private jets involved. We're able to do this by heavily relying on modern cloud services that are available off the shelf at what we condor a reasonable price, so there are trade offs made everywhere because of this, especially in the freshness of data. Caching levels in the browser, at a CDN, and in the API service is a great help with cost.
One specific element that we're finding that works well is parallel analysis of historic data. As you may imagine, it's common to learn about new obscure names of Threat Actors or Malware families that may have been mentioned in the past before we knew they where a Threat Actor. To handle this large parallel batch processing jobs are undertaken to re-read historic data. This of course causes a spike in costs, but it's a great trade off vs. the alternative of storing a huuuugggeeee dataset of low value data.
If you're interested in learning more about Cybergeist and don't yet have a user account, simply click the Login / Sign up button at the top of the page and check the 'Don't have an account? Sign Up' link.
Got questions? We read email (sometimes) [email protected]
Max