How does the Cybergeist AI work?
...Because Your Cyber AI Sucks!

Let's face it, most current AI Cyber Threat Intelligence tools suck! It's not really the fault of artificial intelligence though, it's the fault of our own expectations being incorrectly set by an over-hyped marketing machine. The right AI and data science concepts applied to the right problems can achieve faster results at a lower cost than ever before possible; this is why Cybergeist takes a pragmatic, approach to using AI technology, and has even more room to embrace it further in the future.


I don't want to divulge the exact secret algorithm sauce that's used behind the Cybergeist scenes, but trust me, it's not magic, it's logic built upon industry experience and experimentation. With that said, it likely does deserve some explanation.

Firstly: the non-AI systems and processes

In addition to supervised machine learning, natural language processing, and Generative AI, the Cybergeist algorithm also relies on simple math, human input, and industry experience to make deterministic decisions; These algorithmic and human inputs come in the following forms:

  • Multi-sourcing of scored information about the subject
  • User feedback through a simple up/down vote system, coupled with a questionnaire for crowd sourced opinions
  • User comments & discussions (this is an experiment for now, so don't be surprised if you don't see it live for your account)
  • Source relationships (e.g. if source A and source B both owned by the same media organisation, don't assume the opinions are made by different people )
  • Source bias maps: This is worth a blog post in itself, but consider if a threat report is from a vendor that shills a product to solve that issue there could be some exaggeration
Where does the raw data come from?

The source information for Cybergeist primarily comes from unstructured text readers implemented as serverless FaaS processes. These readers consume content from blog posts, historic documents and threat analysis PDFs that are all in the public domain and TLP clear/white. This data is consumed and run through a series of NLP processes to make sense of it for later analysis. We think that focusing on unstructured input allows easy future extensions to consume other streams of human data that are not used today such as 'tweets' or streams of posts to a chat room.

NLP & ML (Natural Language Processing & Machine Learning)

Since the main process input is unstructured text, a couple of different NLP tools are required and used for both making sense of the statements made in a sentence, but also to identify subject and relevance of a report itself. This is why you may find 'score' ratings for documents scattered around the place. For a simple conceptual example, consider the below hypothetical sentence of text that could have appeared in a threat blog post.


In this report we discuss how the FooRansomwareGroup is exploiting CVE-0000-00001 to target UK financial institutions.


When you consider pseudo-code rule of the below we may identify a few interesting statements that are stored for later evaluation:


<subjectNounClass> ...junk... <subjectVerbClass> ...junk... <subjectNounClass>

The above would conceptually yield a list of assertions that cybergeist can leverage later in the algorithm.

  • FooRansomwareGroup Exploited CVE-0000-00000
  • FooRansomwareGroup Targets UK
  • FooRansomwareGroup Targets Financial institutions
  • CVE-0000-00001 Impacts Financial institutions

Now these are all useful atomic statements, but as you may think, they are prone to a lot of noise and false positive assertions. This is where the multi-sourcing, scoring and user feedback kick in. When this approach is applied to a large enough data-set, built from an aggregate of over a hundred thousand public reports, news articles, and blog posts (including a ton of re-syndicated duplicate content!) since relying on one single statement could be problematic.

MITRE ATT&CK Mapping

In the back end systems, we use a data type called a 'term' to describe and anchor the nouns or verbs. Examples of a term could be:

  • APT1
  • Gay Furry Hackers
  • Cl0p
  • Microsoft
  • CVE-0000-0000

Terms are classified for collective use, can have aliases, and also associations to MITRE ATT&CK elements. We're working (slowly) on a MITRE ATT&CK view of our data.

Description / Summary Generation

Generative AI is used to build specific summary information about classes of terms, such as Threat Actors, Malware, and Vulnerabilities. Right now ChatGPT is being used for the generation, but you may find a few that were generated with LAMA2 since we're continuously experimenting on how to lower our costs!

Scaling a solution at low cost

We're running Cybergeist on a shoestring budget out of our own shallow pockets today, there are no VCs, rich uncles that have own private jets involved. We're able to do this by heavily relying on modern cloud services that are available off the shelf at what we condor a reasonable price, so there are trade offs made everywhere because of this, especially in the freshness of data. Caching levels in the browser, at a CDN, and in the API service is a great help with cost.

One specific element that we're finding that works well is parallel analysis of historic data. As you may imagine, it's common to learn about new obscure names of Threat Actors or Malware families that may have been mentioned in the past before we knew they where a Threat Actor. To handle this large parallel batch processing jobs are undertaken to re-read historic data. This of course causes a spike in costs, but it's a great trade off vs. the alternative of storing a huuuugggeeee dataset of low value data.

If you're interested in learning more about Cybergeist and don't yet have a user account, simply click the Login / Sign up button at the top of the page and check the 'Don't have an account? Sign Up' link.


Got questions? We read email (sometimes) [email protected]

Max