top of page

Exploring DarkBERT: AI's Secret Weapon to Tackle the Dark Web

Hi everyone, its Yaniv Hoffman here back with another blog. ChatGPT became an household name recently but did you know that there is a ChatGPT like for the dark web????? So In today's video, I will speak about DarkBERT, an AI language model specifically trained on the elusive and often nefarious Dark Web, so Join me to explore how DarkBERT is revolutionizing cybersecurity efforts and combatting cybercrime in the hidden corners of the internet.

Large language models (LLMs) have gained immense popularity and continue to evolve with the introduction of new models. These models, such as ChatGPT, are commonly trained on diverse internet sources like articles, websites, books, and social media.

In a groundbreaking development, a group of researchers from South Korea (Eugene Jang, Jian Cui, Jin-Woo Chung, Yongjae Lee) introduced DarkBERT, an LLM that stands out by being exclusively trained on datasets sourced from the dark web. The objective behind DarkBERT's creation is to surpass the capabilities of existing language models and provide valuable assistance to threat researchers, law enforcement agencies, and cybersecurity professionals in combatting cyber threats.

What exactly is Darkbert?

DarkBERT is a powerful language model based on the RoBERTa architecture designed to understand and process text. What makes DarkBERT unique is that it has been specifically trained on millions of web pages from the dark web. The dark web is a hidden part of the internet that cannot be accessed through regular web browsers. It is known for hosting anonymous websites and marketplaces where illegal activities like trading stolen data, drugs, and weapons take place.

To train DarkBERT, the researchers accessed the dark web using a special network called Tor. They collected a large amount of raw data from various sources on the dark web, including hacking forums and scamming websites. The collected data was carefully filtered and processed to remove duplicates and ensure a balanced representation of different categories. This refined dark web database was then used to train DarkBERT over a period of about 15 days, using the RoBERTa model as a framework.

In simple terms, DarkBERT is an advanced language model that has been trained on dark web data to better understand and analyze the language used in illegal online activities.

How does it work?

The 'Dark Web' is a part of the internet that makes it difficult to trace user identities, thanks to sophisticated techniques employed to hide their online activities. To access this section, millions of people rely on a software called Tor, which provides anonymity and privacy.

DarkBERT, based on the RoBERTa architecture, that has recently gained renewed attention as researchers discovered its untapped performance potential. By further training DarkBERT, it has shown improved efficiency beyond its capabilities in 2019.

Researchers are actively investigating how large language models like ChatGPT can be leveraged to combat cybercrime. By harnessing the power of artificial intelligence, these models aim to use advanced techniques to tackle and address online threats effectively.


How the Darkbert was constructed, its process and case scenarios

Step 1: Data Collection

The researchers gathered a massive amount of text from the Dark Web to train DarkBERT. They started by collecting seed addresses from sources like Ahmia and public repositories that provide lists of onion domains. Using these addresses, they crawled the Dark Web and obtained web pages. Each page's HTML title and body were saved as a text file. The researchers also classified the pages based on their primary language, with a focus on English content. They managed to collect approximately 6.1 million pages for training.

Step 2: Data Filtering and Processing

The researchers implemented several measures to ensure that the training data is useful. First, they removed pages with low information density that didn't provide meaningful content. They also balanced the categories of the pages and eliminated any duplicate pages. Additionally, the researchers took steps to address ethical concerns, such as masking or removing sensitive information from the data.

Step 3: DarkBERT Pretraining

The researchers opted to use an existing model architecture instead of starting from scratch to save computational resources and leverage the knowledge learned by the existing model. They chose the RoBERTa model as the base initialization model for DarkBERT. The Dark Web pretraining text corpus was fed into the RoBERTa model, using the same tokenization vocabulary and separating each page with a separator token. The researchers created two versions of DarkBERT: one with raw text data and the other with preprocessed text. Both versions were pre-trained using PyTorch, a popular deep-learning framework.

Use Cases of DarkBert:

DarkBERT has shown remarkable capabilities in several cybersecurity-related use cases. Let's explore a few of them.

Monitor Dark Web Forums for Potentially Harmful Threads

Researchers recognized the importance of monitoring dark web forums to identify threads that could pose potential harm. However, manually reviewing these forums is time-consuming, prompting the need for automated processes to assist security experts.

To address this challenge, the researchers concentrated on activities within hacking forums that could lead to significant harm. They created guidelines for annotating noteworthy threads, such as those involving the sharing of confidential data or the distribution of critical malware and vulnerabilities.

DarkBERT, surpassing other language models in precision, recall, and F1 score, emerged as the top performer for identifying important threads on the dark web. Its superior capabilities make it a valuable tool for security professionals in detecting potentially harmful discussions.

Ransomware Leak Site Detection:

DarkBERT demonstrates its effectiveness in identifying and categorizing ransomware leak sites found on the Dark Web. Cybercriminal groups frequently exploit the Dark Web to disclose confidential information stolen from organizations that resist paying the ransom. By outperforming other language models, DarkBERT improves the process of detecting and classifying these sites, equipping cybersecurity experts with enhanced capabilities to effectively manage the risks associated with such data leaks.

During their study, the researchers gathered data from well-known ransomware groups and examined ransomware leak sites that disclose private information belonging to organizations. In this analysis, DarkBERT exhibited superior performance compared to other language models when it came to identifying and categorizing these sites. This outcome highlights DarkBERT's ability to comprehend the language utilized in underground hacking forums on the dark web.


Identify Keywords Related to Threats on the Dark Web DarkBERT utilizes the fill-mask function, a built-in capability of BERT-family language models, to precisely recognize keywords linked to illegal activities, such as drug sales on the dark web.

When the term "MDMA" was concealed within a drug sales page, DarkBERT generated relevant words associated with drugs, while alternative models proposed unrelated general terms like various professions.

Is DarkBERT Accessible to the General Public?

The capability of DarkBERT to identify keywords connected to illicit activities holds significant importance in monitoring and addressing emerging cyber threats.

Currently, DarkBERT is not accessible to the general public. However, the researchers are open to considering requests for its usage in academic settings. This means that individuals or organizations with academic purposes, such as research or educational projects, can approach the researchers to explore the possibility of using DarkBERT. The availability of DarkBERT for other purposes or to the broader public may be subject to future developments or considerations.

Is it possible for hackers to harness the power of DarkBert for malicious purposes?

Certainly! DarkBERT, like any powerful technology, has the potential to be used for both positive and negative purposes. Its ability to comprehend and analyze dark web content can assist in cybersecurity efforts and combat cyber threats. However, it's important to consider the ethical implications of its use. The dark web is known for facilitating illegal activities, and by training DarkBERT on dark web data, there is a risk that it could be leveraged for malicious purposes.

For example, individuals with malicious intent could potentially use DarkBERT to improve their tactics in illegal activities, such as evading detection or enhancing their ability to carry out cybercrimes. It's essential to have measures in place to ensure that tools like DarkBERT are used responsibly and by legal and ethical frameworks.

To mitigate these risks, it is crucial to have appropriate regulations, oversight, and collaboration between researchers, law enforcement agencies, and ethical hackers. This can help ensure that DarkBERT is deployed in ways that align with societal interests, protect individual privacy, and contribute to the overall goal of enhancing cybersecurity.


In short, DarkBERT represents a groundbreaking advancement in leveraging AI language models to tackle the challenges posed by the Dark Web. Its superior performance, specialized training, and unmatched understanding of Dark Web language hold great potential for enhancing cybersecurity efforts, enabling efficient threat detection, and supporting investigations in this hidden domain.

Thank you for joining us on this exploration of DarkBERT. If you found this video informative, don't forget to like and subscribe to our channel and blog for more exciting cyber insights. Stay tuned for our next video, where we delve into the ever-evolving world of cybersecurity. Until then, stay curious and stay secure!


bottom of page